In this session, we will learn string manipulation in R. We’ll learn the basics of how the strings work and how to create them by hand. The focus of this session will be on regular expression. We will use the R package stringr.
Note: This lesson is based on the book: R for Data Science (Wickham and Grolemund (2016)).
We can create strings with either single quotes or double quotes.
string1 <- "Hello, the world!"
string2 <- "We can use a 'single' quote inside of a string"
string3 <- 'or use a "double" quote inside of a string'
## [1] "Hello, the world!"
## [1] "We can use a 'single' quote inside of a string"
## [1] "Or use a \"double\" quote inside of a string"
To include a literal single or double quote in a string, we can use
\
to escape it.
## [1] "'"
## [1] "\""
This means if we want to include a literal backslash, we will need to
double it up: "\\"
.
Note: We should note that the printed representation of a string is not the same as string itself since the printed representation shows the escapes. To see the raw contents of the strings, use the writeLines() function.
## or use a "double" quote inside of a string
## "
There are a handful of special characters. We list a few of them here:
"\n"
: newline"\t"
: tab\\
: backslash \
Note: A complete list could be found by requesting
help on ?'"'
or ?"'"
.
We may also need to write non-English characters such as Greek
alphabet letters & symbols. We can Unicode Character that works on all
platforms. A list of unicode characters could be found at Wikipedia.
For example, we can find the code for the Greek small letter mu is
U+03BC
and the small letter rho is U+03C1
.
## [1] "μ"
## [1] "ρ"
We can use the str_length() function to get the number of characters in a string.
## [1] 1
## [1] 17
Note: The length() function returns the length of the object, not the number of characters in the string.
To combine two or more strings, we could use the paste(), or paste0(), or str_c(). Since we have introduced the first two functions, we will focus on the usage of the third function. Since we use RStudio, typing str_ will trigger autocomplete, allowing us to see all stringr functions.
## [1] "PythonandR"
## [1] "Python, C++, R"
## [1] "Python,C++,R"
## [1] "Data for2015 Year" "Data for2016 Year" "Data for2017 Year"
## [4] "Data for2018 Year" "Data for2019 Year" "Data for2020 Year"
name <- "Tessa"
time_of_day <- "morning"
birthday <- FALSE
str_c("Good ", time_of_day, " ", name,
if (birthday) " and Happy Birthday!", ".")
## [1] "Good morning Tessa."
We can extract parts of a string using the str_sub() function, which takes start and end arguments that give the position of the substring.
x <- c("December 1, 2021", "July 7, 1969", "March 17, 2004", "January 11, 1976")
str_sub(x, nchar(x)-4, nchar(x))
## [1] " 2021" " 1969" " 2004" " 1976"
## [1] "2021" "1969" "2004" "1976"
## [1] "December 1, 2022" "July 7, 2022" "March 17, 2022" "January 11, 2022"
Note: 1. Negative numbers count backwards from end
The str_sub() function won’t fail if the string is too short. It will just return as much as possible.
We can also use the assignment form of str_sub() function to modify strings.
To learn regular expression, we will use str_view() and str_view_all() functions. We start from a simple example.
## [1] "December 1, 2022" "July 7, 2022" "March 17, 2022" "January 11, 2022"
The next step up in complexity is., which matches any character (except a newline).
But if . matches any character, how do we match the character “.”? To create the regular expression, we need the string “\.”.
Try: str_view(c("abc", "a.c", "bet"), "a.c")
## \.
If we want to match \
in a string, we need to write
\\\\
to mach one \
in the string.
## a\b
By default, regular expressions will match any part of a string.
There are a number of special patterns that match more than one character. Here are four useful tools:
\d
matches any digit\s
matches any whitespace (space, tab, newline)[abc]
matches a, b, or c[^abc]
matches anything except a, b, or cTo create a regular expression containing \d
or
\s
, we will need to excepe the \
for the
string. So we will type \\d
or \\s
. We can use
alternation to pick between one or more alternative patterns.
Try:
str_view(x, "CC+")
str_view(x, "CC*")
str_view(x, "C[LX]+")
We can also specify the number of matches precisely.
{n}
: exactly n{n,}
: n or more{,m}
: at most m{n,m}
: between n amd mstr_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
By default these matches will match the longest string possible. You can make them “lazy”, matching the shortest string possible, by putting a ? after them.
str_view(x, "C{2,3}?")
We can use the str_detect() function to check if a character vector matches a pattern.
x <- c("Jazz Chisholm Jr", "Cedric Mullins", "Alex Colome", "LaMonte Wade Jr", "AJrich")
str_detect(x, "Jr")
## [1] TRUE FALSE FALSE TRUE TRUE
## [1] "Jazz Chisholm Jr" "LaMonte Wade Jr" "AJrich"
## [1] "Jazz Chisholm Jr" "LaMonte Wade Jr"
We can use the str_extract() function to check if a character vector matches a pattern. The following example was from the Harvard sentence, which were designed to test VOIP systems. There are provided in the stringr package.
## [1] 720
## [1] "The birch canoe slid on the smooth planks."
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."
## [4] "These days a chicken leg is a rare dish."
Suppose that we want to find all sentences that contain a color.
colors <- c("red", "orange", "yellow", "green", "purple", "blue")
color_match <- str_c(colors, collapse = "|")
color_match
## [1] "red|orange|yellow|green|purple|blue"
has_color <- str_subset(sentences, color_match)
matches <- str_extract(has_color, color_match)
head(matches)
## [1] "blue" "blue" "red" "red" "red" "blue"
The str_extract() function only extracts the first match. We can use the str_extract_all() function to get all matches.
## [1] "It is hard to erase blue or red ink."
## [2] "The green light in the brown box flickered."
## [3] "The sky in the west is tinged with orange red."
## [[1]]
## [1] "blue" "red"
##
## [[2]]
## [1] "green" "red"
##
## [[3]]
## [1] "orange" "red"
If we use simplify = TRUE function, str_extract_all() function will return a matrix with short matches expanded to the same length as the longest.
## [,1] [,2]
## [1,] "blue" "red"
## [2,] "green" "red"
## [3,] "orange" "red"
The str_replace() and str_replace_all() functions allow us to replace matches with new strings.
languages <- c("python", "java,", "php", "javascript", "objective-c", "ruby",
"perl", " sql","kotlin", " r,", "matlab"," c#", " c++ ", "c++,", "c++/", " c,", " c ", "c/")
languages <- toupper(languages)
all_c1 <- str_c(languages[c(12:18)], collapse = "|")
str_replace(languages, all_c1, "C")
## [1] "PYTHON" "JAVA," "PHP" "JAVASCRIPT" "OBJECTIVE-C"
## [6] "RUBY" "PERL" " SQL" "KOTLIN" " R,"
## [11] "MATLAB" "C" " C++ " "C++," "C++/"
## [16] "C" "C" "C"
all_c2 <- str_c(c("C\\+\\+ ", "C\\+\\+,", "C\\+\\+/"), collapse = "|")
all_c <- str_c(c(all_c1, all_c2), collapse = "|")
str_replace(languages, all_c, "C")
## [1] "PYTHON" "JAVA," "PHP" "JAVASCRIPT" "OBJECTIVE-C"
## [6] "RUBY" "PERL" " SQL" "KOTLIN" " R,"
## [11] "MATLAB" "C" " C" "C" "C"
## [16] "C" "C" "C"
You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):
A: Switches show of current versus all slides (helpful for printing all pages)
B: Make fonts large
c: Show table of contents
S: Make fonts smaller