MTH 209 Data Manipulation and Management

Lesson 16: String Manipulation with stringr

Ying-Ju Tessa Chen
ychen4@udayton.edu
University of Dayton

Overview

In this session, we will learn string manipulation in R. We’ll learn the basics of how the strings work and how to create them by hand. The focus of this session will be on regular expression. We will use the R package stringr.

library(stringr)

Note: This lesson is based on the book: R for Data Science (Wickham and Grolemund (2016)).

String Basics - 1

We can create strings with either single quotes or double quotes.

string1 <- "Hello, the world!"
string2 <- "We can use a 'single' quote inside of a string"
string3 <- 'or use a "double" quote inside of a string'
string1
## [1] "Hello, the world!"
string2
## [1] "We can use a 'single' quote inside of a string"
string3
## [1] "Or use a \"double\" quote inside of a string"

String Basics - 2

To include a literal single or double quote in a string, we can use \ to escape it.

single_q <- '\'' # or "'"
double_q <- "\"" # or '"'
single_q
## [1] "'"
double_q
## [1] "\""

This means if we want to include a literal backslash, we will need to double it up: "\\".

Note: We should note that the printed representation of a string is not the same as string itself since the printed representation shows the escapes. To see the raw contents of the strings, use the writeLines() function.

writeLines(string3)
## or use a "double" quote inside of a string
writeLines(double_q)
## "

Some Common Special Characters

There are a handful of special characters. We list a few of them here:

Note: A complete list could be found by requesting help on ?'"' or ?"'".

We may also need to write non-English characters such as Greek alphabet letters & symbols. We can Unicode Character that works on all platforms. A list of unicode characters could be found at Wikipedia. For example, we can find the code for the Greek small letter mu is U+03BC and the small letter rho is U+03C1.

x1 <- "\u03bc"
x2 <- "\u03c1"
x1
## [1] "μ"
x2
## [1] "ρ"

String Length

We can use the str_length() function to get the number of characters in a string.

length(string1)
## [1] 1
str_length(string1)
## [1] 17

Note: The length() function returns the length of the object, not the number of characters in the string.

Combining Strings

To combine two or more strings, we could use the paste(), or paste0(), or str_c(). Since we have introduced the first two functions, we will focus on the usage of the third function. Since we use RStudio, typing str_ will trigger autocomplete, allowing us to see all stringr functions.

str_c("Python", "and", "R")
## [1] "PythonandR"
str_c("Python", "C++", "R", sep=", ")
## [1] "Python, C++, R"
str_c(c("Python", "C++", "R"), collapse = ",")
## [1] "Python,C++,R"
str_c("Data for", 2015:2020, " Year")
## [1] "Data for2015 Year" "Data for2016 Year" "Data for2017 Year"
## [4] "Data for2018 Year" "Data for2019 Year" "Data for2020 Year"
name <- "Tessa"
time_of_day <- "morning"
birthday <- FALSE
str_c("Good ", time_of_day, " ", name, 
      if (birthday) " and Happy Birthday!", ".")
## [1] "Good morning Tessa."

Subsetting Strings

We can extract parts of a string using the str_sub() function, which takes start and end arguments that give the position of the substring.

x <- c("December 1, 2021", "July 7, 1969", "March 17, 2004", "January 11, 1976")
str_sub(x, nchar(x)-4, nchar(x))
## [1] " 2021" " 1969" " 2004" " 1976"
str_sub(x, -4, -1)
## [1] "2021" "1969" "2004" "1976"
str_sub(x, -4, -1) <- "2022"
x
## [1] "December 1, 2022" "July 7, 2022"     "March 17, 2022"   "January 11, 2022"

Note: 1. Negative numbers count backwards from end

  1. The str_sub() function won’t fail if the string is too short. It will just return as much as possible.

  2. We can also use the assignment form of str_sub() function to modify strings.

Matching Patterns with Regular Expressions

To learn regular expression, we will use str_view() and str_view_all() functions. We start from a simple example.

x
## [1] "December 1, 2022" "July 7, 2022"     "March 17, 2022"   "January 11, 2022"
str_view(x, "20")
#str_view(x, ", ")

Basic Matches - 1

The next step up in complexity is., which matches any character (except a newline).

str_view(x, ".ly.")

Basic Matches - 2

But if . matches any character, how do we match the character “.”? To create the regular expression, we need the string “\.”.

x <- c("abc", "a.c", "bef")

Try: str_view(c("abc", "a.c", "bet"), "a.c")

dot <- "\\."
writeLines(dot)
## \.
str_view(x, "a\\.c")

Basic Matches - 3

If we want to match \ in a string, we need to write \\\\ to mach one \ in the string.

x <- "a\\b"
writeLines(x)
## a\b
str_view(x, "\\\\")

Basic Matches - 4

By default, regular expressions will match any part of a string.

address <- c("tessa@udayton.edu", "abc@bgsu.edu", "def@miamioh.edu", "hello@gmail.com", "rst@mailbox.org")
# str_vew(address, "^t")
str_view(x, ".edu$")

Basic Matches - 5

x <- c("apple pie", "apple", "apple cake", "sweet apple cookies")
# str_vew(address, "^t")
str_view(x, "^apple$")

Character Classes and Alternatives

There are a number of special patterns that match more than one character. Here are four useful tools:

To create a regular expression containing \d or \s, we will need to excepe the \ for the string. So we will type \\d or \\s. We can use alternation to pick between one or more alternative patterns.

str_view(c("grey", "gray"), "gr(e|a)y")

Repetition - 1

Try:

  1. str_view(x, "CC+")
  2. str_view(x, "CC*")
  3. str_view(x, "C[LX]+")
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")

Repetition - 2

We can also specify the number of matches precisely.

str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")

By default these matches will match the longest string possible. You can make them “lazy”, matching the shortest string possible, by putting a ? after them.

str_view(x, "C{2,3}?")

Detect Matches

We can use the str_detect() function to check if a character vector matches a pattern.

x <- c("Jazz Chisholm Jr", "Cedric Mullins", "Alex Colome", "LaMonte Wade Jr", "AJrich")
str_detect(x, "Jr")
## [1]  TRUE FALSE FALSE  TRUE  TRUE
x[str_detect(x, "Jr")]
## [1] "Jazz Chisholm Jr" "LaMonte Wade Jr"  "AJrich"
x[str_detect(x, "Jr$")]
## [1] "Jazz Chisholm Jr" "LaMonte Wade Jr"

Extract Matches - 1

We can use the str_extract() function to check if a character vector matches a pattern. The following example was from the Harvard sentence, which were designed to test VOIP systems. There are provided in the stringr package.

length(sentences)
## [1] 720
head(sentences, n = 4)
## [1] "The birch canoe slid on the smooth planks." 
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."     
## [4] "These days a chicken leg is a rare dish."

Suppose that we want to find all sentences that contain a color.

colors <- c("red", "orange", "yellow", "green", "purple", "blue")
color_match <- str_c(colors, collapse = "|")
color_match
## [1] "red|orange|yellow|green|purple|blue"
has_color <- str_subset(sentences, color_match)
matches <- str_extract(has_color, color_match)
head(matches)
## [1] "blue" "blue" "red"  "red"  "red"  "blue"

Extract Matches - 2

The str_extract() function only extracts the first match. We can use the str_extract_all() function to get all matches.

more <- sentences[str_count(sentences, color_match) > 1]
more
## [1] "It is hard to erase blue or red ink."          
## [2] "The green light in the brown box flickered."   
## [3] "The sky in the west is tinged with orange red."
str_extract_all(more, color_match)
## [[1]]
## [1] "blue" "red" 
## 
## [[2]]
## [1] "green" "red"  
## 
## [[3]]
## [1] "orange" "red"

If we use simplify = TRUE function, str_extract_all() function will return a matrix with short matches expanded to the same length as the longest.

str_extract_all(more, color_match, simplify = TRUE)
##      [,1]     [,2] 
## [1,] "blue"   "red"
## [2,] "green"  "red"
## [3,] "orange" "red"

Replaceing Matches

The str_replace() and str_replace_all() functions allow us to replace matches with new strings.

languages <- c("python", "java,", "php", "javascript", "objective-c", "ruby",
               "perl", " sql","kotlin", " r,", "matlab"," c#", " c++ ", "c++,", "c++/", " c,", " c ", "c/")

languages <- toupper(languages)
all_c1 <- str_c(languages[c(12:18)], collapse = "|")
str_replace(languages, all_c1, "C")
##  [1] "PYTHON"      "JAVA,"       "PHP"         "JAVASCRIPT"  "OBJECTIVE-C"
##  [6] "RUBY"        "PERL"        " SQL"        "KOTLIN"      " R,"        
## [11] "MATLAB"      "C"           " C++ "       "C++,"        "C++/"       
## [16] "C"           "C"           "C"
all_c2 <- str_c(c("C\\+\\+ ", "C\\+\\+,", "C\\+\\+/"), collapse = "|")
all_c <- str_c(c(all_c1, all_c2), collapse = "|")
str_replace(languages, all_c, "C")
##  [1] "PYTHON"      "JAVA,"       "PHP"         "JAVASCRIPT"  "OBJECTIVE-C"
##  [6] "RUBY"        "PERL"        " SQL"        "KOTLIN"      " R,"        
## [11] "MATLAB"      "C"           " C"          "C"           "C"          
## [16] "C"           "C"           "C"

README

You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".
Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. CRC Press.