Working with text using stringr

Author

Violet Ross and Emily Malcolm-White

artwork by @allisonhorst

A few basics

What is a string?

  • datatype we use to represent text
  • use ” ”

Examples of strings:

  • “Hello world”
  • “5678”
  • “blah blah blah”

** NOT a string:**

  • 5678

Using stringr

stringr is a package containing a bunch of functions that help us work with strings. We’ll discuss how to detect, remove, extract, and count words/characters/phrases from a string. We’ll also talk about how to slice a string to get only the parts (aka the substrings) of it that you want.

stringr cheat sheet

stringr is contained within the tidyverse package.

library(tidyverse)

I’m registering for classes this Spring and am trying to decide what to take. Let’s look at the course catalog!

Read in the courses data.

courses <- read_csv("data/courses.csv")

str_detect

artwork by @allisonhorst

inputs: - string - pattern

output: - TRUE/FALSE

little example:

str_detect("Welcome to data science, look at this cool data", "data")
[1] TRUE
str_detect("Welcome to data science, look at this cool data", "pineapple")
[1] FALSE

I only want to take classes in Warner!

courses %>% 
  filter(str_detect(location, "WNS"))
# A tibble: 48 × 9
   titles      distros department time  location professor description courseNum
   <chr>       <chr>   <chr>      <chr> <chr>    <chr>     <chr>       <chr>    
 1 Beginning … LNG     Chinese    8:40… "Warner… Hang Du … "\nThis co… CHNS0101…
 2 Beginning … LNG     Chinese    9:45… "Warner… Hang Du … "\nThis co… CHNS0101…
 3 Economic S… DED     Economics  12:4… "Warner… Erick Go… "\nAn intr… ECON0111…
 4 Introducto… SOC     Economics  9:45… "Warner… David Mu… "\nAn intr… ECON0150…
 5 Introducto… SOC     Economics  11:1… "Warner… David Mu… "\nAn intr… ECON0150…
 6 Introducto… SOC     Economics  8:15… "Warner… Cihan Ar… "\nAn intr… ECON0150…
 7 Introducto… SOC     Economics  2:15… "Warner… <NA>      "\nAn intr… ECON0150…
 8 Introducto… SOC     Economics  2:15… "Warner… Phani Wu… "\nAn intr… ECON0155…
 9 Introducti… DED     Economics  9:45… "Warner… German R… "\nIn this… ECON0211…
10 Introducti… DED     Economics  11:1… "Warner… German R… "\nIn this… ECON0211…
# ℹ 38 more rows
# ℹ 1 more variable: meet <chr>

Suppose I don’t want any classes on Friday. Let’s use str_detect to find our options.

notFriday <- courses %>% 
  filter(!str_detect(meet, "Friday"))

Perhaps I’m interested in immigration.

The regex function is used to write regular expressions in R. Regular expressions are helpful if you want to search for a pattern rather than a specific word or phrase.

For now, we will only use regex to ignore capitalization.

If you’re interested in using regular expressions at some point, this regex cheat sheet will be super helpful.

immigrationclasses <- courses %>% 
  filter(str_detect(description, regex("immigration", ignore_case=TRUE)))

immigrationclasses
# A tibble: 7 × 9
  titles distros department time  location professor description courseNum meet 
  <chr>  <chr>   <chr>      <chr> <chr>    <chr>     <chr>       <chr>     <chr>
1 Immig… AMR HIS Program i… 11:1… "Axinn … Rachael … "\nIn this… AMST0175… "Tue…
2 Globa… AMR NOR Economics  12:4… "Axinn … Erin Wol… "\nDoes gl… ECON0420… "Tue…
3 Intro… CMP     Internati… 11:1… "Twilig… Amit Pra… "\nThis is… IGST0101… "Tue…
4 An In… EUR LN… Italian    9:45… "Atwate… Thomas V… "\nIntende… ITAL0251… "Fri…
5 An In… EUR LN… Italian    11:1… "Atwate… Pat Zupan "\nIntende… ITAL0251… "Fri…
6 Globa… SOC     Political… 9:45… "Hillcr… Orion Le… "\nHow doe… PSCI0314… "Tue…
7 Chris… AMR HI… Religion   1:30… "Munroe… James Ca… "\nReligio… RELI0398… "Wed…

str_extract and str_remove

str_extract inputs: - string - pattern str_extract output: - the extracted pattern, if it appears in the the string

str_remove inputs: - string - pattern str_extract output: - the string without the pattern, if it appears in the string

little example:

str_extract("Welcome to data science, look at this cool data", "data")
[1] "data"
str_extract_all("Welcome to data science, look at this cool data", "data")
[[1]]
[1] "data" "data"
str_remove("Welcome to data science, look at this cool data", "data")
[1] "Welcome to  science, look at this cool data"
str_remove_all("Welcome to data science, look at this cool data", "data")
[1] "Welcome to  science, look at this cool "

CW is part of the distribution requirement column. I want CW to be its own column.

courses %>% 
  mutate(CW = str_extract(distros, "CW")) %>% 
  mutate(distros = str_remove(distros, "CW"))
# A tibble: 568 × 10
   titles      distros department time  location professor description courseNum
   <chr>       <chr>   <chr>      <chr> <chr>    <chr>     <chr>       <chr>    
 1 Cultural C… <NA>    Program i… 2:15… "Wright… Olga San… "\nIn this… AMST0121…
 2 Immigrant … AMR HIS Program i… 11:1… "Axinn … Rachael … "\nIn this… AMST0175…
 3 Introducti… AMR HI… Program i… 7:30… "Axinn … Roberto … "\nIn this… AMST0213…
 4 See the U.… AMR  H… Program i… 11:1… "Axinn … Deb Evans "\nIn this… AMST0231…
 5 Science Fi… LIT     Program i… 2:15… "Axinn … Michael … "\nTime tr… AMST0253…
 6 Music and … CMP LIT Program i… 9:45… "Axinn … William … "\nAlthoug… AMST0257…
 7 <NA>        AMR AR… Program i… 9:45… "Axinn … Ellery F…  <NA>       AMST0273…
 8 Viewer Dis… AMR AR… Program i… 2:15… "Axinn … Ellery F… "\nWhat ar… AMST0281…
 9 Posthuman … LIT     Program i… 11:1… "Axinn … Michael … "\nMedical… AMST0287…
10 Humanitari… AMR     Program i… 12:4… "Axinn … Rachael … "\nThis pu… AMST0343…
# ℹ 558 more rows
# ℹ 2 more variables: meet <chr>, CW <chr>

str_sub

str_sub inputs: - string
- starting character - ending character str_sub output: - string with only the characters between the start and the end

little example:

str_sub("Welcome to data science, look at this cool data", start=12, end=23) 
[1] "data science"

Bounds are inclusive!

Maybe I only want 200 level math classes.

  • First we filter for just math classes.
  • Then we can create a new column called level that contains only the sixth character from the courses column.

We call this a substring, hence the function str_sub.

MathClasses <- courses %>% 
  filter(department == "Mathematics") %>% 
  mutate(level=str_sub(courseNum, start=6, end=6)) 

Math2Classes <- MathClasses %>% 
  filter(level== "2")

str_count

str_count inputs: - string
- pattern str_count output: - a count of the number of times the pattern appears in the string

little example:

str_count("Welcome to data science, look at this cool data", "data")
[1] 2

Maybe I only want my classes to meet twice a week.

courses <- courses %>% 
  mutate(dayCount = str_count(meet, "day"))

#what's the maximum number of days a week a class meets?
max(courses$dayCount)
[1] 6
#what's the mean number of days?
mean(courses$dayCount)
[1] 2.225352

Let’s visualize this data.

courses %>% 
  ggplot() + 
  geom_bar(aes(x=dayCount %>% as.factor()), fill="blue") + 
  xlab("Number of Days Class Meets") + 
  ylab("Number of Classes") + 
  labs(title="How many Days a Week do Classes at Middlebury Meet?")+
  theme_classic()

Another useful function str_squish

str_squish is used to remove leading, trailing, and repeated interior whitespaces from strings

artwork by @allisonhorst