library(tidyverse)Working with text using stringr
A few basics
What is a string?
- datatype we use to represent text
- use ” ”
Examples of strings:
- “Hello world”
- “5678”
- “blah blah blah”
** NOT a string:**
- 5678
Using stringr
stringr is a package containing a bunch of functions that help us work with strings. We’ll discuss how to detect, remove, extract, and count words/characters/phrases from a string. We’ll also talk about how to slice a string to get only the parts (aka the substrings) of it that you want.
stringr is contained within the tidyverse package.
I’m registering for classes this Spring and am trying to decide what to take. Let’s look at the course catalog!
Read in the courses data.
courses <- read_csv("data/courses.csv")str_detect
inputs: - string - pattern
output: - TRUE/FALSE
little example:
str_detect("Welcome to data science, look at this cool data", "data")[1] TRUEstr_detect("Welcome to data science, look at this cool data", "pineapple")[1] FALSEI only want to take classes in Warner!
courses %>% 
  filter(str_detect(location, "WNS"))# A tibble: 48 × 9
   titles      distros department time  location professor description courseNum
   <chr>       <chr>   <chr>      <chr> <chr>    <chr>     <chr>       <chr>    
 1 Beginning … LNG     Chinese    8:40… "Warner… Hang Du … "\nThis co… CHNS0101…
 2 Beginning … LNG     Chinese    9:45… "Warner… Hang Du … "\nThis co… CHNS0101…
 3 Economic S… DED     Economics  12:4… "Warner… Erick Go… "\nAn intr… ECON0111…
 4 Introducto… SOC     Economics  9:45… "Warner… David Mu… "\nAn intr… ECON0150…
 5 Introducto… SOC     Economics  11:1… "Warner… David Mu… "\nAn intr… ECON0150…
 6 Introducto… SOC     Economics  8:15… "Warner… Cihan Ar… "\nAn intr… ECON0150…
 7 Introducto… SOC     Economics  2:15… "Warner… <NA>      "\nAn intr… ECON0150…
 8 Introducto… SOC     Economics  2:15… "Warner… Phani Wu… "\nAn intr… ECON0155…
 9 Introducti… DED     Economics  9:45… "Warner… German R… "\nIn this… ECON0211…
10 Introducti… DED     Economics  11:1… "Warner… German R… "\nIn this… ECON0211…
# ℹ 38 more rows
# ℹ 1 more variable: meet <chr>Suppose I don’t want any classes on Friday. Let’s use str_detect to find our options.
notFriday <- courses %>% 
  filter(!str_detect(meet, "Friday"))Perhaps I’m interested in immigration.
The regex function is used to write regular expressions in R. Regular expressions are helpful if you want to search for a pattern rather than a specific word or phrase.
For now, we will only use regex to ignore capitalization.
If you’re interested in using regular expressions at some point, this regex cheat sheet will be super helpful.
immigrationclasses <- courses %>% 
  filter(str_detect(description, regex("immigration", ignore_case=TRUE)))
immigrationclasses# A tibble: 7 × 9
  titles distros department time  location professor description courseNum meet 
  <chr>  <chr>   <chr>      <chr> <chr>    <chr>     <chr>       <chr>     <chr>
1 Immig… AMR HIS Program i… 11:1… "Axinn … Rachael … "\nIn this… AMST0175… "Tue…
2 Globa… AMR NOR Economics  12:4… "Axinn … Erin Wol… "\nDoes gl… ECON0420… "Tue…
3 Intro… CMP     Internati… 11:1… "Twilig… Amit Pra… "\nThis is… IGST0101… "Tue…
4 An In… EUR LN… Italian    9:45… "Atwate… Thomas V… "\nIntende… ITAL0251… "Fri…
5 An In… EUR LN… Italian    11:1… "Atwate… Pat Zupan "\nIntende… ITAL0251… "Fri…
6 Globa… SOC     Political… 9:45… "Hillcr… Orion Le… "\nHow doe… PSCI0314… "Tue…
7 Chris… AMR HI… Religion   1:30… "Munroe… James Ca… "\nReligio… RELI0398… "Wed…str_extract and str_remove
str_extract inputs: - string - pattern str_extract output: - the extracted pattern, if it appears in the the string
str_remove inputs: - string - pattern str_extract output: - the string without the pattern, if it appears in the string
little example:
str_extract("Welcome to data science, look at this cool data", "data")[1] "data"str_extract_all("Welcome to data science, look at this cool data", "data")[[1]]
[1] "data" "data"str_remove("Welcome to data science, look at this cool data", "data")[1] "Welcome to  science, look at this cool data"str_remove_all("Welcome to data science, look at this cool data", "data")[1] "Welcome to  science, look at this cool "CW is part of the distribution requirement column. I want CW to be its own column.
courses %>% 
  mutate(CW = str_extract(distros, "CW")) %>% 
  mutate(distros = str_remove(distros, "CW"))# A tibble: 568 × 10
   titles      distros department time  location professor description courseNum
   <chr>       <chr>   <chr>      <chr> <chr>    <chr>     <chr>       <chr>    
 1 Cultural C… <NA>    Program i… 2:15… "Wright… Olga San… "\nIn this… AMST0121…
 2 Immigrant … AMR HIS Program i… 11:1… "Axinn … Rachael … "\nIn this… AMST0175…
 3 Introducti… AMR HI… Program i… 7:30… "Axinn … Roberto … "\nIn this… AMST0213…
 4 See the U.… AMR  H… Program i… 11:1… "Axinn … Deb Evans "\nIn this… AMST0231…
 5 Science Fi… LIT     Program i… 2:15… "Axinn … Michael … "\nTime tr… AMST0253…
 6 Music and … CMP LIT Program i… 9:45… "Axinn … William … "\nAlthoug… AMST0257…
 7 <NA>        AMR AR… Program i… 9:45… "Axinn … Ellery F…  <NA>       AMST0273…
 8 Viewer Dis… AMR AR… Program i… 2:15… "Axinn … Ellery F… "\nWhat ar… AMST0281…
 9 Posthuman … LIT     Program i… 11:1… "Axinn … Michael … "\nMedical… AMST0287…
10 Humanitari… AMR     Program i… 12:4… "Axinn … Rachael … "\nThis pu… AMST0343…
# ℹ 558 more rows
# ℹ 2 more variables: meet <chr>, CW <chr>str_sub
str_sub inputs: - string
- starting character - ending character str_sub output: - string with only the characters between the start and the end
little example:
str_sub("Welcome to data science, look at this cool data", start=12, end=23) [1] "data science"Bounds are inclusive!
Maybe I only want 200 level math classes.
- First we filter for just math classes.
- Then we can create a new column called levelthat contains only the sixth character from thecoursescolumn.
We call this a substring, hence the function str_sub.
MathClasses <- courses %>% 
  filter(department == "Mathematics") %>% 
  mutate(level=str_sub(courseNum, start=6, end=6)) 
Math2Classes <- MathClasses %>% 
  filter(level== "2")str_count
str_count inputs: - string
- pattern str_count output: - a count of the number of times the pattern appears in the string
little example:
str_count("Welcome to data science, look at this cool data", "data")[1] 2Maybe I only want my classes to meet twice a week.
courses <- courses %>% 
  mutate(dayCount = str_count(meet, "day"))
#what's the maximum number of days a week a class meets?
max(courses$dayCount)[1] 6#what's the mean number of days?
mean(courses$dayCount)[1] 2.225352Let’s visualize this data.
courses %>% 
  ggplot() + 
  geom_bar(aes(x=dayCount %>% as.factor()), fill="blue") + 
  xlab("Number of Days Class Meets") + 
  ylab("Number of Classes") + 
  labs(title="How many Days a Week do Classes at Middlebury Meet?")+
  theme_classic()Another useful function str_squish
str_squish is used to remove leading, trailing, and repeated interior whitespaces from strings