library(tidyverse)
Working with text using stringr
A few basics
What is a string?
- datatype we use to represent text
- use ” ”
Examples of strings:
- “Hello world”
- “5678”
- “blah blah blah”
** NOT a string:**
- 5678
Using stringr
stringr
is a package containing a bunch of functions that help us work with strings. We’ll discuss how to detect, remove, extract, and count words/characters/phrases from a string. We’ll also talk about how to slice a string to get only the parts (aka the substrings) of it that you want.
stringr
is contained within the tidyverse
package.
I’m registering for classes this Spring and am trying to decide what to take. Let’s look at the course catalog!
Read in the courses data.
<- read_csv("data/courses.csv") courses
str_detect
inputs: - string - pattern
output: - TRUE/FALSE
little example:
str_detect("Welcome to data science, look at this cool data", "data")
[1] TRUE
str_detect("Welcome to data science, look at this cool data", "pineapple")
[1] FALSE
I only want to take classes in Warner!
%>%
courses filter(str_detect(location, "WNS"))
# A tibble: 48 × 9
titles distros department time location professor description courseNum
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Beginning … LNG Chinese 8:40… "Warner… Hang Du … "\nThis co… CHNS0101…
2 Beginning … LNG Chinese 9:45… "Warner… Hang Du … "\nThis co… CHNS0101…
3 Economic S… DED Economics 12:4… "Warner… Erick Go… "\nAn intr… ECON0111…
4 Introducto… SOC Economics 9:45… "Warner… David Mu… "\nAn intr… ECON0150…
5 Introducto… SOC Economics 11:1… "Warner… David Mu… "\nAn intr… ECON0150…
6 Introducto… SOC Economics 8:15… "Warner… Cihan Ar… "\nAn intr… ECON0150…
7 Introducto… SOC Economics 2:15… "Warner… <NA> "\nAn intr… ECON0150…
8 Introducto… SOC Economics 2:15… "Warner… Phani Wu… "\nAn intr… ECON0155…
9 Introducti… DED Economics 9:45… "Warner… German R… "\nIn this… ECON0211…
10 Introducti… DED Economics 11:1… "Warner… German R… "\nIn this… ECON0211…
# ℹ 38 more rows
# ℹ 1 more variable: meet <chr>
Suppose I don’t want any classes on Friday. Let’s use str_detect
to find our options.
<- courses %>%
notFriday filter(!str_detect(meet, "Friday"))
Perhaps I’m interested in immigration.
The regex
function is used to write regular expressions in R. Regular expressions are helpful if you want to search for a pattern rather than a specific word or phrase.
For now, we will only use regex to ignore capitalization.
If you’re interested in using regular expressions at some point, this regex cheat sheet will be super helpful.
<- courses %>%
immigrationclasses filter(str_detect(description, regex("immigration", ignore_case=TRUE)))
immigrationclasses
# A tibble: 7 × 9
titles distros department time location professor description courseNum meet
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Immig… AMR HIS Program i… 11:1… "Axinn … Rachael … "\nIn this… AMST0175… "Tue…
2 Globa… AMR NOR Economics 12:4… "Axinn … Erin Wol… "\nDoes gl… ECON0420… "Tue…
3 Intro… CMP Internati… 11:1… "Twilig… Amit Pra… "\nThis is… IGST0101… "Tue…
4 An In… EUR LN… Italian 9:45… "Atwate… Thomas V… "\nIntende… ITAL0251… "Fri…
5 An In… EUR LN… Italian 11:1… "Atwate… Pat Zupan "\nIntende… ITAL0251… "Fri…
6 Globa… SOC Political… 9:45… "Hillcr… Orion Le… "\nHow doe… PSCI0314… "Tue…
7 Chris… AMR HI… Religion 1:30… "Munroe… James Ca… "\nReligio… RELI0398… "Wed…
str_extract
and str_remove
str_extract inputs: - string - pattern str_extract output: - the extracted pattern, if it appears in the the string
str_remove inputs: - string - pattern str_extract output: - the string without the pattern, if it appears in the string
little example:
str_extract("Welcome to data science, look at this cool data", "data")
[1] "data"
str_extract_all("Welcome to data science, look at this cool data", "data")
[[1]]
[1] "data" "data"
str_remove("Welcome to data science, look at this cool data", "data")
[1] "Welcome to science, look at this cool data"
str_remove_all("Welcome to data science, look at this cool data", "data")
[1] "Welcome to science, look at this cool "
CW is part of the distribution requirement column. I want CW to be its own column.
%>%
courses mutate(CW = str_extract(distros, "CW")) %>%
mutate(distros = str_remove(distros, "CW"))
# A tibble: 568 × 10
titles distros department time location professor description courseNum
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Cultural C… <NA> Program i… 2:15… "Wright… Olga San… "\nIn this… AMST0121…
2 Immigrant … AMR HIS Program i… 11:1… "Axinn … Rachael … "\nIn this… AMST0175…
3 Introducti… AMR HI… Program i… 7:30… "Axinn … Roberto … "\nIn this… AMST0213…
4 See the U.… AMR H… Program i… 11:1… "Axinn … Deb Evans "\nIn this… AMST0231…
5 Science Fi… LIT Program i… 2:15… "Axinn … Michael … "\nTime tr… AMST0253…
6 Music and … CMP LIT Program i… 9:45… "Axinn … William … "\nAlthoug… AMST0257…
7 <NA> AMR AR… Program i… 9:45… "Axinn … Ellery F… <NA> AMST0273…
8 Viewer Dis… AMR AR… Program i… 2:15… "Axinn … Ellery F… "\nWhat ar… AMST0281…
9 Posthuman … LIT Program i… 11:1… "Axinn … Michael … "\nMedical… AMST0287…
10 Humanitari… AMR Program i… 12:4… "Axinn … Rachael … "\nThis pu… AMST0343…
# ℹ 558 more rows
# ℹ 2 more variables: meet <chr>, CW <chr>
str_sub
str_sub inputs: - string
- starting character - ending character str_sub output: - string with only the characters between the start and the end
little example:
str_sub("Welcome to data science, look at this cool data", start=12, end=23)
[1] "data science"
Bounds are inclusive!
Maybe I only want 200 level math classes.
- First we filter for just math classes.
- Then we can create a new column called
level
that contains only the sixth character from thecourses
column.
We call this a substring, hence the function str_sub
.
<- courses %>%
MathClasses filter(department == "Mathematics") %>%
mutate(level=str_sub(courseNum, start=6, end=6))
<- MathClasses %>%
Math2Classes filter(level== "2")
str_count
str_count inputs: - string
- pattern str_count output: - a count of the number of times the pattern appears in the string
little example:
str_count("Welcome to data science, look at this cool data", "data")
[1] 2
Maybe I only want my classes to meet twice a week.
<- courses %>%
courses mutate(dayCount = str_count(meet, "day"))
#what's the maximum number of days a week a class meets?
max(courses$dayCount)
[1] 6
#what's the mean number of days?
mean(courses$dayCount)
[1] 2.225352
Let’s visualize this data.
%>%
courses ggplot() +
geom_bar(aes(x=dayCount %>% as.factor()), fill="blue") +
xlab("Number of Days Class Meets") +
ylab("Number of Classes") +
labs(title="How many Days a Week do Classes at Middlebury Meet?")+
theme_classic()
Another useful function str_squish
str_squish
is used to remove leading, trailing, and repeated interior whitespaces from strings