Webscraping Text

Author
Affiliation

Emily Malcolm-White

Middlebury College

#LOAD PACKAGES 
library(tidyverse)
library(rvest)

Webscraping Text

Let’s look at the Science Books on Good Reads

URL <- read_html("https://www.goodreads.com/shelf/show/science")

Notice that the data for all these books isn’t housed inside a <table> element!

Titles

For example, check out the lines of html code for the book “Sapians”:

<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a title="Sapiens: A Brief History of Humankind" class="leftAlignedImage" href="/book/show/23692271-sapiens"><img alt="Sapiens: A Brief History of Humankind" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1703329310l/23692271._SY75_.jpg" /></a>
<a class="bookTitle" href="/book/show/23692271-sapiens">Sapiens: A Brief History of Humankind (Paperback)</a>
<br />
<span class='by'>by</span>
<span itemprop='author' itemscope='' itemtype='http://schema.org/Person'>
<div class='authorName__container'>
<a class="authorName" itemprop="url" href="https://www.goodreads.com/author/show/395812.Yuval_Noah_Harari"><span itemprop="name">Yuval Noah Harari</span></a>
</div>
</span>
(shelved 6592 times as <em>science</em>)
<br />
<span class="greyText smallText">
avg rating 4.35 —
1,168,542 ratings  —
published 2011
</span>
</div>

Titles

In this case, we want to look for the class bookTitle using html_elements(".bookTitle")

Tip

In this case, we want ALL titles so we used html_elements(). If we had only wanted the first title we would have used html_element()

titles <- URL %>%
  html_elements(".bookTitle") %>%
  html_text()

titles
 [1] "A Short History of Nearly Everything (Paperback)"                                                                                               
 [2] "The Selfish Gene (Paperback)"                                                                                                                   
 [3] "Astrophysics for People in a Hurry (Hardcover)"                                                                                                 
 [4] "Sapiens: A Brief History of Humankind (Paperback)"                                                                                              
 [5] "Cosmos (Mass Market Paperback)"                                                                                                                 
 [6] "The Immortal Life of Henrietta Lacks (Hardcover)"                                                                                               
 [7] "What If?: Serious Scientific Answers to Absurd Hypothetical Questions (Hardcover)"                                                              
 [8] "The Origin of Species (Hardcover)"                                                                                                              
 [9] "The Elegant Universe: Superstrings, Hidden Dimensions, and the Quest for the Ultimate Theory (Paperback)"                                       
[10] "The Demon-Haunted World: Science as a Candle in the Dark (Paperback)"                                                                           
[11] "Stiff: The Curious Lives of Human Cadavers (Paperback)"                                                                                         
[12] "\"Surely You're Joking, Mr. Feynman!\": Adventures of a Curious Character (Paperback)"                                                          
[13] "Why We Sleep: Unlocking the Power of Sleep and Dreams (Hardcover)"                                                                              
[14] "The Gene: An Intimate History (Hardcover)"                                                                                                      
[15] "The Emperor of All Maladies: A Biography of Cancer (Hardcover)"                                                                                 
[16] "The Sixth Extinction: An Unnatural History (Hardcover)"                                                                                         
[17] "The Disappearing Spoon: And Other True Tales of Madness, Love, and the History of the World from the Periodic Table of the Elements (Hardcover)"
[18] "The Man Who Mistook His Wife for a Hat and Other Clinical Tales (Paperback)"                                                                    
[19] "Guns, Germs, and Steel: The Fates of Human Societies (Paperback)"                                                                               
[20] "The Grand Design (Hardcover)"                                                                                                                   
[21] "The Greatest Show on Earth: The Evidence for Evolution (Hardcover)"                                                                             
[22] "Seven Brief Lessons on Physics (Hardcover)"                                                                                                     
[23] "The Fabric of the Cosmos: Space, Time, and the Texture of Reality (Paperback)"                                                                  
[24] "Homo Deus: A History of Tomorrow (ebook)"                                                                                                       
[25] "Brief Answers to the Big Questions (Hardcover)"                                                                                                 
[26] "The God Delusion (Hardcover)"                                                                                                                   
[27] "The Structure of Scientific Revolutions (Paperback)"                                                                                            
[28] "I Contain Multitudes: The Microbes Within Us and a Grander View of Life (Hardcover)"                                                            
[29] "The Body: A Guide for Occupants (Hardcover)"                                                                                                    
[30] "Chaos: Making a New Science (Paperback)"                                                                                                        
[31] "Packing for Mars: The Curious Science of Life in the Void (Paperback)"                                                                          
[32] "Gödel, Escher, Bach: An Eternal Golden Braid (Paperback)"                                                                                       
[33] "Pale Blue Dot: A Vision of the Human Future in Space (Paperback)"                                                                               
[34] "The Universe in a Nutshell (Hardcover)"                                                                                                         
[35] "The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe Without Design (Paperback)"                                              
[36] "The Hidden Life of Trees: What They Feel, How They Communicate: Discoveries from a Secret World (Hardcover)"                                    
[37] "Behave: The Biology of Humans at Our Best and Worst (Hardcover)"                                                                                
[38] "Bad Science (Paperback)"                                                                                                                        
[39] "Physics of the Impossible (Hardcover)"                                                                                                          
[40] "The Rise and Fall of the Dinosaurs: A New History of a Lost World (Hardcover)"                                                                  
[41] "Entangled Life: How Fungi Make Our Worlds, Change Our Minds & Shape Our Futures (Hardcover)"                                                    
[42] "Death by Black Hole: And Other Cosmic Quandaries (Paperback)"                                                                                   
[43] "Genome: The Autobiography of a Species in 23 Chapters (Paperback)"                                                                              
[44] "The Order of Time (Hardcover)"                                                                                                                  
[45] "Six Easy Pieces: Essentials of Physics Explained by Its Most Brilliant Teacher (Paperback)"                                                     
[46] "A Universe from Nothing: Why There Is Something Rather Than Nothing (Hardcover)"                                                                
[47] "Your Inner Fish: a Journey into the 3.5-Billion-Year History of the Human Body (Hardcover)"                                                     
[48] "The Hot Zone: The Terrifying True Story of the Origins of the Ebola Virus (Paperback)"                                                          
[49] "Bonk: The Curious Coupling of Science and Sex (Paperback)"                                                                                      
[50] "Longitude: The True Story of a Lone Genius Who Solved the Greatest Scientific Problem of His Time (Hardcover)"                                  

Authors

In this case, we want to look for the class authorName using html_elements(".authorName")

Tip

In this case, we want ALL titles so we used html_elements(). If we had only wanted the first title we would have used html_element()

authors <- URL %>%
  html_elements(".authorName") %>%
  html_text()

authors
 [1] "Bill Bryson"           "Richard Dawkins"       "Neil deGrasse Tyson"  
 [4] "Yuval Noah Harari"     "Carl Sagan"            "Rebecca Skloot"       
 [7] "Randall Munroe"        "Charles Darwin"        "Brian Greene"         
[10] "Carl Sagan"            "Mary Roach"            "Richard P. Feynman"   
[13] "(Contibutor)"          "Matthew Walker"        "Siddhartha Mukherjee" 
[16] "Siddhartha Mukherjee"  "Elizabeth Kolbert"     "Sam Kean"             
[19] "Oliver Sacks"          "Jared Diamond"         "Stephen Hawking"      
[22] "Richard Dawkins"       "Carlo Rovelli"         "Brian Greene"         
[25] "Yuval Noah Harari"     "Stephen Hawking"       "Richard Dawkins"      
[28] "Thomas S. Kuhn"        "Ed Yong"               "Bill Bryson"          
[31] "James Gleick"          "Mary Roach"            "Douglas R. Hofstadter"
[34] "Carl Sagan"            "Stephen Hawking"       "Richard Dawkins"      
[37] "Peter Wohlleben"       "Robert M. Sapolsky"    "Ben Goldacre"         
[40] "Michio Kaku"           "Steve Brusatte"        "Merlin Sheldrake"     
[43] "Neil deGrasse Tyson"   "Matt Ridley"           "Carlo Rovelli"        
[46] "Richard P. Feynman"    "Lawrence M. Krauss"    "Neil Shubin"          
[49] "Richard   Preston"     "Mary Roach"            "Dava Sobel"           

Other Info

<span class="greyText smallText">
avg rating 4.35 —
1,168,542 ratings  —
published 2011
</span>

In this case, we need to reference the class greyText

info <- URL %>%
  html_elements(".greyText") %>%
  html_text()

info
  [1] "\n                  avg rating 4.22 —\n                  407,535 ratings  —\n                  published 2003\n                "  
  [2] "Rate this book"                                                                                                                   
  [3] "(Goodreads Author)"                                                                                                               
  [4] "\n                  avg rating 4.16 —\n                  187,031 ratings  —\n                  published 1976\n                "  
  [5] "Rate this book"                                                                                                                   
  [6] "(Goodreads Author)"                                                                                                               
  [7] "\n                  avg rating 4.08 —\n                  195,986 ratings  —\n                  published 2017\n                "  
  [8] "Rate this book"                                                                                                                   
  [9] "\n                  avg rating 4.35 —\n                  1,168,557 ratings  —\n                  published 2011\n                "
 [10] "Rate this book"                                                                                                                   
 [11] "\n                  avg rating 4.40 —\n                  153,771 ratings  —\n                  published 1980\n                "  
 [12] "Rate this book"                                                                                                                   
 [13] "(Goodreads Author)"                                                                                                               
 [14] "\n                  avg rating 4.13 —\n                  775,348 ratings  —\n                  published 2010\n                "  
 [15] "Rate this book"                                                                                                                   
 [16] "(Goodreads Author)"                                                                                                               
 [17] "\n                  avg rating 4.14 —\n                  188,324 ratings  —\n                  published 2014\n                "  
 [18] "Rate this book"                                                                                                                   
 [19] "\n                  avg rating 4.01 —\n                  119,116 ratings  —\n                  published 1859\n                "  
 [20] "Rate this book"                                                                                                                   
 [21] "\n                  avg rating 4.10 —\n                  100,287 ratings  —\n                  published 1999\n                "  
 [22] "Rate this book"                                                                                                                   
 [23] "\n                  avg rating 4.29 —\n                  77,237 ratings  —\n                  published 1995\n                "   
 [24] "Rate this book"                                                                                                                   
 [25] "\n                  avg rating 4.06 —\n                  231,078 ratings  —\n                  published 2003\n                "  
 [26] "Rate this book"                                                                                                                   
 [27] "(Contibutor)"                                                                                                                     
 [28] "\n                  avg rating 4.27 —\n                  210,509 ratings  —\n                  published 1985\n                "  
 [29] "Rate this book"                                                                                                                   
 [30] "\n                  avg rating 4.38 —\n                  210,037 ratings  —\n                  published 2017\n                "  
 [31] "Rate this book"                                                                                                                   
 [32] "\n                  avg rating 4.36 —\n                  52,981 ratings  —\n                  published 2016\n                "   
 [33] "Rate this book"                                                                                                                   
 [34] "\n                  avg rating 4.34 —\n                  109,260 ratings  —\n                  published 2010\n                "  
 [35] "Rate this book"                                                                                                                   
 [36] "(Goodreads Author)"                                                                                                               
 [37] "\n                  avg rating 4.16 —\n                  76,010 ratings  —\n                  published 2014\n                "   
 [38] "Rate this book"                                                                                                                   
 [39] "\n                  avg rating 3.93 —\n                  54,009 ratings  —\n                  published 2010\n                "   
 [40] "Rate this book"                                                                                                                   
 [41] "\n                  avg rating 4.05 —\n                  236,212 ratings  —\n                  published 1985\n                "  
 [42] "Rate this book"                                                                                                                   
 [43] "\n                  avg rating 4.04 —\n                  438,255 ratings  —\n                  published 1997\n                "  
 [44] "Rate this book"                                                                                                                   
 [45] "\n                  avg rating 4.06 —\n                  75,945 ratings  —\n                  published 2010\n                "   
 [46] "Rate this book"                                                                                                                   
 [47] "(Goodreads Author)"                                                                                                               
 [48] "\n                  avg rating 4.16 —\n                  55,086 ratings  —\n                  published 2009\n                "   
 [49] "Rate this book"                                                                                                                   
 [50] "\n                  avg rating 3.99 —\n                  63,098 ratings  —\n                  published 2014\n                "   
 [51] "Rate this book"                                                                                                                   
 [52] "\n                  avg rating 4.13 —\n                  39,676 ratings  —\n                  published 2004\n                "   
 [53] "Rate this book"                                                                                                                   
 [54] "\n                  avg rating 4.19 —\n                  273,715 ratings  —\n                  published 2015\n                "  
 [55] "Rate this book"                                                                                                                   
 [56] "\n                  avg rating 4.27 —\n                  82,164 ratings  —\n                  published 2018\n                "   
 [57] "Rate this book"                                                                                                                   
 [58] "(Goodreads Author)"                                                                                                               
 [59] "\n                  avg rating 3.90 —\n                  277,060 ratings  —\n                  published 2006\n                "  
 [60] "Rate this book"                                                                                                                   
 [61] "\n                  avg rating 4.03 —\n                  28,670 ratings  —\n                  published 1962\n                "   
 [62] "Rate this book"                                                                                                                   
 [63] "\n                  avg rating 4.18 —\n                  29,489 ratings  —\n                  published 2016\n                "   
 [64] "Rate this book"                                                                                                                   
 [65] "\n                  avg rating 4.32 —\n                  92,736 ratings  —\n                  published 2019\n                "   
 [66] "Rate this book"                                                                                                                   
 [67] "(Goodreads Author)"                                                                                                               
 [68] "\n                  avg rating 4.04 —\n                  39,810 ratings  —\n                  published 1987\n                "   
 [69] "Rate this book"                                                                                                                   
 [70] "\n                  avg rating 3.95 —\n                  59,009 ratings  —\n                  published 2010\n                "   
 [71] "Rate this book"                                                                                                                   
 [72] "\n                  avg rating 4.29 —\n                  51,144 ratings  —\n                  published 1979\n                "   
 [73] "Rate this book"                                                                                                                   
 [74] "\n                  avg rating 4.34 —\n                  38,264 ratings  —\n                  published 1994\n                "   
 [75] "Rate this book"                                                                                                                   
 [76] "\n                  avg rating 4.19 —\n                  44,261 ratings  —\n                  published 2001\n                "   
 [77] "Rate this book"                                                                                                                   
 [78] "(Goodreads Author)"                                                                                                               
 [79] "\n                  avg rating 4.09 —\n                  40,760 ratings  —\n                  published 1986\n                "   
 [80] "Rate this book"                                                                                                                   
 [81] "\n                  avg rating 4.07 —\n                  82,665 ratings  —\n                  published 2015\n                "   
 [82] "Rate this book"                                                                                                                   
 [83] "\n                  avg rating 4.40 —\n                  28,168 ratings  —\n                  published 2017\n                "   
 [84] "Rate this book"                                                                                                                   
 [85] "\n                  avg rating 4.06 —\n                  44,114 ratings  —\n                  published 2008\n                "   
 [86] "Rate this book"                                                                                                                   
 [87] "\n                  avg rating 4.10 —\n                  40,702 ratings  —\n                  published 2008\n                "   
 [88] "Rate this book"                                                                                                                   
 [89] "\n                  avg rating 4.21 —\n                  36,920 ratings  —\n                  published 2018\n                "   
 [90] "Rate this book"                                                                                                                   
 [91] "(Goodreads Author)"                                                                                                               
 [92] "\n                  avg rating 4.34 —\n                  44,913 ratings  —\n                  published 2020\n                "   
 [93] "Rate this book"                                                                                                                   
 [94] "(Goodreads Author)"                                                                                                               
 [95] "\n                  avg rating 4.10 —\n                  31,649 ratings  —\n                  published 2006\n                "   
 [96] "Rate this book"                                                                                                                   
 [97] "\n                  avg rating 4.06 —\n                  27,448 ratings  —\n                  published 1999\n                "   
 [98] "Rate this book"                                                                                                                   
 [99] "\n                  avg rating 4.13 —\n                  35,548 ratings  —\n                  published 2017\n                "   
[100] "Rate this book"                                                                                                                   
[101] "\n                  avg rating 4.21 —\n                  29,605 ratings  —\n                  published 1994\n                "   
[102] "Rate this book"                                                                                                                   
[103] "\n                  avg rating 3.94 —\n                  30,103 ratings  —\n                  published 2012\n                "   
[104] "Rate this book"                                                                                                                   
[105] "(Goodreads Author)"                                                                                                               
[106] "\n                  avg rating 4.04 —\n                  27,664 ratings  —\n                  published 2008\n                "   
[107] "Rate this book"                                                                                                                   
[108] "\n                  avg rating 4.16 —\n                  118,332 ratings  —\n                  published 1994\n                "  
[109] "Rate this book"                                                                                                                   
[110] "\n                  avg rating 3.85 —\n                  59,273 ratings  —\n                  published 2008\n                "   
[111] "Rate this book"                                                                                                                   
[112] "\n                  avg rating 3.99 —\n                  74,237 ratings  —\n                  published 1995\n                "   
[113] "Rate this book"                                                                                                                   

Putting it all together

length(titles)
[1] 50
length(authors)
[1] 51
length(info)
[1] 113

Oops! They aren’t all the same length, so we can’t put them in the same dataset. There should be 50 books! We need to look at this by hand…

The 13th element of authors just says “(Contributers)”. Let’s remove that.

authors <- authors[-13]

Our info vector has some excessive info (ie. info[58] = (GoodReads Author))

Sometimes webscraping requires some manual massaging. We could try using the full class = "greyText smallText but using multiple classes html_elements(".greyText.smallText")

info <- URL %>%
  html_elements(".greyText.smallText") %>%
  html_text()

info
 [1] "\n                  avg rating 4.22 —\n                  407,535 ratings  —\n                  published 2003\n                "  
 [2] "\n                  avg rating 4.16 —\n                  187,031 ratings  —\n                  published 1976\n                "  
 [3] "\n                  avg rating 4.08 —\n                  195,986 ratings  —\n                  published 2017\n                "  
 [4] "\n                  avg rating 4.35 —\n                  1,168,557 ratings  —\n                  published 2011\n                "
 [5] "\n                  avg rating 4.40 —\n                  153,771 ratings  —\n                  published 1980\n                "  
 [6] "\n                  avg rating 4.13 —\n                  775,348 ratings  —\n                  published 2010\n                "  
 [7] "\n                  avg rating 4.14 —\n                  188,324 ratings  —\n                  published 2014\n                "  
 [8] "\n                  avg rating 4.01 —\n                  119,116 ratings  —\n                  published 1859\n                "  
 [9] "\n                  avg rating 4.10 —\n                  100,287 ratings  —\n                  published 1999\n                "  
[10] "\n                  avg rating 4.29 —\n                  77,237 ratings  —\n                  published 1995\n                "   
[11] "\n                  avg rating 4.06 —\n                  231,078 ratings  —\n                  published 2003\n                "  
[12] "(Contibutor)"                                                                                                                     
[13] "\n                  avg rating 4.27 —\n                  210,509 ratings  —\n                  published 1985\n                "  
[14] "\n                  avg rating 4.38 —\n                  210,037 ratings  —\n                  published 2017\n                "  
[15] "\n                  avg rating 4.36 —\n                  52,981 ratings  —\n                  published 2016\n                "   
[16] "\n                  avg rating 4.34 —\n                  109,260 ratings  —\n                  published 2010\n                "  
[17] "\n                  avg rating 4.16 —\n                  76,010 ratings  —\n                  published 2014\n                "   
[18] "\n                  avg rating 3.93 —\n                  54,009 ratings  —\n                  published 2010\n                "   
[19] "\n                  avg rating 4.05 —\n                  236,212 ratings  —\n                  published 1985\n                "  
[20] "\n                  avg rating 4.04 —\n                  438,255 ratings  —\n                  published 1997\n                "  
[21] "\n                  avg rating 4.06 —\n                  75,945 ratings  —\n                  published 2010\n                "   
[22] "\n                  avg rating 4.16 —\n                  55,086 ratings  —\n                  published 2009\n                "   
[23] "\n                  avg rating 3.99 —\n                  63,098 ratings  —\n                  published 2014\n                "   
[24] "\n                  avg rating 4.13 —\n                  39,676 ratings  —\n                  published 2004\n                "   
[25] "\n                  avg rating 4.19 —\n                  273,715 ratings  —\n                  published 2015\n                "  
[26] "\n                  avg rating 4.27 —\n                  82,164 ratings  —\n                  published 2018\n                "   
[27] "\n                  avg rating 3.90 —\n                  277,060 ratings  —\n                  published 2006\n                "  
[28] "\n                  avg rating 4.03 —\n                  28,670 ratings  —\n                  published 1962\n                "   
[29] "\n                  avg rating 4.18 —\n                  29,489 ratings  —\n                  published 2016\n                "   
[30] "\n                  avg rating 4.32 —\n                  92,736 ratings  —\n                  published 2019\n                "   
[31] "\n                  avg rating 4.04 —\n                  39,810 ratings  —\n                  published 1987\n                "   
[32] "\n                  avg rating 3.95 —\n                  59,009 ratings  —\n                  published 2010\n                "   
[33] "\n                  avg rating 4.29 —\n                  51,144 ratings  —\n                  published 1979\n                "   
[34] "\n                  avg rating 4.34 —\n                  38,264 ratings  —\n                  published 1994\n                "   
[35] "\n                  avg rating 4.19 —\n                  44,261 ratings  —\n                  published 2001\n                "   
[36] "\n                  avg rating 4.09 —\n                  40,760 ratings  —\n                  published 1986\n                "   
[37] "\n                  avg rating 4.07 —\n                  82,665 ratings  —\n                  published 2015\n                "   
[38] "\n                  avg rating 4.40 —\n                  28,168 ratings  —\n                  published 2017\n                "   
[39] "\n                  avg rating 4.06 —\n                  44,114 ratings  —\n                  published 2008\n                "   
[40] "\n                  avg rating 4.10 —\n                  40,702 ratings  —\n                  published 2008\n                "   
[41] "\n                  avg rating 4.21 —\n                  36,920 ratings  —\n                  published 2018\n                "   
[42] "\n                  avg rating 4.34 —\n                  44,913 ratings  —\n                  published 2020\n                "   
[43] "\n                  avg rating 4.10 —\n                  31,649 ratings  —\n                  published 2006\n                "   
[44] "\n                  avg rating 4.06 —\n                  27,448 ratings  —\n                  published 1999\n                "   
[45] "\n                  avg rating 4.13 —\n                  35,548 ratings  —\n                  published 2017\n                "   
[46] "\n                  avg rating 4.21 —\n                  29,605 ratings  —\n                  published 1994\n                "   
[47] "\n                  avg rating 3.94 —\n                  30,103 ratings  —\n                  published 2012\n                "   
[48] "\n                  avg rating 4.04 —\n                  27,664 ratings  —\n                  published 2008\n                "   
[49] "\n                  avg rating 4.16 —\n                  118,332 ratings  —\n                  published 1994\n                "  
[50] "\n                  avg rating 3.85 —\n                  59,273 ratings  —\n                  published 2008\n                "   
[51] "\n                  avg rating 3.99 —\n                  74,237 ratings  —\n                  published 1995\n                "   
info <- info[-12]
info
 [1] "\n                  avg rating 4.22 —\n                  407,535 ratings  —\n                  published 2003\n                "  
 [2] "\n                  avg rating 4.16 —\n                  187,031 ratings  —\n                  published 1976\n                "  
 [3] "\n                  avg rating 4.08 —\n                  195,986 ratings  —\n                  published 2017\n                "  
 [4] "\n                  avg rating 4.35 —\n                  1,168,557 ratings  —\n                  published 2011\n                "
 [5] "\n                  avg rating 4.40 —\n                  153,771 ratings  —\n                  published 1980\n                "  
 [6] "\n                  avg rating 4.13 —\n                  775,348 ratings  —\n                  published 2010\n                "  
 [7] "\n                  avg rating 4.14 —\n                  188,324 ratings  —\n                  published 2014\n                "  
 [8] "\n                  avg rating 4.01 —\n                  119,116 ratings  —\n                  published 1859\n                "  
 [9] "\n                  avg rating 4.10 —\n                  100,287 ratings  —\n                  published 1999\n                "  
[10] "\n                  avg rating 4.29 —\n                  77,237 ratings  —\n                  published 1995\n                "   
[11] "\n                  avg rating 4.06 —\n                  231,078 ratings  —\n                  published 2003\n                "  
[12] "\n                  avg rating 4.27 —\n                  210,509 ratings  —\n                  published 1985\n                "  
[13] "\n                  avg rating 4.38 —\n                  210,037 ratings  —\n                  published 2017\n                "  
[14] "\n                  avg rating 4.36 —\n                  52,981 ratings  —\n                  published 2016\n                "   
[15] "\n                  avg rating 4.34 —\n                  109,260 ratings  —\n                  published 2010\n                "  
[16] "\n                  avg rating 4.16 —\n                  76,010 ratings  —\n                  published 2014\n                "   
[17] "\n                  avg rating 3.93 —\n                  54,009 ratings  —\n                  published 2010\n                "   
[18] "\n                  avg rating 4.05 —\n                  236,212 ratings  —\n                  published 1985\n                "  
[19] "\n                  avg rating 4.04 —\n                  438,255 ratings  —\n                  published 1997\n                "  
[20] "\n                  avg rating 4.06 —\n                  75,945 ratings  —\n                  published 2010\n                "   
[21] "\n                  avg rating 4.16 —\n                  55,086 ratings  —\n                  published 2009\n                "   
[22] "\n                  avg rating 3.99 —\n                  63,098 ratings  —\n                  published 2014\n                "   
[23] "\n                  avg rating 4.13 —\n                  39,676 ratings  —\n                  published 2004\n                "   
[24] "\n                  avg rating 4.19 —\n                  273,715 ratings  —\n                  published 2015\n                "  
[25] "\n                  avg rating 4.27 —\n                  82,164 ratings  —\n                  published 2018\n                "   
[26] "\n                  avg rating 3.90 —\n                  277,060 ratings  —\n                  published 2006\n                "  
[27] "\n                  avg rating 4.03 —\n                  28,670 ratings  —\n                  published 1962\n                "   
[28] "\n                  avg rating 4.18 —\n                  29,489 ratings  —\n                  published 2016\n                "   
[29] "\n                  avg rating 4.32 —\n                  92,736 ratings  —\n                  published 2019\n                "   
[30] "\n                  avg rating 4.04 —\n                  39,810 ratings  —\n                  published 1987\n                "   
[31] "\n                  avg rating 3.95 —\n                  59,009 ratings  —\n                  published 2010\n                "   
[32] "\n                  avg rating 4.29 —\n                  51,144 ratings  —\n                  published 1979\n                "   
[33] "\n                  avg rating 4.34 —\n                  38,264 ratings  —\n                  published 1994\n                "   
[34] "\n                  avg rating 4.19 —\n                  44,261 ratings  —\n                  published 2001\n                "   
[35] "\n                  avg rating 4.09 —\n                  40,760 ratings  —\n                  published 1986\n                "   
[36] "\n                  avg rating 4.07 —\n                  82,665 ratings  —\n                  published 2015\n                "   
[37] "\n                  avg rating 4.40 —\n                  28,168 ratings  —\n                  published 2017\n                "   
[38] "\n                  avg rating 4.06 —\n                  44,114 ratings  —\n                  published 2008\n                "   
[39] "\n                  avg rating 4.10 —\n                  40,702 ratings  —\n                  published 2008\n                "   
[40] "\n                  avg rating 4.21 —\n                  36,920 ratings  —\n                  published 2018\n                "   
[41] "\n                  avg rating 4.34 —\n                  44,913 ratings  —\n                  published 2020\n                "   
[42] "\n                  avg rating 4.10 —\n                  31,649 ratings  —\n                  published 2006\n                "   
[43] "\n                  avg rating 4.06 —\n                  27,448 ratings  —\n                  published 1999\n                "   
[44] "\n                  avg rating 4.13 —\n                  35,548 ratings  —\n                  published 2017\n                "   
[45] "\n                  avg rating 4.21 —\n                  29,605 ratings  —\n                  published 1994\n                "   
[46] "\n                  avg rating 3.94 —\n                  30,103 ratings  —\n                  published 2012\n                "   
[47] "\n                  avg rating 4.04 —\n                  27,664 ratings  —\n                  published 2008\n                "   
[48] "\n                  avg rating 4.16 —\n                  118,332 ratings  —\n                  published 1994\n                "  
[49] "\n                  avg rating 3.85 —\n                  59,273 ratings  —\n                  published 2008\n                "   
[50] "\n                  avg rating 3.99 —\n                  74,237 ratings  —\n                  published 1995\n                "   

All Together Now

Now they are all the same length:

length(titles)
[1] 50
length(authors)
[1] 50
length(info)
[1] 50
books <- data.frame(Title = titles,
  Author = authors,
  Info = info
  )
books
                                                                                                                                             Title
1                                                                                                 A Short History of Nearly Everything (Paperback)
2                                                                                                                     The Selfish Gene (Paperback)
3                                                                                                   Astrophysics for People in a Hurry (Hardcover)
4                                                                                                Sapiens: A Brief History of Humankind (Paperback)
5                                                                                                                   Cosmos (Mass Market Paperback)
6                                                                                                 The Immortal Life of Henrietta Lacks (Hardcover)
7                                                                What If?: Serious Scientific Answers to Absurd Hypothetical Questions (Hardcover)
8                                                                                                                The Origin of Species (Hardcover)
9                                         The Elegant Universe: Superstrings, Hidden Dimensions, and the Quest for the Ultimate Theory (Paperback)
10                                                                            The Demon-Haunted World: Science as a Candle in the Dark (Paperback)
11                                                                                          Stiff: The Curious Lives of Human Cadavers (Paperback)
12                                                             "Surely You're Joking, Mr. Feynman!": Adventures of a Curious Character (Paperback)
13                                                                               Why We Sleep: Unlocking the Power of Sleep and Dreams (Hardcover)
14                                                                                                       The Gene: An Intimate History (Hardcover)
15                                                                                  The Emperor of All Maladies: A Biography of Cancer (Hardcover)
16                                                                                          The Sixth Extinction: An Unnatural History (Hardcover)
17 The Disappearing Spoon: And Other True Tales of Madness, Love, and the History of the World from the Periodic Table of the Elements (Hardcover)
18                                                                     The Man Who Mistook His Wife for a Hat and Other Clinical Tales (Paperback)
19                                                                                Guns, Germs, and Steel: The Fates of Human Societies (Paperback)
20                                                                                                                    The Grand Design (Hardcover)
21                                                                              The Greatest Show on Earth: The Evidence for Evolution (Hardcover)
22                                                                                                      Seven Brief Lessons on Physics (Hardcover)
23                                                                   The Fabric of the Cosmos: Space, Time, and the Texture of Reality (Paperback)
24                                                                                                        Homo Deus: A History of Tomorrow (ebook)
25                                                                                                  Brief Answers to the Big Questions (Hardcover)
26                                                                                                                    The God Delusion (Hardcover)
27                                                                                             The Structure of Scientific Revolutions (Paperback)
28                                                             I Contain Multitudes: The Microbes Within Us and a Grander View of Life (Hardcover)
29                                                                                                     The Body: A Guide for Occupants (Hardcover)
30                                                                                                         Chaos: Making a New Science (Paperback)
31                                                                           Packing for Mars: The Curious Science of Life in the Void (Paperback)
32                                                                                        Gödel, Escher, Bach: An Eternal Golden Braid (Paperback)
33                                                                                Pale Blue Dot: A Vision of the Human Future in Space (Paperback)
34                                                                                                          The Universe in a Nutshell (Hardcover)
35                                               The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe Without Design (Paperback)
36                                     The Hidden Life of Trees: What They Feel, How They Communicate: Discoveries from a Secret World (Hardcover)
37                                                                                 Behave: The Biology of Humans at Our Best and Worst (Hardcover)
38                                                                                                                         Bad Science (Paperback)
39                                                                                                           Physics of the Impossible (Hardcover)
40                                                                   The Rise and Fall of the Dinosaurs: A New History of a Lost World (Hardcover)
41                                                     Entangled Life: How Fungi Make Our Worlds, Change Our Minds & Shape Our Futures (Hardcover)
42                                                                                    Death by Black Hole: And Other Cosmic Quandaries (Paperback)
43                                                                               Genome: The Autobiography of a Species in 23 Chapters (Paperback)
44                                                                                                                   The Order of Time (Hardcover)
45                                                      Six Easy Pieces: Essentials of Physics Explained by Its Most Brilliant Teacher (Paperback)
46                                                                 A Universe from Nothing: Why There Is Something Rather Than Nothing (Hardcover)
47                                                      Your Inner Fish: a Journey into the 3.5-Billion-Year History of the Human Body (Hardcover)
48                                                           The Hot Zone: The Terrifying True Story of the Origins of the Ebola Virus (Paperback)
49                                                                                       Bonk: The Curious Coupling of Science and Sex (Paperback)
50                                   Longitude: The True Story of a Lone Genius Who Solved the Greatest Scientific Problem of His Time (Hardcover)
                  Author
1            Bill Bryson
2        Richard Dawkins
3    Neil deGrasse Tyson
4      Yuval Noah Harari
5             Carl Sagan
6         Rebecca Skloot
7         Randall Munroe
8         Charles Darwin
9           Brian Greene
10            Carl Sagan
11            Mary Roach
12    Richard P. Feynman
13        Matthew Walker
14  Siddhartha Mukherjee
15  Siddhartha Mukherjee
16     Elizabeth Kolbert
17              Sam Kean
18          Oliver Sacks
19         Jared Diamond
20       Stephen Hawking
21       Richard Dawkins
22         Carlo Rovelli
23          Brian Greene
24     Yuval Noah Harari
25       Stephen Hawking
26       Richard Dawkins
27        Thomas S. Kuhn
28               Ed Yong
29           Bill Bryson
30          James Gleick
31            Mary Roach
32 Douglas R. Hofstadter
33            Carl Sagan
34       Stephen Hawking
35       Richard Dawkins
36       Peter Wohlleben
37    Robert M. Sapolsky
38          Ben Goldacre
39           Michio Kaku
40        Steve Brusatte
41      Merlin Sheldrake
42   Neil deGrasse Tyson
43           Matt Ridley
44         Carlo Rovelli
45    Richard P. Feynman
46    Lawrence M. Krauss
47           Neil Shubin
48     Richard   Preston
49            Mary Roach
50            Dava Sobel
                                                                                                                                Info
1    \n                  avg rating 4.22 —\n                  407,535 ratings  —\n                  published 2003\n                
2    \n                  avg rating 4.16 —\n                  187,031 ratings  —\n                  published 1976\n                
3    \n                  avg rating 4.08 —\n                  195,986 ratings  —\n                  published 2017\n                
4  \n                  avg rating 4.35 —\n                  1,168,557 ratings  —\n                  published 2011\n                
5    \n                  avg rating 4.40 —\n                  153,771 ratings  —\n                  published 1980\n                
6    \n                  avg rating 4.13 —\n                  775,348 ratings  —\n                  published 2010\n                
7    \n                  avg rating 4.14 —\n                  188,324 ratings  —\n                  published 2014\n                
8    \n                  avg rating 4.01 —\n                  119,116 ratings  —\n                  published 1859\n                
9    \n                  avg rating 4.10 —\n                  100,287 ratings  —\n                  published 1999\n                
10    \n                  avg rating 4.29 —\n                  77,237 ratings  —\n                  published 1995\n                
11   \n                  avg rating 4.06 —\n                  231,078 ratings  —\n                  published 2003\n                
12   \n                  avg rating 4.27 —\n                  210,509 ratings  —\n                  published 1985\n                
13   \n                  avg rating 4.38 —\n                  210,037 ratings  —\n                  published 2017\n                
14    \n                  avg rating 4.36 —\n                  52,981 ratings  —\n                  published 2016\n                
15   \n                  avg rating 4.34 —\n                  109,260 ratings  —\n                  published 2010\n                
16    \n                  avg rating 4.16 —\n                  76,010 ratings  —\n                  published 2014\n                
17    \n                  avg rating 3.93 —\n                  54,009 ratings  —\n                  published 2010\n                
18   \n                  avg rating 4.05 —\n                  236,212 ratings  —\n                  published 1985\n                
19   \n                  avg rating 4.04 —\n                  438,255 ratings  —\n                  published 1997\n                
20    \n                  avg rating 4.06 —\n                  75,945 ratings  —\n                  published 2010\n                
21    \n                  avg rating 4.16 —\n                  55,086 ratings  —\n                  published 2009\n                
22    \n                  avg rating 3.99 —\n                  63,098 ratings  —\n                  published 2014\n                
23    \n                  avg rating 4.13 —\n                  39,676 ratings  —\n                  published 2004\n                
24   \n                  avg rating 4.19 —\n                  273,715 ratings  —\n                  published 2015\n                
25    \n                  avg rating 4.27 —\n                  82,164 ratings  —\n                  published 2018\n                
26   \n                  avg rating 3.90 —\n                  277,060 ratings  —\n                  published 2006\n                
27    \n                  avg rating 4.03 —\n                  28,670 ratings  —\n                  published 1962\n                
28    \n                  avg rating 4.18 —\n                  29,489 ratings  —\n                  published 2016\n                
29    \n                  avg rating 4.32 —\n                  92,736 ratings  —\n                  published 2019\n                
30    \n                  avg rating 4.04 —\n                  39,810 ratings  —\n                  published 1987\n                
31    \n                  avg rating 3.95 —\n                  59,009 ratings  —\n                  published 2010\n                
32    \n                  avg rating 4.29 —\n                  51,144 ratings  —\n                  published 1979\n                
33    \n                  avg rating 4.34 —\n                  38,264 ratings  —\n                  published 1994\n                
34    \n                  avg rating 4.19 —\n                  44,261 ratings  —\n                  published 2001\n                
35    \n                  avg rating 4.09 —\n                  40,760 ratings  —\n                  published 1986\n                
36    \n                  avg rating 4.07 —\n                  82,665 ratings  —\n                  published 2015\n                
37    \n                  avg rating 4.40 —\n                  28,168 ratings  —\n                  published 2017\n                
38    \n                  avg rating 4.06 —\n                  44,114 ratings  —\n                  published 2008\n                
39    \n                  avg rating 4.10 —\n                  40,702 ratings  —\n                  published 2008\n                
40    \n                  avg rating 4.21 —\n                  36,920 ratings  —\n                  published 2018\n                
41    \n                  avg rating 4.34 —\n                  44,913 ratings  —\n                  published 2020\n                
42    \n                  avg rating 4.10 —\n                  31,649 ratings  —\n                  published 2006\n                
43    \n                  avg rating 4.06 —\n                  27,448 ratings  —\n                  published 1999\n                
44    \n                  avg rating 4.13 —\n                  35,548 ratings  —\n                  published 2017\n                
45    \n                  avg rating 4.21 —\n                  29,605 ratings  —\n                  published 1994\n                
46    \n                  avg rating 3.94 —\n                  30,103 ratings  —\n                  published 2012\n                
47    \n                  avg rating 4.04 —\n                  27,664 ratings  —\n                  published 2008\n                
48   \n                  avg rating 4.16 —\n                  118,332 ratings  —\n                  published 1994\n                
49    \n                  avg rating 3.85 —\n                  59,273 ratings  —\n                  published 2008\n                
50    \n                  avg rating 3.99 —\n                  74,237 ratings  —\n                  published 1995\n                

Seperating Data in Columns

We will need some help of regex (short for regular expression) – a powerful way to search, match, and extract patterns in text.

books <- books %>%
  mutate(
    # Remove newlines and trim spaces
    Info = str_squish(Info),
    
    # Extract each piece using regex
    avg_rating = str_extract(Info, "(?<=avg rating )\\d+\\.\\d+"),   
    num_ratings = str_extract(Info, "\\d{1,3}(,\\d{3})*(?= ratings)"),  
    yr_published = str_extract(Info, "(?<=published )\\d{4}")
  ) %>% 
  select(-Info)

🌶️ regex expressions can be very challenging depending on the complexity of the text you are trying to format. You are encouraged to use Google and/or generative AI to help you come up with the appropriate expressions. Below is a quick guide to some of the basics:

| Pattern | Meaning                                     | Example                          |
|---------|---------------------------------------------|----------------------------------|
| `.`     | Any character except newline                | `a.b` matches `acb`, `a2b`       |
| `\d`    | Any digit (0–9)                             | `\d+` matches `2023`, `55`       |
| `\w`    | Any word character (letter, number, _)      | `\w+` matches `hello`, `abc123`  |
| `\s`    | Whitespace (space, tab, newline)            | `\s+` matches spaces/tabs        |
| `+`     | One or more of the preceding                | `a+` matches `a`, `aa`, `aaa`    |
| `*`     | Zero or more of the preceding               | `a*` matches ``, `a`, `aa`       |
| `?`     | Optional (zero or one)                      | `colou?r` matches `color`, `colour` |
| `^`     | Start of line/string                        | `^The` matches lines starting with `The` |
| `$`     | End of line/string                          | `end$` matches lines ending with `end` |
| `[]`    | Match one character inside brackets         | `[aeiou]` matches any vowel      |
| `[^]`   | Match anything **except** what's in brackets | `[^0-9]` matches non-digits      |
| `()`    | Grouping for extracting or repeating        | `(\d{4})` extracts 4-digit numbers |
| `|`     | OR                                           | `cat|dog` matches `cat` or `dog` |
| `{n}`   | Exactly n repetitions                       | `\d{4}` matches exactly four digits |

Formatting Columns

books <- books %>% 
  mutate(
    num_ratings = str_remove_all(num_ratings, ",") %>%  as.numeric(),
    avg_rating = avg_rating %>%  as.numeric(),
    num_ratings = num_ratings %>%  as.numeric(),
    yr_published = yr_published %>%  as.numeric()
  )