Webscraping Tables

Author

Emily Malcolm-White

#LOAD PACKAGES 
library(tidyverse)

Data doesn’t just magically appear on your computer you need to get it from somewhere.

Sometimes we download data files (.csv, .xlsx, etc.) and save them locally. Other times, we use datasets that come bundled with R packages (like the gapminder dataset).

Obtaining Data From The Web

Say you’re interested in renting an apartment in Vermont—or studying the rental market. You might browse Craigslist’s Vermont rental listings.

You could spend hours copying and pasting each listing… or you could write code that extracts the data for you.

When you visit a website, your browser loads the HTML source code, which includes the structure and content of the page—like headings, tables, and links. We can use R to read that code and extract specific content. This is called web scraping.

🛑 Should we be scraping this data?

Before scraping, always ask:

Is it legal?
Can your specific use case violate the rules?
Even if legal, is it ethical?

In the U.S., scraping public data is typically legal if:

It’s not used for harmful purposes
It doesn’t interfere with a website’s business
It excludes personally identifiable information (PII)

Case Law Examples

eBay vs. Bidder’s Edge (2000): Bots restricted from overloading systems
Facebook vs. Power Venures (2009): Logging in on others’ behalf violated terms
Linkedin vs. hiQ Labs (2019): Scraping public profiles ruled permissible

Websites often describe scraping policies in two places:

Their robots.txt file (e.g., craigslist.org/robots.txt)
Their Terms of Service (TOS) document

Craigslist explicitly forbids scraping. Wikipedia does not.

Note

Why might Craigslist restrict scraping, while Wikipedia allows it?

How does HTML work?

HTML (HyperText Markup Language) is the language used to create web pages. HTML uses tags (like <table>, <tr>, <td>) to define page elements. If we understand this structure, we can write code that extracts tables and other elements from the page.

Tip

Typically, an HTML element is defined by a start tag, some content, and an end tag

<tagname> ...some content here... </tagname>

For example:

<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>

There are many, many different possible tag elements. In this class, it’s not important that you know the specifics of what each element is. It’s useful for you to understand the basic structure.

HTML Tables

An HTML table is used to represent data in a structured way

<table> Defines a table
<th> Defines a header cell in a table
<tr> Defines a row in a table
<td> Defines a cell in a table

Here is the HTML code:

<table>
  <tr>
    <th>Name</th>
    <th>Birth Year</th>  
    <th>Country</th>
  </tr>
  <tr>
    <td>Harry Styles</td>
    <td>Feb 1, 1994</td>
    <td>Britain</td>
  </tr>
  <tr>
    <td>Taylor Swift</td>
    <td>Dec 13, 1989</td>
    <td>USA</td>
  </tr>
  <tr>
    <td>Justin Bieber</td>
    <td>Mar 1, 1994</td>
    <td>Canada</td>
  </tr>
</table>

Here is how the HTML displays in a web browser:

Name	Birth Year	Country
Harry Styles	Feb 1, 1994	Britain
Taylor Swift	Dec 13, 1989	USA
Justin Bieber	Mar 1, 1994	Canada

Today’s class will focus on scraping data from HTML tables!

HTML class

The class attribute can be added to any HTML element. Often it is used to help customize the styling of the element (among other things).

<h2 class="city">Middlebury</h2>
<p class="city">Middlebury is a town in Vermont</p>

This can be particularly useful in web scraping – we can ask to scrape particular elements, particular classes, or both!

Web Scraping using `rvest`

We need the package rvest to help us with this.

library(rvest)

Viewing Raw HTML from a website

You can inspect the source code of any webpage by using a web browser like Firefox or Chrome.

On Firefox, navigate to the “Tools” menu item in the top menu and click on “Web Developer/Page Source”. You can also use the shortcut Command + U
On Chrome, navigate to the top menu item “View” and click on “Developer/View Source.” You can also use the keyboard shortcut Option-Command-U. It also can be useful to use the SelectorGadget Extension.

Webscraping Tables from Wikipedia

Check out the information on the (List of the Most Viewed YouTube Videos on Wikipedia)[https://en.wikipedia.org/wiki/List_of_most-viewed_YouTube_videos]. Suppose we want to scrape this data to use in R.

read_html scrapes the raw html from the webpage as text
html_element (and html_elements) selects particular elements from the HTML code
html_table formats a scraped html table as a tibble (R table)

youtube_videos <- read_html("https://en.wikipedia.org/wiki/List_of_most-viewed_YouTube_videos") %>%
  html_element(".wikitable") %>%
  html_table() 

youtube_videos

# A tibble: 31 × 6
   `Video name`                    Uploader `Views (billions)` Date  Notes ``   
   <chr>                           <chr>    <chr>              <chr> <chr> <chr>
 1 Baby Shark Dance[7]             Pinkfon… 15.65              June… "[A]" <NA> 
 2 Despacito[10]                   Luis Fo… 8.66               Janu… "[B]" <NA> 
 3 Wheels on the Bus[18]           Cocomel… 7.17               May … ""    <NA> 
 4 Johny Johny Yes Papa[19]        LooLoo … 7.02               Octo… ""    <NA> 
 5 Bath Song[20]                   Cocomel… 7.01               May … ""    <NA> 
 6 See You Again[21]               Wiz Kha… 6.58               Apri… "[C]" <NA> 
 7 Shape of You[26]                Ed Shee… 6.42               Janu… "[D]" <NA> 
 8 Phonics Song with Two Words[29] ChuChu … 6.31               Marc… ""    <NA> 
 9 Uptown Funk[30]                 Mark Ro… 5.49               Nove… ""    <NA> 
10 Gangnam Style[31]               Psy      5.48               July… "[E]" <NA> 
# ℹ 21 more rows

We could have used html_element("table") If we did this, it would have pulled the first <table> from the page.
We could have used html_elements("table") If we did this, it would have pulled all the <table> elements from the page.
If you want a specific table that isn’t the first table, scrape all the tables and apply html_table(). Then take that new object of the tables and add [[n]] to get the $n^{th}$ table. For example to call the $2^{nd}$ table,

tables <- html %>% 
  html_elements("table") %>%
  html_table() 
  
tables[[2]]

In this case, we used html_elements(".wikitable") I choose to use this because the <table> was also defined with a unique class: <table class="wikitable sortable">

Warning

Note that if we are using html_element to call a class, it is important to add a “.” before the class element name. You do not need to do this is you are calling an HTML element (like “table”)

Cleaning up with `janitor`

Web scraping doesn’t always format perfectly. Let’s clean it up!

library(janitor)

Clean up the names of the header:

youtube_videos <- clean_names(youtube_videos)

Format the views as a number using as.numeric:

youtube_videos <- youtube_videos %>% 
  mutate(views_billions = as.numeric(views_billions))

What are the top 10 most viewed YouTube Videos?

top10 <- youtube_videos %>%
  arrange(desc(views_billions)) %>%
  slice(1:10)

top10

# A tibble: 10 × 6
   video_name                      uploader     views_billions date  notes x    
   <chr>                           <chr>                 <dbl> <chr> <chr> <chr>
 1 Baby Shark Dance[7]             Pinkfong Ba…          15.6  June… "[A]" <NA> 
 2 Despacito[10]                   Luis Fonsi             8.66 Janu… "[B]" <NA> 
 3 Wheels on the Bus[18]           Cocomelon -…           7.17 May … ""    <NA> 
 4 Johny Johny Yes Papa[19]        LooLoo Kids…           7.02 Octo… ""    <NA> 
 5 Bath Song[20]                   Cocomelon -…           7.01 May … ""    <NA> 
 6 See You Again[21]               Wiz Khalifa            6.58 Apri… "[C]" <NA> 
 7 Shape of You[26]                Ed Sheeran             6.42 Janu… "[D]" <NA> 
 8 Phonics Song with Two Words[29] ChuChu TV N…           6.31 Marc… ""    <NA> 
 9 Uptown Funk[30]                 Mark Ronson            5.49 Nove… ""    <NA> 
10 Gangnam Style[31]               Psy                    5.48 July… "[E]" <NA>

Once we have this data, we can make cool plots!

top10 %>% 
  ggplot( aes(x=views_billions, y=reorder(video_name, views_billions))) +
    geom_bar(stat="identity") +
    xlab("Views (in billions)") +
    ylab("Videos") +
    ggtitle("Top 10 Most Watched YouTube Videos of All Time") +
    theme_minimal()

In this case, the list of the names is still not displaying very neatly. For example, rather than "Baby Shark Dance"[6] I might want it to just say Baby Shark Dance.

We can use the stringr package to remove symbols and numbers from the video names. We will be talking more about stringr later this semester and it’s not something I expect you to be able to do at this point in the semester.

library(stringr)

top10 %>% 
  mutate(video_name=str_replace_all(video_name, "[^[:alpha:]]", " ")) %>% 
  ggplot(aes(x=views_billions, y=reorder(video_name, views_billions))) +
    geom_bar(stat="identity") +
    xlab("Views (in billions)") +
    ylab("Videos") +
    ggtitle("Top 10 Most Watched YouTube Videos of All Time") +
    theme_minimal()

External Resources

R for Data Science, Webscraping

--- title: 'Webscraping Tables' author: 'Emily Malcolm-White' format: html: toc: TRUE code-overflow: wrap embed-resources: true code-tools: source: true toggle: false caption: none code-annotations: hover pdf: default execute: message: FALSE warning: FALSE --- ![](https://rvest.tidyverse.org/logo.png){width=30%} ```{r message=FALSE, warning=FALSE} #LOAD PACKAGES library(tidyverse) ``` Data doesn't just magically appear on your computer you need to get it from somewhere. Sometimes we download data files (`.csv`, `.xlsx`, etc.) and save them locally. Other times, we use datasets that come bundled with R packages (like the `gapminder` dataset). # Obtaining Data From The Web Say you're interested in renting an apartment in Vermont—or studying the rental market. You might browse Craigslist's Vermont rental listings. You could spend hours copying and pasting each listing... or you could write code that extracts the data for you. When you visit a website, your browser loads the HTML source code, which includes the structure and content of the page—like headings, tables, and links. We can use R to read that code and extract specific content. This is called web scraping. # 🛑 Should we be scraping this data? Before scraping, always ask: - Is it legal? - Can your specific use case violate the rules? - Even if legal, is it ethical? In the U.S., scraping public data is typically legal if: - It's not used for harmful purposes - It doesn't interfere with a website's business - It excludes personally identifiable information (PII) # Case Law Examples - [eBay vs. Bidder's Edge (2000) ](https://en.wikipedia.org/wiki/EBay_v._Bidder's_Edge#Order): Bots restricted from overloading systems - [Facebook vs. Power Venures (2009)](https://en.wikipedia.org/wiki/Facebook,_Inc._v._Power_Ventures,_Inc.#Ruling): Logging in on others' behalf violated terms - [Linkedin vs. hiQ Labs (2019)](https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn): Scraping public profiles ruled permissible Websites often describe scraping policies in two places: 1. Their robots.txt file (e.g., craigslist.org/robots.txt) 2. Their Terms of Service (TOS) document Craigslist explicitly forbids scraping. Wikipedia does not. ::: {.callout-note} Why might Craigslist restrict scraping, while Wikipedia allows it? ::: # How does HTML work? HTML (HyperText Markup Language) is the language used to create web pages. HTML uses tags (like `<table>`, `<tr>`, `<td>`) to define page elements. If we understand this structure, we can write code that extracts tables and other elements from the page. :::callout-tip Typically, an HTML element is defined by a start tag, some content, and an end tag `<tagname> ...some content here... </tagname>` ::: For example: ```{} <html> <head> <title>Page Title</title> </head> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html> ``` There are many, many different possible tag elements. In this class, it's not important that you know the specifics of what each element is. It's useful for you to understand the basic structure. ## HTML Tables An HTML table is used to represent data in a structured way - `<table>` Defines a table - `<th>` Defines a header cell in a table - `<tr>` Defines a row in a table - `<td>` Defines a cell in a table Here is the HTML code: ```{} <table> <tr> <th>Name</th> <th>Birth Year</th> <th>Country</th> </tr> <tr> <td>Harry Styles</td> <td>Feb 1, 1994</td> <td>Britain</td> </tr> <tr> <td>Taylor Swift</td> <td>Dec 13, 1989</td> <td>USA</td> </tr> <tr> <td>Justin Bieber</td> <td>Mar 1, 1994</td> <td>Canada</td> </tr> </table> ``` Here is how the HTML displays in a web browser: <table> <tr> <th>Name</th> <th>Birth Year</th> <th>Country</th> </tr> <tr> <td>Harry Styles</td> <td>Feb 1, 1994</td> <td>Britain</td> </tr> <tr> <td>Taylor Swift</td> <td>Dec 13, 1989</td> <td>USA</td> </tr> <tr> <td>Justin Bieber</td> <td>Mar 1, 1994</td> <td>Canada</td> </tr> </table> Today's class will focus on scraping data from HTML tables! ## HTML class The `class` attribute can be added to any HTML element. Often it is used to help customize the styling of the element (among other things). ``` <h2 class="city">Middlebury</h2> <p class="city">Middlebury is a town in Vermont</p> ``` This can be particularly useful in web scraping -- we can ask to scrape particular elements, particular classes, or both! # Web Scraping using `rvest` We need the package [`rvest`](https://rvest.tidyverse.org/) to help us with this. ```{r message=FALSE, warning=FALSE} library(rvest) ``` ## Viewing Raw HTML from a website You can inspect the source code of any webpage by using a web browser like Firefox or Chrome. - On Firefox, navigate to the “Tools” menu item in the top menu and click on “Web Developer/Page Source”. You can also use the shortcut `Command + U` - On Chrome, navigate to the top menu item “View” and click on “Developer/View Source.” You can also use the keyboard shortcut `Option-Command-U`. It also can be useful to use the [SelectorGadget Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb/related). ## Webscraping Tables from Wikipedia Check out the information on the (List of the Most Viewed YouTube Videos on Wikipedia)[https://en.wikipedia.org/wiki/List_of_most-viewed_YouTube_videos]. Suppose we want to scrape this data to use in R. - `read_html` scrapes the raw html from the webpage as text - `html_element` (and `html_elements`) selects particular elements from the HTML code - `html_table` formats a scraped html table as a tibble (R table) ```{r} youtube_videos <- read_html("https://en.wikipedia.org/wiki/List_of_most-viewed_YouTube_videos") %>% html_element(".wikitable") %>% html_table() youtube_videos ``` - We could have used `html_element("table")` If we did this, it would have pulled the first `<table>` from the page. - We could have used `html_elements("table")` If we did this, it would have pulled all the `<table>` elements from the page. - If you want a specific table that isn't the first table, scrape all the tables and apply `html_table()`. Then take that new object of the tables and add `[[n]]` to get the $n^{th}$ table. For example to call the $2^{nd}$ table, ```{} tables <- html %>% html_elements("table") %>% html_table() tables[[2]] ``` - In this case, we used `html_elements(".wikitable")` I choose to use this because the `<table>` was also defined with a unique `class`: `<table class="wikitable sortable">` :::callout-warning Note that if we are using `html_element` to call a class, it is important to add a "`.`" before the class element name. You do not need to do this is you are calling an HTML element (like "table") ::: ## Cleaning up with `janitor` Web scraping doesn't always format perfectly. Let's clean it up! ```{r message=FALSE, warning=FALSE} library(janitor) ``` ![Artwork by @allisonhorst](https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/79a12c01-0cc1-4643-b1e7-8ce8cafc947b_rw_1920.png?h=a3760f3882e735d7065627fc060714ce){width=50%} Clean up the names of the header: ```{r} youtube_videos <- clean_names(youtube_videos) ``` Format the views as a number using `as.numeric`: ```{r} youtube_videos <- youtube_videos %>% mutate(views_billions = as.numeric(views_billions)) ``` What are the top 10 most viewed YouTube Videos? ```{r} top10 <- youtube_videos %>% arrange(desc(views_billions)) %>% slice(1:10) top10 ``` Once we have this data, we can make cool plots! ```{r} top10 %>% ggplot( aes(x=views_billions, y=reorder(video_name, views_billions))) + geom_bar(stat="identity") + xlab("Views (in billions)") + ylab("Videos") + ggtitle("Top 10 Most Watched YouTube Videos of All Time") + theme_minimal() ``` In this case, the list of the names is still not displaying very neatly. For example, rather than `"Baby Shark Dance"[6]` I might want it to just say `Baby Shark Dance`. We can use the `stringr` package to remove symbols and numbers from the video names. We will be talking more about `stringr` later this semester and it's not something I expect you to be able to do at this point in the semester. ```{r} library(stringr) top10 %>% mutate(video_name=str_replace_all(video_name, "[^[:alpha:]]", " ")) %>% ggplot(aes(x=views_billions, y=reorder(video_name, views_billions))) + geom_bar(stat="identity") + xlab("Views (in billions)") + ylab("Videos") + ggtitle("Top 10 Most Watched YouTube Videos of All Time") + theme_minimal() ``` :::callout # External Resources - [R for Data Science, Webscraping](https://r4ds.hadley.nz/webscraping) :::

Obtaining Data From The Web

🛑 Should we be scraping this data?

Case Law Examples

How does HTML work?

HTML Tables

HTML class

Web Scraping using rvest

Viewing Raw HTML from a website

Webscraping Tables from Wikipedia

Cleaning up with janitor

Web Scraping using `rvest`

Cleaning up with `janitor`