Rvest: Web Scraping in R – Part I

Keep it simple

Rvest is a super easy way to scrape data from a website.

Why keep it this simple?

I prefer R over something like VBA because VBA relies on a web browser called Edge and at one point it relied on Internet Explorer – enough said!

Example

To demonstrate rvest’s ability, I’ll use a simple example where I scrape the historical stock price of every single company within the current S&P 500 stock index.  For this example I will follow a few steps:

  1. Deciding on an R library that will allow me to scrape data
  2. Generating a list of S&P 500 companies and their stock ticker
  3. Scraping historical stock prices

I’ll go over the first two steps in this post.

Step 1: Finding a library

I decided, of course, to go with rvest for this exercise as my web scraping library.  The rvest package was written by the same individual who wrote both the dplyr and tidyr packages which I had mentioned in a previous post.

Step 2.1: Finding a list of S&P 500 companies

Using a google search, I retrieved the Chicago Board Options Exchange’s (CBOE) website, which has a list of the S&P 500.  Using my browser’s developer tools, I can find the “node” (HTML tag) that I need to scrape.  In this case I will be scraping a “table”, and I will be using the “class” attribute (variables after the node/tag) to distinguish the table I want to scrape.


library(rvest)

url <- "http://www.cboe.com/products/snp500.aspx"

url %>%
read_html() %>%
html_nodes("table.table.center") %>%
html_table(header = T) -> sp500
sp500 <- sp500[[1]]

A little explanation:

The %>% (pipe operator) allows me to pass the result of one line of code into the next line of code (specifically into the first argument of a function).  This is an optional feature in rvest (imported from the magrittr library), however it can make the flow easier to follow.

In this case, the variable url is passed into read_html(); then the result of read_html() is passed into html_nodes().  The second argument of html_nodes(), “table.table.center”, is declared directly inside the function.

Next, I need to declare that my table has headers, which I declare as the second argument of html_table().  Finally, since html_table() returns a list of all the tables scraped by html_nodes(), I need to extract/subset the table that I want in order produce a data frame for my new variable sp500.

Step 2.2: Cleaning the list

As with any mining activity, you are going to get a little dirty.  And in this case, although the first six lines of the table appear to be OK:


head(sp500, n = 6L)

sp500_head1

The next several lines presents a problem: some company names (in the column “Company”) appear to have line breaks and carriage returns nested inside the company’s names (i.e. “/r/n”).  Take for example, the 16th company on the list, Air Products & Chemicals:

sp500[16,]

sp500_line_break

Examining the HTML code with the developer tool shows that indeed there is a line break just after “Products”.

sp500_line_break2

Time to clean it up.


sp500$Company <- gsub("\r\n", "", sp500$Company)

This should produce a table with cleaned up company names, which is good enough for now.

sp500_line_break_cleaned

And finally, write the table to a csv file so that it can be read into R at a later date.

write.csv(sp500, "sp500.csv")

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s