Keep it simple
Rvest is a super easy way to scrape data from a website.
Why keep it this simple?
I prefer R over something like VBA because VBA relies on a web browser called Edge and at one point it relied on Internet Explorer – enough said!
To demonstrate rvest’s ability, I’ll use a simple example where I scrape the historical stock price of every single company within the current S&P 500 stock index. For this example I will follow a few steps:
- Deciding on an R library that will allow me to scrape data
- Generating a list of S&P 500 companies and their stock ticker
- Scraping historical stock prices
I’ll go over the first two steps in this post.
Step 1: Finding a library
I decided, of course, to go with rvest for this exercise as my web scraping library. The rvest package was written by the same individual who wrote both the dplyr and tidyr packages which I had mentioned in a previous post.
Step 2.1: Finding a list of S&P 500 companies
Using a google search, I retrieved the Chicago Board Options Exchange’s (CBOE) website, which has a list of the S&P 500. Using my browser’s developer tools, I can find the “node” (HTML tag) that I need to scrape. In this case I will be scraping a “table”, and I will be using the “class” attribute (variables after the node/tag) to distinguish the table I want to scrape.
library(rvest) url <- "http://www.cboe.com/products/snp500.aspx" url %>% read_html() %>% html_nodes("table.table.center") %>% html_table(header = T) -> sp500 sp500 <- sp500[]
A little explanation:
The %>% (pipe operator) allows me to pass the result of one line of code into the next line of code (specifically into the first argument of a function). This is an optional feature in rvest (imported from the magrittr library), however it can make the flow easier to follow.
In this case, the variable url is passed into read_html(); then the result of read_html() is passed into html_nodes(). The second argument of html_nodes(), “table.table.center”, is declared directly inside the function.
Next, I need to declare that my table has headers, which I declare as the second argument of html_table(). Finally, since html_table() returns a list of all the tables scraped by html_nodes(), I need to extract/subset the table that I want in order produce a data frame for my new variable sp500.
Step 2.2: Cleaning the list
As with any mining activity, you are going to get a little dirty. And in this case, although the first six lines of the table appear to be OK:
head(sp500, n = 6L)
The next several lines presents a problem: some company names (in the column “Company”) appear to have line breaks and carriage returns nested inside the company’s names (i.e. “/r/n”). Take for example, the 16th company on the list, Air Products & Chemicals:
Examining the HTML code with the developer tool shows that indeed there is a line break just after “Products”.
Time to clean it up.
sp500$Company <- gsub("\r\n", "", sp500$Company)
This should produce a table with cleaned up company names, which is good enough for now.
And finally, write the table to a csv file so that it can be read into R at a later date.