Rvest: Web Scrapping in R – Part II

Prior Proper Planning

Before I jump into the historical stock prices, I want to create a plan of attack. A good place to start will be to look at what I have so far:

##import the rvest library
library(rvest)

##set the url
url <- "http://www.cboe.com/products/snp500.aspx"

##fetch the table
sp500 <- url %>%
         read_html() %>%
         html_nodes("table.table.center") %>%
         html_table(header = T)

##tidy up
sp500         <- sp500[[1]]
sp500$Company <- gsub("\r\n", "", sp500$Company)

##save the table in a csv file
write.csv(sp500, "sp500.csv")

I have my list of S&P 500 companies, now I need to find a website from which I can scrape their historical stock prices.  I know the perfect(ish) website: http://finance.yahoo.com.

Building Blocks

There are two key things which need to be considered when mining data from the internet:

  1. The web layout – how data are stored on the internet (the relationship among the urls which need to be accessed and used for navigation)
  2. The data layout – which is the actual format that the data is in (which could be HTML tables or files for download)

3.1 The Layout

Yahoo!’s finance web layout is fairly straight forward:

  • The base url: https://finance.yahoo.com
  • The query url which is three parts:
    • The start: /q/hp?s=
    • The input: the ticker symbols from the S&P 500 list
    • The end: +Historical+Prices
  • The “Next” button, which directs to the next page of data (until there is no more data in which case the “href” attribute in the “a” tag no longer exists – which is helpful to know!)

3.1.1 Setting and Reading the URL

I can build custom urls to create a starting point to navigate to each company’s historical price data as follows:


    ##build the url and read it
    url   <- "https://finance.yahoo.com"
    start <- "/q/hp?s="
    end   <- "+Historical+Prices"
    link  <- paste(url, start, ticker, end, sep = "")
    yfp   <- read_html(link)

3.2.1 Scraping the Data
I can now load the data from the first webpage using rvest’s html_nodes() and html_table() functions. I’ll need to subset the “list” object that html_table() returns, and I will need to eliminate the last row (which contains information unrelated to the historical prices). Fortunately, the nodes I will be extracting are the same across every comapny’s web page (very convenient for recursion and looping).

        ##read the table
        y.table            <- yfp %>%
                              html_nodes("table.yfnc_datamodoutline1") %>%
                              html_nodes("table") %>%
                              html_table()
        y.table            <- y.table[[1]]
        y.table            <- y.table[-(dim(y.table)[1]), ]
        yft[[ticker]][[j]] <- y.table

3.2.2 Grabbing the Next Page
Now that I’ve established the initial connection to the first page, I need to navigate to the next page once I’ve finished scraping data.

##search for the "Next" button
n         <- html_nodes(yfp, "a")
rel       <- html_attr(n, "rel")
link      <- n[grep["next", rel)]
next_link <- link %>%
             html_attr("href") %>%
             unique()
yfp       <- paste(url, next_link, sep = "") %>%
             read_html()

3.2.3 Intelligently Moving to the Next Page
I need to add some logic to make sure I’m only grabbing the”a” tag when the “href” attribute exists. Also, once I have scraped all the data for a single company I then want to concatenate the data frames into a single data frame and then break the loop. I’ll use ldply() from the plyr package to concatenate the data frames down to one. So here is the revamped code.

 ##search for the "Next" button
n <- html_nodes(yfp, "a")
rel <- html_attr(n, "rel")
link <- n[grep("next", rel)]

if (!length(link) == 2){
yfp <- ""
yft[[ticker]] <- yft[[ticker]] %>%
ldply()
yft[[ticker]]
break
}

##set the link if "Next" button exists
next_link <- link %>%
html_attr("href") %>%
unique()
yfp <- paste(url, next_link, sep = "") %>%
read_html()

3.3.1 Looping it all
As you may have seen, I had an undeclared variable in 3.1.1 – j. This variable is declared ahead of the while loop, which I will use as a way to reiterate each ticker symbol’s web layout. I’ll also be adding one variables – i, which will be used for iteration of the while loop – and I will declare yft as a list (so it is not mishandled when I eventually wrap everything in a function in the next step). Which should look something like this.

    ##build the url and read it
    url   <- "https://finance.yahoo.com"
    start <- "/q/hp?s="
    end   <- "+Historical+Prices"
    link  <- paste(url, start, ticker, end, sep = "")
    yfp   <- read_html(link)

    ##set variables for the loop
    i   <- 1
    j   <- 1
    yft <- list()

    while(i == 1){
        i <- 0

        ##read the table
        y.table            <- yfp %>%
                              html_nodes("table.yfnc_datamodoutline1") %>%
                              html_nodes("table") %>%
                              html_table()
        y.table            <- y.table[[1]]
        y.table            <- y.table[-(dim(y.table)[1]), ]
        yft[[ticker]][[j]] <- y.table

        ##search for the "Next" button
        n    <- html_nodes(yfp, "a")
        rel  <- html_attr(n, "rel")
        link <- n[grep("next", rel)]

        if (!length(link) == 2){
            yfp           <- ""
            yft[[ticker]] <- yft[[ticker]] %>%
                             ldply()
            yft[[ticker]]
            break
        }

        ##set the link if "Next" button exists
        next_link <- link %>%
                     html_attr("href") %>%
                     unique()
        yfp       <- paste(url, next_link, sep = "") %>%
                     read_html()

        ##prepare for the next round
        i <- i + 1
        j <- j + 1
        Sys.sleep(3)
    }


3.4 The Finished Product
You may have noticed the “Sys.sleep(3)” – this is used to avoid those pesky timeouts (Yahoo!’s server referee)

yahoo_recurse <- function(ticker){

    ##build the url and read it
    url   <- "https://finance.yahoo.com"
    start <- "/q/hp?s="
    end   <- "+Historical+Prices"
    link  <- paste(url, start, ticker, end, sep = "")
    yfp   <- read_html(link)

    ##set variables for the loop
    i   <- 1
    j   <- 1
    yft <- list()

    while(i == 1){
        i <- 0

        ##read the table
        y.table            <- yfp %>%
                              html_nodes("table.yfnc_datamodoutline1") %>%
                              html_nodes("table") %>%
                              html_table()
        y.table            <- y.table[[1]]
        y.table            <- y.table[-(dim(y.table)[1]), ]
        yft[[ticker]][[j]] <- y.table

        ##search for the "Next" button
        n    <- html_nodes(yfp, "a")
        rel  <- html_attr(n, "rel")
        link <- n[grep("next", rel)]

        if (!length(link) == 2){
            yfp           <- ""
            yft[[ticker]] <- yft[[ticker]] %>%
                             ldply()
            yft[[ticker]]
            break
        }

        ##set the link if "Next" button exists
        next_link <- link %>%
                     html_attr("href") %>%
                     unique()
        yfp       <- paste(url, next_link, sep = "") %>%
                     read_html()

        ##prepare for the next round
        i <- i + 1
        j <- j + 1
        Sys.sleep(3)
    }
    yft[[ticker]]
}

4.1 The Real Finished Product

In order to loop through all the S&P 500 stocks I be creating a separate function to call the function above. I also need a way to check that the required libraries are loaded, so I built another function (which you can optionally save to a separate R file and call with the coded-out #source() function. Also, I need a way to assign names to each of those data frames in my list.

yahoo_scrape <- function(ticker){
    stopifnot(is.character(ticker))
    source("yahoo_recurse.R")
    
    #source("check_pkgs.R")
    ##--check if the requested package is loaded
    check_pkgs <- function(x){
    sesh <- sessionInfo()
    my_pack <- c(names(sesh$loadedOnly), names(sesh$otherPkgs))
    any(grepl(x, my_pack))
    }

    if (!check_pkgs("rvest")) library(rvest)
    if (!check_pkgs("plyr")) library(plyr)

yahoo_results        <- lapply(ticker, yahoo_recurse)
names(yahoo_results) <- ticker
yahoo_results
}

Example

I’m only going to try a small sample of data. You can choose to run through every single ticker symbol in the S&P 500, if you’d like.

set.seed(2016)
sp500    <- read.csv("sp500.csv")
sub      <- sample(1:504, 5)
stocks   <- as.character(sp500[sub, 'Ticker'])
test_run <- yahoo_scrape(stocks)
lapply(test_run, head)

test_run

2 thoughts on “Rvest: Web Scrapping in R – Part II

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s