R, FRED, and the 2016 Texas Primary: Part 1

I’m going to focus on scraping the FRED data in this post. You can keep up with my the code by exploring my TexasPrimary2016 GitHub repo.

Why?

I want to add some data to my earlier analysis of the 2016 Texas Primary.  The Federal Reserve’s Economic Data (FRED) website has hundreds of thousands of datasets to look at – many of them at the US county level in Texas.  I chose a few that looked interesting while browsing their website.

Meeting FRED

I saved the data I have scraped thus far from FRED as R objects in my repo. However, if you want to run the code below, you will need to acquire an API key from the Federal Reserve. DON’T get discouraged just yet. Getting access to the API key is easier than you may think:

  1. Create an account at this website: https://research.stlouisfed.org/fred2/
  2. Go to your account home page (My Account >> Account Home)
  3. Select “API key” from the navigation bar
  4. Select “Create API key”
  5. Give a description of your “app” (hint: I chose “Personal project”)
  6. Select “Request API key”

If you see a page like the one below, then congratulations – you have yourself a FRED API key!

FRED_API.png

Dealing with FRED

Data Model

Some terminology first. FRED’s data hierarchy is as follow:

  • Source
    • Release
      • Series
        • Observation

Here is my explanation of the hierarchy with an added level (Category) and some examples in ():

  • Source = where FRED gets data from (US. Bureau of Economic Analysis)
    • Release = a group of data from a “source” that have something in common (Local Area Personal Income)
      •  Category = a group of data within a release that are common (Per Capita Personal Income)
        • Series = an individual dataset found within a “release” (Per Capita Personal Income in Van Zandt County, TX)
          • Observation = the actual data observations for the series (the data itself)

Constraints

There is a maximum of 1,000 results returned in a FRED API request.

The Plan

I’m going to use a few functions to handle the gathering of the data. I’ve stored these functions in the file scraper.functions.R. These functions are focused on retrieving metadata about each series (series.scraper(), county.scraper(), category.scraper()), and collecting the data/observations (obs.scraper()).

As far as metadata goes, with each observation (unique SeriesID) I want to know a few things about that observation. My focus is on:

  • Release
  • SeriesID
  • Title
  • Frequency
  • First observation
  • Most recent observation
  • Units
  • Seasonal adjustment
  • Realtime dates (when the data was blessed)

Scraping SeriesID Information

There are three functions from scraper.functions.R that will scrape the SeriesID metadata (above) that I want. The first handles the bulk of the work:

# series.scraper(key, id, filters) ----------------------------------------

# This function will retrieve the meta data for each series within the specified release.
# You MUST enter a 'key' (your FRED API)
# You MUST enter an 'id' (a number corresponding to a release)
# You MAY choose to enter filters (character vector of 'tags' which will restrict the results)
# A data frame will be returned which can act as a master/top-level table

series.scraper <- function(key = NULL, id = NULL, filters = NULL){
    
    if(is.null(key)){
        stop("An API 'key' is required!")
    }
    if(is.null(id)){
        stop("You must input a character vector corresponding to a release's 'id'")
    }
    
    source("county.scraper.R")
    source("category.scraper.R")
    
    root    <- "https://api.stlouisfed.org/fred/release/series?release_id="
    cred    <- "&api_key="
    
    if(!is.character(id)) {
        id <- as.character(id)
    }
    if(!is.null(filters)){
        filter <- paste("&tag_names=",
                        paste(filters, collapse = ";"),
                        sep = ""
        )
        url    <- paste(root, id, cred, key, filter, sep = "")
    }   else {
        url    <- paste(root, id, cred, key, sep = "")
    }
    
    html               <- url %>% read_html()
    series             <- html %>% html_nodes("series")
    series_id          <- series %>% html_attr("id")                        %>% as.character()
    title              <- series %>% html_attr("title")                     %>% as.character()
    frequency          <- series %>% html_attr("frequency")                 %>% as.character()
    start              <- series %>% html_attr("observation_start")         %>% as.Date()
    end                <- series %>% html_attr("observation_end")           %>% as.Date()
    units              <- series %>% html_attr("units")                     %>% as.character()
    adjustment         <- series %>% html_attr("seasonal_adjustment_short") %>% as.character()
    last_updated       <- series %>% html_attr("last_updated")              %>% as.Date()
    blessed_start      <- series %>% html_attr("realtime_start")            %>% as.Date()
    blessed_end        <- series %>% html_attr("realtime_end")              %>% as.Date()
    release            <- rep(id, length(series))                           %>% as.numeric()
    data.info          <- data.frame("Release"     = release,
                                     "SeriesID"    = series_id,
                                     "Title"       = title,
                                     "Frequency"   = frequency,
                                     "Units"       = units,
                                     "Start"       = start,
                                     "End"         = end,
                                     "Adjustment"  = adjustment,
                                     "LastUpdated" = last_updated,
                                     "Blessed"     = blessed_start
    )
    data.info$CountyName <- county.scraper(title)
    data.info$Category   <- category.scraper(title)
    data.info
}

The next two are called within the function above, and will retrieve county and category information. These are very sloppy at this stage and will have to be cleaned up in the example below, but they will do for now:

# county.scraper(string) --------------------------------------------------

county.scraper <- function(string){
    
    spaces.vector    <- gregexpr("\\s", string)
    spaceless.vector <- sapply(seq_along(spaces.vector), function(x){
        
        target.space1 <- length(spaces.vector[[x]]) - 2
        target.space2 <- length(spaces.vector[[x]]) - 1
        target.pos1   <- spaces.vector[[x]][target.space1] + 1
        target.pos2   <- spaces.vector[[x]][target.space2] - 1
        substr(string[[x]], target.pos1, target.pos2)
    })
    spaceless.vector
}


# category.scraper(string) ------------------------------------------------

category.scraper <- function(string){
    
    spaces.vector    <- gregexpr("\\s", string)
    spaceless.vector <- sapply(seq_along(spaces.vector), function(x){
        
        target.space1 <- length(spaces.vector[[x]]) - 3
        target.pos1   <- spaces.vector[[x]][target.space1] - 1
        substr(string[[x]], 1, target.pos1)
    })
    spaceless.vector
}

The last remaining function in scraper.functions.R will do the actual collecting of the data (the Observations):

# obs.scraper(key, series) ------------------------------------------------

# This function will retrieve the 'observations' (the actual data that is inside a 'series')
# You MUST specify a 'key' (your API)
# You MUST specify a 'series' (alphanumeric sequence corresponding to a data table within a release)
# A data frame will be returned with the actual data (a column of dates and a column of 'series' values)

obs.scraper <- function(key = NULL, series = NULL){
    
    if(is.null(key)){
        stop("An API 'key' is required!")
    }
    if(is.null(series)){
        stop("You must input a character vector corresponding to a series' 'id'")
    }
    
    root        <- "https://api.stlouisfed.org/fred/series/observations?series_id="
    cred        <- "&api_key="
    url         <- paste(root, series, cred, key, sep = "")
    
    html        <- url %>% read_html() %>% html_nodes("observation")
    date        <- html %>% html_attr("date")  %>% as.Date()
    value       <- html %>% html_attr("value")
    value       <- ifelse(value == ".", NA, value) %>% as.numeric()
    data.values <- data.frame("Date" = date,
                              "Value" = value
    )
}

Notes on scraper.functions.R

Two of the functions above (series.scraper() and obs.scraper()) require an API key, release ID, and a series ID. Filters are optional but highly recommended to prevent truncation – remember there is a MAX of 1,000 results returned.  The filters must be placed inside a character vector (i.e. c(“tx”, “income”, “NSA”)).

An Example

I went through FRED and found a few series that I liked. I chose their respective release ID and tag filters in the example below.

This example is verbose and it may take some time to download all the data (even with a high-speed connection). YOU WILL HAVE TO ENTER YOUR API KEY. When you do, tt should go something like this:

1. Load dependencies
2. Set my variables for the series.scraper() function (series ID information scraper)
3. Run the series.scraper() function inside a lapply() function with a couple of tryCatch() wrappers (to handle server time outs)
4. Clean up the County and Category information
5. Save the metadata as R objects
6. Scrape the actual observations in those R objects with the obs.scraper() iniside a lapply() function with a couple of tryCatch() wrappers (to handle server time outs)
7. Save the actual data (observations) as R objects

library(rvest)
library(plyr)
source("scraper.functions.R")

# Release Examples Info ---------------------------------------------------


# 116 = Unemployment in States and Local Areas (all other areas) 
# 346 = Small Area Income and Poverty Estimates 
# 119 = Annual Estimates of the Population for Counties 
# 330 = American Community Survey 
# 175 = Local Area Personal Income 


# Run the Meta Data Function-----------------------------------------------
 
        ## set 1st set of variables
        myapi <- "" # set your API here
        fred.input1 <- c("116", "119", "330", "175")
        filter1     <- c("tx", "county")
        f.data1     <- list()
        
## run the 1st set
fred.series1 <- lapply(seq_along(fred.input1), function(x){
    
    tryCatch({
        f.data1[[x]] <- series.scraper(key    = myapi,
                                    id     = fred.input1[x],
                                    filter = filter1
                                    )
    }, error = function(e){
        Sys.sleep(3.5)
        f.data1[[x]] <- series.scraper(key    = myapi,
                                    id     = fred.input1[x],
                                    filter = filter1
        )
    })
}) %>% ldply()

    ## set 2nd set of variables
    fred.input2 <- c("346")
    filter2     <- c("tx", "county", "income")
    f.data2     <- list()

## run the 2nd set
fred.series2 <- lapply(seq_along(fred.input2), function(x){
    
    tryCatch({
        f.data2[[x]] <- series.scraper(key    = myapi,
                                    id     = fred.input2[x],
                                    filter = filter2
                                    )
    }, error = function(e){
        Sys.sleep(3.5)
        f.data2[[x]] <- series.scraper(key    = myapi,
                                    id     = fred.input2[x],
                                    filter = filter2
        )
    })
}) %>% ldply()


# Clean up the Meta Data --------------------------------------------------

    ## get the right county names
    right.counties <- c("Deaf Smith", "El Paso", "Fort Bend",
                        "Jeff Davis", "Jim Hogg", "Jim Wells",
                        "La Salle","Live Oak", "Palo Pinto",
                        "Red River", "San Augustine", "San Jacinto",
                        "San Patricio", "San Saba", "Tom Green",
                        "Val Verde", "Van Zandt")

## 1st series unique 'Category' values
uis1           <- unique(fred.series1$Category)
    
# clean up 'CountyName' and 'Category'
fred.series1[fred.series1$Category == uis1[[3]], 'CountyName'] <- right.counties
fred.series1[fred.series1$Category == uis1[[4]], 'CountyName'] <- right.counties
fred.series1[fred.series1$Category == uis1[[6]], 'CountyName'] <- right.counties
fred.series1[fred.series1$Category == uis1[[8]], 'CountyName'] <- right.counties
fred.series1[fred.series1$Category == uis1[[10]], 'CountyName'] <- right.counties
fred.series1[fred.series1$Category == uis1[[12]], 'CountyName'] <- right.counties
fred.series1[fred.series1$Category == uis1[[14]], 'CountyName'] <- right.counties

fred.series1[fred.series1$Category == uis1[[3]], 'Category'] <- uis1[[1]]
fred.series1[fred.series1$Category == uis1[[4]], 'Category'] <- uis1[[2]]
fred.series1[fred.series1$Category == uis1[[6]], 'Category'] <- uis1[[5]]
fred.series1[fred.series1$Category == uis1[[8]], 'Category'] <- uis1[[7]]
fred.series1[fred.series1$Category == uis1[[10]], 'Category'] <- uis1[[9]]
fred.series1[fred.series1$Category == uis1[[12]], 'Category'] <- uis1[[11]]

    # save a R object
    save(fred.series1, file = "fred.series1.RData")


## 2nd series unique 'Category' names
uis2           <- unique(fred.series2$Category)

# clean up 'CountyName' and 'Category'
fred.series2[fred.series2$Category == uis2[[2]], 'CountyName'] <- right.counties
fred.series2[fred.series2$Category == uis2[[4]], 'CountyName'] <- right.counties
fred.series2[fred.series2$Category == uis2[[6]], 'CountyName'] <- right.counties    

fred.series2[fred.series2$Category == uis2[[2]], 'Category'] <- uis2[[1]]
fred.series2[fred.series2$Category == uis2[[4]], 'Category'] <- uis2[[3]]
fred.series2[fred.series2$Category == uis2[[6]], 'Category'] <- uis2[[5]]

    # save a R object
    save(fred.series1, file = "fred.series1.RData")


# Run the Data Function ---------------------------------------------------

## 1st series
fred.obs1 <- lapply(seq_along(fred.series1$SeriesID), function(x){
    
    tryCatch({
        
        obs.scraper(myapi,fred.series1$SeriesID[x])
    }, error = function(e) {
        
        tryCatch({
            Sys.sleep(3.5)
            obs.scraper(myapi,fred.series1$SeriesID[x])
        }, error = function(e) {
            Sys.sleep(3.5)
            obs.scraper(myapi,fred.series1$SeriesID[x])
        })
    })
})
    # give it names and save as R object
    names(fred.obs1) <- fred.series1$SeriesID
    save(fred.obs1, file = "fred.obs1.RData")

## 2nd series
fred.obs2 <- lapply(seq_along(fred.series2$SeriesID), function(x){
    
    tryCatch({
        
        obs.scraper(myapi,fred.series2$SeriesID[x])
    }, error = function(e) {
        
        tryCatch({
            Sys.sleep(3.5)
            obs.scraper(myapi,fred.series2$SeriesID[x])
        }, error = function(e) {
            Sys.sleep(3.5)
            obs.scraper(myapi,fred.series2$SeriesID[x])
        })
    })
})
    # git is names and save as R object
    names(fred.obs2) <- fred.series2$SeriesID
    save(fred.obs2, file = "fred.obs2.RData")

Some Extra Notes

I plan on tidying up the category.scraper() and county.scraper() functions in order to avoid having to do all that clean up in the example above. I’ll most likely create a vector of Texas county names and use a grep() like function to find where I need to split the series title (in order to create both county and category columns).

The above example uses a column header called ‘CountyName’ for county information. This is so that I can more easily join data with the county’s corresponding ‘region’ id (which is what is used in choroplethr maps).

My Next Post

My next post will focus on preparing the FRED data I have collected so far for analysis. My first focus will be on the data itself (observations) – which are now floating around as data frames within a list.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s