Simple automated web-scraping with R CMD BATCH and Task Scheduler

With a mixture of R’s command-line tool, a batch file, and the Windows Task Scheduler, a simple automated web-scraper can be built.

Invoking R at the command-line

It is possible to invoke R from the Windows command-line by entering the full path name of the executable, such as

C:\"Program Files"\R\R-3.3.0\bin\R --vanilla

The option --vanilla is an alias for several options, which in short summary tell R to not load any files at startup and to not ask the user whether to save the workspace image upon exit. If you were to invoke R from within the bin/ directory, you could enter the much simpler command

R --vanilla

And in the spirit of keeping things simple, if you place the bin/ directory in your PATH variable, then no matter the location of your current directory, you can always use the simpler command to invoke R. Here’s how to set that bin/ directory in your PATH variable:

  1. Press the Windows Key.
  2. Type systempropertiesadvanced – all one word – and press enter — OR — type sysdm.cpl, press enter, and click the “Advanced” tab.
  3. Click “Environment Variables”.
  4. Select “Path” and click “Edit”.
  5. Place the cursor at the very end of the “Variable value” field.
  6. Type the appropriate path name to the bin/ directory with a preceding ; (path names are ; delimited); here’s an example of what I typed:
    ;C:\Program Files\R\R-3.3.0\bin\
  7. Click all the “OK” buttons until you have exited.

Congratulations. You can now invoke R from anywhere within the command-line.

The BATCH tool

The above invocation of R will launch R in the command-line window – just as though you were using the command-line in RStudio or R GUI. However, from within the command line there are several CMD “tools” which are available to the user which are not meant to be called directly (from a GUI).

One such tool, BATCH, allows the user to run R files at the command-line (similar to using source() in the interactive GUI).  The command

R --vanilla CMD BATCH file.R file.Rout

will execute file.R and save the output to file.Rout — assuming you are within (your working directory is) file.R‘s directory.

The .Rout file, if not given, is created in the same directory as the .R file and is given the same name but with extension .Rout. In the above example, once R CMD BATCH has finished executing file.R, it calls proc.time() and inserts the returned value in the .Rout file — giving an indication of how long it took to execute the file. Warning messages and errors are also written to the .Rout file.

About batch files

Instead of repeatedly entering an R CMD BATCH command to run an R file, the command can be both stored in and executed from a batch file. Batch files, which have extension .bat, are plain text files whose content can be read and executed by the shell. These files can be created and edited using any text editing program (including RStudio).

Here is a batch file based on the above example:

@echo off
R --vanilla CMD BATCH file.R file.Rout 

where:

  • @echo off = do not print the lines of code.
  • The directory that the batch file is saved to and executed from is the same directory as file.R‘s directory — if not, then change the working directory or specify the full file path.

Windows Task Scheduler

The Windows Task Scheduler allows users to schedule various types of tasks. One such task that can be scheduled is the execution of a batch file.

Using the GUI interface, it is possible to schedule an R file to execute daily by telling the scheduler to run a batch file which runs an R CMD BATCH command to execute that R file. Using the Task Scheduler GUI is a straight forward process:

  1. Press the Windows Key, type either taskschd.msc or “task scheduler”, and press enter to open the program.
  2. Click on “Create Task”.
  3. Assign a name and give a description.
  4. Create a new trigger and action to execute a batch file on a daily basis.
  5. Select additional conditions and settings as needed (such as “Wake to run” and “Run task as soon as possible after a scheduled start is missed”).

There are other features you can use such as “Hidden” or “Run weather user is logged on or not”, but the above should be a good enough.

Putting it all together

I have taken some web-scraping code from a previous post on scraping North Dakota rig count data and modified and saved it in a file called rigcount.data.R. You can find the modified code bellow, plus some caveats about writing R files that are executed by R CMD BATCH, at the end of this post.

Here is all that is need to create a simple automated web-scraper based on rigcount.data.R:

  1. Create a batch file to execute rigcount.data.R. The batch file will run in the C:\Windows\System32 directory, so be sure to change the directory to where your R file is located, such as
    @echo off
    cd %USERPROFILE%\R\
    R --vanilla CMD BATCH rigcount.data.R rigcount.data.Rout
  2. Use the task scheduler to create a task that will execute the above batch file on a daily basis.

There you have it. With a scheduled task to execute the batch file, you have just created a simple automated web-scraper.

rigcount.data.r

Because you are executing an R file in batch mode, there will be a few changes to how R normally works when used with a program such as RStudio (which redirects standard input and output among other things).

  1. The library path to your %USERPROFILE%\R directory that is normally available when using RStudio will not be seen when using R CMD BATCH. That is why, before calling library(), it is necessary to specify that path, as in my case
    .libPaths("C:/Users/Luke/Documents/R/win-library/3.3")
  2. When using write.csv() to create a new CSV file within RStudio, you normally don’t need to create and connect to that file. Using R CMD BATCH, however, you will need to do this, such as
    fname <- "C:/Users/Luke/Documents/R/newFile.csv"
    file.create(fname)
    fcon <- file(fname, open = "w")
    write.csv(some.object, fname, row.names = FALSE)
    close(fcon)
    

Here is the code for rigcount.data.R:

# Scrape Rig Count Data ---------------------------------------------------

# Load Dependencies
.libPaths("C:/Users/Luke/Documents/R/win-library/3.3")
library(rvest)

# Set today's date; to be used in file name.
today <- Sys.Date()

# Create and load URL; scrape table nodes and attributes ("summary").
url           <- "https://www.dmr.nd.gov/oilgas/riglist.asp"
html          <- url %>% read_html()
table         <- html %>% html_nodes("table")
table.summary <- table %>% html_attr("summary")

# Find the table with rig count data, which is called "results".
table.filter <- grep("results", table.summary)
rig.table    <- table[table.filter] %>% html_table()

# Extract the table from the list; find and apply the header to the table.
rig.table           <- rig.table[[1]]
rig.table.header    <- table[table.filter] %>%
  html_nodes("thead") %>%
  html_nodes("th") %>%
  html_text()
colnames(rig.table) <- rig.table.header

# Add "Publication Date" and make it the first column.
rig.table[ncol(rig.table) + 1L]   <- today
names(rig.table)[ncol(rig.table)] <- "Publication Date"
rig.table <- rig.table[, c(ncol(rig.table), 1:(ncol(rig.table) - 1L))]

# Write table to CSV file.
fname <- paste0(getwd(), "/", today, ".csv")
file.create(fname)
fcon <- file(fname, open = "w")
write.csv(rig.table, fname, row.names = FALSE)
close(fcon)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s