Basic WebScraping with R

When I first started scraping data from websites I used the amazing beautiful soup library in python. Recently I discovered the rvest package in R that makes webscraping in R a breeze. I show some basic code below for scraping data from a website.

One caveat that should always be made about scraping the web: be careful not to overwhelm the sites’ servers. People just like you worked hard to make their website and the data is available because of them. If your scraping code makes a bazillion calls per second on one website, you will crash the site and ruin it for the rest of us. Don’t wreck the internet. I have found the simplest way to scrape ethically is build in rests between scraping calls using something like Sys.sleep(2). That way the code acts a little more at human speed allowing the website to “catch up”.

The following code scrapes the sports card website tcdb.com. This is an amazing website that I really don’t want to crash because I use it all the time.

The first step is to get the html from the landing page for one year.

library(rvest)
library(data.table)

Sys.sleep(3)
YEAR <- 2011
print(YEAR)

year_url <- paste0("https://www.tcdb.com/ViewAll.cfm/sp/Baseball/year/",YEAR)
year_page_html <- read_html(year_url)

Next, get all the links that match a particular brand (Topps in this case) from that landing page into one list that I can loop over.

all_year_links <- year_page_html %>%
  html_elements("a") %>%
  html_attr("href")

year_topps_links <- all_year_links[grepl("viewset.*?-topps$", tolower(all_year_links)) |
                                     grepl("viewset.*?-topps-update$", tolower(all_year_links)) ]

Loop over the links to that brand and then repeat the process: get the links that are on that page for a particular subset of cards (“Rookies” in this case). Note that I had to add some words to the URL to get to the right link.

for (TOPPS_LINK in paste0("https://www.tcdb.com",gsub("ViewSet","Rookies",year_topps_links))){
  Sys.sleep(3)
  print(TOPPS_LINK)
  topps_page_html <- read_html(TOPPS_LINK)

  all_topps_links <- topps_page_html %>%
    html_elements("a") %>%
    html_attr("href")

Within that same loop, get the links for the individual cards. The key value here is that the program found the links to the cards so I didn’t need to loop over them.

  year_topps_card_links <- paste0("https://www.tcdb.com",
                                  unique(all_topps_links[grepl("ViewCard",all_topps_links)]))


  for (CARD_LINK in year_topps_card_links){
    Sys.sleep(3)
    print(CARD_LINK)
    year_topps_card_page_html <- read_html(CARD_LINK)

    all_price_links <- year_topps_card_page_html %>%
      html_elements("a") %>%
      html_attr("href")

    price_link <- paste0("https://www.tcdb.com",
                         unique(all_price_links[grepl("Price\\.cfm",all_price_links)]))

You can see the process I’m going through here. It’s just drilling down to the individual sites I want. If the URLs are easier to loop over, that would be an easier approach. But in this case, I didn’t know the set IDs or the card IDs, so I used this drill-down method to loop over what I know (years) and then find links using the rvest pacakge’s ability to find links on HTML pages.

Once I get to the level I am interested in, I add an if statement to make sure the page will have what I want, then use html_table to extract the table info.

    if (price_link == "https://www.tcdb.com"){
      print("invalid price link")
    } else {

      price_page_html <- read_html(price_link)

      price_table <- price_page_html %>%
        html_nodes("table") %>%
        html_table()

Finally, I use the data.table package to convert the data to a data.table and save it to a list so that I can keep looping and adding to the list, then outside the loop I use rbindlist to combine all the data into one table.

      price_dt <- as.data.table(price_table[[4]])
      price_dt[, X1 := NULL]
      names(price_dt) <- as.character(price_dt[1,])
      price_dt <- price_dt[-1,]

      price_dt <- price_dt[Date != ""]

      price_dt[, price := as.numeric(gsub("\\+.*","",gsub("[\\$|S&H]","",Price)))]
      price_dt[grepl("S&H",Price), sh := as.numeric(gsub(".*?\\+","",gsub("[\\$|S&H]","",Price)))]
      price_dt[, tot_price := ifelse(!is.na(sh), price + sh, price)]

      rc_price_list[[price_link]] <- price_dt[,.(date_sold = as.Date(Date, format = "%m-%d-%Y"),
                                                 price, sh, tot_price,
                                                 name = gsub("-"," ",gsub("^-","",gsub("\\d+?-","", gsub(".*?Topps","", price_link)))))]

      Sys.sleep(3)
    }
  }
}

This process takes a lot of iterating and a lot of reading the page HTML to find the right text to add to URL links.

Updated: