How to Implement Web Scraping in R

Introduction

Data is everything in today’s digital age, and we come across immense amounts of publicly available data on the internet. To create meaningful inferences, we need all these data together. There comes the role of web scraping.

We can collect the data from the web page using web scrapers. This article will teach how to implement a web scraper in R.

Why R?

Almost all programming languages can be used to create a web scraper. But R has got certain advantages when it comes to manipulating the extracted data. The R has many libraries for data analysis and statistical modeling, which can be used for analyzing the collected data.

Installation

To install R in Linux, run the following commands in the terminal.

sudo apt update && sudo apt upgrade
sudo apt install r-base

To check the version, run the command

r --version

Install required libraries

In order to implement web scraping in R, we need the following libraries :

  • httr
  • rvest
  • parallel

Now let’s install these libraries. For that, launch the R console and run the following command.

install.packages("httr")

Note: To launch the R console, go to the terminal and type R or R.exe (for Windows OS).

The above code installs the package httr.

Similarly, we can install the packages rvest and parallel

install.packages("rvest")
install.packages("parallel")

Create our first R scraper

Let’s create our first scraper using R. The workflow of the scraper is mentioned below:

  • Go to the website https://scrapeme.live/shop
  • Navigate through the first 5 listing pages and collect all product urls
  • Visit each product page and collect the following data
    • Name
    • Description
    • Price
    • Stock
    • Image URL
    • Product URL
  • Save these collected data to a CSV file

Import required libraries

First, we need to import the required libraries.

library('httr')
library('rvest')
library('parallel')

Send request to the website

We use the help of httr library to collect data from the websites. The httr library allows the R program to send HTTP requests. It also helps us to handle the responses received from the website.

Let’s send a request to https://scrapeme.live/shop

headers <- c(
"User-Agent" = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
"Accept-Language" = "en-US,en;q=0.5"
)
url <- 'https://scrapeme.live/shop/'
response <- httr::GET(url, add_headers(headers))

The above code sends an HTTP request to the website and stores the response to the variable response.

Now, we need to validate the response

verify_response <- function(response){
if (status_code(response) == 200){
return(TRUE)
} else {
return(FALSE)
}
}

Here we can verify the website response using the status code; if the status code is 200, the response is valid; otherwise, it is invalid. If we get an invalid response, we can add retries. This may solve the invalid response issue.

max_retry <- 3 while (max_retry >= 1){
response <- httr::GET(url, add_headers(headers))
if (verify_response(response)) {
return(response)
} else {
max_retry <- max_retry - 1
}
}

Related: Essential HTTP Headers for Web Scraping

Collect Required Data

 Collecting required data in web scraping using R

Now we have the response from the listing page. Let’s collect the product URLs. To parse the HTML response, we will use the r-vest library.

As you can see in the above screenshot, the url to the product page is in the node ‘a’ having the class name woocommerce-LoopProduct-link woocommerce-loop-product__link. The node “a” comes under another node ‘li,’ so its XPATH can be written as//li/a[contains(@class, "product__link")].

The url to the product is in the “href” attribute of that node. So we access the attribute value using r-vest as below:

product_urls <- html_nodes( parser, xpath='//li/a[contains(@class, "product__link")]') %>% html_attr(
      'href')

Similarly, we can get the next page url from the next button in HTML.

selecting the next page URL for web scraping using R
Since there are two results for the same xpath and we want to select the first result to get the next page URL from the ‘a’ node, we give the xpath inside a bracket () and index it. So the XPATH //a[@class="next page-numbers"] becomes (//a[@class="next page-numbers"])[1].

Now we use the help of html_attr from the r-vest library to collect the data.

next_page_url <- html_nodes(parser, xpath='(//a[@class="next page-numbers"])[1]') %>% html_attr('href')

Now we have all the product URLs and save them into a list. We paginate through the listing page and add the product URLs to the same list. Once all the paginations are done, we send the request to the product URLs using the httr library.

Process the URLs parallelly

Now we have collected all product URLs. Let’s send the request to the product pages.

Since there are many product URLs, and each request takes a few seconds to collect the response through the network, the code block will wait for the response to execute the remaining code, leading to a much greater execution time.

To overcome this, we use the help of the parallel library. It has a function called mclapply(), which takes a vector as its first argument; the second argument is a function. It also accepts the number of cores that need to be used.

We assign the number of cores to be used to mc.cores params. Here scrape_page function will be called for each URL in the urls vector. mc.cores define the number of cores that need to be used for running the function block.

If mc.cores is not defined, it will run with the default number of cores available. If the assigned value of mc.cores is greater than the available number of cores, then it does not affect the processing.

results <- mclapply(product_urls, get_product_data, mc.cores = 8)

Now let’s collect the required data points’ Name, Description, price, stock, image url.

Name

collecting data points' title from node h1 for web scraping in R

From the image, we can see that the product’s name is inside node h1. Since no other h1 nodes are on the product page, we can simply call the XPATH //h1 to select that particular node.

Since the text is inside the node, we use the following code

title <- html_nodes(parser, xpath='//h1') %>% html_text2()

There are two methods within the rvest library to extract text. Those are html_text2() and html_text(). The html_text2() cleans and strips any unwanted white spaces between the selected string, whereas the html_text() returns the text as it is available on the website.

Description

collecting product description for web scraping in R

As we can see, the product description is inside the node p, which is situated inside div having class name substring ‘product-details__short-description.’ We can collect the text inside it as follows.

description <- html_nodes(parser, xpath='//div[contains(@class,"product-details__short-description")]') %>% html_text2()

Stock

collecting stock data for web scraping in R

Since stock is directly present inside the node p, whose class contains the string ‘in-stock’ we can use the following code to collect data from it

stock <- html_nodes(parser, xpath='//p[contains(@class, "in-stock")]') %>% html_text2()

Price

collecting price data for web scraping in R

We can get the product’s price using the following code since the price is directly available in the node p having class price.

price <- html_nodes(parser, xpath='//p[@class="price"]') %>% html_text2()

Image URL

collecting image url for web scraping in R

We can get the image URL from the attribute href of the node ‘a’ which is selected as shown in the screenshot above.

image_url <- html_nodes(parser, xpath='//div[contains(@class, "woocommerce-product-gallery__image")]/a') %>% html_attr('href')

Now we return the collected data of each product as a new data frame and append it to a common list. To create the data frame, we use the following code

image_url <- html_nodes(parser, xpath='//div[contains(@class, "woocommerce-product-gallery__image")]/a') %>% html_attr('href')

The code below can be used for combining each data frame to a single one.

single_data_frame <- do.call(rbind, data_frame_list)

To save the collected data to CSV format, we can use the following code:

write.csv(single_data_frame, file = "scrapeme_live_R_data.csv", row.names = FALSE)

You can find the complete code here – Web scraping in R

How to send GET requests using cookies and headers

Now, let’s see how to send the requests using headers and cookies. First, call the library httr and initialize the required parameters.

library('httr')
# URL to which we send the request
url <- "https://httpbin.org/anything"

# Set the required headers
headers <- c(
"User-Agent" = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
"Accept-Language" = "en-US,en;q=0.5"
)

# Set the cookies if any, else no need to use set_cookies()
cookies <- c(
"cookie_name1" = "cookie_value1",
"cookie_name2" = "cookie_value2"
)

After the initialization send the request and save the response object to a variable.

# Send the GET request from httr library
response <- GET(url, add_headers(headers), set_cookies(cookies))

# Print the response to see if it is a valid response
print(response)

In web scraping, the usage of headers is critical to avoid getting blocked.

Sending POST requests in R

Now let’s see how to send a POST request using httr in R programming

library(httr)
# Set the URL
url <- "https://httpbin.org/anything"

We initialize the POST request payload as below.

# Set the request body (payload)
payload <- list(
  key1 = "value1",
  key2 = "value2"
)
# Set the headers
headers <- c(
  "Content-Type" = "application/json",
  "Authorization" = "Bearer YOUR_TOKEN",
 "User-Agent" = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
)

Here we have to mention the encode type as “json” to convert the payload to JSON before sending the request.

# Send the POST request
response <- POST(url, body=payload, encode ="json", add_headers(headers))

# Print the response to see if it is a valid response
print(response)

Conclusion

This tutorial gives you a strong foundation for implementing web scraping in R.

Web Scraping in R has many advantages, one of which is its ability to easily manipulate data. R has libraries specifically designed for data analysis and statistical modeling, which can be used to analyze the gathered information.

These libraries offer a robust set of tools for managing, manipulating, and displaying data in a manner that is efficient and effective.

We’ve covered installation of libraries required for web scraping in R, how to create the scrape and sending GET and POST requests in R. We also discussed the method to gather data using the XPath.

Posted in:   Web Scraping Tutorials

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?