This is an open thread and the goal is to solicit comments on what the best web scraping service may look like. Please go ahead a type away and write down the ideas or requirements…
Introduction
Data is everything in today’s digital age, and we come across immense amounts of publicly available data on the internet. To create meaningful inferences, we need all these data together. There comes the role of web scraping.
We can collect the data from the web page using web scrapers. This article will teach how to implement a web scraper in R.
Why R?
Almost all programming languages can be used to create a web scraper. But R has got certain advantages when it comes to manipulating the extracted data. The R has many libraries for data analysis and statistical modeling, which can be used for analyzing the collected data.
Installation
To install R in Linux, run the following commands in the terminal.
sudo apt update && sudo apt upgrade sudo apt install r-base
To check the version, run the command
r --version
Install required libraries
In order to implement web scraping in R, we need the following libraries :
- httr
- rvest
- parallel
Now let’s install these libraries. For that, launch the R console and run the following command.
install.packages("httr")
Note: To launch the R console, go to the terminal and type R or R.exe (for Windows OS).
The above code installs the package httr.
Similarly, we can install the packages rvest and parallel
install.packages("rvest") install.packages("parallel")
Create our first R scraper
Let’s create our first scraper using R. The workflow of the scraper is mentioned below:
- Go to the website https://scrapeme.live/shop
- Navigate through the first 5 listing pages and collect all product urls
- Visit each product page and collect the following data
- Name
- Description
- Price
- Stock
- Image URL
- Product URL
- Save these collected data to a CSV file
Import required libraries
First, we need to import the required libraries.
library('httr') library('rvest') library('parallel')
Send request to the website
We use the help of httr library to collect data from the websites. The httr library allows the R program to send HTTP requests. It also helps us to handle the responses received from the website.
Let’s send a request to https://scrapeme.live/shop
headers <- c( "User-Agent" = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36", "Accept-Language" = "en-US,en;q=0.5" ) url <- 'https://scrapeme.live/shop/' response <- httr::GET(url, add_headers(headers))
The above code sends an HTTP request to the website and stores the response to the variable response
.
Now, we need to validate the response
verify_response <- function(response){ if (status_code(response) == 200){ return(TRUE) } else { return(FALSE) } }
Here we can verify the website response using the status code; if the status code is 200, the response is valid; otherwise, it is invalid. If we get an invalid response, we can add retries. This may solve the invalid response issue.
max_retry <- 3 while (max_retry >= 1){ response <- httr::GET(url, add_headers(headers)) if (verify_response(response)) { return(response) } else { max_retry <- max_retry - 1 } }
Related: Essential HTTP Headers for Web Scraping
Collect Required Data
Now we have the response from the listing page. Let’s collect the product URLs. To parse the HTML response, we will use the r-vest library.
As you can see in the above screenshot, the url to the product page is in the node ‘a’ having the class name woocommerce-LoopProduct-link woocommerce-loop-product__link
. The node “a” comes under another node ‘li,’ so its XPATH can be written as//li/a[contains(@class, "product__link")]
.
The url to the product is in the “href” attribute of that node. So we access the attribute value using r-vest as below:
product_urls <- html_nodes( parser, xpath='//li/a[contains(@class, "product__link")]') %>% html_attr( 'href')
Similarly, we can get the next page url from the next button in HTML.
Since there are two results for the same xpath and we want to select the first result to get the next page URL from the ‘a’ node, we give the xpath inside a bracket () and index it. So the XPATH //a[@class="next page-numbers"]
becomes (//a[@class="next page-numbers"])[1]
.
Now we use the help of html_attr
from the r-vest library to collect the data.
next_page_url <- html_nodes(parser, xpath='(//a[@class="next page-numbers"])[1]') %>% html_attr('href')
Now we have all the product URLs and save them into a list. We paginate through the listing page and add the product URLs to the same list. Once all the paginations are done, we send the request to the product URLs using the httr
library.
Process the URLs parallelly
Now we have collected all product URLs. Let’s send the request to the product pages.
Since there are many product URLs, and each request takes a few seconds to collect the response through the network, the code block will wait for the response to execute the remaining code, leading to a much greater execution time.
To overcome this, we use the help of the parallel library. It has a function called mclapply()
, which takes a vector as its first argument; the second argument is a function. It also accepts the number of cores that need to be used.
We assign the number of cores to be used to mc.cores
params. Here scrape_page function will be called for each URL in the urls vector. mc.cores
define the number of cores that need to be used for running the function block.
If mc.cores
is not defined, it will run with the default number of cores available. If the assigned value of mc.cores
is greater than the available number of cores, then it does not affect the processing.
results <- mclapply(product_urls, get_product_data, mc.cores = 8)
Now let’s collect the required data points’ Name, Description, price, stock, image url.
Name
From the image, we can see that the product’s name is inside node h1. Since no other h1 nodes are on the product page, we can simply call the XPATH //h1 to select that particular node.
Since the text is inside the node, we use the following code
title <- html_nodes(parser, xpath='//h1') %>% html_text2()
There are two methods within the rvest library to extract text. Those are html_text2()
and html_text()
. The html_text2()
cleans and strips any unwanted white spaces between the selected string, whereas the html_text()
returns the text as it is available on the website.
Description
As we can see, the product description is inside the node p, which is situated inside div having class name substring ‘product-details__short-description.’
We can collect the text inside it as follows.
description <- html_nodes(parser, xpath='//div[contains(@class,"product-details__short-description")]') %>% html_text2()
Stock
Since stock is directly present inside the node p, whose class contains the string ‘in-stock’ we can use the following code to collect data from it
stock <- html_nodes(parser, xpath='//p[contains(@class, "in-stock")]') %>% html_text2()
Price
We can get the product’s price using the following code since the price is directly available in the node p having class price.
price <- html_nodes(parser, xpath='//p[@class="price"]') %>% html_text2()
Image URL
We can get the image URL from the attribute href of the node ‘a’ which is selected as shown in the screenshot above.
image_url <- html_nodes(parser, xpath='//div[contains(@class, "woocommerce-product-gallery__image")]/a') %>% html_attr('href')
Now we return the collected data of each product as a new data frame and append it to a common list. To create the data frame, we use the following code
image_url <- html_nodes(parser, xpath='//div[contains(@class, "woocommerce-product-gallery__image")]/a') %>% html_attr('href')
The code below can be used for combining each data frame to a single one.
single_data_frame <- do.call(rbind, data_frame_list)
To save the collected data to CSV format, we can use the following code:
write.csv(single_data_frame, file = "scrapeme_live_R_data.csv", row.names = FALSE)
You can find the complete code here – Web scraping in R
How to send GET requests using cookies and headers
Now, let’s see how to send the requests using headers and cookies. First, call the library httr and initialize the required parameters.
library('httr')
# URL to which we send the request
url <- "https://httpbin.org/anything"
# Set the required headers
headers <- c(
"User-Agent" = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
"Accept-Language" = "en-US,en;q=0.5"
)
# Set the cookies if any, else no need to use set_cookies()
cookies <- c(
"cookie_name1" = "cookie_value1",
"cookie_name2" = "cookie_value2"
)
After the initialization send the request and save the response object to a variable.
# Send the GET request from httr library response <- GET(url, add_headers(headers), set_cookies(cookies)) # Print the response to see if it is a valid response print(response)
In web scraping, the usage of headers is critical to avoid getting blocked.
Sending POST requests in R
Now let’s see how to send a POST request using httr in R programming
library(httr) # Set the URL url <- "https://httpbin.org/anything"
We initialize the POST request payload as below.
# Set the request body (payload) payload <- list( key1 = "value1", key2 = "value2" ) # Set the headers headers <- c( "Content-Type" = "application/json", "Authorization" = "Bearer YOUR_TOKEN", "User-Agent" = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36" )
Here we have to mention the encode type as “json” to convert the payload to JSON before sending the request.
# Send the POST request
response <- POST(url, body=payload, encode ="json", add_headers(headers))
# Print the response to see if it is a valid response
print(response)
Conclusion
This tutorial gives you a strong foundation for implementing web scraping in R.
Web Scraping in R has many advantages, one of which is its ability to easily manipulate data. R has libraries specifically designed for data analysis and statistical modeling, which can be used to analyze the gathered information.
These libraries offer a robust set of tools for managing, manipulating, and displaying data in a manner that is efficient and effective.
We’ve covered installation of libraries required for web scraping in R, how to create the scrape and sending GET and POST requests in R. We also discussed the method to gather data using the XPath.