This is an open thread and the goal is to solicit comments on what the best web scraping service may look like. Please go ahead a type away and write down the ideas or requirements…
Web scraping has become essential for extracting valuable data from websites in today’s data-driven world. Cheerio, a fast and lightweight web scraping library for Node.js, provides an API for parsing HTML and manipulating data, making it a popular choice for web scraping tasks.
In this article, we will delve into the fundamentals of Cheerio web scraping and learn how to extract data from websites easily.
Let’s build our first Cheerio Scraper
Here we are going to scrape data from https://scrapemelive/shop and iterate through each product and extract necessary data from the page.
How to Set Up Cheerio for Web Scraping
For this, we will use the local installation:
Step 1: Create a directory for this Cheerio project
Step 2: Open the terminal inside the directory and type the following command:
npm init
It will create a file package.json
Step 3: Type the following command in the same terminal:
npm install axios npm install cheerio npm install objects-to-csv
Scraper Workflow
Let’s see the scraper workflow:
- Navigate to the listing page
- Extract the product page URLs of each product on the listing page
- Navigate to product page
- Extract the required data field from the product page
- Repeat steps 1 – 4 for each listing page URL
- Save data into a CSV file
Now, Let’s see how to build a web scraper using Axios and Cheerio.
First, import the required libraries
const axios = require('axios'); const cheerio = require('cheerio'); const ObjectsToCsv = require("objects-to-csv");
Now, let’s navigate to the listing page. We can use the below code line to perform the navigation:
const { data } = await axios.get(listingUrl);
The above code returns the HTML content of the listing page with the status code 200. The axios.get()
method will take the input URL string as a parameter and return the response. It also supports passing URL parameters, headers, proxies, etc., as function parameters.
Theawait
keyword is used to wait for the promise to complete. The await
is only valid in the async function. To use await
, the execution context must be asynchronous in nature. Just put async before the function declaration where the asynchronous operation will be executed as shown below
async function main() { const res = await axios.post('https://httpbin.org/post', { hello: 'world' }, { headers: { 'content-type': 'text/json' } }); } main();
Now we need to extract the data from the HTML content. For that, we can usecheerio.load()
method.
const parser = cheerio.load(data);
The load
method is the easiest way to parse HTML or XML documents with Cheerio. It takes an HTML content as an argument and returns a Cheerio object.
Now, let’s select all listed products. First, to get data for each product, find the HTML element that contains the required data. If we inspect the listed products, we can see that every product is listed inside a <li> tag, with a common class name product.
We can select all such products by looking for all <li> tags with a class name product
, which can be represented as the CSS selector li.product.
const products = parser("li.product");
From each product listing, let’s extract the below data points:
- Product Name
- Product URL
- Price
- Image URL
If you inspect the HTML elements, you can see that the product URL is present in selector <a>.
const productPageUrl = parser(product).find("a").attr("href");
The method find("a")
finds all a(anchor)
tags that are descendants of the current cheerio object. In this case, it would find all elements within the li.product
element.
The method attr("href")
retrieves the value of the href
attribute of the first element in the cheerio object. In this case, it retrieves the value of the href
attribute of the first anchor(a) tag within the li.product
element
Now, let’s navigate to the product page. Like in the listing page, send the request to the product URL using Axios and Use the Cheerio library to parse the data
const { data } = await axios.get(productPageUrl); const parser = cheerio.load(data);
We need to extract the following data points from each product:
- Description
- Title
- Price
- Stock
- SKU
- Image URL
By inspecting the data, we can see that the selectors are:
- Description: div.woocommerce-product-details__short-description
- Title: h1.product_title
- Price: p.price>span.woocommerce-Price-amount.amount
- Stock: p.stock.in-stock
- SKU: span.sku
- Image URL: figure>div>a.href
const description = parser("div.woocommerce-product-details__short-description").text(); const title = parser("h1.product_title").text(); const price = parser("p.price>span.woocommerce-Price-amount.amount").text(); const stock = parser("p.stock.in-stock").text(); const sku = parser("span.sku").text(); const imageUrl = parser("figure>div>a").attr("href");
The method text()
retrieves the text content of the first element in the cheerio object. Now save this data field into an object for each iteration and push it to productDataFields[]
productDataFields.push({ title, price, stock, sku, imageUrl, description })
After completing the iterations, we need to save the data into a CSV file. Here we need just pass productDataFields[]
to ObjectsToCsv()
method and pass the PATH as a string to toDIsk()
method as shown below
const csv = new ObjectsToCsv(productDataFields) csv.toDisk("<PATH>");
Complete code:
https://github.com/scrapehero-code/cheerio-web-scraping/blob/main/cheerio-web-scraper.js
How to use proxies and headers in Axios?
HTTP headers are important in conveying additional information between the client and server along with HTTP requests and responses. A few applications of headers include.
- Authorization: Headers serve as a means for transmitting authentication data, like a user’s credentials or an API key, to the server.
- Protection: Headers play a key role in establishing security measures, such as defining the origin of a request or guarding against cross-site scripting (XSS) attacks.
- Content Management: Through headers, clients can negotiate the format or encoding of the content that the server returns, making it possible to request specific content.
Similarly, proxies play a very important role when it comes to scraping. To learn more about proxy rotation please go through our article How To Rotate Proxies and change IP Addresses
const axios = require('axios'); const res = await axios.get('http://httpbin.org/get?answer=42', { proxy: { host: '<ip>', port: <port> }, Headers: { 'content-type': 'text/json' } });
How to Send POST request using AXIOS?
We can make a POST request using Axios to a given endpoint and trigger events. To perform an HTTP POST request, we can use axios.post() method which takes two parameters: the endpoint URL and an object containing the data you want to send to the server.
The importance of post requests is listed below:
- POST requests are often used to submit data to a server, such as search queries. This can be useful when you need to submit data to a server in order to retrieve specific information
- POST requests are secure as they don’t expose data in the URL.
- Websites use POST requests to dynamically load content on a page like pagination and filters.
const axios = require('axios'); const res = await axios.post('https://httpbin.org/post', { hello: 'world' }, { headers: { 'content-type': 'text/json' } });
Features of Cheerio
- Familiar Syntax: Cheerio incorporates a portion of the jQuery core and eliminates any inconsistent DOM and unwanted browser elements, presenting a superior API.
- Extremely fast: Cheerio operates on a straightforward sequential DOM, leading to quick parsing, handling, and display.
- Highly adaptable: Cheerio utilizes the parse5 parser and can also utilize htmlparser2, Cheerio has the ability to parse almost any HTML or XML document.
Conclusion
In conclusion, web scraping with Axios and Cheerio is a powerful technique for extracting data from websites. By leveraging these tools and following best practices, developers can automate data collection and gain valuable insights into markets, competitors, and customer behavior.