Cheerio Web Scraping: A Beginner’s Guide

Web scraping has become essential for extracting valuable data from websites in today’s data-driven world. Cheerio, a fast and lightweight web scraping library for Node.js, provides an API for parsing HTML and manipulating data, making it a popular choice for web scraping tasks.

In this article, we will delve into the fundamentals of Cheerio web scraping and learn how to extract data from websites easily.

Let’s build our first Cheerio Scraper

Here we are going to scrape data from https://scrapemelive/shop and iterate through each product and extract necessary data from the page.

How to Set Up Cheerio for Web Scraping

For this, we will use the local installation:

Step 1: Create a directory for this Cheerio project

Step 2: Open the terminal inside the directory and type the following command:

npm init

It will create a file package.json

Step 3: Type the following command in the same terminal:

npm install axios
npm install cheerio
npm install objects-to-csv

Scraper Workflow

Let’s see the scraper workflow:

  1. Navigate to the listing page
  2. Extract the product page URLs of each product on the listing page
  3. Navigate to product page
  4. Extract the required data field from the product page
  5. Repeat steps 1 – 4 for each listing page URL
  6. Save data into a CSV file

Now, Let’s see how to build a web scraper using Axios and Cheerio.

First, import the required libraries

const axios = require('axios');
const cheerio = require('cheerio');
const ObjectsToCsv = require("objects-to-csv");

Now, let’s navigate to the listing page. We can use the below code line to perform the navigation:

const { data } = await axios.get(listingUrl);

The above code returns the HTML content of the listing page with the status code 200. The axios.get() method will take the input URL string as a parameter and return the response. It also supports passing URL parameters, headers, proxies, etc., as function parameters.

Theawait keyword is used to wait for the promise to complete. The await is only valid in the async function. To use await, the execution context must be asynchronous in nature. Just put async before the function declaration where the asynchronous operation will be executed as shown below

async function main() {
    const res = await axios.post('https://httpbin.org/post', { hello: 'world' }, {
    headers: {
        'content-type': 'text/json'
        }
    });
}
main();

Now we need to extract the data from the HTML content. For that, we can usecheerio.load() method.

const parser = cheerio.load(data);

The load method is the easiest way to parse HTML or XML documents with Cheerio. It takes an HTML content as an argument and returns a Cheerio object.

Now, let’s select all listed products. First, to get data for each product, find the HTML element that contains the required data. If we inspect the listed products, we can see that every product is listed inside a <li> tag, with a common class name product.

We can select all such products by looking for all <li> tags with a class name product, which can be represented as the CSS selector li.product.

cheerio web scraping listing page inspect elements

const products = parser("li.product");

From each product listing, let’s extract the below data points:

  1. Product Name
  2. Product URL
  3. Price
  4. Image URL

 

cheerio web scraping inspect product url xpath

If you inspect the HTML elements, you can see that the product URL is present in selector <a>.

const productPageUrl = parser(product).find("a").attr("href");

The method find("a") finds all a(anchor) tags that are descendants of the current cheerio object. In this case, it would find all elements within the li.product element.

The method attr("href") retrieves the value of the href attribute of the first element in the cheerio object. In this case, it retrieves the value of the href attribute of the first anchor(a) tag within the li.product element

Now, let’s navigate to the product page. Like in the listing page, send the request to the product URL using Axios and Use the Cheerio library to parse the data

const { data } = await axios.get(productPageUrl);
const parser = cheerio.load(data);

We need to extract the following data points from each product:

  • Description
  • Title
  • Price
  • Stock
  • SKU
  • Image URL

cheerio web scraping inspect product page data xpath

By inspecting the data, we can see that the selectors are:

  • Description: div.woocommerce-product-details__short-description
  • Title: h1.product_title
  • Price: p.price>span.woocommerce-Price-amount.amount
  • Stock: p.stock.in-stock
  • SKU: span.sku
  • Image URL: figure>div>a.href
const description = parser("div.woocommerce-product-details__short-description").text();
const title = parser("h1.product_title").text();
const price = parser("p.price>span.woocommerce-Price-amount.amount").text();
const stock = parser("p.stock.in-stock").text();
const sku = parser("span.sku").text();
const imageUrl = parser("figure>div>a").attr("href");

The method text() retrieves the text content of the first element in the cheerio object. Now save this data field into an object for each iteration and push it to productDataFields[]

productDataFields.push({ title, price, stock, sku, imageUrl, description })

After completing the iterations, we need to save the data into a CSV file. Here we need just pass productDataFields[] to ObjectsToCsv() method and pass the PATH as a string to toDIsk() method as shown below

const csv = new ObjectsToCsv(productDataFields)
csv.toDisk("<PATH>");

Complete code:

https://github.com/scrapehero-code/cheerio-web-scraping/blob/main/cheerio-web-scraper.js

How to use proxies and headers in Axios?

HTTP headers are important in conveying additional information between the client and server along with HTTP requests and responses. A few applications of headers include.

  1. Authorization: Headers serve as a means for transmitting authentication data, like a user’s credentials or an API key, to the server.
  2. Protection: Headers play a key role in establishing security measures, such as defining the origin of a request or guarding against cross-site scripting (XSS) attacks.
  3. Content Management: Through headers, clients can negotiate the format or encoding of the content that the server returns, making it possible to request specific content.

Similarly, proxies play a very important role when it comes to scraping. To learn more about proxy rotation please go through our article How To Rotate Proxies and change IP Addresses

const axios = require('axios');
const res = await axios.get('http://httpbin.org/get?answer=42', {
    proxy: {
        host: '<ip>',
        port: <port>
    },
    Headers: {
        'content-type': 'text/json'
        }
});

How to Send POST request using AXIOS?

We can make a POST request using Axios to a given endpoint and trigger events. To perform an HTTP POST request, we can use axios.post() method which takes two parameters: the endpoint URL and an object containing the data you want to send to the server.

The importance of post requests is listed below:

  1. POST requests are often used to submit data to a server, such as search queries. This can be useful when you need to submit data to a server in order to retrieve specific information
  2. POST requests are secure as they don’t expose data in the URL.
  3. Websites use POST requests to dynamically load content on a page like pagination and filters.
const axios = require('axios');
const res = await axios.post('https://httpbin.org/post', { hello: 'world' }, {
    headers: {
        'content-type': 'text/json'
    }
});

Features of Cheerio

  1. Familiar Syntax: Cheerio incorporates a portion of the jQuery core and eliminates any inconsistent DOM and unwanted browser elements, presenting a superior API.
  2. Extremely fast: Cheerio operates on a straightforward sequential DOM, leading to quick parsing, handling, and display.
  3. Highly adaptable: Cheerio utilizes the parse5 parser and can also utilize htmlparser2, Cheerio has the ability to parse almost any HTML or XML document.

Conclusion

In conclusion, web scraping with Axios and Cheerio is a powerful technique for extracting data from websites. By leveraging these tools and following best practices, developers can automate data collection and gain valuable insights into markets, competitors, and customer behavior.

Posted in:   Web Scraping Tutorials

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?