Web Scraping with Puppeteer and NodeJS

Puppeteer is a node.js library which provides a powerful but simple API that allows you to control Google’s Chrome or Chromium browser. It also allows you to run Chromium in headless mode (useful for running browsers in servers) and can send and receive requests without the need of a user interface. It works in the background, performing actions as instructed by the API.

What can you do with puppeteer?

Puppeteer can do almost everything Google Chrome or Chromium can do.

  • Click elements such as buttons, links, and images
  • Type like a user in input boxes and automate form submissions
  • Navigate pages, click on links and follow them, go back and forward etc.
  • Take a timeline trace to find out where the issues are in a website
  • Carry out automated testing for user interfaces and various front-end apps, directly in a browser
  • Take screenshots and convert web pages to pdf’s

The developer community for puppeteer is very active and new updates are rolled out regularly. With its full-fledged API, it covers most of the actions that can be done with a chrome browser. As of now it is one of the best options to scrape JavaScript-heavy websites

In this tutorial, we’ll show you how to create a scraper for Booking.com to scrape the details of hotel listings in a particular city from the first page of results. We will scrape the hotel name, rating, number of reviews and price for each hotel listing.

Required Tools

To install Puppeteer you need to first install node.js and write the code to control the browser a.k.a scraper in JavaScript. Node.js runs the script and lets you control the chrome browser using the puppeteer library. Puppeteer requires at least Node v7.6.0 or greater but for this tutorial, we will go with Node v9.0.0.

Installing Node.js

Linux

You can head over to Nodesource and choose the distribution you want. Here are the steps to install node.js in Ubuntu 16.04 :

1. Open a terminal run – sudo apt install curl in case it’s not installed.

2. Then run – curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -

3. Once that’s done, install node.js by running, sudo apt install nodejs. This will automatically install npm.

Windows and Mac

To install node.js in Windows or Mac, download the package for your OS from Nodes JS’s website https://nodejs.org/en/download/

Obtaining the URL

Let’s start by obtaining the booking URL. Go to booking.com and search for a city with the inputs for check-in and check-out dates. Click the search button and copy the URL that has been generated. This will be your booking URL.

The gif below shows how to obtain the booking URL for hotels available in Singapore.

puppeteer-obtaining-booking-url

After you have completed the installation of node.js we will install the project requirements, which will also download the puppeteer library that will be used in the scraper. Download both the files app.js and package.js from below and place it inside a folder. We have named our folder booking_scraper.  

The script below is the scraper. We have named it app.js. This script will scrape the results for a single listing page:

The script below is package.json which contains the libraries needed to run the scraper

Installing the project dependencies, which will also install Puppeteer.

  • Install the project directory and make sure it has the package.json file inside it.
  • Use npm install to install the dependencies. This will also install puppeteer and download the chromium browser to run the puppeteer code. By default, puppeteer works with the Chromium browser but you can also use Chrome.

Now copy the URL that was generated from booking.com and paste it in the bookingUrl variable in the provided space (line 3 in app.js). You should make sure the URL is inserted within quotes otherwise, the script will not work.

Running the Scraper

To run a node.js program you need to type:

For this script, it will be:

Turning off Headless Mode

The script above runs the browser in headless mode. To turn headless mode off, just modify this line

const browser = await puppeteer.launch({ headless: true }); to const browser = await puppeteer.launch({ headless: false});

You should then be able to see what is going on.

The program will run and fetch all the hotel details and display it in the terminal. If you want to scrape another page you can change the URL in the bookingUrl variable and run the program again.

Here is how the output for hotels in Singapore will look like:

Known Limitations

When using Puppeteer you should keep some things in mind. Since Puppeteer opens up a browser it takes a lot of memory and CPU to run in comparison to script-based approaches like Selenium for JavaScript.

Using Selenium as a Headless Browser: How to Prevent getting Blacklisted while scraping

If you want to scrape a simple website that does not use JavaScript heavy frontends, use a simple Python Scraper.

You will find Puppeteer to be a bit slow as it only opens one page at a time and starts scraping a page once it has been fully loaded. Pupetteer scripts can only be written in JavaScript and does not support any other languages.

If you need professional help with scraping complex websites, contact us by filling up the form below

You can also get data delivered to you, as a Service from us. Interested?

Turn websites into meaningful and structured data through our web data extraction service

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

 

 

Join the conversation


Turn websites into meaningful and structured data through our web data extraction service