Web Scraping using Playwright in Python and Javascript

Browser-based web scraping provides the quickest and easiest solution for scraping javascript-based, client-side rendering web pages. There are multiple frameworks available to build and run browser-based web scrapers. The most common amongst these are Selenium, Puppeteer, and Playwright. We have already covered Selenium and Puppeteer in our previous articles. Now, let’s take a look at Playwright, the browser automation framework from Microsoft.

What is Playwright?

Playwright is a browser automation framework with APIs available in Javascript, Python, .NET, and Java. Its simplicity and powerful automation capabilities make it an ideal tool for web scraping. It also comes with headless browser support.

Features of Playwright:

  1. Cross-browser: Playwright supports all modern browsers, including Google Chrome, Microsoft Edge (with Chromium), Apple Safari (with WebKit), and Mozilla Firefox. It also supports the option to pass custom web drivers using the argument executable_path. It allows us to scrape on multiple browsers simultaneously. In addition, cross-browser scraping helps in bypassing bot detection by using different browsers and operating systems. Playwright helps us to identify the best browser, based on the speed. Read more
  2. Cross-platform: With Playwright, you can test how your applications perform in different browser builds for Windows, Linux, and macOS. Read more
  3. Cross-language: Playwright supports multiple programming languages, which include Javascript, Typescript, Python, Java, and .Net, with documentation and community support. Read more
  4. Auto-wait: Playwright performs a series of checks on items before performing actions, to ensure that those actions work as expected. It waits until all relevant checks have been passed to perform the requested action. If the required checks are not passed within the specified timeout, the action will fail with a TimeoutError. Read more
  5. Web-first assertion: Playwright assertions are created specifically for the dynamic web. It checks whether the condition has been met or not. If not, it gets the node again and checks until the condition is met or it times out. The timeout for assertions is not set by default, so it’ll wait until the whole test times out. Read more
  6. Proxies: Playwright supports the use of proxies. The proxy can be set either globally for the entire browser or for each browser context individually. Read more
  7. Browser contexts: We can create individual browser contexts for each test within a single browser instance. Browser context is equivalent to a brand new browser profile. This is useful when performing multi-user functionality and web scraping with complete isolation. This delivers full test isolation with zero overhead. We can also set up cookies, user agent, viewport, proxy, and enable/disable javascript for individual contexts. Read more

Installation

Python:

Install the python package:

pip install playwright

Install the required browsers:

playwright install

Javascript:

Install using npm

npm init -y
npm install playwright@latest

Install csv writer

npm i objects-to-csv

Building a scraper

Let’s create a scraper using Playwright to scrape data of the first 3 listing pages from https://scrapeme.live/shop. We will collect the following data points:

  • Name
  • Price
  • Image URL

You can view the complete code here:
Python: https://github.com/scrapehero-code/playwright-webscraping/blob/main/intro/scraper.py

Javascript: https://github.com/scrapehero-code/playwright-webscraping/blob/main/intro/scraper.js

Import the required libraries:

In Python, Playwright supports both synchronous and asynchronous operations. But Node.js is asynchronous in nature, and hence Playwright only supports asynchronous operations in Node.js.

Here in this article, we used asynchronous Playwright.

# Python
from playwright.async_api import async_playwright
import asyncio
// Javascript
const { chromium } = require('playwright');

Launch the Browser instance:

Here, we can define the browser (Chrome, Firefox, WebKit) and pass the required arguments.

Async/await is a feature that allows you to execute functions asynchronously while waiting for results. This can improve the performance of your applications by working on multiple threads instead of performing one operation after another synchronously. The await keyword releases the flow of control back to the event loop.

# Python
# Launch the headed browser instance
browser = await playwright.chromium.launch(headless=False)
# Python
# Launch the headless browser instance
browser = await playwright.chromium.launch(headless=True)
// Javascript
// Launch headless browser instance
const browser = await chromium.launch({
headless: true,
});
// Javascript
// Launch headed browser instance
const browser = await chromium.launch({
headless: false,
});

Create a new browser context:

Playwright allows us to create a new context from an existing browser instance that won’t share cookies/cache with other browser contexts.

# Python
# Creates a new browser context
context = browser.new_context()
// Javascript
// Creates a new browser context
const page = await browser.newContext();

Create a page from the browser context:

# Python
# opens new page
page = await context.new_page()
// Javascript
// Open new page
const page = await context.newPage();

This will open a Chromium browser. Now, let’s navigate to the listing page. We can use the below code lines to perform the navigation:

# Python
# Go to https://scrapeme.live/shop
await page.goto('https://scrapeme.live/shop')
// Javascript
// Go to https://scrapeme.live/shop
await page.goto('https://scrapeme.live/shop');

Find and select all product listings:

The products (Pokemons) are listed on this page. In order to get data for each product, we first need to find the element that contains the data for each product and extract the data from it.

If we inspect one of the product listings, we can see that every product is inside a <li> tag, with a common class name “product”.

We can select all such products by looking for all <li> tags with a class name “product”, which can be represented as the CSS selector li.product .

Playwright selecting elements

The method called query selector all lets you get all the elements that match the selector. If no elements match the selector, it returns an empty list ( [] ).

# Python
all_items = await page.query_selector_all('li.product')
// Javascript
const product = await page.$$eval('li.product', all_items => {})

Select data from each listing:

From each product listing, we need to extract the following data points:

  • Name
  • Image URL
  • Price

In order to get these details, we need to find the CSS Selectors for the data points. You can do that by inspecting the element, and finding the class name and tag name.

Playwright selects data from elements using css selector

We can now see that the selectors are:

  • Name- h2
  • Price- span.woocommerce-Price-amount
  • Image URL- a.woocommerce-LoopProduct-link.woocommerce-loop-product__link > img

We can use the function query selector for selecting the individual elements. The query selector returns the first matching element. If no element matches the selector, the return value resolves to null. You can see the implementation below:

# Python
# Looping through listing pages
for i in range(2):
name_el = await item.query_selector('h2')
// Javascript
// Looping through listing pages
for (let i = 2; i < 4; i++)
{
const name_el = await product.querySelector('h2')
}

Extracting text from the elements:

Now, we need to extract the text from the elements. We can use the function inner text for extracting the text.

# Python
name = await name_el.inner_text()
// Javascript
const name = name_el.innerText;

Navigate to the next page:

Now, we need to extract the data from the next page. To perform this action, we need to find the element-locator of the next button. For this, we can use the method locator in playwright.

Playwright select element for pagination

The method locator returns an element locator that can be used for various operations, such as click, fill, tap, etc. The function supports pattern matching(RegEx), XPath, and selectors.

# Python
next = page.locator("text=→").nth(1)
// Javascript
next = page.locator("text=→").nth(1)

Now, we need to click on the next button. To perform this, we can use the function click. You may need to wait for the required elements to load on the page. To ensure this, we can use the function wait for selector.

# Python
await next.click()
# wait for the selector to load
await page.wait_for_selector('li.product')
// Javascript
await next.click();
// wait for selector to load
await page.waitForSelector('li.product');

Close the browser and context:

After completing the task, we need to close all the context and browser instances.

# python
await context.close()
await browser.close()
// Javascript
await context.close();
await browser.close();

After closing both contexts and browser, we need to save the data into a CSV file. For saving into CSV in javascript we need an external package to be installed. The Installation command is given below

npm i objects-to-csv

Setting up headless mode and proxies in the browser:

Why do you need proxies for web scraping?

A proxy is an invisible cloak that hides your IP address and allows seamless access to your data without being blocked. With a proxy, the website you request no longer sees your original IP address, but instead sees the proxy’s IP address, allowing you to browse the website without getting detected.

You can check out this article to learn more: How To Rotate Proxies and change IP Addresses using Python 3

Why do you need a headless browser?

A browser without a user interface(UI) is called a headless browser. It can render the website like any other standard browser. They are better, less time-consuming, and faster. Since the headless browser does not have a UI, it has minimal overhead and can be used for tasks like web scraping and automation.

Both of these can be achieved while defining and launching the browser:

// Javascript
const browser = await chromium.launch({
headless: true,
proxy: {
server: '<proxy>',
username: '<username>',
password: '<password>'
}
});
# Python
browser = playwright.chromium.launch(headless=True, proxy={
"server": "<proxy>",
"username": "<username>",
"password": "<password>"
})

Posted in:   Web Scraping Tutorials

Leave a Reply

Your email address will not be published.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?