How to build and run a web scraping by generating code, from your interactions on a browser - using Playwright Codegen.
Browser-based web scraping provides the quickest and easiest solution for scraping javascript-based, client-side rendering web pages. There are multiple frameworks available to build and run browser-based web scrapers. The most common amongst these are Selenium, Puppeteer, and Playwright. We have already covered Selenium and Puppeteer in our previous articles. Now, let’s take a look at Playwright, the browser automation framework from Microsoft.
What is Playwright?
Playwright is a browser automation framework with APIs available in Javascript, Python, .NET, and Java. Its simplicity and powerful automation capabilities make it an ideal tool for web scraping. It also comes with headless browser support.
Features of Playwright:
- Cross-browser: Playwright supports all modern browsers, including Google Chrome, Microsoft Edge (with Chromium), Apple Safari (with WebKit), and Mozilla Firefox. It also supports the option to pass custom web drivers using the argument executable_path. It allows us to scrape on multiple browsers simultaneously. In addition, cross-browser scraping helps in bypassing bot detection by using different browsers and operating systems. Playwright helps us to identify the best browser, based on the speed. Read more
- Cross-platform: With Playwright, you can test how your applications perform in different browser builds for Windows, Linux, and macOS. Read more
- Cross-language: Playwright supports multiple programming languages, which include Javascript, Typescript, Python, Java, and .Net, with documentation and community support. Read more
- Auto-wait: Playwright performs a series of checks on items before performing actions, to ensure that those actions work as expected. It waits until all relevant checks have been passed to perform the requested action. If the required checks are not passed within the specified timeout, the action will fail with a TimeoutError. Read more
- Web-first assertion: Playwright assertions are created specifically for the dynamic web. It checks whether the condition has been met or not. If not, it gets the node again and checks until the condition is met or it times out. The timeout for assertions is not set by default, so it’ll wait until the whole test times out. Read more
- Proxies: Playwright supports the use of proxies. The proxy can be set either globally for the entire browser or for each browser context individually. Read more
- Browser contexts: We can create individual browser contexts for each test within a single browser instance. Browser context is equivalent to a brand new browser profile. This is useful when performing multi-user functionality and web scraping with complete isolation. This delivers full test isolation with zero overhead. We can also set up cookies, user agent, viewport, proxy, and enable/disable javascript for individual contexts. Read more
You can also read: How to Scrape Google Maps: Code and No-Code Approach
Installation
Python:
Install the python package:
pip install playwright
Install the required browsers:
playwright install
Javascript:
Install using npm
npm init -y npm install playwright@latest
Install csv writer
npm i objects-to-csv
You can also use playwright codegen to record actions and turn that into code.
Building a scraper
Let’s create a scraper using Playwright to scrape data of the first 3 listing pages from https://scrapeme.live/shop. We will collect the following data points:
- Name
- Price
- Image URL
Source Code on Github
You can view the complete code here:
Python: https://github.com/scrapehero-code/playwright-webscraping/blob/main/intro/scraper.pyJavascript: https://github.com/scrapehero-code/playwright-webscraping/blob/main/intro/scraper.js
Import the required libraries:
In Python, Playwright supports both synchronous and asynchronous operations. But Node.js is asynchronous in nature, and hence Playwright only supports asynchronous operations in Node.js.
Here in this article, we used asynchronous Playwright.
# Python from playwright.async_api import async_playwright import asyncio
// Javascript const { chromium } = require('playwright');
Launch the Browser instance:
Here, we can define the browser (Chrome, Firefox, WebKit) and pass the required arguments.
Async/await is a feature that allows you to execute functions asynchronously while waiting for results. This can improve the performance of your applications by working on multiple threads instead of performing one operation after another synchronously. The await keyword releases the flow of control back to the event loop.
# Python # Launch the headed browser instance browser = await playwright.chromium.launch(headless=False)
# Python # Launch the headless browser instance browser = await playwright.chromium.launch(headless=True)
// Javascript // Launch headless browser instance const browser = await chromium.launch({ headless: true, });
// Javascript // Launch headed browser instance const browser = await chromium.launch({ headless: false, });
Create a new browser context:
Playwright allows us to create a new context from an existing browser instance that won’t share cookies/cache with other browser contexts.
# Python # Creates a new browser context context = browser.new_context()
// Javascript // Creates a new browser context const page = await browser.newContext();
Create a page from the browser context:
# Python # opens new page page = await context.new_page()
// Javascript // Open new page const page = await context.newPage();
This will open a Chromium browser. Now, let’s navigate to the listing page. We can use the below code lines to perform the navigation:
# Python # Go to https://scrapeme.live/shop await page.goto('https://scrapeme.live/shop')
// Javascript // Go to https://scrapeme.live/shop await page.goto('https://scrapeme.live/shop');
Find and select all product listings:
The products (Pokemons) are listed on this page. In order to get data for each product, we first need to find the element that contains the data for each product and extract the data from it.
If we inspect one of the product listings, we can see that every product is inside a <li> tag, with a common class name “product”.
We can select all such products by looking for all <li> tags with a class name “product”, which can be represented as the CSS selector li.product .
The method called query selector all lets you get all the elements that match the selector. If no elements match the selector, it returns an empty list ( [] ).
# Python all_items = await page.query_selector_all('li.product')
// Javascript const product = await page.$$eval('li.product', all_items => {})
Select data from each listing:
From each product listing, we need to extract the following data points:
- Name
- Image URL
- Price
In order to get these details, we need to find the CSS Selectors for the data points. You can do that by inspecting the element, and finding the class name and tag name.
We can now see that the selectors are:
- Name- h2
- Price- span.woocommerce-Price-amount
- Image URL- a.woocommerce-LoopProduct-link.woocommerce-loop-product__link > img
We can use the function query selector for selecting the individual elements. The query selector returns the first matching element. If no element matches the selector, the return value resolves to null. You can see the implementation below:
# Python # Looping through listing pages for i in range(2): name_el = await item.query_selector('h2')
// Javascript // Looping through listing pages for (let i = 2; i < 4; i++) { const name_el = await product.querySelector('h2') }
Extracting text from the elements:
Now, we need to extract the text from the elements. We can use the function inner text for extracting the text.
# Python name = await name_el.inner_text()
// Javascript const name = name_el.innerText;
Navigate to the next page:
Now, we need to extract the data from the next page. To perform this action, we need to find the element-locator of the next button. For this, we can use the method locator in playwright.
The method locator returns an element locator that can be used for various operations, such as click, fill, tap, etc. The function supports pattern matching(RegEx), XPath, and selectors.
# Python next = page.locator("text=→").nth(1)
// Javascript next = page.locator("text=→").nth(1)
Now, we need to click on the next button. To perform this, we can use the function click. You may need to wait for the required elements to load on the page. To ensure this, we can use the function wait for selector.
# Python await next.click() # wait for the selector to load await page.wait_for_selector('li.product')
// Javascript await next.click(); // wait for selector to load await page.waitForSelector('li.product');
Close the browser and context:
After completing the task, we need to close all the context and browser instances.
# python await context.close() await browser.close()
// Javascript await context.close(); await browser.close();
After closing both contexts and browser, we need to save the data into a CSV file. For saving into CSV in javascript we need an external package to be installed. The Installation command is given below
npm i objects-to-csv
Setting up headless mode and proxies in the browser:
Why do you need proxies for web scraping?
A proxy is an invisible cloak that hides your IP address and allows seamless access to your data without being blocked. With a proxy, the website you request no longer sees your original IP address, but instead sees the proxy’s IP address, allowing you to browse the website without getting detected.
You can check out this article to learn more: How To Rotate Proxies and change IP Addresses using Python 3
Why do you need a headless browser?
A browser without a user interface(UI) is called a headless browser. It can render the website like any other standard browser. They are better, less time-consuming, and faster. Since the headless browser does not have a UI, it has minimal overhead and can be used for tasks like web scraping and automation.
Both of these can be achieved while defining and launching the browser:
// Javascript const browser = await chromium.launch({ headless: true, proxy: { server: '<proxy>', username: '<username>', password: '<password>' } });
# Python browser = playwright.chromium.launch(headless=True, proxy={ "server": "<proxy>", "username": "<username>", "password": "<password>" })
Source Code on Github
You can view the complete code here:
Python: https://github.com/scrapehero-code/playwright-webscraping/blob/main/intro/scraper.pyJavascript: https://github.com/scrapehero-code/playwright-webscraping/blob/main/intro/scraper.js
If you would like to learn how to speed up your browser based web scrapers, please read the article below.
How to make web scraping with browsers faster and cost effective
Next, lets see how we can use use playwright codegen to build web scrapers faster.
Responses
Comments are closed.