How to build web scrapers quickly using Playwright Codegen

The Playwright library offers many features, but the one that stands out from the rest is Codegen. Playwright Codegen is a test generation tool that comes bundled with Playwright. You can also use it to build browser-based web scrapers with ease.

If you are new to Playwright, please visit our previous article to learn more – Playwright installation and Basics.

This tutorial covers Playwright Codegen using Python and Javascript. We will be using Node.js as JavaScript Runtime Environment.

A simple example

Python

$ playwright codegen example.com

Javascript

$ npx playwright codegen example.com

We now have two windows open, the browser and the Playwright Inspector. You can use the browser window to interact with the website, and the Playwright Inspector will record all interactions.

Playwright Codegen Open Browser

Playwright Inspector

Playwright inspector will record all the interactions performed on the browser and convert them into code in a programming language of your choice.

Features of Playwright Inspector

Codegen Playwright Inspector features

  1. Record button – In the top left-hand corner, you can see the Record button that allows you to start/stop recording activities you perform on the browser.
  2. Target -You can see the Target selector in the top right-hand corner. It shows all the languages supported by Playwright Inspector. All your recorded interactions will be converted into the selected language by choosing a different language.
  3. Code panel – The Code panel contains the code lines of the recorded interactions. An initial template will be present, which launches the browser and navigates to the specified website.
  4. ExploreExplore allows you to get the CSS selector of any element on the page. Instead of writing custom selectors, you can use this feature to get the selectors quickly. The button will not be visible when you are recording.

Pros of Codegen

  1. Code generation: Codegen generates most of the code for navigations and interactions. This allows you to get started quickly with the project. Even though most of the code is auto-generated, it needs to be fine-tuned for particular use cases.
  2. Commented code: Codegen will also add comments on each line.
  3. Code conversion: The Target button can easily convert your code to multiple languages.
  4. Initial Template: By default, a basic template will be provided. The template will have the code to start the browser and go to the targeted page.

Cons of Codegen

  • Optimization is needed: The generated code usually needs to be optimized manually depending on the use case.
  • Choosing dynamic selectors: Codegen may choose selectors that are not static across web pages. We need to modify it manually.

Wikipedia Scraper using Playwright Codegen

Let’s create a simple scraper for Wikipedia that searches for celebrity names and collects information about them. We want the scraper to perform the following tasks:

  1. Go to https://www.wikipedia.org/
  2. Enter the celebrity’s name in the search bar and click on the first suggestion. This will lead you to the celebrity’s page.
  3. Save the entire page. You can extract only the necessary information instead of saving the entire page.

To start Codegen, run the code given below:

$ playwright codegen https://www.wikipedia.org/

Playwright Codegen navigate to website

We now have two windows open:

An initial template will be present in the code panel, which launches the browser and navigates to Wikipedia. Now, let’s search for “Tom Cruise”.

As we type the celebrity name, we can see the equivalent code being generated in the Playwright Inspector:

Codegen Playwright inspector generates python code

Currently, the code is in Python, and if you need to convert it to another programming language, you can click on the Target button and select the required language. Here, we have converted the same code to Javascript.

Codegen playwright inspector convert python to javascript

We can now copy the above python code to a file scraper.py and modify it as per our use case.

Add an additional function to write raw HTML data to a file:

Python

def write_to_file(filename: str, data: str):
# Saves raw html to name.html files.
page_content = page.content()
with open(filename, 'w') as f:
f.write(page_content)

Javascript

// Saves raw html to name.html files.
async function writeToFile(filename, data) {
fs.writeFile(filename, data, (err) => {
if (err) throw err;
});
}

Add a new variable at the beginning to store the celebrity names.
Python

celebrity_names = ["Tom Cruise", "Johnny Depp", "Tom Holland", "Scarlett Johansson"]

Javascript

let celebrityNames = [
"Tom Cruise",
"Johnny Depp",
"Tom Holland",
"Scarlett Johansson",
];

Add the logic to loop through celebrity names. Then, add the code to write the HTML content and close the page after use.
Python

celebrity_names = ["Tom Cruise", "Johnny Depp", "Tom Holland", "Scarlett Johansson"]
# looping through all celebrities
for celebrity in celebrity_names:
# Open new page
page = context.new_page()
# Go to https://www.wikipedia.org/
page.goto("https://www.wikipedia.org/")
# Click input[name="search"]
page.click('input[name="search"]')
# Fill input[name="search"]
page.fill('input[name="search"]', celebrity)
# Click #typeahead-suggestions a >> :nth-match(div, 2)
page.click("#typeahead-suggestions a >> :nth-match(div, 2)")
# file names should be like tom_cruise.html
filename = "_".join(celebrity.lower().split()) + ".html"
# write the html to a file
write_to_file(filename, page.content())
# close the page
page.close()

Javascript

// looping through all celebrities
for (const celebrity of celebrityNames) {
 // Open new page
 const page = await context.newPage();

 // Go to https://www.wikipedia.org/
 await page.goto("https://www.wikipedia.org/");

 // Click input[name="search"]
 await page.click('input[name="search"]');

 // Fill input[name="search"]
 await page.fill('input[name="search"]', celebrity);

 // Click #typeahead-suggestions a >> :nth-match(div, 2)
 await page.click("#typeahead-suggestions a >> :nth-match(div, 2)");

 // file names should be like tom_cruise.html
 let filename = celebrity.toLowerCase().split(" ").join("_") + ".html";

 // write the html to a file
 await writeToFile(filename, await page.content());

 await page.close();
}

The complete code has been provided below:

Python implementation – https://github.com/scrapehero-code/playwright-webscraping/blob/main/Playwright-Codegen/Wikipedia/scraper.py

Javascript implementation – https://github.com/scrapehero-code/playwright-webscraping/blob/main/Playwright-Codegen/Wikipedia/scraper.js

Use the following command to run the scraper:

Python

$ python3 wikipedia_scraper.py

Javascript

$ node wikipedia_scraper.js

Amazon Scraper using Playwright Codegen

Now, let’s build an Amazon scraper that collects the price of a MacBook Pro for a specific zip code in New York City. The scraper performs the following tasks:

  1. Go to the product link.
  2. Set the zip code(10013).
  3. Collect the current price and the date on which the price is collected.

Example: Product link – https://www.amazon.com/Apple-MacBook-16-inch-10%E2%80%91core-32%E2%80%91core/dp/B09R34VZP6/

Start  Codegen

playwright codegen https://www.amazon.com/Apple-MacBook-16-inch-10%E2%80%91core-32%E2%80%91core/dp/B09R34VZP6/

Codegen Playwright navigate to website

Set the zip code in the browser

The Playwright Inspector will generate the code for setting the zip code. Let’s copy it to a file.

Playwright Codegen python code

You can see that the function run does not have the code to click the button Done. This is because while recording the interactions,  Codegen failed to detect it. Now, we have to add this step manually.

Playwright Codegen element locator missing

Use the Explore feature on Playwright Inspector to get the selector for the button and add the following lines after clicking the Apply button.

Python

# wait for the zip code to change
page.wait_for_selector("#GLUXZipConfirmationValue")
# Click the Done button
page.click('button:has-text("Done")')
# reload the page
page.reload()

Javascript

// wait for zipcode to change
await page.waitForSelector("#GLUXZipConfirmationValue");
// Click the Done button
await page.click('button:has-text("Done")');
// reload the page
await page.reload();

Add the following code to extract the price:

Python

price = page.inner_text('#corePrice_feature_div .a-price .a-offscreen')

Javascript

let price = await page.innerText(
  "#corePrice_feature_div .a-price .a-offscreen"
);

The selectors may have slight differences depending on your region.

You can refer to the complete code below:

Python Implementation – https://github.com/scrapehero-code/playwright-webscraping/blob/main/Playwright-Codegen/Amazon/scraper.py

Javascript Implementation – https://github.com/scrapehero-code/playwright-webscraping/blob/main/Playwright-Codegen/Amazon/scraper.js

Posted in:   Web Scraping Tutorials

Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?