Here are the high-level steps involved in this process and we will go through each of these in detail - Building scrapers, Running web scrapers at scale, Getting past anti-scraping techniques, Data Validation and Quality…
The Playwright library offers many features, but the one that stands out from the rest is Codegen. Playwright Codegen is a test generation tool that comes bundled with Playwright. You can also use it to build browser-based web scrapers with ease.
If you are new to Playwright, please visit our previous article to learn more – Playwright installation and Basics.
This tutorial covers Playwright Codegen using Python and Javascript. We will be using Node.js as JavaScript Runtime Environment.
If you already know how to build web scrapers with Playwright, jump right into our article about optimizing Playwright web scrapers using code profiling.
Table of Contents
A simple example
Python
$ playwright codegen example.com
Javascript
$ npx playwright codegen example.com
We now have two windows open, the browser and the Playwright Inspector. You can use the browser window to interact with the website, and the Playwright Inspector will record all interactions.
Playwright Inspector
Playwright inspector will record all the interactions performed on the browser and convert them into code in a programming language of your choice.
Features of Playwright Inspector
- Record button – In the top left-hand corner, you can see the Record button that allows you to start/stop recording activities you perform on the browser.
- Target -You can see the Target selector in the top right-hand corner. It shows all the languages supported by Playwright Inspector. All your recorded interactions will be converted into the selected language by choosing a different language.
- Code panel – The Code panel contains the code lines of the recorded interactions. An initial template will be present, which launches the browser and navigates to the specified website.
- Explore – Explore allows you to get the CSS selector of any element on the page. Instead of writing custom selectors, you can use this feature to get the selectors quickly. The button will not be visible when you are recording.
Pros of Codegen
- Code generation: Codegen generates most of the code for navigations and interactions. This allows you to get started quickly with the project. Even though most of the code is auto-generated, it needs to be fine-tuned for particular use cases.
- Commented code: Codegen will also add comments on each line.
- Code conversion: The Target button can easily convert your code to multiple languages.
- Initial Template: By default, a basic template will be provided. The template will have the code to start the browser and go to the targeted page.
Cons of Codegen
- Optimization is needed: The generated code usually needs to be optimized manually depending on the use case.
- Choosing dynamic selectors: Codegen may choose selectors that are not static across web pages. We need to modify it manually.
Wikipedia Scraper using Playwright Codegen
Let’s create a simple scraper for Wikipedia that searches for celebrity names and collects information about them. We want the scraper to perform the following tasks:
- Go to https://www.wikipedia.org/
- Enter the celebrity’s name in the search bar and click on the first suggestion. This will lead you to the celebrity’s page.
- Save the entire page. You can extract only the necessary information instead of saving the entire page.
To start Codegen, run the code given below:
$ playwright codegen https://www.wikipedia.org/
We now have two windows open:
An initial template will be present in the code panel, which launches the browser and navigates to Wikipedia. Now, let’s search for “Tom Cruise”.
As we type the celebrity name, we can see the equivalent code being generated in the Playwright Inspector:
Currently, the code is in Python, and if you need to convert it to another programming language, you can click on the Target button and select the required language. Here, we have converted the same code to Javascript.
We can now copy the above python code to a file scraper.py and modify it as per our use case.
Add an additional function to write raw HTML data to a file:
Python
def write_to_file(filename: str, data: str): # Saves raw html to name.html files. page_content = page.content() with open(filename, 'w') as f: f.write(page_content)
Javascript
// Saves raw html to name.html files. async function writeToFile(filename, data) { fs.writeFile(filename, data, (err) => { if (err) throw err; }); }
Add a new variable at the beginning to store the celebrity names.
Python
celebrity_names = ["Tom Cruise", "Johnny Depp", "Tom Holland", "Scarlett Johansson"]
Javascript
let celebrityNames = [ "Tom Cruise", "Johnny Depp", "Tom Holland", "Scarlett Johansson", ];
Add the logic to loop through celebrity names. Then, add the code to write the HTML content and close the page after use.
Python
celebrity_names = ["Tom Cruise", "Johnny Depp", "Tom Holland", "Scarlett Johansson"] # looping through all celebrities for celebrity in celebrity_names: # Open new page page = context.new_page() # Go to https://www.wikipedia.org/ page.goto("https://www.wikipedia.org/") # Click input[name="search"] page.click('input[name="search"]') # Fill input[name="search"] page.fill('input[name="search"]', celebrity) # Click #typeahead-suggestions a >> :nth-match(div, 2) page.click("#typeahead-suggestions a >> :nth-match(div, 2)") # file names should be like tom_cruise.html filename = "_".join(celebrity.lower().split()) + ".html" # write the html to a file write_to_file(filename, page.content()) # close the page page.close()
Javascript
// looping through all celebrities for (const celebrity of celebrityNames) { // Open new page const page = await context.newPage(); // Go to https://www.wikipedia.org/ await page.goto("https://www.wikipedia.org/"); // Click input[name="search"] await page.click('input[name="search"]'); // Fill input[name="search"] await page.fill('input[name="search"]', celebrity); // Click #typeahead-suggestions a >> :nth-match(div, 2) await page.click("#typeahead-suggestions a >> :nth-match(div, 2)"); // file names should be like tom_cruise.html let filename = celebrity.toLowerCase().split(" ").join("_") + ".html"; // write the html to a file await writeToFile(filename, await page.content()); await page.close(); }
The complete code has been provided below:
Python implementation – https://github.com/scrapehero-code/playwright-webscraping/blob/main/Playwright-Codegen/Wikipedia/scraper.py
Javascript implementation – https://github.com/scrapehero-code/playwright-webscraping/blob/main/Playwright-Codegen/Wikipedia/scraper.js
Use the following command to run the scraper:
Python
$ python3 wikipedia_scraper.py
Javascript
$ node wikipedia_scraper.js
Amazon Scraper using Playwright Codegen
Now, let’s build an Amazon scraper that collects the price of a MacBook Pro for a specific zip code in New York City. The scraper performs the following tasks:
- Go to the product link.
- Set the zip code(10013).
- Collect the current price and the date on which the price is collected.
Example: Product link – https://www.amazon.com/Apple-MacBook-16-inch-10%E2%80%91core-32%E2%80%91core/dp/B09R34VZP6/
Start Codegen
playwright codegen https://www.amazon.com/Apple-MacBook-16-inch-10%E2%80%91core-32%E2%80%91core/dp/B09R34VZP6/
Set the zip code in the browser
The Playwright Inspector will generate the code for setting the zip code. Let’s copy it to a file.
You can see that the function run does not have the code to click the button Done. This is because while recording the interactions, Codegen failed to detect it. Now, we have to add this step manually.
Use the Explore feature on Playwright Inspector to get the selector for the button and add the following lines after clicking the Apply button.
Python
# wait for the zip code to change page.wait_for_selector("#GLUXZipConfirmationValue") # Click the Done button page.click('button:has-text("Done")') # reload the page page.reload()
Javascript
// wait for zipcode to change await page.waitForSelector("#GLUXZipConfirmationValue"); // Click the Done button await page.click('button:has-text("Done")'); // reload the page await page.reload();
Add the following code to extract the price:
Python
price = page.inner_text('#corePrice_feature_div .a-price .a-offscreen')
Javascript
let price = await page.innerText( "#corePrice_feature_div .a-price .a-offscreen" );
The selectors may have slight differences depending on your region.
You can refer to the complete code below:
Python Implementation – https://github.com/scrapehero-code/playwright-webscraping/blob/main/Playwright-Codegen/Amazon/scraper.py
Javascript Implementation – https://github.com/scrapehero-code/playwright-webscraping/blob/main/Playwright-Codegen/Amazon/scraper.js
Responses
Comments are closed.