How to block specific resources in Playwright

One of the biggest disadvantages of web scraping with browsers is that they are expensive to run on a large scale due the amount of compute and network bandwidth required. The biggest reason for the increase in the network bandwidth is the additional requests that are fetched for rendering a web page. Unlike a traditional web scraper that downloads the HTML file of a web page and parse it, a web scraper built with a browser “renders” the page by requesting every resource marked in the HTML file that is required for “painting” the web page. These could be images, css stylesheets, web fonts, javascript libraries, javascript files, etc.

To give you an example, let’s take a look at the percent of requests that are downloaded to render this page https://scrapeme.live/shop/

Bandwidth consumption pie chart

1.9 MB, and 51 requests are sent in the background. 69% of that are images, and 18% are stylesheets. We could save a large percent of the network bandwith used by blocking these requests, that are not necessary to showing us the data we are trying to scrape.

response rendering time without interception

Rendering web pages also takes up time, increasing the run time of our web scrapers. 2614 ms is required for rendering the images. In most cases, we may not need to render images and CSS while scraping a website. We could reduce the time spend waiting for the page to render if we block these requests.

If you are building a browser based scraper to take screenshot of a web page or check how the page is rendered, blocking resources wouldn’t make sense.

The technique will use to block these requests is called Request Interception.

To learn more about other techniques to improve the performance of your web scrapers that use browsers, take a look at the article below.

How to make web scraping with browsers faster and cost effective

What types of resources can be intercepted in Playwright?

You could technically intercept any type of request. The resource types supported by Playwright are

  • document – An HTML / XML file.
  • stylesheet – CSS stylesheet for the web page
  • image – Image files like PNG, JPG, etc.
  • media – Resources loaded via a <video> or <audio> element
  • font – Web Fonts
  • script – Javascript code loaded via <script> tags
  • texttrack – Text Tracks for Audio or Video
  • xhr – Short for XML HTTP Request primarily used for AJAX requests for data. Most websites that render data using Javascript code use this type, so these are better left unblocked.
  • fetch – Similar to XHR, but fetched using the new fetch() method.
  • eventsource – Opens a persistent connection to an HTTP server, which sends events in text/event-stream format.
  • websocket – Web Socket connections that are opened between browser and a web socket server
  • manifest – JSON specification for Progressive Web Apps
  • other – Resources that aren’t covered by any other available type.

Blocking font, media, image, stylesheet requests are generally safe for web scraping, as most pages will not loose any data. As we saw above blocking images and stylesheets will give you the most savings.

How to intercept requests in Playwright

If you are just here for the code here is some code to intercept images, javascript, css style sheets and fonts in playwright . You can put any of of the types specified above into the set of resource types below and it would be blocked.

The function intercept below checks each request against a list of unwanted resource types. If a request type is in our block list, the request will be aborted by calling await route.abort() else allowed by calling await route.continue_().

import asyncio

from playwright.async_api import Playwright, async_playwright


async def intercept(route, request):
    if request.resource_type in {'image', 'media', 'stylesheet', 'font'}:
        await route.abort()
    else:
        await route.continue_()

        
async def run(playwright: Playwright) -> None:
    browser = await playwright.chromium.launch(headless=False)
    context = await browser.new_context()

    # Open new page
    page = await context.new_page()

    # adding interception
    await page.route('**/*', intercept)

    # Go to https://scrapeme.live/shop/
    await page.goto("https://scrapeme.live/shop/")

    # ---------------------
    await context.close()
    await browser.close()


async def main() -> None:
    async with async_playwright() as playwright:
        await run(playwright)


asyncio.run(main())

Having the simple code out of our way, lets get into so more details regarding the type

Intercept and block Images

You can block images in playwright by blocking the resource type – image. Blocking images will give you the best bandwidth savings, as image are the largest resources requested by most web pages.

async def intercept(route, request):
    if request.resource_type in {'image'}:
        await route.abort()
    else:
        await route.continue_()

await page.route("**/*", intercept)

You shouldn’t block images, if the purpose of your scraper is to actually download the images. 🙂

Intercept and block CSS Stylesheets

In most use cases, a CSS Stylesheet is not required for web scraping. They are safe to block. You can block CSS Stylesheets by intercepting the resource type stylesheet.

If your web scrapers logic has clicks based on the postition of a button, blocking CSS would mess with that logic. You might want to target the click with CSS Selector instead.

async def intercept(route, request):
    if request.resource_type in {'stylesheet'}:
        await route.abort()
    else:
        await route.continue_()

await page.route("**/*", intercept)

Intercept and block Javascript files

Javascript files or scripts loaded via the script tag, are generally not safe to block. Most of the logic of a web page rendered via Javascript would lie in these scripts, and blocking then would just mess up the data. But, with care ful trial and error you could block javascript resources that are not crucial to rendering the web page with the data you need. For example you could block Google Analytics, Web Site trackers, etc. Please see the section “Intercepting requests with Chrome Browser” below to learn how.

Here is the code to block all javascript resources in playwright

async def intercept(route, request):
    if request.resource_type in {'script'}:
        await route.abort()
    else:
        await route.continue_()

await page.route("**/*", intercept)

Here is how you can block specific javascript requests

async def intercept(route, request):
    urls_to_block = ['https://www.google-analytics.com/analytics.js','https://www.googletagmanager.com/gtm.js','https://static.hotjar.com']
    if request.resource_type in {'script'} and any(t in request.url for t in urls_to_block) :
        await route.abort()
    else:
        await route.continue_()

await page.route("**/*", intercept)

A request interceptor in action

Let’s look at a request interceptor in action, and see the savings.

This is the same code from the beginning of the article.

import asyncio

from playwright.async_api import Playwright, async_playwright


async def intercept(route, request):
    if request.resource_type in {'image', 'media', 'stylesheet', 'font'}:
        await route.abort()
    else:
        await route.continue_()

        
async def run(playwright: Playwright) -> None:
    browser = await playwright.chromium.launch(headless=False)
    context = await browser.new_context()

    # Open new page
    page = await context.new_page()

    # adding interception
    await page.route('**/*', intercept)

    # Go to https://scrapeme.live/shop/
    await page.goto("https://scrapeme.live/shop/")

    # ---------------------
    await context.close()
    await browser.close()


async def main() -> None:
    async with async_playwright() as playwright:
        await run(playwright)


asyncio.run(main())


Let’s run the scraper.

Transferred bandwidth on browser network tab

You can see that, after intercepting unwanted requests, the bandwidth transferred has been reduced to 8.7kb and the time taken to load the web page is decreased to 2.05 seconds. When it comes to large-scale web scraping, request interception can create a significant difference in time and resource costs. You can see the difference below:

Without request interception:
playwright-scraper-without-request-interception

With request interception:

Playeright scraper using request interception

 

How to intercept requests using Chrome browser?

You can test whether blocking a certain request may break the page from the browser’s network tab. Open the browser’s network tab and load the webpage.

how to block unwanted request on browser

Select the request that you want to intercept and right-click on it. There you can see an option Block request URL(annotated in the above screenshot). Reload the page and ensure the page is still functioning.

Apart from blocking individual requests, we can intercept the request which follows the same pattern. In the above example, we can see that there are a lot of images. Individually blocking each of them will consume a lot of time. In such cases, we can intercept the requests using pattern matching. The below image will show you how to do this:

 

Intercept more than one request using web browser

Complete Code

Python

import asyncio

from playwright.async_api import Playwright, async_playwright


async def intercept(route, request):
    if request.resource_type in {'image', 'script', 'stylesheet', 'font'}:
        await route.abort()
    else:
        await route.continue_()

        
async def run(playwright: Playwright) -> None:
    browser = await playwright.chromium.launch(headless=False)
    context = await browser.new_context()

    # Open new page
    page = await context.new_page()

    # adding interception
    await page.route('**/*', intercept)

    # Go to https://scrapeme.live/shop/
    await page.goto("https://scrapeme.live/shop/")

    # ---------------------
    await context.close()
    await browser.close()


async def main() -> None:
    async with async_playwright() as playwright:
        await run(playwright)


asyncio.run(main())

JavaScript

const { chromium } = require("playwright");

(async () => {
  const browser = await chromium.launch({
    headless: false,
  });
  const context = await browser.newContext();

  // Open new page
  const page = await context.newPage();

  await page.route("**/*", (route, request) => {
    const unwantedResources = ["image", "script", "stylesheet", "font"];
    if (unwantedResources.includes(request.resourceType())) {
      route.abort();
    } else {
      route.continue();
    }
  });

  // Go to https://scrapeme.live/shop/
  await page.goto("https://scrapeme.live/shop/");

  // ---------------------
  await context.close();
  await browser.close();
})();

Posted in:   Web Scraping Tutorials

Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?