Tired of Broken Scrapers? Discover Adaptive Web Scraping

Share:

Adaptive Web Scraping

Websites frequently change their HTML structure, and some structures are dynamically generated with different attributes each time—meaning you need to constantly update your selectors. Adaptive web scraping aims to reduce the frequency with which you have to do that.

For instance, an e-commerce site might change a product card’s class from “product-item” to “item-card” after a JavaScript re-render. Traditional scrapers, which rely on fixed selectors (e.g., CSS or XPath), fail when these changes occur, resulting in “No Element Found” errors. However, using adaptive web scraping techniques, you can build a scraper that accounts for these changes.

This article discusses three ways you can implement adaptive web scraping:

  1. Using Flexible Selectors
  2. Using Fallback Selectors
  3. Using LLMs

Adaptive web scraping methods

Adaptive Web Scraping Using Flexible Selectors

Flexible selectors don’t target the exact attributes but rather a part of them. For instance, an element holding a price may have a class like “current-price,” “buy-price,” “price-100,” etc. All of them contain the string ‘price,’ so a flexible selector can target that.

How you use a flexible selector depends on your parser.

In lxml, when you use XPaths, you can use the contains keyword like this:

parser = html.fromstring(‘html_string’)
parser.xpath(“//div[contains(@class,’price’)

However, you’ll need to use RegEx if you are using Python requests and BeautifulSoup.

pattern = re.compile(‘price’)
soup.find('div',{'class':pattern})

Here’s a sample script for adaptive scraping using RegEx:

import requests

from bs4 import BeautifulSoup

import re

def scrape_adaptive(url):

    response = requests.get(url)

    soup = BeautifulSoup(response.text, 'html.parser')

    class_pattern = re.compile(r'product')

    content = None

    element = soup.find('div',{'class':class_pattern})

    if element:

        content = element.get_text()

    print(content or 'No element found')

Explanation:

  • HTTP Request: requests.get(url) fetches the page’s HTML.
  • BeautifulSoup Parsing: BeautifulSoup(response.text, ‘html.parser’) creates a parser object.
  • Class Pattern: The RegEx pattern targets any string with the word ‘product’ in it.
  • Element Selection: Uses the find() method of BeautifulSoup to find the required element.
  • Output: The extracted text or a failure message is printed.

Need to know more about web scraping using the requests library? Read this ultimate guide to scrape websites with Python requests.

Go the hassle-free route with ScrapeHero

Why worry about expensive infrastructure, resource allocation and complex websites when ScrapeHero can scrape for you at a fraction of the cost?

Adaptive Web Scraping Using Fallback Selectors

Sometimes, attribute changes are so different that the flexible selector may fail or grab an element you don’t need. Such situations call for fallback selectors.

Consider this example: An element holding a writer’s name can have an ID like “author,” “byline,” “writer,” etc. You can’t use a single flexible selector to target all these IDs effectively. You can, however, test them one by one, which is what this method does.

The method to implement fallback selectors is the same for any parser. You iterate through a list of selectors and in each iteration try to extract the data points using the current selector.

Here’s how to implement fallback selectors using Playwright, which is typically used for scraping dynamic websites:

from playwright.sync_api import sync_playwright

def scrape_adaptive(url):

    with sync_playwright() as p:

        browser = p.chromium.launch()

        page = browser.new_page()

        page.goto(url)

        selectors = {

            'primary': '.product-item',

            'fallback1': '.item-card',

            'fallback2': '[data-test="product"]'

        }

        content = None

        for key, selector in selectors.items():

            try:

                content = page.locator(selector).text_content()

                print(f'Used {key} selector')

                break

            except Exception as e: # Catching a general exception is usually better

                print(f'{key} selector failed: {e}') # Print the error for debugging

        print(content or 'No element found')

        browser.close()

Explanation:

  • Browser Setup: sync_playwright() initializes Playwright, p.chromium.launch() starts Chrome, new_page() creates a tab, and page.goto(url) loads the page.
  • Selector Dictionary: The selectors dictionary maps keys to CSS selectors for clarity.
  • Fallback Loop: The for loop iterates through selectors.items(), using page.locator(selector).text_content() to extract text. If a selector fails, the exception is caught and the loop moves to the next selector.
  • Output: The extracted text or a failure message is printed.
  • Cleanup: browser.close() is called within the with block to ensure proper resource cleanup.

Read this article to learn more about web scraping with Playwright.

Adaptive Web Scraping Using Large Language Models (LLMs)

You can use LLMs to analyze HTML code and extract required details without using any selectors. Either install an open-source LLM like Meta’s Llama locally and run it using powerful graphics cards, or use proprietary LLMs like Gemini or OpenAI.

Here’s how you might implement the Gemini API to extract product details.

import google.generativeai as genai

import os

import json

def extract_data(html):

    genai.configure(api_key=os.environ["GEMINI_API_KEY"].strip())

    # Create the model

    generation_config = {

        "temperature": 0,

        "max_output_tokens": 65536,

        "response_mime_type": "application/json",

    }

    model = genai.GenerativeModel(

        model_name="gemini-2.5-pro-preview",

        generation_config=generation_config,

    )

    chat_session = model.start_chat(

        history=[]
    )

    prompt = f"extract details of all the products from the following HTML: {html}"

    response = chat_session.send_message(prompt)

    print(json.loads(response.text))

Explanation:

  • API Setup: genai.configure() adds your API key for authentication with the model.
  • Configuration Dictionary: Creates a dictionary that specifies the model’s behavior.
  • Model Instance: genai.GenerativeModel() starts an instance of the model using the specified model name and the configuration dictionary.
  • Chat Session: model.start_chat() initializes a chat session with Gemini.
  • Prompt Creation: Sets a prompt that instructs the model to extract product details from the HTML code accepted by the function.
  • Response Parsing: json.loads() parses the response, which the function then returns.

Limitations

  • Anti-Scraping Measures: CAPTCHAs, IP bans, or bot detection (e.g., Cloudflare) can block scrapers. Bypassing these requires proxies, headless browser tweaks, or specialized services, which are beyond this article’s scope.
  • Unaccounted Changes: Unexpected HTML changes (e.g., site redesigns or new frameworks) can still break adaptive scrapers, requiring continuous monitoring.
  • Unreliability of LLMs: LLMs may fail to extract data because of unusual formatting, or they may generate incorrect data that doesn’t exist on the website (hallucination).

Worried about anti-scraping measures? Read this article on how to ethically avoid anti-scraping measures.

Why Use a Web Scraping Service

Adaptive web scraping aims to avoid issues due to HTML structure changes without manually changing the code. It usually involves using flexible and fallback selectors or LLMs to extract data. However, these methods have limitations.

The webpage may generate tags and attributes unaccounted for by the flexible or fallback selectors. Or, LLMs may hallucinate and generate incorrect data.

You can either deal with these limitations yourself or use a web scraping service.

A web scraping service like ScrapeHero can handle all these problems. We can take care of all the technicalities, including changing HTML structures and anti-scraping measures.

ScrapeHero is an enterprise-grade web scraping service provider. Contact ScrapeHero! Get high-quality data for analysis without worrying about extraction.

Table of contents

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Clients love ScrapeHero on G2

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Serverless Web Scraping

Deploy Scrapers the Smart Way: How Serverless Web Scraping is Redefining Data Collection

Learn the technicalities of serverless web scraping and how you can automate data collection using AWS, Azure, and GCP.
Web Scraping for Fake News Detection

Fighting Misinformation: Web Scraping for Fake News Detection

ScrapeHero Logo

Can we help you get some data?