Websites frequently change their HTML structure, and some structures are dynamically generated with different attributes each time—meaning you need to constantly update your selectors. Adaptive web scraping aims to reduce the frequency with which you have to do that.
For instance, an e-commerce site might change a product card’s class from “product-item” to “item-card” after a JavaScript re-render. Traditional scrapers, which rely on fixed selectors (e.g., CSS or XPath), fail when these changes occur, resulting in “No Element Found” errors. However, using adaptive web scraping techniques, you can build a scraper that accounts for these changes.
This article discusses three ways you can implement adaptive web scraping:
- Using Flexible Selectors
- Using Fallback Selectors
- Using LLMs
Adaptive Web Scraping Using Flexible Selectors
Flexible selectors don’t target the exact attributes but rather a part of them. For instance, an element holding a price may have a class like “current-price,” “buy-price,” “price-100,” etc. All of them contain the string ‘price,’ so a flexible selector can target that.
How you use a flexible selector depends on your parser.
In lxml, when you use XPaths, you can use the contains keyword like this:
parser = html.fromstring(‘html_string’)
parser.xpath(“//div[contains(@class,’price’)
However, you’ll need to use RegEx if you are using Python requests and BeautifulSoup.
pattern = re.compile(‘price’)
soup.find('div',{'class':pattern})
Here’s a sample script for adaptive scraping using RegEx:
import requests
from bs4 import BeautifulSoup
import re
def scrape_adaptive(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
class_pattern = re.compile(r'product')
content = None
element = soup.find('div',{'class':class_pattern})
if element:
content = element.get_text()
print(content or 'No element found')
Explanation:
- HTTP Request: requests.get(url) fetches the page’s HTML.
- BeautifulSoup Parsing: BeautifulSoup(response.text, ‘html.parser’) creates a parser object.
- Class Pattern: The RegEx pattern targets any string with the word ‘product’ in it.
- Element Selection: Uses the find() method of BeautifulSoup to find the required element.
- Output: The extracted text or a failure message is printed.
Why worry about expensive infrastructure, resource allocation and complex websites when ScrapeHero can scrape for you at a fraction of the cost?Go the hassle-free route with ScrapeHero
Adaptive Web Scraping Using Fallback Selectors
Sometimes, attribute changes are so different that the flexible selector may fail or grab an element you don’t need. Such situations call for fallback selectors.
Consider this example: An element holding a writer’s name can have an ID like “author,” “byline,” “writer,” etc. You can’t use a single flexible selector to target all these IDs effectively. You can, however, test them one by one, which is what this method does.
The method to implement fallback selectors is the same for any parser. You iterate through a list of selectors and in each iteration try to extract the data points using the current selector.
Here’s how to implement fallback selectors using Playwright, which is typically used for scraping dynamic websites:
from playwright.sync_api import sync_playwright
def scrape_adaptive(url):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
selectors = {
'primary': '.product-item',
'fallback1': '.item-card',
'fallback2': '[data-test="product"]'
}
content = None
for key, selector in selectors.items():
try:
content = page.locator(selector).text_content()
print(f'Used {key} selector')
break
except Exception as e: # Catching a general exception is usually better
print(f'{key} selector failed: {e}') # Print the error for debugging
print(content or 'No element found')
browser.close()
Explanation:
- Browser Setup: sync_playwright() initializes Playwright, p.chromium.launch() starts Chrome, new_page() creates a tab, and page.goto(url) loads the page.
- Selector Dictionary: The selectors dictionary maps keys to CSS selectors for clarity.
- Fallback Loop: The for loop iterates through selectors.items(), using page.locator(selector).text_content() to extract text. If a selector fails, the exception is caught and the loop moves to the next selector.
- Output: The extracted text or a failure message is printed.
- Cleanup: browser.close() is called within the with block to ensure proper resource cleanup.
Adaptive Web Scraping Using Large Language Models (LLMs)
You can use LLMs to analyze HTML code and extract required details without using any selectors. Either install an open-source LLM like Meta’s Llama locally and run it using powerful graphics cards, or use proprietary LLMs like Gemini or OpenAI.
Here’s how you might implement the Gemini API to extract product details.
import google.generativeai as genai
import os
import json
def extract_data(html):
genai.configure(api_key=os.environ["GEMINI_API_KEY"].strip())
# Create the model
generation_config = {
"temperature": 0,
"max_output_tokens": 65536,
"response_mime_type": "application/json",
}
model = genai.GenerativeModel(
model_name="gemini-2.5-pro-preview",
generation_config=generation_config,
)
chat_session = model.start_chat(
history=[]
)
prompt = f"extract details of all the products from the following HTML: {html}"
response = chat_session.send_message(prompt)
print(json.loads(response.text))
Explanation:
- API Setup: genai.configure() adds your API key for authentication with the model.
- Configuration Dictionary: Creates a dictionary that specifies the model’s behavior.
- Model Instance: genai.GenerativeModel() starts an instance of the model using the specified model name and the configuration dictionary.
- Chat Session: model.start_chat() initializes a chat session with Gemini.
- Prompt Creation: Sets a prompt that instructs the model to extract product details from the HTML code accepted by the function.
- Response Parsing: json.loads() parses the response, which the function then returns.
Limitations
- Anti-Scraping Measures: CAPTCHAs, IP bans, or bot detection (e.g., Cloudflare) can block scrapers. Bypassing these requires proxies, headless browser tweaks, or specialized services, which are beyond this article’s scope.
- Unaccounted Changes: Unexpected HTML changes (e.g., site redesigns or new frameworks) can still break adaptive scrapers, requiring continuous monitoring.
- Unreliability of LLMs: LLMs may fail to extract data because of unusual formatting, or they may generate incorrect data that doesn’t exist on the website (hallucination).
Why Use a Web Scraping Service
Adaptive web scraping aims to avoid issues due to HTML structure changes without manually changing the code. It usually involves using flexible and fallback selectors or LLMs to extract data. However, these methods have limitations.
The webpage may generate tags and attributes unaccounted for by the flexible or fallback selectors. Or, LLMs may hallucinate and generate incorrect data.
You can either deal with these limitations yourself or use a web scraping service.
A web scraping service like ScrapeHero can handle all these problems. We can take care of all the technicalities, including changing HTML structures and anti-scraping measures.
ScrapeHero is an enterprise-grade web scraping service provider. Contact ScrapeHero! Get high-quality data for analysis without worrying about extraction.