How to Scrape AI Overviews for Multiple Queries: A Technical Guide

Share:

AI overview scraping

The AI Overview feature now puts summaries straight in search results. Users get answers without ever leaving Google or visiting your site. This is a problem. So you now need AI overview scraping to understand how your content appears in these summaries to stay competitive. This guide shows you how to build an AI Overview scraper that extracts text and citations across multiple queries.  

Go the hassle-free route with ScrapeHero

Why worry about expensive infrastructure, resource allocation and complex websites when ScrapeHero can scrape for you at a fraction of the cost?

AI Overview Scraping: Setting Up the Environment

Before running the scraper, install the necessary dependencies. This code relies on Playwright for webscraping, a browser automation library. 

Installing Playwright is necessary as Google generates AI overviews dynamically and needs to run JavaScript; you can install it using:

pip install playwright
playwright install chromium

AI Overview Scraping: The Data Scraped

The scraper pulls two key things from the Google results page:

  • The Summary Text: It extracts the full AI Overview text using the inner_text() method. This just gives you clean text with no messy HTML.
  • The Sources: It finds all the hyperlinks (source citations) referenced in the overview by looking for all the anchor tags (a[href]) inside the main content.

AI Overview Scraping: The Code

Here’s the complete code scrape AI overviews.

"""
Google AI Overview Scraper using Playwright
Scrapes AI Overview sections from Google Search results
"""

from playwright.sync_api import sync_playwright
import json
import time
import os
import random
from typing import  Dict, List

class GoogleAIOverviewScraper:

    def __init__(self, headless: bool = False):

        """
        Initialize the scraper
        
        Args:
            headless: Whether to run browser in headless mode
        """
        self.headless = headless
        self.playwright = None
        self.browser = None
        self.context = None
        self.page = None
    
    def start(self):
        
        """Start the browser"""
        user_data_dir = os.path.join(os.getcwd(), "user_data")
        if not os.path.exists(user_data_dir):
            os.makedirs(user_data_dir)

        user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

        self.playwright = sync_playwright().start()

        self.context = self.playwright.chromium.launch_persistent_context(
            
            headless=False,
            user_agent=user_agent,
            user_data_dir=user_data_dir,
            viewport={'width': 1920, 'height': 1080},
            java_script_enabled=True,
            locale='en-US',
            timezone_id='America/New_York',
            permissions=['geolocation'],
            # Mask automation
            bypass_csp=True,
            ignore_https_errors=True,
            channel="chromium",
            args=[
                '--disable-blink-features=AutomationControlled',
                '--disable-dev-shm-usage',
                '--no-sandbox',
                '--disable-gpu',
                '--disable-setuid-sandbox'
            ]
        )
        self.page = self.context.new_page()
    
    def close(self):
        """Close the browser"""
        if self.page:
            self.page.close()
        if self.context:
            self.context.close()
        if self.browser:
            self.browser.close()
        if self.playwright:
            self.playwright.stop()
    
    def search_and_extract_ai_overview(self, query: str) -> Dict:
        """
        Search Google and extract AI Overview if present

        """
        result = {
            'query': query,
            'ai_overview_present': False,
            'ai_overview_text': None,
            'ai_overview_sources': [],
            'timestamp': time.strftime('%Y-%m-%d %H:%M:%S')
        }
        
        try:
            # Navigate to Google
            search_url = f"https://www.google.com/search?q={query.replace(' ', '+')}"
            self.page.goto(search_url, wait_until='networkidle')
            
            # Wait a bit for AI Overview to load
            time.sleep(2)
            
            # Try to find AI Overview section
            # Google's AI Overview typically appears in specific containers
            # Click "Show more" if it exists to expand the AI overview
            try:
                show_more_button = self.page.get_by_role('button',name='Show more')
                if show_more_button:
                    
                    self.page.wait_for_timeout(random.randrange(1000,3000))
                    show_more_button.click()
                    # Wait a moment for the content to expand
                    time.sleep(1)
                    
            except Exception as e:
                print(f"couldn't find show more button: {e}")
                # If "Show more" button is not found, just continue
                pass
            input()
            ai_overview_selector = '.kCrYT'

            try:
                elements = self.page.query_selector_all(ai_overview_selector)
                
                loop = 0
                
                for element in elements:

                    if "AI Overview" in element.text_content():
                        ai_overview_element = elements[loop+1]
                        break
                    loop = loop + 1
                    
                
            except Exception as e:
                print(e)
        
            if ai_overview_element:
                result['ai_overview_present'] = True
                
                # Extract the main AI Overview text
                text_content = ai_overview_element.inner_text()
                result['ai_overview_text'] = text_content.strip()
                
                # Try to extract sources/citations
                source_links = ai_overview_element.query_selector_all('a[href]')
                sources = []

                for link in source_links:
                    href = link.get_attribute('href')
                    text = link.inner_text().strip()

                    if href and text:
                        sources.append({
                            'text': text,
                            'url': href
                        })
                
                result['ai_overview_sources'] = sources
            else:
                print(f"No AI Overview found for query: {query}")
        
        except Exception as e:
            result['error'] = str(e)
            print(f"Error scraping AI Overview: {e}")
        
        return result
    
    def scrape_multiple_queries(self, queries: List[str]) -> List[Dict]:
        
        """
        Scrape AI Overviews for multiple queries
        """
        results = []
        
        for i, query in enumerate(queries):
            print(f"Processing query {i+1}/{len(queries)}: {query}")
            result = self.search_and_extract_ai_overview(query)
            results.append(result)
            
            # Add delay between requests to avoid rate limiting
            if i < len(queries) - 1:
                time.sleep(3)
        
        return results

def main():
    
    # Main function to demonstrate usage

    queries = [
    "what is powerlifting",
    "how does progressive overload work",
    "what is RPE in lifting",
    "how to calculate one rep max",
    "what is proper squat form",
    "what is proper bench press form",
    "what is proper deadlift form",
    "what is lifting programming",
    "what are compound exercises",
    "what is hypertrophy training"
]
    
    scraper = GoogleAIOverviewScraper(headless=False)
    
    try:
        # Start the browser
        scraper.start()
        
        # Scrape AI Overviews
        results = scraper.scrape_multiple_queries(queries)
        
        # Save results to JSON file
        output_file = 'ai_overview_results.json'
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=4, ensure_ascii=False)
        
        print(f"\nResults saved to {output_file}")
        
        # Print summary
        print("\n=== Summary ===")
        for result in results:
            status = "Found" if result['ai_overview_present'] else "Not Found"
            print(f"Query: {result['query']}")
            print(f"AI Overview: {status}")
            if result['ai_overview_present']:
                print(f"Text length: {len(result['ai_overview_text'])} characters")
                print(f"Sources: {len(result['ai_overview_sources'])}")
            print("-" * 50)
    
    finally:
        # Close the browser
        scraper.close()

if __name__ == "__main__":
    main()

For a quick, high-level view of everything this script does, just look at this flowchart.

High-level flowchart of AI overview scraping

 

The script starts by importing the necessary Python libraries:

from playwright.sync_api import sync_playwright
import json
import time
import os
import random
from typing import  Dict, List
  • The playwright.sync_api helps with browser automation—that’s the main engine. 
  • The json module handles saving our scraped results.
  • The time and random modules are used to create those small, unpredictable delays between actions, which helps us avoid being detected.
  • The os module helps us manage files and folders
  • The typing module makes the code cleaner by letting us specify the data types, like Dict (for dictionaries) and List (for lists). 

Now, initialize the scraper class, which means configure the browser and define exactly how the data will be pulled from the page.

Initializing the Scraper Class

The scraper uses a class named GoogleAIOverviewScraper to keep the code neat and organized.

class GoogleAIOverviewScraper:
    def __init__(self, headless: bool = False):
        """
        Initialize the scraper
        
        Args:
            headless: Whether to run browser in headless mode
        """
        self.headless = headless
        self.playwright = None
        self.context = None
        self.page = None

The constructor of the class initializes five variables:

  1. self.headless: stores a boolean value that determines whether or not the Playwright browser starts in a headless mode.
  2. self.playwright: stores the Playwright instance that manages the entire browser automation framework
  3. self.context: stores the browser context that isolates cookies, cache, etc. for the scraping session
  4. self.page: stores the page object, which are individual tabs

Starting the Browser

To launch the browser, create a “persistent context.” This just means setting up several anti-detection flags and configurations that help the browser mimic a real user and avoid being flagged as automation.

def start(self):
    """Start the browser"""
    user_data_dir = os.path.join(os.getcwd(), "../user_data")
    if not os.path.exists(user_data_dir):
        os.makedirs(user_data_dir)
    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    self.playwright = sync_playwright().start()
    self.context = self.playwright.chromium.launch_persistent_context(
        headless=False,
        user_agent=user_agent,
        user_data_dir=user_data_dir,
        viewport={'width': 1920, 'height': 1080},
        java_script_enabled=True,
        locale='en-US',
        timezone_id='America/New_York',
        permissions=['geolocation'],
        # Mask automation
        bypass_csp=True,
        ignore_https_errors=True,
        channel="msedge",
        args=[
            '--disable-blink-features=AutomationControlled',
            '--disable-dev-shm-usage',
            '--no-sandbox',
            '--disable-gpu',
            '--disable-setuid-sandbox'
        ]
    )
    self.page = self.context.new_page()

In this code,

  • The launch_persistent_context() method saves the browser’s state (like cookies and settings) in a special folder. This means the browser remembers things between each time you run the script.
  • The user agent is set to make the browser look like Chrome running on Windows. This is a very common setup, which makes it less suspicious.
  • The viewport is set to 1920×1080 to mimic a standard desktop monitor.
  • The bypass_csp flag temporarily disables security rules (Content Security Policy) that some websites use to specifically block automated tools.
  • The disable-blink-features=AutomationControlled argument removes a hidden flag navigator.webdriver that most automation tools expose, making it much harder for websites to know they’re talking to Playwright.

Extracting the AI Overview

To extract AI overviews, the script

  1. Goes to Google’s search interface
  2. Waits for content to load
  3. Attempts to locate and expand the AI Overview section

It uses wait_until=’networkidle’ to ensure JavaScript has finished rendering.

def search_and_extract_ai_overview(self, query: str) -> Dict:
    ## Search Google and extract AI Overview if present    
    
    result = {
        'query': query,
        'ai_overview_present': False,
        'ai_overview_text': None,
        'ai_overview_sources': [],
        'timestamp': time.strftime('%Y-%m-%d %H:%M:%S')
    }
    
    try:++++++++++++++++++++++++++++++++++++++++++++
        # Navigate to Google
        search_url = f"https://www.google.com/search?q={query.replace(' ', '+')}"
        self.page.goto(search_url, wait_until='networkidle')
        
        # Wait a bit for AI Overview to load
        time.sleep(2)

Here,

  1. The result dictionary starts with empty values, establishing the final structure for the extracted data.
  2. The search URL format replaces spaces with plus signs, according to Google’s URL conventions. 
  3. The two-second delay after navigation ensures AI Overview content has loaded before the code begins extraction.

Expanding Collapsed Overviews

Google often hides the full AI Overview behind a “Show more” button. Inspecting it shows that it is actually a div element with the role “button.”

Inspecting the “Show more” button

 

This part of the code simply tries to find and click that button to show the full text. It also includes checks (error handling) for two scenarios: if the button doesn’t exist (because the overview is already fully open) or if the button can’t be clicked.

# Click "Show more" if it exists to expand the AI overview
        try:
            show_more_button = self.page.get_by_role('button',name='Show more')
            if show_more_button:
                self.page.wait_for_timeout(random.randrange(1000,3000))
                show_more_button.click()
                # Wait a moment for the content to expand
                time.sleep(1)
        except Exception as e:
            pass

This code

  • Uses the get_by_role() method to find the “Show more” button. This is a much more stable way to locate it because it relies on accessibility names, which don’t change as often as complicated website code (CSS selectors).
  • Adds a random delay of 1 to 3 seconds after attempting the click. This pause is crucial because it simulates the variable time a real person might take, making the automation much harder to detect.
  • Continues execution if the “Show more” button isn’t found ( usually when the AI Overview is already fully expanded or simply missing.)

Locating and Extracting Text Content

After the “Show more” button is handled, the scraper looks for the AI Overview content using the .kCrYT selector. This specific code is how Google usually identifies the main container for its AI-generated content.

Inspecting the AI overviews

However, the page has multiple elements with the class kCrYt, including one that holds the ‘AI Overview’ heading. The code grabs all of them and picks the one right after the heading.

 ai_overview_selector = '.kCrYT'

       try:
            elements = self.page.query_selector_all(ai_overview_selector)
            
            loop = 0
            
            for element in elements:

                if "AI Overview" in element.text_content():
                    ai_overview_element = elements[loop+1]
                    break
                loop = loop + 1

        except Exception as e:
            print(e)
        
        if ai_overview_element:
            result['ai_overview_present'] = True
            
            # Extract the main AI Overview text
            text_content = ai_overview_element.inner_text()
            result['ai_overview_text'] = text_content.strip()

Once the code finds the correct content, the inner_text() method is used to grab all the visible text. This gives you the full summary Google generated, but without any messy HTML tags.

The .strip() method then quickly cleans things up by removing any extra spaces at the beginning or end of the text. This is necessary because web pages often include this kind of unwanted whitespace.

Capturing Source Citations

The AI Overview usually has citations that link back to its sources. The scraper grabs these by searching for all the anchor tags (the hyperlinks) inside the main overview box.

It collects two things for each source: the URL destination (the actual web address) and the display text (the title you see on the screen). This is how you get all the source documents.

 # Try to extract sources/citations
            source_links = ai_overview_element.query_selector_all('a[href]')
            sources = []
            for link in source_links:
                href = link.get_attribute('href')
                text = link.inner_text().strip()
                if href and text:
                    sources.append({
                        'text': text,
                        'url': href
                    })
            
            result['ai_overview_sources'] = sources
        else:
            print(f"No AI Overview found for query: {query}")

The code includes a check, if href and text, to make sure only valid sources are saved. This avoids storing any links that are empty or broken.

Each valid source is then saved as a small dictionary. This keeps the clickable text and the destination URL together. This format makes it easy later on to analyze where the AI Overview is getting its information or to compare the source titles to the summary text.

Performing Multiple Query Web Scraping

To run the scraper for lots of different searches, use the scrape_multiple_queries() method. This part of the code handles the whole process, making sure there’s a delay between each search.

This delay is really important because it prevents your scraper from running too fast, which helps you avoid getting rate-limited or looking too suspicious to Google.

def scrape_multiple_queries(self, queries: List[str]) -> List[Dict]:
    """
    Scrape AI Overviews for multiple queries
    
    Args:
        queries: List of search queries
    
    Returns:
        List of dictionaries containing results for each query
    """
    results = []
    
    for i, query in enumerate(queries):
        print(f"Processing query {i+1}/{len(queries)}: {query}")
        result = self.search_and_extract_ai_overview(query)
        results.append(result)
        
        # Add delay between requests to avoid rate limiting
        if i < len(queries) - 1:
            time.sleep(3)
    
    return results

There’s a three-second pause between each search query. This delay is there to mimic natural browsing behavior and significantly lowers the chance of Google slowing down (throttling) your connection.

The script also provides progress output as it runs. This is helpful for monitoring long scraping jobs, so you always know the current processing status.

Cleanup and Resource Management

Proper cleanup is important to make sure memory is released and all processes stop correctly.

The close method takes care of this by systematically shutting down all Playwright resources in the exact reverse order of how they were started. This ensures everything closes cleanly.

def close(self):
    """Close the browser"""
    if self.page:
        self.page.close()
    if self.context:
        self.context.close()
    if self.playwright:
        self.playwright.stop()

Main Execution and Result Export

The main function shows you exactly how to run the scraper, from starting it up to exporting the final results. This structured format gives you a repeatable template you can use for your full-scale, production scraping jobs.

def main():
    """Main function to demonstrate usage"""
    # Example queries that might trigger AI Overview
    queries = [
        "what is artificial intelligence",
        "how does machine learning work",
        "what is deep learning",
        "what is ChatGPT",
        "what are neural networks",
        "how does AI image generation work",
        "what is natural language processing",
        "what is computer vision",
        "what is reinforcement learning",
        "what are language models"
    ]
    
    scraper = GoogleAIOverviewScraper(headless=False)
    
    try:
        # Start the browser
        scraper.start()
        
        # Scrape AI Overviews
        results = scraper.scrape_multiple_queries(queries)
        
        # Save results to JSON file
        output_file = 'ai_overview_results.json'
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=4, ensure_ascii=False)
        
        print(f"\nResults saved to {output_file}")
        
        # Print summary
        print("\n=== Summary ===")
        for result in results:
            status = "Found" if result['ai_overview_present'] else "Not Found"
            print(f"Query: {result['query']}")
            print(f"AI Overview: {status}")
            if result['ai_overview_present']:
                print(f"Text length: {len(result['ai_overview_text'])} characters")
                print(f"Sources: {len(result['ai_overview_sources'])}")
            print("-" * 50)
    
    finally:
        # Close the browser
        scraper.close()

if __name__ == "__main__":
    main()

The results are saved in JSON format using UTF-8 encoding. This is important because it saves all international characters correctly and makes sure the data works with all your analysis tools later.

Code Limitations

This code is a good start to scrape AI search results, but there are a few limits you need to consider:

  • Maintenance Headache: Google’s layout and code is constantly changing—things like class names and features get restructured. This means your code will require ongoing maintenance and monitoring to keep it working.
  • Rate Limits & Blocking: Google uses advanced bot detection and rate limiting. Even with our anti-detection steps, too many requests from one IP will often trigger CAPTCHAs or temporary blocks. This makes scraping at a large scale very difficult.
  • Performance Costs: Browser automation uses a lot of your computer’s memory and CPU. Trying to run thousands of queries isn’t practical unless you use powerful distributed infrastructure and a smart queue management system.

Why Choose a Web Scraping Service

Building a custom AI Overview scraper for your enterprise is a hidden tax on engineering talent, requiring constant maintenance, debugging, and resource allocation away from innovation.

ScrapeHero, a fully managed, enterprise-grade web scraping service, eliminates this overhead entirely. Our self-healing infrastructure, massive-scale proxy network, and AI-powered quality checks turn the complex challenge of data extraction into a simple, reliable API call.

Contact ScrapeHero today to start your solution and free your engineering team from the burden of scraping infrastructure.

Frequently Asked Questions

Why does the scraper fail to find the AI Overview even when Google displays one in my browser?

Google sometimes hides AI Overviews when it detects automated browsers like Playwright. Anti-bot systems can still spot these frameworks even when you use anti-detection flags. To troubleshoot this, you need to manually check if the overview is appearing consistently when you browse normally. Also, check the browser console for any JavaScript errors, and make sure the .kCrYT selector your code uses is still the correct one.

How can I increase extraction success rate for large query volumes?

You can use a few techniques to increase success rate:
1. Delay on Blocks: If blocked, pause requests and gradually increase the delay.
2. Rotate Proxies: Use multiple residential proxies to spread traffic geographically.
3. Change Identity: Randomize user agents and browser settings to evade detection.
4. Go Off-Peak: Schedule scraping during low-traffic hours when limits are softer.
5. Monitor Frequencies: Watch response patterns to find the fastest safe request speed.

Can this scraper extract AI Overviews for languages other than English?

Yes, you can change the target language:
1. Edit the Code: In the launch_persistent_context() method, change the locale (e.g., set locale=’fr-FR’ for French) and update the timezone_id for the right region.
2. Keep in mind: The AI Overview language always follows the language of your search query.

Table of contents

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Clients love ScrapeHero on G2

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Scrape YouTube Shorts videos results

SERP Data Extraction Guide: How to Scrape YouTube Shorts Video Results Efficiently

Use Playwright to scrape YouTube Shorts titles, channels, URLs from Google SERP.
Google search scraping

Google Search Scraping Step-By-Step: A Code and No-Code Guide

Guide to scraping Google results via code or no-code for SEO and research.
Questions on Web Scraping

Choosing a Web Scraping Company? Here are 10 Questions on Web Scraping You Must Ask

Questions on Web Scraping: What to Ask Before You Sign a Contract.
ScrapeHero Logo

Can we help you get some data?