Concurrency and Parallelism: How to make web scraping faster

Want to make web scraping faster? Read on to learn about two methods: Concurrency and Parallelism. They both reduce the code execution time, but their approach differs entirely.

This article covers both methods with examples in Python.

Concurrency vs. Parallelism

Concurrency and parallelism improve performance differently:

  • Concurrency is when a single processor manages multiple tasks.
  • In Parallelism, multiple processors manage the tasks where each performs only one.

The following table illustrates how exactly concurrency and parallelism differ.

Concurrency

Parallelism

Requires only a single processor Requires multiple processors
Preferred for less CPU-intensive tasks where there are idle periods Preferred for CPU-intensive tasks that are independent
Example: when you browse the web with a task in the background, it is concurrency. The processor rapidly switches between web browsing and performing the background task. Example: GPU processing is an example of parallelism. A GPU has many cores, each independently processing a part of an image or a video. 
Concurrency uses threads that share memory. Parallelism uses processes that have a separate memory. 

Task execution for a program not using concurrency or parallelism

Task execution for a program using concurrency

Task execution for a program using parallelism

How to Make Web Scraping Faster Using Concurrency and Parallelism

You can use concurrency and parallelism in any programming language. Here, you will read how to make web scraping faster in Python.

Python has several modules for implementing concurrency and parallelism to make web scraping faster. This tutorial shows the example using two modules:

  • ThreadPoolExecuter: This module helps you with concurrency. It uses threads called workers to manage multiple tasks.
  • Multiprocessing: This module helps you achieve parallelism.

The Example

Here, you will see a code that scrapes the details of 100 companies. This code takes 1 minute to execute on Colab.

Screenshot showing code execution time without concurrency or parallelism

import csv
import requests
from lxml import html




def get_response(company_name: str) -> requests.Response:
    """
    Retrieves html response by sending requests


    Args:
        company_name(str): Company name. This is used as a search term in the Google search URL


    Returns:
        requests.Response
    """




def scrape_data(input_company_names: list, output_file: str):
    """
    Reads list of company names from the input file, extracts company data, and writes to CSV file


    Args:
        input_company_names(list): List of company names
        output_file(str): Output file name
    """
    company_details_list = []




    for company in input_company_names:
        response = get_response(company)
        # If the response fails, even after retries, get_response won't return any response
        if not response:
            print(f'Invalid response for company name {company}')
            continue
        company_details = extract_company_details(response)
        company_details['input_company_name'] = company
        company_details_list.append(company_details)
    write_csv(output_file, company_details_list)


def write_csv(file_name: str, company_details_list: list):
    """
    Writes scraped data to CSV file


    Args:
        file_name: output file name
        company_details_list: list of scraped company details
    """




if __name__ == "__main__":


    company_names = [
    "Facebook", "Twitter", "Microsoft", "Amazon", "Apple", "Netflix", "Spotify", "Adobe", "Salesforce", "Oracle",
    "IBM", "Intel", "NVIDIA", "Samsung Electronics", "LG Electronics", "Sony", "Panasonic", "Dell Technologies",
    "HP Inc.", "Cisco Systems", "Zoom Video Communications", "Dropbox", "Slack Technologies", "Square", "Shopify",
    "Airbnb", "Uber", "Lyft", "Pinterest", "Snap Inc.", "Reddit", "TikTok", "ByteDance", "Alibaba Group", "Tencent",
    "Baidu", "Xiaomi", "Huawei Technologies", "Lenovo Group", "ZTE Corporation", "Asus", "Acer", "Rakuten", "SoftBank Group",
    "NTT Data", "SAP", "Siemens AG", "Philips", "Logitech", "Fitbit", "Garmin", "GoPro", "AMD", "Micron Technology",
    "Western Digital", "Seagate Technology", "Kingston Technology", "Corsair Gaming", "Razer Inc.", "AsusTek Computer",
    "MSI", "Gigabyte Technology", "Palit Microsystems", "Zotac", "EVGA Corporation", "Intel Corporation", "Qualcomm",
    "Broadcom Inc.", "Texas Instruments", "ARM Holdings", "TSMC", "SK Hynix", "MediaTek", "Infineon Technologies",
    "STMicroelectronics", "NXP Semiconductors", "Renesas Electronics", "Analog Devices", "Microchip Technology",
    "ON Semiconductor", "Kyocera", "Sharp Corporation", "Toshiba", "Fujitsu", "NEC Corporation", "Hitachi",
    "Kaspersky Lab", "Symantec", "McAfee", "Trend Micro", "Avast", "Fortinet", "Check Point Software Technologies",
    "Palo Alto Networks", "SolarWinds", "Atlassian", "Twilio", "Akamai Technologies", "Cloudflare", "Fastly"
]


    output_file_name = 'company_details.csv'
    scrape_data(company_names, output_file_name)

However, it takes only 14 seconds if you use the ThreadPoolExecuter. Here is how you import ThreadPoolExecutor.

from concurrent.futures import ThreadPoolExecutor, as_completed

Screenshot showing code execution time with concurrency.

You only have to modify the scrape_data() function to implement concurrency. The scrape_data() function will now have the extra process_company(); ThreadPoolExecutor will use this function to implement concurrency.

Each worker executes a task, calling the process_company() function with the company name as the argument. You can set the maximum number of workers, which sets the maximum number of concurrent operations.

Here, you can see that the maximum number of workers is 10; the code can scrape the details of 10 companies concurrently.

def scrape_data(input_company_names: list, output_file: str):
    """
    Reads list of company names, extracts company data concurrently and writes to CSV file


    Args:
        input_company_names(list): List of company names
        output_file(str): Output file name
    """
    company_details_list = []


    # Function to process each company name
    def process_company(company):
        response = get_response(company)
        if response:
            details = extract_company_details(response)
            details['input_company_name'] = company
            return details
        else:
            print(f'Invalid response for company name {company}')
            return None


    # Using ThreadPoolExecutor for concurrent requests
    with ThreadPoolExecutor(max_workers=32) as executor:
        future_to_company = {executor.submit(process_company, company): company for company in input_company_names}
        for future in as_completed(future_to_company):
            company_details = future.result()
            if company_details:
                company_details_list.append(company_details)
    # Writing to CSV
    write_csv(output_file, company_details_list)

With parallelism using multiprocessing, you can achieve the same task in around 29 seconds, still better than the original code. You need to use the Pool object from the multiprocessing module to implement parallelism.

from multiprocessing import Pool

Screenshot showing code execution time with Parallelism

Again, you must modify the original scrape_data() to achieve parallelism. You also must add another function, process_company(), outside scrape_data().

The process_company() function takes a company name and returns its details. This code executes this step using the multiprocessing module. That means there will be multiple simultaneous calls of process_company().

def process_company(company):
    """
    Process a single company, fetching and extracting data.


    Args:
        company (str): Company name.


    Returns:
        dict: Extracted company data or None if failed.
    """
    response = get_response(company)
    if response:
        details = extract_company_details(response)
        details['input_company_name'] = company
        return details
    else:
        print(f'Invalid response for company name {company}')
        return None


def scrape_data(input_company_names: list, output_file: str):
    """
    Reads list of company names, extracts company data in parallel using multiprocessing and writes to CSV file


    Args:
        input_company_names(list): List of company names
        output_file(str): Output file name
    """
    company_details_list = []


    # Setting up multiprocessing Pool
    with Pool(processes=os.cpu_count()) as pool:
        company_details_list = pool.map(process_company, input_company_names)


    # Removing None entries
    company_details_list = [details for details in company_details_list if details]


    # Writing to CSV
    write_csv(output_file, company_details_list)

Which Option is Better for Making Web Scraping Faster?

The fastest way to scrape a website depends on the use case and available resources.

Here, the code using concurrency was faster than the code using parallelism. The reason is that web scraping involves a lot of waiting.

You must wait a little before the server responds with the required data. During this time, the processor is available for another task. Concurrency leverages this window.

However, the code that uses parallelism requires separate processors for each simultaneous task. The code shown in the tutorial ran on Colab with only 2 processors. Therefore, the execution is substantially less efficient than using a ThreadPoolExecuter with 10 workers.

That means

  • Use concurrency when you have limited resources for web scraping
  • Implement parallelism when you have enough processors and the tasks are CPU intensive.

Challenges in Making Web Scraping Faster

The two main challenges you will face are

  • Code complexity: The code will become quite complex, increasing the probability of errors and difficulty in debugging.
  • Rate limiting: Websites use rate-limiting measures to prevent anyone from accessing their data above a specific rate.
  • Lack of Resources: Programming alone can’t speed up web scraping. The lack of resources will limit it, especially in the case of parallelism.

Are There Other Tips for Fast Web Scraping?

We only discussed concurrency and parallelism here. However, consider the following tips to improve performance while web scraping.

  • Use headless browsers for web scraping to reduce resource consumption.
  • Don’t launch a new browser context for every start page.
  • Avoid downloading all the resources to save bandwidth.
  • Use browsers to scrape websites only if HTTP request methods fail.
Read “How to make web scraping faster” to know more.

Conclusion

Using concurrency and parallelism can speed up web scraping. However, you must be well-versed in coding and have enough resources to support it.

If you want to avoid coding yourself or invest in resources for faster web scraping, try ScrapeHero Services.

ScrapeHero is a full-service web scraping service provider. Our infrastructure can scrape data at 3000 pages per second. We have the resources and skills to provide enterprise-grade scraping services tailored to your needs.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Posted in:   Data, Featured, Tutorials, Web Scraping Tutorials

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?