Ultimate Guide to Scrape Websites With the Python Request Library

Share:

Scrape Websites With Python Request Library

Do you know that Requests is one of the most popular libraries used while web scraping with Python? That’s because it provides intuitive methods to handle HTTP requests—a popular way to access websites. Want to know more? Here’s an in-depth tutorial showing how to scrape websites with the Python requests library.

Scrape Websites with the Python Requests Library: An Overview

While you can’t scrape websites with the Python request library alone, you can get the HTML source code of web pages using it. Web scraping with Python requests includes two core steps:

  • Sending GET requests
  • Parsing the response text

After extracting the response text—which contains the HTML source code—you’ll need to use parsing libraries, like lxml, to pull the data points you need.

Want to start scraping with the Python requests right away? Read this tutorial on web scraping YouTube with Python requests

Setting Up The Environment

Since the requests library is external, you need to install it using Python’s package manager pip. You also need a parsing library; here, the code uses lxml, which is also an external library and is installable with pip.

pip install requests lxml

Want to know about other Python libraries for web scraping? Read this article on the best Python libraries for data extraction.

Writing the Code

This tutorial scrapes the website scrapeme.live, which is a mock website used for scraping experiments. Here’s the workflow to extract data with Python:

  1. Navigate to the website
  2. Locate the URLs of all the products by navigating through the listing pages
  3. Extract these details from each product page:
    • Name
    • Description
    • Price
    • Stock
    • Image URL
    • Product URL
  4. Save the collected data to a CSV file

Flowchart showing the scraping workflow

Now, let’s implement this workflow.  

Don’t want to code? ScrapeHero Cloud is exactly what you need.

With ScrapeHero Cloud, you can download data in just two clicks!

Import the Required Packages

Start by importing the required packages. Apart from requests and lxml mentioned above, this code also imports the csv module. This module saves the extracted data to a CSV file.

import requests
from lxml import html
import csv

Send a Request to the Website

Send an HTTP request to https://scrapeme.live/shop using the get() method. Use appropriate headers while sending the request to pose as a request originating from a user.

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
                  "fari/537.36",
    "Accept-Language": "en-US,en;q=0.5"
}
response = requests.get(url, headers=headers)

Check the Response Status Code

Even with headers, the target server may detect your scraper and send an error message. You can check whether or not the request was successful by checking if the status code is 200.

def verify_response(response):
    return True if response.status_code == 200 else Falsea

If the status code is not 200, you may retry; use a loop to retry multiple times.

max_retry = 3
while max_retry >= 1:
    response = requests.get(url, headers=headers)
    if verify_response(response):
        return response
    else:
        max_retry -= max_retry

Sometimes, the website may block your IP; you need to use proxies in that case. Read this article on using proxies and rotating IP addresses.

Get the Product URLs

If the response is valid, move on to extracting the data points. 

First, get URLs from the target page; you need to parse the response to do so.

The code uses lxml to parse the response text and uses XPaths to locate the data points. You can figure out which XPaths to use by looking at the HTML code of the page.

 Inspect tool showing the HTML code of the product link

From the screenshot, it is clear that node ‘a’, which has the class name class=”woocommerce-LoopProduct-link woocommerce-loop-product__link,” contains the URL to the product page. 

Since node “a” comes under node ‘li,’ its XPATH is “//li/a[contains(@class, “product__link”)].” You can then get the URL from the “href” attribute of that node:

from lxml import html
parser = html.fromstring(response.text)
product_urls = parser.xpath('//li/a[contains(@class, "product__link")]/@href')

Get the Next-Page URL

The scraper extracts the product URLs from multiple pages, so the workflow becomes:

  1. Get the URL of the next-page link
  2. Make a request to the URL
  3. Extract all the product links and store them in a list
  4. Repeat the process

To get the next-page URL, analyze the HTML code and figure out the XPath of the next-page link.

Inspect tool showing the HTML code of the next-page link

from lxml import html
parser = html.fromstring(response.text)
next_page_url = parser.xpath('(//a[@class="next page-numbers"])[1]/@href')[0]

You can now make a request to this next page URL, get the product-page links, and repeat the process with other pages.

After getting all the product URLs, you can start pulling the name, description, price, stock, and image URLs from the product pages. 

Name

Inspect tool showing the HTML code of the product name

From the image, it is clear that node h1 contains the name of the product. You can see that the product page does not have any other h1 node. Simply call the XPath //h1 to select that particular node.

Use the following code since the text is inside the node:

title = parser.xpath('//h1[contains(@class, "product_title")]/text()')

Description

Inspect tool showing the HTML code of the product name

Here, the product description is inside the node p. You can also see that it is inside the div with the class name substring ‘product-details__short-description’. Collect the text inside it as follows:

description = parser.xpath('//div[contains(@class,"product-details__short-description")]//text()')

Stock

Inspect tool showing the HTML code of the stock count

From the image, it is evident that stock is directly present inside the node p, whose class contains the string ‘in-stock.’ Use this code to collect data from it:

stock = parser.xpath('//p[contains(@class, "in-stock")]/text()')

Price

Inspect tool showing the HTML code of the product price

Here, the price can be directly seen in the node p having class price. So, use the code to get the actual price value of the product:

price = parser.xpath('//p[@class="price"]//text()')
price = clean_string(price)

Image URL

Inspect tool showing the HTML code of the image URL

In the screenshot, the attribute href of the node ‘a’ is highlighted. It is from this href attribute that you will get the image URL.

image_url = parser.xpath('//div[contains(@class, "woocommerce-product-gallery__image")]/a/@href')
Image_url = clean_string(list_or_txt=image_url, connector=' | ')

Clean the Data

In all the above code snippets, parser.xpath() returns a list of string elements. This string may have extra spaces, so create a function to remove them.

def clean_string(list_or_txt, connector=' '):
    if not list_or_txt:
        return None
    return ' '.join(connector.join(list_or_txt).split())

Additionally, create a function to remove the text ‘ in stock’ that you’ll get when extracting the stock number.

def clean_stock(stock):
    stock = clean_string(stock)
    if stock:
        stock = stock.replace(' in stock', '')
        return stock
    else:
        return None

Here’s the complete code that uses everything you’ve read so far.

import csv
from lxml import html
import requests

def verify_response(response):
    """
    Verify if we received a valid response or not
    """
    return True if response.status_code == 200 else False

def send_request(url):
    """
    Send request and handle retries.
    :param url:
    :return: Response we received after sending a request to the URL.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
                      "fari/537.36",
        "Accept-Language": "en-US,en;q=0.5"
    }
    max_retry = 3
    while max_retry >= 1:
        response = requests.get(url, headers=headers)
        if verify_response(response):
            return response
        else:
            max_retry -= max_retry
    print("Invalid response received even after retrying. URL with the issue is:", url)
    raise Exception("Stopping the code execution as invalid response received.")

def get_next_page_url(response):
    """
    Collect pagination URL.
    :param response:
    :return: next listing page URL
    """
    parser = html.fromstring(response.text)
    next_page_url = parser.xpath('(//a[@class="next page-numbers"])[1]/@href')[0]
    return next_page_url

def get_product_urls(response):
    """
    Collects all product URLs from a listing page response.
    :param response:
    :return: list of URLs. List of product page URLs returned.
    """
    parser = html.fromstring(response.text)
    product_urls = parser.xpath('//li/a[contains(@class, "product__link")]/@href')
    return product_urls

def clean_stock(stock):
    """
    Clean the data stock by removing unwanted text present in it.
    :param stock:
    :return: Stock data. The stock number will be returned by removing the extra string.
    """
    stock = clean_string(stock)
    if stock:
        stock = stock.replace(' in stock', '')
        return stock
    else:
        return None

def clean_string(list_or_txt, connector=' '):
    """
    Clean and fix the list of objects received. We are also removing unwanted white spaces.
    :param list_or_txt:
    :param connector:
    :return: Cleaned string.
    """
    if not list_or_txt:
        return None
    return ' '.join(connector.join(list_or_txt).split())

def get_product_data(url):
    """
    Collect all details of a product.
    :param url:
    :return: All data of a product.
    """
    response = send_request(url)
    parser = html.fromstring(response.text)
    title = parser.xpath('//h1[contains(@class, "product_title")]/text()')
    price = parser.xpath('//p[@class="price"]//text()')
    stock = parser.xpath('//p[contains(@class, "in-stock")]/text()')
    description = parser.xpath('//div[contains(@class,"product-details__short-description")]//text()')
    image_url = parser.xpath('//div[contains(@class, "woocommerce-product-gallery__image")]/a/@href')
    product_data = {
        'Title': clean_string(title), 'Price': clean_string(price), 'Stock': clean_stock(stock),
        'Description': clean_string(description), 'Image_URL': clean_string(list_or_txt=image_url, connector=' | '),
        'Product_URL': url}
    return product_data

def save_data_to_csv(data, filename):
    """
    Save a list of dicts to CSV.
    :param data: Data to be saved to CSV
    :param filename: Filename of csv
    """
    keys = data[0].keys()
    with open(filename, 'w', newline='') as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=keys)
        writer.writeheader()
        writer.writerows(data)

def start_scraping():
    """
    Starting function.
    """
    listing_page_url = 'https://scrapeme.live/shop/'
    product_urls = list()
    for listing_page_number in range(1, 6):
        response = send_request(listing_page_url)
        listing_page_url = get_next_page_url(response)
        products_from_current_page = get_product_urls(response)
        product_urls.extend(products_from_current_page)
        results = []
    for url in product_urls:
        results.append(get_product_data(url))
    save_data_to_csv(data=results, filename='scrapeme_live_Python_data.csv')
    print('Data saved as CSV)

if __name__ == "__main__":
    start_scraping()

Sending GET Requests Using Cookies

Sometimes, you might need to use cookies to avoid being blocked by the server. Define cookies in a dict and use it while making the request.

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
                  "fari/537.36",
    "Accept-Language": "en-US,en;q=0.5"
}
#define cookies
cookies = {'location': 'New York'}

url = "https://scrapeme.live/shop/"
response = requests.get(url, headers=headers, cookies=cookies)

Sending POST Requests

Sometimes, you need to make a POST request to a specific URL to get the data. You can do that using the post() method.

#define payload as a dict

payload = {“key1”: “value1”, “key2”: “value2”}
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
                     "fari/537.36",
        "Accept-Language": "en-US,en;q=0.5"
    }
url = "https://scrapeme.live/shop/"
response =requests.post(url, headers=headers, json=payload)

Why Use a Web Scraping Service

While you can use the techniques shown in the tutorial to scrape websites with the Python requests library at a small scale, for larger projects, it’s better to use a web scraping service. 

Otherwise, you have to take care of all the technicalities, including monitoring the website for changes and bypassing anti-scraping measures. Additionally, requests can’t handle dynamic websites, requiring you to use other libraries like Selenium.

If you use a web scraping service like ScrapeHero, you don’t need to worry about all this. 

ScrapeHero is a fully-managed web scraping service. We can build enterprise-grade web scrapers, and crawlers customized to your data needs. Not only that! We can take care of your entire data pipeline, from extraction to building custom AI solutions.

Table of contents

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Clients love ScrapeHero on G2

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Web Scraping NFT Marketplaces

Stay Updated with NFT Trends: Learn About Web Scraping NFT Marketplaces

Scrape NFT marketplaces using Playwright.
Benefits of web scraping for HR

Key Benefits of Web Scraping for HR

Explore the benefits of web scraping for HR and see how it enhances recruitment.
Scraping Google’s Dynamic Search Results

Google’s Dynamic Search Results Scraping Explained

Learn how to scrape Google’s dynamic search results.
ScrapeHero Logo

Can we help you get some data?