Web Scraping With Python: Using Request Library

Web scraping is the extraction of data such as product names, rates, or any other valuable information that is public from various websites using a bot or a crawler.
Even though many programming languages like Ruby, C++, JavaScript, and Java are used for web scraping, the most popular one is Python.

Web scraping in Python is widely adopted due to the language’s versatility and ability to handle complex data extraction processes.
The article gives a solid foundation for web scraping with Python and Requests. By the end of the article, you will have knowledge for building your own web scrapers, collecting website data, or even automating repetitive tasks.

Why Web Scraping With Python?

Python is one of the best choices for web scraping, as it has many native libraries that are dedicated to web scraping. The Python syntax is also easy to understand and learn, as it is similar to reading a statement in the English language.
Scraping web pages with Python is a common trend due to several reasons:

  • Ease of use

Python is a simple and readable programming language that is accessible to both beginners and programming experts. Due to its straightforward syntax, developers can quickly understand the concepts of web scraping.

  • Large and active community

The vast and active developer community of Python continuously contributes to open-source libraries and frameworks. Because of this, there are plenty of resources, tutorials, and code snippets to learn web scraping. You can even solve problems by using this collective knowledge.

  • Abundance of libraries

Python libraries such as BeautifulSoup and LXML are specifically designed for web scraping. These libraries help to parse and navigate HTML and XML documents with their powerful tools.
The libraries also assist us in extracting data from web pages, manipulating HTML structures, and handling various data formats, making web scraping in Python an important topic of discussion.

  • Requests library

Requests are Python libraries that enable you to make HTTP requests while also handling responses. It can be identified as a high-level interface that is used for sending HTTP requests like GET and POST, setting headers, handling cookies, and managing sessions.

  • Data manipulation and analysis

Python libraries like Pandas and NumPy are some of the most powerful and prominent data manipulation and analysis libraries that can be used for processing, cleaning, and analyzing data efficiently.
You can rely on these libraries to filter, sort, aggregate, and visualize the data for data-driven decision-making.

  • Integration with other tools and technologies

Why is web scraping with Python a serious topic on the internet? What makes Python the easiest language to scrape? The reason is Python’s ability to seamlessly integrate with other web scraping tools and technologies.
Python can be combined with database systems such as MySQL and MongoDB for storing and managing the scraped data. Moreover, it also goes well with the Django or Flask frameworks for building web applications.

The “Requests” library is one of the most important libraries that is used for web scraping in Python. In fact, it is one of the most downloaded packages for Python per week. This is because Requests is an HTTP library that can be considered simple which enables you to send HTTP requests and later handle responses very easily.
But why Requests when web scraping with Python? The reason is that it is an excellent option for web scraping as it provides a high-level interface where HTTP requests can be made.

Note: There are various Python frameworks and libraries that are used for web scraping aside from Requests.

Process of web scraping with Python Request.

Scraping Web Pages With Python Requests

In one of our previous articles, you already learned about web scraping with Pandas. This article deals with scraping web pages with Python Requests. Let us learn how the Requests library can be used for web scraping purposes. In this web scraping with Python tutorial, you can also learn about sending GET and POST requests, setting headers, handling cookies, and managing sessions.
We can also try to understand how HTTP requests can be made, how responses can be handled, and finally how the required data can be extracted from the HTML by using Requests.
In addition to that, we will also be covering various techniques and strategies for parsing HTML data using the LXML library.
LXML is a popular Python library that helps in traversing and extracting data from the HTML structure. It also makes the scraping process easier, thus obtaining the required information.

Be considerate and respectful of the terms of service of the website when you are scraping web pages with Python. Also, be aware of the legal and ethical considerations and scrape responsibly. Let us begin to learn web scraping with Python and Requests.

Step by Step Installation Process

Make sure to follow the steps listed below in the installation process.

  1. On your Linux device, open a terminal.
  2. Type python –version or python3 –version in the terminal to check whether your Python is already installed. You will be able to see a version number if Python is already installed. Skip the installation steps and use the existing Python installation.
  3. If Python is found to be missing, you need to install it. For this, you should use the package manager that is specific to your Linux distribution. The installation and other dependencies required will be handled by this package manager.
  4. You can use the command mentioned below for Ubuntu, Debian, or related distributions.
    sudo apt-get update
    sudo apt-get install python3
  5. Install the required libraries, in this case, requests and LXML. For installing these two libraries you can use the following commands.
    pip install requests
    pip install lxml 

 How to Create Your First Python Scraper

This tutorial explains how the extraction of data is made simpler and faster if Python Requests is used for web scraping. Even a non-programmer can create their own web scraper in Python by following certain steps.
The workflow of the scraper can be listed as follows:

  1. Open the website https://scrapeme.live/shop
  2. Collect all product URLs by navigating through the first few listing pages
  3. Collect details such as
    • Name
    • Description
    • Price
    • Stock
    • Image URL
    • Product URL
  4. Now you can save all the data you collected to a CSV file

Importing the Required Libraries

You can begin scraping web pages with Python by importing the required data libraries.

import requests
from lxml import html
import csv

Sending a Request to the Website

The Requests module can be used here to collect data from the websites. Note that it is the Requests library that allows Python to send HTTP requests.
Let’s send a request to https://scrapeme.live/shop

headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
"fari/537.36",
"Accept-Language": "en-US,en;q=0.5"
}
response = requests.get(url, headers=headers)

Before proceeding further, the response that is received from the website must be validated, and this is done using the response status code. Every website’s validation criteria will also be different.

def verify_response(response):
   return True if response.status_code == 200 else False

Based on the status code, you determine whether the response is valid or not. If the status code value is 200 then the response is considered valid, or else it is invalid. For the invalid response, you will be able to add retries, which solves the invalid response issue.

max_retry = 3
while max_retry >= 1:
    response = requests.get(url, headers=headers)
    if verify_response(response):
        return response
    else:
        max_retry -= max_retry

The next step after receiving a valid response is to parse the HTML response.

You have the response from the listing page. Now you can collect the product URLs.

scraping web pages with python requests- finding the product url.

From the above screenshot, it is clear that node ‘a’ which has the class name class=”woocommerce-LoopProduct-link woocommerce-loop-product__link” contains the url to the product page. Since node “a” comes under node ‘li’, its XPATH is written as //li/a[contains(@class, “product__link”)].
The next product’s url is in the “href” attribute of that node. So using the lxml module, it is possible for you to access the attribute value as shown below:

from lxml import html
parser = html.fromstring(response.text)
product_urls = parser.xpath('//li/a[contains(@class, "product__link")]/@href')

Similarly, the next page URL can be obtained from the next button in HTML.

web scraping in python- Obtaining the next page URL

There are two results produced for the same xpath, and to get the next page URL from the ‘a’ node, you may select the first result. Give the xpath inside a bracket () and index it. Now the XPATH //a[@class=”next page-numbers”] becomes (//a[@class=”next page-numbers”])[1]/@href.

from lxml import html
parser = html.fromstring(response.text)
next_page_url = parser.xpath('(//a[@class="next page-numbers"])[1]/@href')[0]

Collect all the product URLs and save them into a list. Now you have to paginate through the listing page, adding the product URLs to the mentioned list. When all paginations are done, you send the request to the product URLs.
You might have noticed that a list of string elements is returned by the parser.xpath(). For all product pages, there is a general xpath. Price may be listed for some products, and for some products, price may not be available since they will be out of stock.
If such a case occurs, the parser.xpath returns a null list. An error will be raised once you call the null list with [0] indexing, stopping the remaining code from running. So a function, ‘clean_string’ is created to handle such a situation.

def clean_string(list_or_txt, connector=' '):
   if not list_or_txt:
      return None
   return ' '.join(connector.join(list_or_txt).split())

Let us now learn about collecting the name, description, price, stock, and image URL of the data points.

Collecting the Name

web scraping with python tutorial- collecting the name of the product.

From the image, it is clear that node h1 contains the name of the product. You can see that the product page does not have any other h1 node. So you can simple call the XPATH //h1 for selecting that particular node.
Also use the following code since the text is inside the node:

title = parser.xpath('//h1[contains(@class, "product_title")]/text()')
Title = clean_string(title)

Collecting the Description

web scraping with python tutorial- collecting the description of the product.

Here the product description is inside the node p. You can also see that it is inside the div with the class name substring ‘product-details__short-description’. Collect the text inside it as follows:

description = parser.xpath('//div[contains(@class,"product-details__short-description")]//text()')
description = clean_string(description)

Collecting the Stock

web scraping with python tutorial- collecting the stock of the product.

From the image, it is evident that stock is directly present inside the node p, whose class contains the string ‘in-stock’. So use the following code to collect data from it:

stock = parser.xpath('//p[contains(@class, "in-stock")]/text()')
stock = clean_string(stock)
if stock:
    stock = stock.replace(' in stock', '')

Collecting the Price

web scraping with python tutorial- collecting the price of the product.

Here the price can be directly seen in the node p having class price. So use the following code to get the actual price value of the product:

price = parser.xpath('//p[@class="price"]//text()')
price = clean_string(price)

Collecting the Image URL

web scraping with python tutorial- collecting the image URL of the product.

In the above screenshot, the attribute href of the node ‘a’ is highlighted. It is from this href attribute that you will get the image URL.

image_url = parser.xpath('//div[contains(@class, "woocommerce-product-gallery__image")]/a/@href')
Image_url = clean_string(list_or_txt=image_url, connector=' | ')

Complete Code

import csv
from lxml import html
import requests

def verify_response(response):
    """
    Verify if we received valid response or not
    """
    return True if response.status_code == 200 else False

def send_request(url):
    """
    Send request and handle retries.
    :param url:
    :return: Response we received after sending request to the URL.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
                      "fari/537.36",
        "Accept-Language": "en-US,en;q=0.5"
    }
    max_retry = 3
    while max_retry >= 1:
        response = requests.get(url, headers=headers)
        if verify_response(response):
            return response
        else:
            max_retry -= max_retry
    print("Invalid response received even after retrying. URL with the issue is:", url)
    raise Exception("Stopping the code execution as invalid response received.")

def get_next_page_url(response):
    """
    Collect pagination URL.
    :param response:
    :return: next listing page url
    """
    parser = html.fromstring(response.text)
    next_page_url = parser.xpath('(//a[@class="next page-numbers"])[1]/@href')[0]
    return next_page_url

def get_product_urls(response):
    """
    Collects all product URL from a listing page response.
    :param response:
    :return: list of urls. List of product page urls returned.
    """
    parser = html.fromstring(response.text)
    product_urls = parser.xpath('//li/a[contains(@class, "product__link")]/@href')
    return product_urls

def clean_stock(stock):
    """
    Clean the data stock by removing unwanted text present in it.
    :param stock:
    :return: Stock data. Stock number will be returned by removing extra string.
    """
    stock = clean_string(stock)
    if stock:
        stock = stock.replace(' in stock', '')
        return stock
    else:
        return None

def clean_string(list_or_txt, connector=' '):
    """
    Clean and fix list of objects received. We are also removing unwanted white spaces.
    :param list_or_txt:
    :param connector:
    :return: Cleaned string.
    """
    if not list_or_txt:
        return None
    return ' '.join(connector.join(list_or_txt).split())

def get_product_data(url):
    """
    Collect all details of a product.
    :param url:
    :return: All data of a product.
    """
    response = send_request(url)
    parser = html.fromstring(response.text)
    title = parser.xpath('//h1[contains(@class, "product_title")]/text()')
    price = parser.xpath('//p[@class="price"]//text()')
    stock = parser.xpath('//p[contains(@class, "in-stock")]/text()')
    description = parser.xpath('//div[contains(@class,"product-details__short-description")]//text()')
    image_url = parser.xpath('//div[contains(@class, "woocommerce-product-gallery__image")]/a/@href')
    product_data = {
        'Title': clean_string(title), 'Price': clean_string(price), 'Stock': clean_stock(stock),
        'Description': clean_string(description), 'Image_URL': clean_string(list_or_txt=image_url, connector=' | '),
        'Product_URL': url}
    return product_data

def save_data_to_csv(data, filename):
    """
    save list of dict to csv.
    :param data: Data to be saved to csv
    :param filename: Filename of csv
    """
    keys = data[0].keys()
    with open(filename, 'w', newline='') as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=keys)
        writer.writeheader()
        writer.writerows(data)

def start_scraping():
    """
    Starting function.
    """
    listing_page_url = 'https://scrapeme.live/shop/'
    product_urls = list()
    for listing_page_number in range(1, 6):
        response = send_request(listing_page_url)
        listing_page_url = get_next_page_url(response)
        products_from_current_page = get_product_urls(response)
        product_urls.extend(products_from_current_page)
        results = []
    for url in product_urls:
        results.append(get_product_data(url))
    save_data_to_csv(data=results, filename='scrapeme_live_Python_data.csv')
    print('Data saved as csv')

if __name__ == "__main__":
    start_scraping()

Sending GET Requests Using Cookies and Headers

Now let’s learn how to send the requests using headers and cookies.

headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
              "fari/537.36",
    "Accept-Language": "en-US,en;q=0.5"
   }

url = "https://scrapeme.live/shop/"
response = requests.get(url, headers=headers, cookies=cookies)

Sending POST Requests

Let’s have a look at making POST requests with the Python Requests library.

payload = {“key1”: “value1”, “key2”: “value2”}
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
                     "fari/537.36",
        "Accept-Language": "en-US,en;q=0.5"
    }
url = "https://scrapeme.live/shop/"
response =requests.post(url, headers=headers, json=payload)

Wrapping Up

The major reason why web scraping in Python has become a favorite choice among the scraping community is the simplicity of the language. Moreover the Requests library in Python provides a high-level interface that makes web scraping easier.
This tutorial has given you a detailed explanation of web scraping with Python Requests and how you can employ Python

Requests to collect all the necessary data. You can efficiently collect, process, and analyze the data you need with this simple method. For enterprises to create new data sets or study market trends, an effective approach like web scraping must be adopted. A leading data service provider like ScrapeHero can provide companies with access to valuable data that is otherwise difficult to obtain.

Posted in:   Featured, Scraping Tips, Tutorials, web scraping, Web Scraping API

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?