Essential HTTP Headers for Web Scraping

Web scraping is widely used for collecting data across the Internet. As the demand for data increases, the importance of web scraping has also increased. Today, a vast amount of data is available online that can be used for business intelligence, market research, decision-making, etc.

While creating a web scraper, we must consider factors such as Headers, Proxy, Request parameters, Request type, etc. In web scraping, headers play a crucial role. In this article, we will discuss the essential headers for web scraping.

What are HTTP Headers?

HTTP headers are some set of information transferred between the website and the client while retrieving the web pages. The information sent from the client to the website server is called Request Headers. It contains information about the browser and operating system. The server can easily identify the request’s source by looking at these headers. Similarly, when the server returns some additional information along with the response is called Response headers. Here is the list of essential headers for web scraping:

  • Request Headers
    • Referer
    • User-Agent
    • Accept
    • Accept-language
    • Cookie header
  • Response Headers
    • Content type
    • Location header

Request Headers

Referer

The referer is the partial or full domain address from which the request has been initiated. Let’s take an example of scraping an e-commerce website by navigating through listing pages and product pages. In these cases, the referer of the product page is the respective listing page. Let’s take a look at the flow diagram below.

Referer header for web scraping

Here, the navigation starts from the home page, with no referer. Then from the home page, we navigate to the listing page. Here the referer is the home page URL. From the listing page, we navigate to the product page, here, the referer will be the listing page URL.

A web server can easily identify a scraper by looking at the referer of that request. It is important to use a proper referer to ensure that the request is from a valid source.

User-Agent

The user-agent is an important header for web scraping. It is a string containing information about the name and versions of the operating system and browser(or tool) used to send the request. An example of a typical user agent string is shown below.

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36

When we send a request to a web page, the website server can identify the client details, such as browser, OS version, etc., using a user agent. Using this information, the websites can block the scrapers. The User-Agent string of a normal Python request and a browser request(normal browsing) is different. You can find the examples below:

Python request module

'User-Agent': 'python-requests/2.28.1'

Linux + Firefox

'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0'

Web scrapers can avoid blocking by using a user agent of a browser(Chrome, Safari, Firefox, etc.). If you inspect the browser’s network tab, you can find the user-agent string.

User agent header for web scraping

We can bypass the blocking when we use this user agent in our scraper. It is always good to use the same browser headers in our scraper.

Accept header

An accept header indicates the type of response the client expects to receive from the server. These are called MIME types. An accept header can have one or more MIME Types. For example, text/html, image/jpeg, application/json.

If we use application/json in the accept header, the server may send the response in JSON format instead of an HTML format. This helps further parsing and operations.

On the other hand, the invalid accept headers might lead to retrieving invalid data or restricting the request by the server. An example of an invalid accept header affecting the retrieved data is as follows.

import requests

url = 'https://api.github.com/user'

headers = {
    'Accept': 'application/pdf',
}
response = requests.get(url, headers=headers)

print("Status code:", response.status_code)
print("Response text:", response.text)

Output:

Status code: 415
Response text: {"message":"Unsupported 'Accept' header: 'application/pdf'. Must accept 'application/json'.","documentation_url":"https://docs.github.com/v3/media"}

Here, the response of the endpoint https://api.github.com/user is in JSON format, and we are trying to fetch a PDF document from that endpoint that the server couldn’t handle. The server returned a 415 status code in the output with the message Unsupported Accept header.

Accept-language

An accept-language header indicates the preferable natural languages of response contents the client expects to receive from the server. The accept-language header can have one or more languages in the order of priority. An example of an accept-language header is as follows:

accept-language: en-GB,en-US;q=0.9,en;q=0.8

This header is important while scraping language-specific websites; if we want to scrape a website having content in any specific language, it is essential to use a proper accept-language header, or else the server may return a response having content in a different language. The accept-language header is essential in mimicking real user behavior in scrapers while scraping language-specific websites. Most real users will use that website in the default language.

Cookies

The HTTP cookie header is a small set of data that are separated by a semi-colon(;). It can be generated on the client side or sent by the web server and stored in the browser’s local storage. The primary intention is to track human interactions with the website, including login information, preference data, browsing pattern, etc. The general syntax of a cookie is as follows.

cookie: <cookie1_name>=<cookie1_value>; <cookie2_name>=<cookie2_value>; ...

The major advantages of using cookie headers in web scraping are listed below:

  • Session management: The cookies keep track of the session state during the data extraction process that requires authentication. If some websites demand basic authentication with bearer tokens or with CSRF tokens for validation, then adding relevant cookies with appropriate authentic tokens is necessary to scrape data accurately.
  • Scraping dynamic websites: Dynamic websites often utilize cookies as a means of efficiently storing and retrieving data. However, if we fail to implement proper cookies, it can lead to complications in accessing dynamic content through scraping.
  • To bypass Anti-Scraping systems: Websites use various anti-scraping methods, including cookie-based security measures. Handling and sending cookies to the server while scraping can reduce cookie-based bot detections.

Response Headers

Content-Type

The “Content-Type” header indicates the preferred format of response that the client expects to receive from the server. This header ensures that the content received from the server is in the desired format. Sometimes, the “Content-Type” header may give information about potential flaws or errors. For example, if the intended content type is not found or the response displays an unsupported content type, that seems like there is an issue with the server side or the request we send during the scraping. Here are some examples of content types.

Content-Type: application/xml
Content-Type: text/html
Content-Type: image/jpeg

Location Header

The “Location” header is significant since it informs the scraper about the new URL to which it should redirect. This enables the scraper to handle URL changes or redirection, collect the required data, and ensure reliable data extraction.

Let’s look into some examples:

Location: https://www.websitedomain.com/another-page

In this example, the requested contents are moved permanently to a new endpoint. The scraper should follow the new URL to gather the required content.

Location: /login

Here, the request redirects to a login page in the same domain. The scraper should handle this redirection to continue further scraping.

Location: /error404

Here, the requested page is not found. It’s redirecting to an error page.

Tips for using Headers in web scraping

User-agent rotation

Using multiple user agents and rotating them for each request is a good practice in web scraping. This helps the scraper to avoid blocking to an extent. An example of user-agent rotation is given here.

import requests
import itertools

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"
]
ua_cycle = itertools.cycle(user_agents)

for i in range(10):

    headers = {"User-Agent": next(ua_cycle)}
    response = requests.get("https://httpbin.org/headers", headers=headers)

    print(response.json()["headers"]["User-Agent"])

Adding Sec-CH-UA

Sec-CH-UA contains detailed information about the device from which the request has been sent. The mismatch of versions and names of the device in the sec-ch-ua and user-agent might lead to blocking. An example of sec-ch-ua with its user-agent is as follows.

sec-ch-ua: "Chromium";v="112", "Google Chrome";v="112", "Not:A-Brand";v="99"
sec-ch-ua-arch: "x86"
sec-ch-ua-bitness: "64"
sec-ch-ua-full-version: "112.0.5615.165"
sec-ch-ua-full-version-list: "Chromium";v="112.0.5615.165", "Google Chrome";v="112.0.5615.165", "Not:A-Brand";v="99.0.0.0"
sec-ch-ua-mobile: ?0
sec-ch-ua-model: ""
sec-ch-ua-platform: "Linux"
sec-ch-ua-platform-version: "5.14.0"
sec-ch-ua-wow64: ?0
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36

Conclusion

The HTTP headers are the information shared between the client and server. There are a lot of headers available. Here we have discussed seven important headers for web scraping. It is important to use correct headers in web scraping to avoid getting blocked.

Posted in:   Web Scraping Tutorials

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?