How to fake and rotate User Agents using Python 3

To rotate user agents in Python here is what you need to do

  1. Collect a list of User-Agent strings of some recent real browsers.
  2. Put them in a Python List.
  3. Make each request pick a random string from this list and send the request with the ‘User-Agent’ header as this string.

There are different methods to do it depending on the level of blocking you encounter.

What is a User-Agent

A user agent is a string that a browser or application sends to each website you visit. A typical user agent string contains details like – the application type, operating system, software vendor, or software version of the requesting software user agent. Web servers use this data to assess the capabilities of your computer, optimizing a page’s performance and display. User-Agents are sent as a request header called “User-Agent”.

User-Agent: Mozilla/<version> (<system-information>) <platform> (<platform-details>) <extensions>

Below is the User-Agent string for Chrome 83 on Mac Os 10.15

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36

Before we look into rotating user agents, let’s see how to fake or spoof a user agent in a request.

Why should you use a User-Agent

Most websites block requests that come in without a valid browser as a User-Agent. For example here are the User-Agent and other headers sent for a simple python request by default while making a request.

import requests 
from pprint import pprint

#Lets test what headers are sent by sending a request to HTTPBin
r = requests.get('http://httpbin.org/headers')
pprint(r.json())
{'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Host': 'httpbin.org',
             'User-Agent': 'python-requests/2.23.0',
             'X-Amzn-Trace-Id': 'Root=1-5ee7a417-97501ac8e10eb62866e09b9c'}}

Ignore the X-Amzn-Trace-Id as it is not sent by Python Requests, instead generated by Amazon Load Balancer used by HTTPBin.

Any website could tell that this came from Python Requests, and may already have measures in place to block such user agents. User-agent spoofing is when you replace the user agent string your browser sends as an HTTP header with another character string. Major browsers have extensions that allow users to change their User-agent.

We can fake the user agent by changing the User-Agent header of the request and bypass such User-Agent based blocking scripts used by websites.

How to change User Agent

To change the User-Agent using Python Requests, we can pass a dict with a key ‘User-Agent’ with the value as the User-Agent string of a real browser,

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36

See the code below

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}

#Lets test what headers are sent by sending a request to HTTPBin
r = requests.get('http://httpbin.org/headers',headers=headers)
pprint(r.json())
{'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Host': 'httpbin.org',
             'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) '
                           'AppleWebKit/537.36 (KHTML, like Gecko) '
                           'Chrome/83.0.4103.97 Safari/537.36',
             'X-Amzn-Trace-Id': 'Root=1-5ee7b614-d1d9a6e8106184eb3d66b108'}}

As before lets ignore the headers that start with X- as they are generated by Amazon Load Balancer used by HTTPBin, and not from what we sent to the server

But, websites that use more sophisticated anti-scraping tools can tell this request did not come from Chrome.

Sending just a User-Agent is not enough, we need to send a full set of headers

Although we had set a user agent, the other headers that we sent are different from what the real chrome browser would have sent.

Here is what real Chrome would have sent

{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8", 
    "Dnt": "1", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-5ee7bae0-82260c065baf5ad7f0b3a3e3"
  }
}

It is missing these headers chrome would sent when downloading an HTML Page or has the wrong values for it

  • Accept (had */*, instead of text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9)
  • Accept-Language
  • Dnt
  • Upgrade-Insecure-Requests

Anti Scraping Tools can easily detect this request as a bot a so just sending a User-Agent wouldn’t be good enough to get past the latest anti-scraping tools and services.

Let’s add these missing headers and make the request look like it came from a real chrome browser

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8", 
    "Dnt": "1", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36", 
  }

r = requests.get("http://httpbin.org/headers",headers=headers)
pprint(r.json())

{'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
             'Accept-Encoding': 'gzip, deflate',
             'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
             'Dnt': '1',
             'Host': 'httpbin.org',
             'Upgrade-Insecure-Requests': '1',
             'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) '
                           'AppleWebKit/537.36 (KHTML, like Gecko) '
                           'Chrome/83.0.4103.97 Safari/537.36',
             'X-Amzn-Trace-Id': 'Root=1-5ee7bbec-779382315873aa33227a5df6'}}

Now, this request looks more like it came from Chrome 83, and should get you past most anti scraping tools – if you are not flooding the website with requests.

Why should you rotate User Agents

If you are making a large number of requests for web scraping a website, it is a good idea to randomize. You can make each request you send look random, by changing the exit IP address of the request using rotating proxies and sending a different set of HTTP headers to make it look like the request is coming from different computers from different browsers

How to rotate user agents

If you are just rotating user agents. The process is very simple.

  1. Collect a list of User-Agent strings of some recent real browsers from WhatIsMyBrowser.com.
  2. Put them in a Python List.
  3. Make each request pick a random string from this list.

Rotating User Agents using Python Requests

import requests
import random
user_agent_list = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
url = 'https://httpbin.org/headers'
for i in range(1,4):
    #Pick a random user agent
    user_agent = random.choice(user_agent_list)
    #Set the headers 
    headers = {'User-Agent': user_agent}
    #Make the request
    response = requests.get(url,headers=headers)
    
    print("Request #%d\nUser-Agent Sent:%s\n\nHeaders Recevied by HTTPBin:"%(i,user_agent))
    print(response.json())
    print("-------------------")

Request #1
User-Agent Sent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15

Headers Recevied by HTTPBin:
{'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15', 'X-Amzn-Trace-Id': 'Root=1-5ee7c263-b009c26abdfd7cb2c05e06ee'}}
-------------------
Request #2
User-Agent Sent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15

Headers Recevied by HTTPBin:
{'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15', 'X-Amzn-Trace-Id': 'Root=1-5ee7c264-975128287d5b3fa1fc9d0f8d'}}
-------------------
Request #3
User-Agent Sent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0

Headers Recevied by HTTPBin:
{'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0', 'X-Amzn-Trace-Id': 'Root=1-5ee7c266-8d04316e70f56abd1ec7beee'}}
-------------------

Rotating User-Agents in Scrapy

To rotate user agents in Scrapy, you need an additional middleware. There are a few Scrapy middlewares that let you rotate user agents like:

  1. Scrapy-UserAgents
  2. Scrapy-Fake-Useragents

Our example is based on Scrapy-UserAgents.

Install Scrapy-UserAgents using

pip install scrapy-useragents

Add in settings file of Scrapy add the following lines

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
}

USER_AGENTS = [
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/57.0.2987.110 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.79 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) '
     'Gecko/20100101 '
     'Firefox/55.0'),  # firefox
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.91 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/62.0.3202.89 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/63.0.3239.108 '
     'Safari/537.36'),  # chrome
]

The right way to rotate User-Agents in any program

Most of the techniques above just rotates the User-Agent header, but we already saw that it is easier for bot detection tools to block you when you are not sending the other correct headers for the user agent you are using.

To get better results and less blocking, we should rotate a full set of headers associated with each User-Agent we use. We can prepare a list like that by taking a few browsers and going to https://httpbin.org/headers and copy the set headers used by each User-Agent. (Remember to remove the headers that start with X- in HTTPBin)

Browsers may behave differently to different websites based on the features and compression methods each website supports. A better way is

  1. Open an incognito or a private tab in a browser, go to the Network tab of each browsers developer tools, and visit the link you are trying to scrape directly in the browser.
  2. Copy the curl command to that request –
    curl 'https://www.amazon.com/' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1'
    
  3. Paste into CurlConverter hosted here http://curl.trillworks.com and just take the value of the variable headers and paste into a list
    {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
    }
    

Things to keep in mind while rotating User Agents and corresponding headers

In order to make your requests from web scrapers look as if they came from a real browser:

  1. Have the right headers for the browser you are using and the website you are scraping
  2. Send headers in the right order as the real browser would send. Although HTTP spec says that the order of the HTTP headers does not matter, some bot detection tools check that too. They have a huge database of the combination of headers that are sent by specific versions of a browser on different operating systems and websites. There isn’t a straight forward way to order the HTTP requests using Python Requests. The code below uses a workaround.
  3. Have a Referer header with the previous page you visited or Google, to make it look real
  4. There is no point rotating the headers if you are logging in to a website or keeping session cookies as the site can tell it is you without even looking at headers
  5. We advise you to use proxy servers when making a large number of requests and use a different IP for each browser or the other way

Having said that, let’s get to the final piece of code

The Code

import requests
import random 
from collections import OrderedDict

# This data was created by using the curl method explained above
headers_list = [
    # Firefox 77 Mac
     {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    },
    # Firefox 77 Windows
    {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "https://www.google.com/",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    },
    # Chrome 83 Mac
    {
        "Connection": "keep-alive",
        "DNT": "1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Dest": "document",
        "Referer": "https://www.google.com/",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
    },
    # Chrome 83 Windows 
    {
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Sec-Fetch-Site": "same-origin",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-User": "?1",
        "Sec-Fetch-Dest": "document",
        "Referer": "https://www.google.com/",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-US,en;q=0.9"
    }
]
# Create ordered dict from Headers above
ordered_headers_list = []
for headers in headers_list:
    h = OrderedDict()
    for header,value in headers.items():
        h[header]=value
    ordered_headers_list.append(h)
    
    
url = 'https://httpbin.org/headers'

for i in range(1,4):
    #Pick a random browser headers
    headers = random.choice(headers_list)
    #Create a request session
    r = requests.Session()
    r.headers = headers
    
    response = r.get(url)
    print("Request #%d\nUser-Agent Sent:%s\n\nHeaders Recevied by HTTPBin:"%(i,headers['User-Agent']))
    print(response.json())
    print("-------------------")

Request #1
User-Agent Sent:Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0

Headers Recevied by HTTPBin:
{'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US,en;q=0.5', 'Dnt': '1', 'Host': 'httpbin.org', 'Referer': 'https://www.google.com/', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0', 'X-Amzn-Trace-Id': 'Root=1-5ee83553-d3b3cfdd774dec24971af289'}}
-------------------
Request #2
User-Agent Sent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0

Headers Recevied by HTTPBin:
{'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'identity', 'Accept-Language': 'en-US,en;q=0.5', 'Dnt': '1', 'Host': 'httpbin.org', 'Referer': 'https://www.google.com/', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0', 'X-Amzn-Trace-Id': 'Root=1-5ee83555-2b6fa55c3f38ba905eeb3a9e'}}
-------------------
Request #3
User-Agent Sent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0

Headers Recevied by HTTPBin:
{'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'identity', 'Accept-Language': 'en-US,en;q=0.5', 'Dnt': '1', 'Host': 'httpbin.org', 'Referer': 'https://www.google.com/', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0', 'X-Amzn-Trace-Id': 'Root=1-5ee83556-58ee2480f33e0b00f02ff320'}}
-------------------

You cannot see the order in which the requests were sent in HTTPBin, as it orders them alphabetically.

There you go! We just made these requests look like they came from real browsers.

Before you go

Rotating user agents can help you from getting blocked by websites that use intermediate levels of bot detection, but advanced anti-scraping services has a large array of tools and data at their disposal and can see past your user agents and IP address.

  1. If you are using proxies that were already detected and flagged by bot detection tools, rotating headers isn’t going to help.
  2. The SSL/TLS fingerprint of Python Requests or Scrapy is going to be very different from that of the browser whose User-Agent you were faking. Not a lot of tools do this, but it may be the reason this technique may not have worked for you.
    For reference, every request we made to a server with Python Requests had the JA3 fingerprint 5fb798ffef091e3699f344d0d0895792 no matter what HTTP Headers we sent. While real Chrome 83 on Mac had the fingerprint b32309a26951912be7dba376398abc3b. You can see your browsers JA3 fingerprint in ja3er. Bypassing such blocking is too complicated to fit into the scope of this article.

You can learn more on this topic here How do websites detect web scrapers and other bots.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Posted in:   Developers, Scraping Tips, Web Scraping Tutorials

Responses

simabn March 17, 2018

There is a python lib called “fake-useragent” which helps getting a list of common UA.


    ScrapeHero March 18, 2018

    Great find. We had used fake user agent before, but at times we feel like the user agent lists are outdated.


MargaritaL May 23, 2018

I have to import “urllib.request” instead of “requests”, otherwise it does not work.


    Mikie June 18, 2018

    agreed, same for me. I think that was a typo.


      Hyyudu April 24, 2019

      requests is different package, it should be installed separately, with “pip install requests”. But urllib.request is a system library always included in your Python installation


    Javier July 2, 2019

    requests use urllib3 packages, you need install requests with pip install.


Nick July 15, 2019

Hi there, thanks for the great tutorials!

Just wondering; if I’m randomly rotating both ips and user agents is there a danger in trying to visit the same URL or website multiple times from the same ip address but with a different user agent and that looking suspicious?

Cheers,
Nick


    ScrapeHero July 19, 2019

    Nick,
    There is no definite answer to these things – they all vary from site to site and time to time.


    Richdev November 15, 2022

    Lobstrio stopped hosting the server for this. This is not a viable option anymore.


WTKBTOS May 30, 2020

There is a website front to a review database which to access with Python will require both faking a User Agent and a supplying a login session to access certain data. Won’t this mean that if I rotate user agents and IP addresses under the same login session it will essentially tell the database I am scraping? Is there any way around this?


ridwan July 27, 2020

………………..
ordered_headers_list = []
for headers in headers_list:
h = OrderedDict()
for header,value in headers.items():
h[header]=value
ordered_headers_list.append(h)

for i in range(1,4):
#Pick a random browser headers
headers = random.choice(headers_list)
#Create a request session
r = requests.Session()
r.headers = headers

# Download the page using requests
print(“Downloading %s”%url)
r = r.get(url, headers=i,headers[‘User-Agent’])
# Simple check to check if page was blocked (Usually 503)
if r.status_code > 500:
if “To discuss automated access to Amazon data please contact” in r.text:
print(“Page %s was blocked by Amazon. Please try using better proxies\n”%url)
else:
print(“Page %s must have been blocked by Amazon as the status code was %d”%(url,r.status_code))
return None
# Pass the HTML of the page and create
return e.extract(r.text)

# product_data = []
with open(“asin.txt”,’r’) as urllist, open(‘hasil-GRAB.txt’,’w’) as outfile:
for url in urllist.read().splitlines():
data = scrape(url)
if data:
json.dump(data,outfile)
outfile.write(“\n”)
# sleep(5)

can anyone help me to combine this random user agent with the amazon.py script that is in the amazon product scrapping tutorial in this tutorial —-> https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python-and-selectorlib/


melmefolti August 8, 2020

@ScrapeHero : Insightful Article

Is there any library like fakeuseragent that will give you list of headers in correct order including user agent to avoid manual spoofing like in the example code.

Hard coding is not sustainable


    ScrapeHero August 9, 2020

    @melmefolti We haven’t found anything so far. A lot of effort would be needed to check each Browser Version, Operating System combination and keep these values updated.


    no-brainer August 10, 2020

    Very useful article with that single component clearly missing. I have come across pycurl and uncurl packages for python which return the same thing as the website, but in alphabetical order. Perhaps the only option is the create a quick little scraper for the cURL website, to then feed the main scraper of whatever other website you’re looking at


    ScrapeHero September 4, 2020

    You can try curl with the -I option
    ie curl -I https://www.example.com and see if that helps


jimm October 6, 2020

why exactly do we need to open the network tab?


    jimm October 6, 2020

    sorry i should be more specific,

    the bit were you say:

    ‘Open an incognito or a private tab in a browser, go to the Network tab of each browsers developer tools, and visit the link you are trying to scrape directly in the browser.

    Copy the curl command to that request –

    curl ‘https://www.amazon.com/’ -H ‘User-Agent:…’

    does the navigator have something to do with the curl command?


      ScrapeHero October 15, 2020

      The curl command is copied from that window – so it is needed.


nik January 27, 2021

In the line “Accept-Encoding”: “gzip, deflate,br”,
the headers having Br is not working it is printing gibberish when i try to use beautiful soup with that request .


    ScrapeHero February 1, 2021

    You can safely remove the br and it will still work


Marcos Fonseca Alcure November 3, 2021

Hi,

How can I send all the headers to SELENIUM, I found only the User-Agent, but not the others.

Thanks A LOT !!!


ThumbOne September 21, 2022

A great page but alas, yes, JA3 fingerprinting has put an end to its utility and we await a Pythonic solution to JA3 spoofing (and are stuck till one evolves). I found one nascent effort here:

https://github.com/an0ndev/requests-ja3


Comments are closed.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?