How To Scrape Amazon Product Details and Pricing using Python

In this tutorial  we will build an amazon scraper for extracting product details and pricing. We will build this simple web scraper using python and LXML and run it in a console. But before we start, let’s look at what can you use it for.

What can you use an Amazon Scraper for ?

  1. Scrape Product Details that you can’t get with the Product Advertising API
    Amazon provides a Product Advertising API, but like most other “API”s, this API doesn’t provide all the information that Amazon has on a product page. A scraper can help you extract all the details displayed on the product page.
  2. Monitor products for change in Price, Stock Count/Availability, Rating, etc.
    By using a web scraper, you can update your data feeds on a timely basis to monitor any product changes. These data feeds can help you form pricing strategies by looking your competition – other sellers or brands.
  3. Analyze how a particular Brand sells on Amazon
    If you’re a retailer, you can monitor your competitor’s products and see how well they do in the market and make adjustments to reprice and sell your products. You could also use it to monitor your distribution channel to identify how your products are sold on Amazon by sellers, and if it is causing you any harm.
  4. Find Customer Opinions from Amazon Product Reviews
    Reviews offer abundant amounts of information. If you’re targeting an established set of sellers who have been selling reasonable volumes, you can extract the reviews of their products to find what you should avoid and what you could quickly improve on while trying to sell similar products on Amazon.

Or anything else – the possibilities are endless and only bound by your imagination

What data are we extracting from Amazon?

This tutorial is limited to extracting the data points below, from a product page:

  1. Product Name
  2. Category
  3. Original Price
  4. Sale Price
  5. Availability
  6. URL

We’ll build a scraper in Python that can go to any Amazon product page using an ASIN – a unique ID Amazon uses to keep track of products in its database.

First, let’s identify a product ASIN.

For example, in this product – Imploding Kittens https://www.amazon.com/Imploding-Kittens-First-Expansion-Exploding/dp/B01HSIIFQ2/ ), the ASIN is B01HSIIFQ2.

Gather the ASINs for the products you need data from.

The next step is to build a script that goes to each one of those product pages, downloads its HTML and extracts the fields you need- e.g., Product Title, Price, Description, etc.

XPaths are used to tell the script where each field we need is present in the HTML. XPaths are one of the few ways in which you can select some content from a big blob of XML or HTML (properly structured HTML is similarly structured as an XML document) content. An XPath tells you the location of an element, just like a catalog card does for books. We’ll find XPaths for each of the fields we need and put that into our scraper.

Once we extract this information, we’ll save it into a JSON file.

Since we already have the list of products, let’s get started.

What tools do we need?

For this tutorial, we will stick to using Python and a couple of python packages for downloading and parsing the HTML. Below are the package requirements:

  • Python 2.7 available here (https://www.python.org/downloads/ )
  • Python PIP to install the following packages in Python (https://pip.pypa.io/en/stable/installing/)
  • Python Requests available here ( http://docs.python-requests.org/en/master/user/install/). Requests allow you to send HTTP requests. You won’t need to add query strings to your URLs manually. It’s an easy-to-use library with a lot of features ranging from passing parameters in URLs to sending custom headers and SSL Verification.
  • Python LXML (Learn how to install that here – http://lxml.de/installation.html)

If you have PIP, installing Requests and LXML would be as easy as running the line below in a python enabled terminal:

pip install requests lxml

If you don’t like or want to code, the ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Run this scraper in the ScrapeHero Cloud within seconds

Run this in the Cloud for FREE
Deploy to ScrapeHero Cloud

The Amazon Scraper

If the embed above doesn’t work, you can download the code directly from here.

Modify the code shown below with a list of your own ASINs.

def ReadAsin():
  #Change the list below with the ASINs you want to track.
    AsinList = ['B0046UR4F4',
    'B00JGTVU5A',
    'B00GJYCIVK',
    'B00EPGK7CQ',
    'B00EPGKA4G',
    'B00YW5DLB4',
    'B00KGD0628',
    'B00O9A48N2',
    'B00O9A4MEW',
    'B00UZKG8QU']
    extracted_data = []
    for i in AsinList:
        url = "http://www.amazon.com/dp/"+i
        extracted_data.append(AmzonParser(url))
        sleep(5)
    #Save the collected data into a json file.
    f=open('data.json','w')
    json.dump(extracted_data,f,indent=4)

Assuming the script is named amazon_scraper.py. Type in the script name in command prompt or terminal like this.

python amazon_scraper.py 

This will create a JSON output file called data.json with the data collected for the list of ASINs present in the AsinList.

The JSON output for a couple of ASINs will look similar to this:

{
        "CATEGORY": "Electronics > Computers & Accessories > Data Storage > External Hard Drives", 
        "ORIGINAL_PRICE": "$1,899.99", 
        "NAME": "G-Technology G-SPEED eS PRO High-Performance Fail-Safe RAID Solution for HD/2K Production 8TB (0G01873)", 
        "URL": "http://www.amazon.com/dp/B0046UR4F4", 
        "SALE_PRICE": "$949.95", 
        "AVAILABILITY": "Only 1 left in stock."
    }, 
    {
        "CATEGORY": "Electronics > Computers & Accessories > Data Storage > USB Flash Drives", 
        "ORIGINAL_PRICE": "$599.95", 
        "NAME": "G-Technology G-RAID USB Removable Dual Drive Storage System 8TB (0G04069)", 
        "URL": "http://www.amazon.com/dp/B00UZKG8QU", 
        "SALE_PRICE": "$599.95", 
        "AVAILABILITY": "Only 2 left in stock."
    }

You can also extract reviews from product pages. Head over to this new blog post to learn how.

What to do if you run into captchas ( Blocked ) while scraping

We are adding this extra section to talk about some methods you could use to not get blocked while scraping Amazon. Amazon is very like to flag you as a “BOT” if you start scraping hundreds of pages using the code above. The easy answer is to NOT get flagged as BOT. Okay, how do we do that ?

Mimic human behavior as much as possible.

While we cannot guarantee that you will not be blocked, we can share some tips and tricks to not get banned by Amazon.

1. Use proxies and rotate them

Let us say we are scraping hundreds of products on amazon.com from a laptop, which usually has just one IP address. Amazon would know that we are a bot in no time as NO HUMAN would ever visit hundreds of products in a minute, say an hour. To look more like a human –  make requests to amazon.com through a pool of IP Addresses or proxies. The rule of thumb here is to have 1 proxy or IP address make not more than 5 requests to Amazon in a minute. If are scraping about 100 pages per minute, we need about 100/5 = 20 Proxies. You can read more about rotating proxies here

2. Specify the User Agents of latest browsers and rotate them

If you look at the code above, you will a line where we had set User-Agent String for the request we are making.

 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'

Just like proxies, it always good to have a pool of User Agent Strings . Just make sure are using user-agent strings of the latest and popular browsers and rotate the strings for each request you make to Amazon. You can learn more about rotating user agent string in python here.  It is also a good idea to create a combination of  (User-Agent, Ip Address) , so that it looks more human than bot.

3. Reduce the number of ASINs you scrape per minute

You can try slowing down the scrape a bit, to give Amazon fewer chances of flagging you as a bot. You don’t have to be too slow. But about 5 requests per IP per minute isn’t much throttling. If you need to go faster, add more proxies. You can modify the speed by increasing or decreasing the delay in the sleep function of line 18 of the code above.

    try:
        # Retrying for failed requests
        for i in range(20):
            # Generating random delays
            sleep(randint(1,3))
            # Adding verify=False to avold ssl related issues

4. Retry, Retry, Retry

When you are blocked by Amazon, make sure you retry that request. If you look at the code block above we have added 20 retries. Our code retries immediately after the scrape fails, you could do an even better job here by creating a retry queue using a list, and retry them after all the other products are scraped from Amazon.

6 Things to keep in mind when scraping Amazon on a larger scale

Usually, there is a limit on large websites. Amazon lets you go through 400 pages per category. This should work for small-scale scraping and hobby projects and get you started on your road to building bigger and better scrapers. However, if you do want to scrape amazon for thousands of pages at short intervals there are some important things you should be aware of :

1. Use a Web Scraping Framework like PySpider or Scrapy

When you’re crawling a massive site like Amazon.com, you need to spend some time to figure out how to run your entire crawl smoothly. Choose an open-source framework for building your scraper, like Scrapy or PySpider which are both based in Python. These frameworks have pretty active communities and can take care of handling a lot of the errors that happen while scraping without disturbing the entire scraper. Most of them also let you use multiple threads to speed up scraping – if you are using a single computer. Scrapy can be deployed to your own servers using ScrapyD.

2. If you need speed, Distribute and Scale Up using a Cloud Provider

There is a limit to the number of pages you can scrape when using a single computer. If you are going to scrape Amazon on a large scale (millions of product pages a day), you need a lot of servers to get the data within a reasonable time. You could consider hosting your scraper in the cloud and use a scalable Version of the Framework – like Scrapy Redis. For a broader crawl, you can use a message broker like Redis, Rabbit MQ, Kafka, etc., so that you can run multiple spider instances to speed up the crawl.

3. Use a scheduler if you need to run the scraper periodically

If you are using a scraper to get updated prices or stock counts of products, you need to update your data frequently to keep track of the changes. If you are using the script in this tutorial, use CRON (in UNIX) or Task Scheduler in Windows to schedule it. If you are using Scrapy, scrapyd+cron can help schedule your spiders so you can refresh the data promptly.

4. Use a database to store the Scraped Data from Amazon

If you are scraping a large number of products from Amazon, writing data to a file would soon become inconvenient. Retrieving data becomes tough, and you might even end up getting gibberish inside the file when multiple processes write to a single file. Using a database is recommended even if you are scraping from a single computer. MySQL will be just fine for moderate workloads, and you can use simple analytics on the scraped data tools like Tableau, PowerBI or Metabase by connecting them to your database. For larger write loads you can look into some of the NoSQL databases like MongoDB, Cassandra etc.

5. Use Request Headers, Proxies, and IP Rotation to prevent getting Captchas from Amazon

Amazon has a lot of anti-scraping measures. If you are throttling amazon, you’ll be blocked in no time and you’ll start seeing captchas instead of product pages. To prevent that to a certain extent, while going through each Amazon product page, it’s better to change your headers by replacing your UserAgent value to make requests look like they’re coming from a browser and not a script.
If you’re going to crawl Amazon at a very large scale, use Proxies and IP Rotation to reduce the number of captchas you get. You can learn more techniques to prevent getting blocked by Amazon and other sites here –  How to prevent getting blacklisted while scraping.  You can also use python to solve some basic captchas using an OCR called Tesseract.  

6. Write some simple data quality tests

Scraped data is always messy. An XPath that works for a page might not work for another variation of the same page on the same site. Amazon has LOTS of product page layouts. If you spend an hour writing some basic sanity check for your data – like verifying if the price is a decimal, a title a string less than say 250 characters, etc., you’ll know when your scraper breaks and you’ll also be able to minimize its impact. This is a must if you feed the scraped amazon data feeds into some price optimization program.

We hope this tutorial gave you a better idea on how to scrape Amazon or similar e-commerce websites. As a company, we understand e-commerce data having worked with it before. If you are interested in professional help with scraping complex websites, let us know, and we will be glad to help.

Need some help with scraping eCommerce data?

Turn the Internet into meaningful, structured and usable data


Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

 

Posted in:   eCommerce Data Gathering Tutorials, Web Scraping Tutorials

Responses

shree May 14, 2019

How to scrape the feedback from consumer?
Thanks in advance

Reply

Bharat Bhushan June 25, 2019

@ ScrapeHero
Can you please give some idea like how to crawl data from amazon for a specific city ?

Reply

Comments or Questions?

Turn the Internet into meaningful, structured and usable data   

Enjoying our Tutorials?

Subscribe to our weekly updates on the latest tutorials in Web Scraping and Data Extraction

icons8-amazon

Get Amazon Product Details using our Real-Time API