How To Scrape Amazon Product Details and Pricing using Python

Scraping Amazon Tutorial (Custom)

Amazon provides a Product Advertising API, but like most APIs, the API doesn’t provide all the information that Amazon has on a product page.

The only way to get the exact data that you see on a product page is by using a web scraper. Scraping ensures that you can get exactly what you see by visiting the site using a web browser.

Scraping Amazon for data is useful for a lot of things, such as:

  1. Scrape product details that you can’t get with the Product Advertising API
  2. Monitor an item for change in Price, Stock Count/Availability, Rating etc.
  3. Analyze how a particular Brand is being sold on Amazon
  4. Analyze Amazon marketplace Sellers
  5. Analyze Amazon Product Reviews
  6. Or anything else – the possibilities are endless and only bound by your imagination

An easy way to get started with scraping Amazon is by building a crawler in Python that can go to any Amazon product’s page using an ASIN (a unique keyword Amazon uses to keep track of products in its database)

If you are looking for a service to collect this data for your business needs, we can help.

Get clean Amazon.com data delivered to you as a service


 

If not, lets continue with the tutorial.

First lets collect a list of products identified by their ASINs.
e.g. An ASIN looks like

B00JGTVU5A or B00GJYCIVK

Then we will download the HTML of each product’s page and start identify the XPaths for the data elements that you need – e.g. Product Title, Price, Description etc. Read more about XPaths here.

The Code

Prerequisites:

For this tutorial, we will stick to using basic Python and a couple of python packages – requests and lxml. We will not use more complicated packages like Scrapy for something simple.

You will need to install the following:

  • Python 2.7 available here ( https://www.python.org/downloads/ )
  • Python Requests available here ( http://docs.python-requests.org/en/master/user/install/) . You might need Python pip to install this available here – https://pip.pypa.io/en/stable/installing/)
  • Python LXML ( Learn how to install that here – http://lxml.de/installation.html )

We make this process a bit easier for you by providing you the actual Python code. The code will help scrape few important data elements such as Product Name, Price, Availability, Description etc.

Feel free to copy and modify it to your needs – that is the best way to learn ! You can download the code directly from here.

 

 

Modify the code shown below with a list of your own ASINs.

def ReadAsin():
  #Change the list below with the ASINs you want to track.
	AsinList = ['B0046UR4F4',
	'B00JGTVU5A',
	'B00GJYCIVK',
	'B00EPGK7CQ',
	'B00EPGKA4G',
	'B00YW5DLB4',
	'B00KGD0628',
	'B00O9A48N2',
	'B00O9A4MEW',
	'B00UZKG8QU',]
	extracted_data = []
	for i in AsinList:
		url = "http://www.amazon.com/dp/"+i
		extracted_data.append(AmzonParser(url))
		sleep(5)
	#Save the collected data into a json file.
	f=open('data.json','w')
	json.dump(extracted_data,f,indent=4)

and run it from a terminal or command prompt like this (if you name the file amazon_scraper.py):

python amazon_scraper.py 

You’ll get a file called data.json with the data collected for the ASINs you had in AsinList in the code.

Here is how the JSON output for a couple of ASINs will look like

{
        "CATEGORY": "Electronics > Computers & Accessories > Data Storage > External Hard Drives", 
        "ORIGINAL_PRICE": "$1,899.99", 
        "NAME": "G-Technology G-SPEED eS PRO High-Performance Fail-Safe RAID Solution for HD/2K Production 8TB (0G01873)", 
        "URL": "http://www.amazon.com/dp/B0046UR4F4", 
        "SALE_PRICE": "$949.95", 
        "AVAILABILITY": "Only 1 left in stock."
    }, 
    {
        "CATEGORY": "Electronics > Computers & Accessories > Data Storage > USB Flash Drives", 
        "ORIGINAL_PRICE": "$599.95", 
        "NAME": "G-Technology G-RAID USB Removable Dual Drive Storage System 8TB (0G04069)", 
        "URL": "http://www.amazon.com/dp/B00UZKG8QU", 
        "SALE_PRICE": "$599.95", 
        "AVAILABILITY": "Only 2 left in stock."
    }

This should work for small scale scraping and hobby projects and get you started on your road to building bigger and better scrapers.

However, if you want to scrape websites for thousands of pages there are some important things you should be aware of and you can read about them at Scalable do-it-yourself scraping – How to build and run scrapers on a large scale.

Web scraping is very useful to automate such simple or many complex tasks that can easily be done by computers.

Thanks for reading and if you need help with your complex scraping projects let us know and we will be glad to help.

EDIT: Nov 25 2016 – If you want to also scrape Amazon reviews for a product, head over to this new blog post.

Need some help with scraping eCommerce data?


Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

24 thoughts on “How To Scrape Amazon Product Details and Pricing using Python

  1. I don’t get the output. No error too. In the json file all the values are ‘null’, for eg:
    [
    {
    “CATEGORY”: null,
    “ORIGINAL_PRICE”: null,
    “NAME”: null,
    “URL”: “http://www.amazon.com/dp/B0046UR4F4”,
    “SALE_PRICE”: null,
    “AVAILABILITY”: null
    },
    ]

    1. Hi , the user agent trick didn’t work for me, scrapehero is there something changed on the amazon code that i get this results:

      [
      {
      “CATEGORY”: null,
      “ORIGINAL_PRICE”: null,
      “NAME”: null,
      “URL”: “http://www.amazon.com/dp/B0046UR4F4”,
      “SALE_PRICE”: null,
      “AVAILABILITY”: null
      },
      {
      “CATEGORY”: null,
      “ORIGINAL_PRICE”: null,
      “NAME”: null,
      “URL”: “http://www.amazon.com/dp/B00JGTVU5A”,
      “SALE_PRICE”: null,
      “AVAILABILITY”: null
      },

  2. Are there any cheap web hosting solutions what have Python installed? Hoping I could set up my required Amazon products, update prices daily then point a website/app to the .json file on my new shared hosting.

    Maybe even AWS, Azure etc or a Cloud IDE. Just looking for a simple solution to start off with.

  3. Nice implementation! Very well done! Just a question…What is the purpose of the sleep() functions? How comes Amazon does not return a typical robot/spider message to use their api?

      1. In the beginning I did not use headers in the requests.get() so in the HTML (html.fromstring()) content there was the following message “To discuss automated access to Amazon data please contact mail. For information about migrating to our APIs refer to our Marketplace APIs at link, or our Product Advertising API at link for advertising use cases.” from Amazon.

    1. Sure but it would need modification to this code.
      The tutorial provides the basis for it but you will need to identify the xpaths for the review and grab the content that way.

    1. Hi Saul,
      The code should work but at those numbers (1500 products) the code is not the problem.
      Everything else related to web scraping that we have written about on our site starts to matter.
      Please try the code by modifying it and let us know.

      Thanks

      1. I was trying to read a csv file as:

        AsinList = csv.DictReader(open(os.path.join(os.path.dirname(__file__),”asinnumbers.csv”)))

        But I am getting the error below:

        Traceback (most recent call last):
        File “amazon_scraper.py”, line 66, in
        ReadAsin()
        File “amazon_scraper.py”, line 57, in ReadAsin
        url = “http://www.amazon.com/dp/”+i
        TypeError: cannot concatenate ‘str’ and ‘dict’ objects

        Any recommendations? I already google about, but could not find anything.

        1. Hi Saul,

          You are trying to concatenate a dictionary object with “http://www.amazon.com/dp/”.

          Can you try replacing

          url = “http://www.amazon.com/dp/”+i

          with

          url = “http://www.amazon.com/dp/”+i[‘asin’].

          This is assuming that your CSV looks like this

          asin,
          B00JGTVU5A
          B00GJYCIVK,
          B00EPGK7CQ,
          B00EPGKA4G,
          B00YW5DLB4,
          B00KGD0628,
          B00O9A48N2,
          B00O9A4MEW,
          B00UZKG8QU

          1. Thanks a lot for this amazing tutorial, but, after using the script for few days, now is not working well, I am getting much as bellow:

            “CATEGORY”: null,
            “ORIGINAL_PRICE”: null,
            “NAME”: null,
            “URL”: “http://www.amazon.com/dp/B00FF01SSS”,
            “SALE_PRICE”: null,
            “AVAILABILITY”: null

            And as I told you, everything was working amazing well, even I add the code below to switch headers every time…

            navegador = randint(0,2)
            if navegador==0:
            headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36’}
            print ‘Using Chrome’
            elif navegador==1:
            headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 6.3; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0’}
            print ‘Using Firefox’
            else:
            headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240’}
            print ‘Using Edge’

            And, everything was perfect, til today, any ideas why?
            Thanks!

  4. Does anyone know of a commercial version of this process? I am looking to scrape Amazon data for an inventory system. We have the ASINs on incoming excel sheets, but need to pull product data and images to populate the inventory. We’d be happy to pay for a pre-existing version of this process rather than build it ourselves or hire a developer.

  5. The main issue I see with this is that it only gets the offer from the Buy Box, but not every offer available from Amazon. I’m trying to do this now to see if I can get it to work; just not overly familiar with python. But I know the URLs stay pretty much the same: http://www.amazon.com/gp/offer-listing/{ASIN}/ref=olp_f_freeShipping?ie=UTF8&f_freeShipping=true&f_new=true&f_primeEligible=true

Join the conversation