How To Scrape Amazon Product Details and Pricing using Python

Scraping Amazon Tutorial (Custom)

In this tutorial, we will walk you through how to create a simple scraper for extracting Amazon’s product details. You can use it for:

  1. Scraping product details that you can’t get with the Product Advertising API
  2. Monitoring products for change in Price, Stock Count/Availability, Rating, etc.
  3. Analyzing how a particular Brand sells on Amazon
  4. Analyze Amazon Product Reviews
  5. Or anything else – the possibilities are endless and only bound by your imagination

 

Here is the list of data points we will be extracting:

  1. Product Name
  2. Category
  3. Original Price
  4. Sale Price
  5. Availability
  6. URL

We’ll be building a crawler in Python that can go to any Amazon product page using an ASIN – a unique keyword Amazon uses to keep track of products in its database.

In this tutorial, we will show you how to extract details of products by first identifying their ASIN’s. First, let’s collect a list of products identified by their ASINs. When you visit an Amazon product page, you’ll see the ASIN present in the URL:

https://www.amazon.com/OnePlus-A5000-Unlocked-International-Warranty/dp/B0734X8GW5/

Then we will download the HTML of each product’s page and start to identify the XPaths for the data elements that you need – e.g., Product Title, Price, Description, etc. Read more about XPaths here.

Required Tools

For this tutorial, we will stick to using Python and a couple of python packages for downloading and parsing the HTML. Below are the package requirements:

  • Python 2.7 available here (https://www.python.org/downloads/ )
  • Python PIP to install the following packages in Python (https://pip.pypa.io/en/stable/installing/)
  • Python Requests available here ( http://docs.python-requests.org/en/master/user/install/) .
  • Python LXML (Learn how to install that here – http://lxml.de/installation.html)

The Code

If the embed above doesn’t work, you can download the code directly from here.

Modify the code shown below with a list of your own ASINs.

def ReadAsin():
  #Change the list below with the ASINs you want to track.
	AsinList = ['B0046UR4F4',
	'B00JGTVU5A',
	'B00GJYCIVK',
	'B00EPGK7CQ',
	'B00EPGKA4G',
	'B00YW5DLB4',
	'B00KGD0628',
	'B00O9A48N2',
	'B00O9A4MEW',
	'B00UZKG8QU',]
	extracted_data = []
	for i in AsinList:
		url = "http://www.amazon.com/dp/"+i
		extracted_data.append(AmzonParser(url))
		sleep(5)
	#Save the collected data into a json file.
	f=open('data.json','w')
	json.dump(extracted_data,f,indent=4)

Assuming the script is named amazon_scraper.py. Type in the script name in command prompt or terminal like this.

python amazon_scraper.py 

This will create a JSON output file called data.json with the data collected for the list of ASINs present in the AsinList.

The JSON output for a couple of ASINs will look similar to this:

{
        "CATEGORY": "Electronics > Computers & Accessories > Data Storage > External Hard Drives", 
        "ORIGINAL_PRICE": "$1,899.99", 
        "NAME": "G-Technology G-SPEED eS PRO High-Performance Fail-Safe RAID Solution for HD/2K Production 8TB (0G01873)", 
        "URL": "http://www.amazon.com/dp/B0046UR4F4", 
        "SALE_PRICE": "$949.95", 
        "AVAILABILITY": "Only 1 left in stock."
    }, 
    {
        "CATEGORY": "Electronics > Computers & Accessories > Data Storage > USB Flash Drives", 
        "ORIGINAL_PRICE": "$599.95", 
        "NAME": "G-Technology G-RAID USB Removable Dual Drive Storage System 8TB (0G04069)", 
        "URL": "http://www.amazon.com/dp/B00UZKG8QU", 
        "SALE_PRICE": "$599.95", 
        "AVAILABILITY": "Only 2 left in stock."
    }

Usually, there is a limit on large websites. Amazon lets you go through 400 pages per category. If you would like to extract more than a few thousand products per category, this scraper is probably not going to work for you. This should work for small-scale scraping and hobby projects and get you started on your road to building bigger and better scrapers.

However, if you do want to scrape websites for thousands of pages at short intervals there are some important things you should be aware of and you can read about them at Scalable do-it-yourself scraping – How to build and run scrapers on a large scale and How to prevent getting blacklisted while scrapingIf you want to also scrape Amazon reviews for a product, head over to this new blog post.

We hope this tutorial gave you a better idea on how to scrape Amazon or similar e-commerce websites.

As a company, we understand e-commerce data having worked with it before. If you are interested in professional help with scraping complex websites let us know and we will be glad to help.

Need some help with scraping eCommerce data?


Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

41 thoughts on “How To Scrape Amazon Product Details and Pricing using Python

  1. I Try to get Image url using this xpath :

    XPATH_IMG = ‘//div[@class=”imgTagWrapper”]/img/@src//text()’

    but the result is Null, can you give me the point to achieved this

  2. Hejsan from Sweden,

    I am a total “dummie” regarding python. I tried to use this code with Python 3 instead. There you have pip and requests included as I understand. Anyway, I do not get a data.json file respectively the provided code is not running and if i check it through python they mention missing parentheses. I just wonder if the code should work for python 3 as well and if not, why? Is it a different language?

    best regards,

    Chris

    1. Hi Chris,
      Yes it is almost a new language – v2 code will not work in 3 for most cases especially with libraries used.
      Try downloading and running in V2.

      Thanks

    2. Hi Chris,

      I am running the following version of python:
      Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06)
      [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
      Type “help”, “copyright”, “credits” or “license” for more information.

      I changed the code only a little to fit python 3. Pasted the code below. Let me know if you need any help.

      from lxml import html
      import csv,os,json
      import requests
      #from exceptions import ValueError
      from time import sleep

      def AmzonParser(url):
      headers = {‘User-Agent’: ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36’}
      page = requests.get(url,headers=headers)
      while True:
      sleep(3)
      try:
      doc = html.fromstring(page.content)
      XPATH_NAME = ‘//h1[@id=”title”]//text()’
      XPATH_SALE_PRICE = ‘//span[contains(@id,”ourprice”) or contains(@id,”saleprice”)]/text()’
      XPATH_ORIGINAL_PRICE = ‘//td[contains(text(),”List Price”) or contains(text(),”M.R.P”) or contains(text(),”Price”)]/following-sibling::td/text()’
      XPATH_CATEGORY = ‘//a[@class=”a-link-normal a-color-tertiary”]//text()’
      XPATH_AVAILABILITY = ‘//div[@id=”availability”]//text()’

      RAW_NAME = doc.xpath(XPATH_NAME)
      RAW_SALE_PRICE = doc.xpath(XPATH_SALE_PRICE)
      RAW_CATEGORY = doc.xpath(XPATH_CATEGORY)
      RAW_ORIGINAL_PRICE = doc.xpath(XPATH_ORIGINAL_PRICE)
      RAw_AVAILABILITY = doc.xpath(XPATH_AVAILABILITY)

      NAME = ‘ ‘.join(”.join(RAW_NAME).split()) if RAW_NAME else None
      SALE_PRICE = ‘ ‘.join(”.join(RAW_SALE_PRICE).split()).strip() if RAW_SALE_PRICE else None
      CATEGORY = ‘ > ‘.join([i.strip() for i in RAW_CATEGORY]) if RAW_CATEGORY else None
      ORIGINAL_PRICE = ”.join(RAW_ORIGINAL_PRICE).strip() if RAW_ORIGINAL_PRICE else None
      AVAILABILITY = ”.join(RAw_AVAILABILITY).strip() if RAw_AVAILABILITY else None

      if not ORIGINAL_PRICE:
      ORIGINAL_PRICE = SALE_PRICE

      if page.status_code!=200:
      raise ValueError(‘captha’)
      data = {
      ‘NAME’:NAME,
      ‘SALE_PRICE’:SALE_PRICE,
      ‘CATEGORY’:CATEGORY,
      ‘ORIGINAL_PRICE’:ORIGINAL_PRICE,
      ‘AVAILABILITY’:AVAILABILITY,
      ‘URL’:url,
      }

      return data
      except Exception as e:
      print(e)

      def ReadAsin():
      # AsinList = csv.DictReader(open(os.path.join(os.path.dirname(__file__),”Asinfeed.csv”)))
      AsinList = [‘B0046UR4F4’,
      ‘B00JGTVU5A’,
      ‘B00GJYCIVK’,
      ‘B00EPGK7CQ’,
      ‘B00EPGKA4G’,
      ‘B00YW5DLB4’,
      ‘B00KGD0628’,
      ‘B00O9A48N2’,
      ‘B00O9A4MEW’,
      ‘B00UZKG8QU’,]
      extracted_data = []
      for i in AsinList:
      url = “http://www.amazon.com/dp/”+i
      print(“Processing: “+url)
      extracted_data.append(AmzonParser(url))
      sleep(5)
      f=open(‘data.json’,’w’)
      json.dump(extracted_data,f,indent=4)

      if __name__ == “__main__”:
      ReadAsin()

  3. Thanks a lot for this very useful script. I m going to the next step : Scalable do-it-yourself scraping – How to build and run scrapers on a large scale

  4. I keep on getting this error: SSLError: HTTPSConnectionPool(host=’www.amazon.com’, port=443): Max retries exceeded with url: /dp/B00YG0JV96 (Caused by SSLError(SSLError(“bad handshake: Error([(‘SSL routines’, ‘tls_process_server_certificate’, ‘certificate verify failed’)],)”,),))

    What am I missing?

  5. I am looking to modify this script to also scrape Walmart, Gamestop, Target, etc what resources can you point me to to modify this script to include those?

Join the conversation