How To Scrape Amazon Product Data and Prices using Python 3

In this tutorial, we will build an Amazon scraper for extracting product details and pricing. We will build this simple web scraper using Python and SelectorLib and run it in a console. But before we scrape Amazon product data, let’s look at what can you use it for.

This Amazon scraper is limited to extracting the data points below, from a product page:

  1. Product Name
  2. Price
  3. Short Description
  4. Full Product Description
  5. Image URLs
  6. Rating
  7. Number of Reviews
  8. Variant ASINs
  9. Sales Rank
  10. Link to all Reviews Page

How to use Amazon Product Data

  1. Monitor Amazon products for change in Price, Stock Count/Availability, Rating, etc.
    By using a web scraper, you can update your data feeds on a timely basis to monitor any product changes. These data feeds can help you form pricing strategies by looking at your competition – other sellers or brands.
  2. Scrape Amazon Product Details that you can’t get with the Product Advertising API
    Amazon provides a Product Advertising API, but like most other “API”s, this API doesn’t provide all the information that Amazon has on a product page. A scraper can help you extract all the details displayed on the product page.
  3. Analyze how a particular Brand sells on Amazon
    If you’re a retailer, you can monitor your competitor’s products and see how well they do in the market and make adjustments to reprice and sell your products. You could also use it to monitor your distribution channel to identify how your products are sold on Amazon by sellers, and if it is causing you any harm.
  4. Find Customer Opinions from Amazon Product Reviews
    Reviews offer abundant amounts of information. If you’re targeting an established set of sellers who have been selling reasonable volumes, you can extract the reviews of their products to find what you should avoid and what you could quickly improve on while trying to sell similar products on Amazon.

Try ScrapeHero Cloud for Free

Setting up your computer for web scraper development

We will use Python 3 for this Amazon scraper. The code will not run if you are using Python 2.7. To start, you need a computer with Python 3 and PIP installed in it.

Follow this guide to setup your computer and install packages:

How To Install Python Packages for Web Scraping in Windows 10

Install Packages

  • Python Requests, to make requests and download the HTML content of the Amazon product pages
  • SelectorLib python package to extract data using the YAML file we created from the webpages we download

Using pip3,

pip3 install requests selectorlib

Creating a Selectorlib Template for Amazon Product Page

Selectorlib is a combination of tools for developers that makes marking up and extracting data from web pages easy. The Selectorlib Chrome Extension lets you mark data that you need to extract, and creates the CSS Selectors or XPaths needed to extract that data, then previews how the data would look like.

You can learn more about Selectorlib and how to use it here

Let’s start by marking all data we need from Amazon Product Page using Selectorlib Chrome Extension.

Selectorlib Template for Amazon.com

Once you have created the template, click on ‘Highlight’ to highlight and preview all of your selectors. Finally, click on ‘Export’ and download the YAML file.

Export Template from Selectorlib as YML

Here is how our template looks like

name:
    css: '#productTitle'
    type: Text
price:
    css: '#price_inside_buybox'
    type: Text
short_description:
    css: '#featurebullets_feature_div'
    type: Text
images:
    css: '.imgTagWrapper img'
    type: Attribute
    attribute: data-a-dynamic-image
rating:
    css: span.arp-rating-out-of-text
    type: Text
number_of_reviews:
    css: 'a.a-link-normal h2'
    type: Text
variants:
    css: 'form.a-section li'
    multiple: true
    type: Text
    children:
        name:
            css: ""
            type: Attribute
            attribute: title
        asin:
            css: ""
            type: Attribute
            attribute: data-defaultasin
product_description:
    css: '#productDescription'
    type: Text
sales_rank:
    css: 'li#SalesRank'
    type: Text
link_to_all_reviews:
    css: 'div.card-padding a.a-link-emphasis'
    type: Link

Let’s save this as a file called selectors.yml in the same directory as our code.

The Code

Create a folder called amazon-scraper and paste your selectorlib yaml template file as selectors.yml.

Let’s create a file called amazon.py and paste the code below into it. All it does is

    1. Read a list of Amazon Product URLs from a file called urls.txt
    2. Scrape the data
    3. Save the data as a JSON Lines file
from selectorlib import Extractor
import requests 
import json 
from time import sleep


# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('selectors.yml')

def scrape(url):    
    headers = {
        'authority': 'www.amazon.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'none',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-dest': 'document',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }

    # Download the page using requests
    print("Downloading %s"%url)
    r = requests.get(url, headers=headers)
    # Simple check to check if page was blocked (Usually 503)
    if r.status_code > 500:
        if "To discuss automated access to Amazon data please contact" in r.text:
            print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
        else:
            print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
        return None
    # Pass the HTML of the page and create 
    return e.extract(r.text)

# product_data = []
with open("urls.txt",'r') as urllist, open('output.jsonl','w') as outfile:
    for url in urllist.readlines():
        data = scrape(url) 
        if data:
            json.dump(data,outfile)
            outfile.write("\n")
            # sleep(5)

Running the Amazon Scraper

You can get the full code from Github – https://github.com/scrapehero-code/amazon-scraper

You can start your scraper by typing the command:

python3 amazon.py

Once the scrape is complete you should see a file called output.json with your data. Here is an example for the URL

https://www.amazon.com/HP-Computer-Quard-Core-Bluetooth-Accessories/dp/B085383P7M/

{
  "name": "2020 HP 15.6\" Laptop Computer, 10th Gen Intel Quard-Core i7 1065G7 up to 3.9GHz, 16GB DDR4 RAM, 512GB PCIe SSD, 802.11ac WiFi, Bluetooth 4.2, Silver, Windows 10, YZAKKA USB External DVD + Accessories",
  "price": "$959.00",
  "short_description": "Powered by latest 10th Gen Intel Core i7-1065G7 Processor @ 1.30GHz (4 Cores, 8M Cache, up to 3.90 GHz); Ultra-low-voltage platform. Quad-core, eight-way processing provides maximum high-efficiency power to go.\n15.6\" diagonal HD SVA BrightView micro-edge WLED-backlit, 220 nits, 45% NTSC (1366 x 768) Display; Intel Iris Plus Graphics\n16GB 2666MHz DDR4 Memory for full-power multitasking; 512GB Solid State Drive (PCI-e), Save files fast and store more data. With massive amounts of storage and advanced communication power, PCI-e SSDs are great for major gaming applications, multiple servers, daily backups, and more.\nRealtek RTL8821CE 802.11b/g/n/ac (1x1) Wi-Fi and Bluetooth 4.2 Combo; 1 USB 3.1 Gen 1 Type-C (Data Transfer Only, 5 Gb/s signaling rate); 2 USB 3.1 Gen 1 Type-A (Data Transfer Only); 1 AC smart pin; 1 HDMI 1.4b; 1 headphone/microphone combo\nWindows 10 Home, 64-bit, English; Natural silver; YZAKKA USB External DVD drive + USB extension cord 6ft, HDMI cable 6ft and Mouse Pad\n› See more product details",
  "images": "{\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX425_.jpg\":[425,425],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX466_.jpg\":[466,466],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY355_.jpg\":[355,355],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX569_.jpg\":[569,569],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY450_.jpg\":[450,450],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX679_.jpg\":[679,679],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX522_.jpg\":[522,522]}",
  "variants": [
    {
      "name": "Click to select 4GB DDR4 RAM, 128GB PCIe SSD",
      "asin": "B01MCZ4LH1"
    },
    {
      "name": "Click to select 8GB DDR4 RAM, 256GB PCIe SSD",
      "asin": "B08537NR9D"
    },
    {
      "name": "Click to select 12GB DDR4 RAM, 512GB PCIe SSD",
      "asin": "B08537ZDYH"
    },
    {
      "name": "Click to select 16GB DDR4 RAM, 512GB PCIe SSD",
      "asin": "B085383P7M"
    },
    {
      "name": "Click to select 20GB DDR4 RAM, 1TB PCIe SSD",
      "asin": "B08537NDVZ"
    }
  ],
  "product_description": "Capacity:16GB DDR4 RAM, 512GB PCIe SSD\n\nProcessor\n\n  Intel Core i7-1065G7 (1.3 GHz base frequency, up to 3.9 GHz with Intel Turbo Boost Technology, 8 MB cache, 4 cores)\n\nChipset\n\n  Intel Integrated SoC\n\nMemory\n\n  16GB DDR4-2666 SDRAM\n\nVideo graphics\n\n  Intel Iris Plus Graphics\n\nHard drive\n\n  512GB PCIe NVMe M.2 SSD\n\nDisplay\n\n  15.6\" diagonal HD SVA BrightView micro-edge WLED-backlit, 220 nits, 45% NTSC (1366 x 768)\n\nWireless connectivity\n\n  Realtek RTL8821CE 802.11b/g/n/ac (1x1) Wi-Fi and Bluetooth 4.2 Combo\n\nExpansion slots\n\n  1 multi-format SD media card reader\n\nExternal ports\n\n  1 USB 3.1 Gen 1 Type-C (Data Transfer Only, 5 Gb/s signaling rate); 2 USB 3.1 Gen 1 Type-A (Data Transfer Only); 1 AC smart pin; 1 HDMI 1.4b; 1 headphone/microphone combo\n\nMinimum dimensions (W x D x H)\n\n  9.53 x 14.11 x 0.70 in\n\nWeight\n\n  3.75 lbs\n\nPower supply type\n\n  45 W Smart AC power adapter\n\nBattery type\n\n  3-cell, 41 Wh Li-ion\n\nBattery life mixed usage\n\n  Up to 11 hours and 30 minutes\n\n  Video Playback Battery life\n\n  Up to 10 hours\n\nWebcam\n\n  HP TrueVision HD Camera with integrated dual array digital microphone\n\nAudio features\n\n  Dual speakers\n\nOperating system\n\n  Windows 10 Home 64\n\nAccessories\n\n  YZAKKA USB External DVD drive + USB extension cord 6ft, HDMI cable 6ft and Mouse Pad",
  "link_to_all_reviews": "https://www.amazon.com/HP-Computer-Quard-Core-Bluetooth-Accessories/product-reviews/B085383P7M/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
}

What to do if you get blocked while scraping Amazon

We are adding this extra section to talk about some methods you could use to not get blocked while scraping Amazon. Amazon is very likely to flag you as a “BOT” if you start scraping hundreds of pages using the code above. The idea is to avoid getting flagged as BOT while scraping. How do we do that?

Mimic human behavior as much as possible.

While we cannot guarantee that you will not be blocked. Here are some tips and tricks on how to avoid getting blocked by amazon

1. Use proxies and rotate them

Let us say we are scraping hundreds of products on Amazon.com from a laptop, which usually has just one IP address. Amazon would know that we are a bot in no time, as NO HUMAN would ever visit hundreds of product pages in a minute. To look more like a human –  make requests to Amazon.com through a pool of IP Addresses or proxies. The rule of thumb here is to have 1 proxy or IP address make not more than 5 requests to Amazon in a minute. If you are scraping about 100 pages per minute, we need about 100/5 = 20 Proxies. You can read more about rotating proxies here

2. Specify the User Agents of latest browsers and rotate them

If you look at the code above, you will a line where we had set User-Agent String for the request we are making.

 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'

Just like proxies, it always good to have a pool of User Agent Strings. Just make sure you’re using user-agent strings of the latest and popular browsers and rotate the strings for each request you make to Amazon. You can learn more about rotating user agent string in python here.  It is also a good idea to create a combination of  (User-Agent, IP Address) so that it looks more human than a bot.

3. Reduce the number of ASINs scraped per minute

You can try slowing down the scrape a bit, to give Amazon fewer chance of flagging you as a bot. But about 5 requests per IP per minute isn’t much throttling. If you need to go faster, add more proxies. You can modify the speed by increasing or decreasing the delay in the sleep function

4. Retry, Retry, Retry

When you are blocked by Amazon, make sure you retry that request. If you look at the code block above we have added 20 retries. Our code retries immediately after the scrape fails, you could do an even better job here by creating a retry queue using a list, and retry them after all the other products are scraped from Amazon.

If you don't like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.

Get Started for Free
Deploy to ScrapeHero Cloud

How to scrape Amazon product data on a large scale

This Amazon scraper should work for small-scale scraping and hobby projects. It can get you started on your road to building bigger and better scrapers. However, if you do want to scrape Amazon for thousands of pages at short intervals here are some important things to keep in mind:

1. Use a Web Scraping Framework like PySpider or Scrapy

When you’re crawling a massive site like Amazon.com, you need to spend some time to figure out how to run your entire crawl smoothly. Choose an open-source framework for building your scraper, like Scrapy or PySpider which are both based in Python. These frameworks have pretty active communities and can take care of handling a lot of the errors that happen while scraping without disturbing the entire scraper. Most of them also let you use multiple threads to speed up scraping – if you are using a single computer. You can deploy Scrapy to your own servers using ScrapyD.

2. If you need speed, Distribute and Scale-Up using a Cloud Provider

There is a limit to the number of pages you can scrape from Amazon when using a single computer. If you’re scraping Amazon on a large scale, you need a lot of servers to get data within a reasonable time. You could consider hosting your scraper in the cloud and use a scalable Version of the Framework – like Scrapy Redis. For broader crawls, use message brokers like Redis, Rabbit MQ, Kafka, to run multiple spider instances to speed up crawls.

3. Use a scheduler if you need to run the scraper periodically

If you are using a scraper to get updated prices of products, you need to refresh your data frequently to keep track of the changes. Use CRON (in UNIX) or Task Scheduler in Windows to schedule the crawler, if you are using the script in this tutorial. If you are using Scrapy, scrapyd+cron can help schedule your spiders so you can refresh the data on a regular interval.

4. Use a database to store the Scraped Data from Amazon

If you are scraping a large number of products from Amazon, writing data to a file would soon become inconvenient. Retrieving data becomes tough, and you might even end up getting gibberish inside the file when multiple processes write to a single file. Use a database even if you are scraping from a single computer. MySQL will be just fine for moderate workloads, and you can use simple analytics on the scraped data tools like Tableau, PowerBI or Metabase by connecting them to your database. For larger write loads you can look into some of the NoSQL databases like MongoDB, Cassandra, etc.

5. Use Request Headers, Proxies, and IP Rotation to prevent getting Captchas from Amazon

Amazon has a lot of anti-scraping measures. If you are throttling Amazon, they will block you in no time and you’ll start seeing captchas instead of product pages. To prevent that, while going through each Amazon product page, it’s better to change headers by replacing your UserAgent value. This makes requests look like they’re coming from a browser and not a script.
To crawl Amazon on a very large scale, use Proxies and IP Rotation to reduce the number of captchas you get. You can learn more techniques to prevent getting blocked by Amazon and other sites here –  How to prevent getting blacklisted while scraping.  You can also use python to solve some basic captchas using an OCR called Tesseract.

6. Write some simple data quality tests

Scraped data is always messy. An XPath that works for a page might not work for another variation of the same page on the same site. Amazon has LOTS of product page layouts. If you spend an hour writing basic sanity check for your data – like verifying if the price is a decimal, you’ll know when your scraper breaks and you’ll also be able to minimize its impact. Incorporating data quality checks to your code are helpful especially if you are scraping Amazon data for price monitoring, seller monitoring, stock monitoring etc.

We hope this tutorial gave you a better idea on how to scrape Amazon or similar e-commerce websites. As a company, we understand e-commerce data having worked with it before. If you are interested in professional help with scraping complex websites, let us know, and we will be glad to help.

Need some help with scraping eCommerce data?

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Posted in:   eCommerce Data Gathering Tutorials, Web Scraping Tutorials

Responses

shree May 14, 2019

How to scrape the feedback from consumer?
Thanks in advance

Reply

Bharat Bhushan June 25, 2019

@ ScrapeHero
Can you please give some idea like how to crawl data from amazon for a specific city ?

Reply

Tiana August 15, 2019

I am getting this errors:

Amazon_Scraper.py”, line 72, in
ReadAsin()
Amazon_Scraper.py”, line 67, in ReadAsin
f=open(‘data.json’,’w’)
PermissionError: [Errno 13] Permission denied: ‘data.json’

Reply

    ScrapeHero August 16, 2019

    Looks like the output file cannot be written due to lack of permissions.
    Please google for such generic python errors.

    Reply

    Kashif March 8, 2020

    Hello.

    I want to be able to do the following with python.

    Initiate a search for any category of products using following parameters:

    No. Reviews
    Average review rating
    Average monthly sales
    Average monthly revenues

    Based on the above parameters, I want python to give me products who fall on the above criteria.

    Please tell me if it’s possible?

    If it’s possible, my next question would be how would we use python to access monthly sales and monthly revenue for a particular product?

    Please looking forward to your reply.

    Reply

jan August 19, 2019

Is there any way to scrape the Asin automatically? I mean, I want to scrapy over 1000+ products and I don’t want to make a list with that much Asin numbers.

Reply

Terry April 2, 2020

So how would one scrape an ecommerce site of their sale/clearance items automatically on a weekly basis and compare to Amazon’s prices?

Reply

Comments or Questions?

Turn the Internet into meaningful, structured and usable data   

icons8-amazon

Get Amazon Product Details using our Real-Time API