How To Scrape Amazon Product Data and Prices using Python 3

Web scraping helps in automating data extraction from websites. In this tutorial, we will build an Amazon scraper for extracting product details and pricing. We will build this simple web scraper using Python and SelectorLib and run it in a console.

Here is how you can scrape product details from Amazon product page

  1. Markup the data fields to be scraped using Selectorlib
  2. Copy and run the code provided

We have also provided how you can scrape product details from Amazon search result page, how to avoid getting blocked by Amazon and how to scrape Amazon on a large scale below.

Alternatively, you can use the Amazon Product Detail Crawler on ScrapeHero Cloud, to scrape Amazon easily without having to code.

Setting up your computer for web scraper development

We will use Python 3 for this Amazon scraper. The code will not run if you are using Python 2.7. To start, you need a computer with Python 3 and PIP installed in it.

Follow this guide to setup your computer and install packages if you are on windows

How To Install Python Packages for Web Scraping in Windows 10

Install Packages

  • Python Requests, to make requests and download the HTML content of the Amazon product pages
  • SelectorLib python package to extract data using the YAML file we created from the webpages we download

Using pip3,

pip3 install requests requests selectorlib

Scrape product details from the Amazon Product Page

The Amazon product page scraper will scrape the following details from product page.

  1. Product Name
  2. Price
  3. Short Description
  4. Full Product Description
  5. Image URLs
  6. Rating
  7. Number of Reviews
  8. Variant ASINs
  9. Sales Rank
  10. Link to all Reviews Page

Markup the data fields using Selectorlib

We have already marked up the data, so you can just skip this step if you want to get right to the data.

Here is how our template looks like. See the file here

Let’s save this as a file called selectors.yml in the same directory as our code.

name:
    css: '#productTitle'
    type: Text
price:
    css: '#price_inside_buybox'
    type: Text
short_description:
    css: '#featurebullets_feature_div'
    type: Text
images:
    css: '.imgTagWrapper img'
    type: Attribute
    attribute: data-a-dynamic-image
rating:
    css: span.arp-rating-out-of-text
    type: Text
number_of_reviews:
    css: 'a.a-link-normal h2'
    type: Text
variants:
    css: 'form.a-section li'
    multiple: true
    type: Text
    children:
        name:
            css: ""
            type: Attribute
            attribute: title
        asin:
            css: ""
            type: Attribute
            attribute: data-defaultasin
product_description:
    css: '#productDescription'
    type: Text
sales_rank:
    css: 'li#SalesRank'
    type: Text
link_to_all_reviews:
    css: 'div.card-padding a.a-link-emphasis'
    type: Link

 

Here is a preview of the markup

Selectorlib Template for Amazon.com

 

Selectorlib is a combination of tools for developers that makes marking up and extracting data from web pages easy. The Selectorlib Chrome Extension lets you mark data that you need to extract, and creates the CSS Selectors or XPaths needed to extract that data, then previews how the data would look like.

You can learn more about Selectorlib and how to use it to markup data here

If you don't like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.

Get Started for Free
Deploy to ScrapeHero Cloud

The Code

Create a folder called amazon-scraper and paste your selectorlib yaml template file as selectors.yml.

Let’s create a file called amazon.py and paste the code below into it. All it does is

  1. Read a list of Amazon Product URLs from a file called urls.txt
  2. Scrape the data
  3. Save the data as a JSON Lines file
from selectorlib import Extractor
import requests 
import json 
from time import sleep


# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('selectors.yml')

def scrape(url):    
    headers = {
        'authority': 'www.amazon.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'none',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-dest': 'document',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }

    # Download the page using requests
    print("Downloading %s"%url)
    r = requests.get(url, headers=headers)
    # Simple check to check if page was blocked (Usually 503)
    if r.status_code > 500:
        if "To discuss automated access to Amazon data please contact" in r.text:
            print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
        else:
            print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
        return None
    # Pass the HTML of the page and create 
    return e.extract(r.text)

# product_data = []
with open("urls.txt",'r') as urllist, open('output.jsonl','w') as outfile:
    for url in urllist.readlines():
        data = scrape(url) 
        if data:
            json.dump(data,outfile)
            outfile.write("\n")
            # sleep(5)

Running the Amazon Product Page Scraper

You can get the full code from Github – https://github.com/scrapehero-code/amazon-scraper

You can start your scraper by typing the command:

python3 amazon.py

Once the scrape is complete you should see a file called output.jsonl with your data. Here is an example for the URL

https://www.amazon.com/HP-Computer-Quard-Core-Bluetooth-Accessories/dp/B085383P7M/

{
  "name": "2020 HP 15.6\" Laptop Computer, 10th Gen Intel Quard-Core i7 1065G7 up to 3.9GHz, 16GB DDR4 RAM, 512GB PCIe SSD, 802.11ac WiFi, Bluetooth 4.2, Silver, Windows 10, YZAKKA USB External DVD + Accessories",
  "price": "$959.00",
  "short_description": "Powered by latest 10th Gen Intel Core i7-1065G7 Processor @ 1.30GHz (4 Cores, 8M Cache, up to 3.90 GHz); Ultra-low-voltage platform. Quad-core, eight-way processing provides maximum high-efficiency power to go.\n15.6\" diagonal HD SVA BrightView micro-edge WLED-backlit, 220 nits, 45% NTSC (1366 x 768) Display; Intel Iris Plus Graphics\n16GB 2666MHz DDR4 Memory for full-power multitasking; 512GB Solid State Drive (PCI-e), Save files fast and store more data. With massive amounts of storage and advanced communication power, PCI-e SSDs are great for major gaming applications, multiple servers, daily backups, and more.\nRealtek RTL8821CE 802.11b/g/n/ac (1x1) Wi-Fi and Bluetooth 4.2 Combo; 1 USB 3.1 Gen 1 Type-C (Data Transfer Only, 5 Gb/s signaling rate); 2 USB 3.1 Gen 1 Type-A (Data Transfer Only); 1 AC smart pin; 1 HDMI 1.4b; 1 headphone/microphone combo\nWindows 10 Home, 64-bit, English; Natural silver; YZAKKA USB External DVD drive + USB extension cord 6ft, HDMI cable 6ft and Mouse Pad\n› See more product details",
  "images": "{\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX425_.jpg\":[425,425],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX466_.jpg\":[466,466],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY355_.jpg\":[355,355],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX569_.jpg\":[569,569],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY450_.jpg\":[450,450],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX679_.jpg\":[679,679],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX522_.jpg\":[522,522]}",
  "variants": [
    {
      "name": "Click to select 4GB DDR4 RAM, 128GB PCIe SSD",
      "asin": "B01MCZ4LH1"
    },
    {
      "name": "Click to select 8GB DDR4 RAM, 256GB PCIe SSD",
      "asin": "B08537NR9D"
    },
    {
      "name": "Click to select 12GB DDR4 RAM, 512GB PCIe SSD",
      "asin": "B08537ZDYH"
    },
    {
      "name": "Click to select 16GB DDR4 RAM, 512GB PCIe SSD",
      "asin": "B085383P7M"
    },
    {
      "name": "Click to select 20GB DDR4 RAM, 1TB PCIe SSD",
      "asin": "B08537NDVZ"
    }
  ],
  "product_description": "Capacity:16GB DDR4 RAM, 512GB PCIe SSD\n\nProcessor\n\n  Intel Core i7-1065G7 (1.3 GHz base frequency, up to 3.9 GHz with Intel Turbo Boost Technology, 8 MB cache, 4 cores)\n\nChipset\n\n  Intel Integrated SoC\n\nMemory\n\n  16GB DDR4-2666 SDRAM\n\nVideo graphics\n\n  Intel Iris Plus Graphics\n\nHard drive\n\n  512GB PCIe NVMe M.2 SSD\n\nDisplay\n\n  15.6\" diagonal HD SVA BrightView micro-edge WLED-backlit, 220 nits, 45% NTSC (1366 x 768)\n\nWireless connectivity\n\n  Realtek RTL8821CE 802.11b/g/n/ac (1x1) Wi-Fi and Bluetooth 4.2 Combo\n\nExpansion slots\n\n  1 multi-format SD media card reader\n\nExternal ports\n\n  1 USB 3.1 Gen 1 Type-C (Data Transfer Only, 5 Gb/s signaling rate); 2 USB 3.1 Gen 1 Type-A (Data Transfer Only); 1 AC smart pin; 1 HDMI 1.4b; 1 headphone/microphone combo\n\nMinimum dimensions (W x D x H)\n\n  9.53 x 14.11 x 0.70 in\n\nWeight\n\n  3.75 lbs\n\nPower supply type\n\n  45 W Smart AC power adapter\n\nBattery type\n\n  3-cell, 41 Wh Li-ion\n\nBattery life mixed usage\n\n  Up to 11 hours and 30 minutes\n\n  Video Playback Battery life\n\n  Up to 10 hours\n\nWebcam\n\n  HP TrueVision HD Camera with integrated dual array digital microphone\n\nAudio features\n\n  Dual speakers\n\nOperating system\n\n  Windows 10 Home 64\n\nAccessories\n\n  YZAKKA USB External DVD drive + USB extension cord 6ft, HDMI cable 6ft and Mouse Pad",
  "link_to_all_reviews": "https://www.amazon.com/HP-Computer-Quard-Core-Bluetooth-Accessories/product-reviews/B085383P7M/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
}

Scrape products from the Amazon Search Results Page

The Amazon search result page scraper will scrape the following details from search result page.

  1. Product Name
  2. Price
  3. URL
  4. Rating
  5. Number of Reviews

The steps and code for scraping search results is very similar to the product page scraper.

Markup the data fields using Selectorlib

Here is our selectorlib yml file. Lets calls it search_results.yml

products:
    css: 'div[data-component-type="s-search-result"]'
    xpath: null
    multiple: true
    type: Text
    children:
        title:
            css: 'h2 a.a-link-normal.a-text-normal'
            xpath: null
            type: Text
        url:
            css: 'h2 a.a-link-normal.a-text-normal'
            xpath: null
            type: Link
        rating:
            css: 'div.a-row.a-size-small span:nth-of-type(1)'
            xpath: null
            type: Attribute
            attribute: aria-label
        reviews:
            css: 'div.a-row.a-size-small span:nth-of-type(2)'
            xpath: null
            type: Attribute
            attribute: aria-label
        price:
            css: 'span.a-price:nth-of-type(1) span.a-offscreen'
            xpath: null
            type: Text

If you don't like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.

Get Started for Free
Deploy to ScrapeHero Cloud

The Code

The code is almost identical to the previous scraper, except that we iterate through each product and save them as a separate line.

Let’s create a file searchresults.py and paste the code below into it. Here is what the code does

  1. Open a file called search_results_urls.txt and read search result page URLs
  2. Scrape the data
  3. Save to a JSON Lines file called search_results_output.jsonl

 

from selectorlib import Extractor
import requests 
import json 
from time import sleep


# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('search_results.yml')

def scrape(url):  

    headers = {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }

    # Download the page using requests
    print("Downloading %s"%url)
    r = requests.get(url, headers=headers)
    # Simple check to check if page was blocked (Usually 503)
    if r.status_code > 500:
        if "To discuss automated access to Amazon data please contact" in r.text:
            print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
        else:
            print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
        return None
    # Pass the HTML of the page and create 
    return e.extract(r.text)

# product_data = []
with open("search_results_urls.txt",'r') as urllist, open('search_results_output.jsonl','w') as outfile:
    for url in urllist.read().splitlines():
        data = scrape(url) 
        if data:
            for product in data['products']:
                product['search_url'] = url
                print("Saving Product: %s"%product['title'])
                json.dump(product,outfile)
                outfile.write("\n")
                # sleep(5)
    

 

Running the Search Result Scraper

You can start your scraper by typing the command:

python3 searchresults.py

Once the scrape is complete you should see a file called search_results_output.jsonl with your data.

Here is an example for the URL
https://www.amazon.com/s?k=laptops

https://github.com/scrapehero-code/amazon-scraper/blob/master/search_results_output.jsonl

What to do if you get blocked while scraping Amazon

We are adding this extra section to talk about some methods you could use to not get blocked while scraping Amazon. Amazon is very likely to flag you as a “BOT” if you start scraping hundreds of pages using the code above. The idea is to avoid getting flagged as BOT while scraping. How do we do that?

Mimic human behavior as much as possible.

While we cannot guarantee that you will not be blocked. Here are some tips and tricks on how to avoid getting blocked by amazon

Use proxies and rotate them

Let us say we are scraping hundreds of products on Amazon.com from a laptop, which usually has just one IP address. Amazon would know that we are a bot in no time, as NO HUMAN would ever visit hundreds of product pages in a minute. To look more like a human –  make requests to Amazon.com through a pool of IP Addresses or proxies. The rule of thumb here is to have 1 proxy or IP address make not more than 5 requests to Amazon in a minute. If you are scraping about 100 pages per minute, we need about 100/5 = 20 Proxies. You can read more about rotating proxies here

Specify the User Agents of latest browsers and rotate them

If you look at the code above, you will a line where we had set User-Agent String for the request we are making.

 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'

Just like proxies, it always good to have a pool of User Agent Strings. Just make sure you’re using user-agent strings of the latest and popular browsers and rotate the strings for each request you make to Amazon. You can learn more about rotating user agent string in python here.  It is also a good idea to create a combination of  (User-Agent, IP Address) so that it looks more human than a bot.

Reduce the number of ASINs scraped per minute

You can try slowing down the scrape a bit, to give Amazon fewer chance of flagging you as a bot. But about 5 requests per IP per minute isn’t much throttling. If you need to go faster, add more proxies. You can modify the speed by increasing or decreasing the delay in the sleep function

Retry, Retry, Retry

When you are blocked by Amazon, make sure you retry that request. If you look at the code block above we have added 20 retries. Our code retries immediately after the scrape fails, you could do an even better job here by creating a retry queue using a list, and retry them after all the other products are scraped from Amazon.

If you don't like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.

Get Started for Free
Deploy to ScrapeHero Cloud

How to scrape Amazon product data on a large scale

This Amazon scraper should work for small-scale scraping and hobby projects. It can get you started on your road to building bigger and better scrapers. However, if you do want to scrape Amazon for thousands of pages at short intervals here are some important things to keep in mind:

Use a Web Scraping Framework like PySpider or Scrapy

When you’re crawling a massive site like Amazon.com, you need to spend some time to figure out how to run your entire crawl smoothly. Choose an open-source framework for building your scraper, like Scrapy or PySpider which are both based in Python. These frameworks have pretty active communities and can take care of handling a lot of the errors that happen while scraping without disturbing the entire scraper. Most of them also let you use multiple threads to speed up scraping – if you are using a single computer. You can deploy Scrapy to your own servers using ScrapyD.

If you need speed, Distribute and Scale-Up using a Cloud Provider

There is a limit to the number of pages you can scrape from Amazon when using a single computer. If you’re scraping Amazon on a large scale, you need a lot of servers to get data within a reasonable time. You could consider hosting your scraper in the cloud and use a scalable Version of the Framework – like Scrapy Redis. For broader crawls, use message brokers like Redis, Rabbit MQ, Kafka, to run multiple spider instances to speed up crawls.

Use a scheduler if you need to run the scraper periodically

If you are using a scraper to get updated prices of products, you need to refresh your data frequently to keep track of the changes. Use CRON (in UNIX) or Task Scheduler in Windows to schedule the crawler, if you are using the script in this tutorial. If you are using Scrapy, scrapyd+cron can help schedule your spiders so you can refresh the data on a regular interval.

Use a database to store the Scraped Data from Amazon

If you are scraping a large number of products from Amazon, writing data to a file would soon become inconvenient. Retrieving data becomes tough, and you might even end up getting gibberish inside the file when multiple processes write to a single file. Use a database even if you are scraping from a single computer. MySQL will be just fine for moderate workloads, and you can use simple analytics on the scraped data tools like Tableau, PowerBI or Metabase by connecting them to your database. For larger write loads you can look into some of the NoSQL databases like MongoDB, Cassandra, etc.

Use Request Headers, Proxies, and IP Rotation to prevent getting Captchas from Amazon

Amazon has a lot of anti-scraping measures. If you are throttling Amazon, they will block you in no time and you’ll start seeing captchas instead of product pages. To prevent that, while going through each Amazon product page, it’s better to change headers by replacing your UserAgent value. This makes requests look like they’re coming from a browser and not a script.
To crawl Amazon on a very large scale, use Proxies and IP Rotation to reduce the number of captchas you get. You can learn more techniques to prevent getting blocked by Amazon and other sites here –  How to prevent getting blacklisted while scraping.  You can also use python to solve some basic captchas using an OCR called Tesseract.

Write some simple data quality tests

Scraped data is always messy. An XPath that works for a page might not work for another variation of the same page on the same site. Amazon has LOTS of product page layouts. If you spend an hour writing basic sanity check for your data – like verifying if the price is a decimal, you’ll know when your scraper breaks and you’ll also be able to minimize its impact. Incorporating data quality checks to your code are helpful especially if you are scraping Amazon data for price monitoring, seller monitoring, stock monitoring etc.

We hope this tutorial gave you a better idea on how to scrape Amazon or similar e-commerce websites. As a company, we understand e-commerce data having worked with it before. If you are interested in professional help with scraping complex websites, let us know, and we will be glad to help.

How to use Amazon Product Data

  1. Monitor Amazon products for change in Price, Stock Count/Availability, Rating, etc.
    By using a web scraper, you can update your data feeds on a timely basis to monitor any product changes. These data feeds can help you form pricing strategies by looking at your competition – other sellers or brands.
  2. Scrape Amazon Product Details that you can’t get with the Product Advertising API
    Amazon provides a Product Advertising API, but like most other “API”s, this API doesn’t provide all the information that Amazon has on a product page. A web scraper can help you extract all the details displayed on the product page.
  3. Analyze how a particular Brand sells on Amazon
    If you’re a retailer, you can monitor your competitor’s products and see how well they do in the market and make adjustments to reprice and sell your products. You could also use it to monitor your distribution channel to identify how your products are sold on Amazon by sellers, and if it is causing you any harm.
  4. Find Customer Opinions from Amazon Product Reviews
    Reviews offer abundant amounts of information. If you’re targeting an established set of sellers who have been selling reasonable volumes, you can extract the reviews of their products to find what you should avoid and what you could quickly improve on while trying to sell similar products on Amazon.

Need some help with scraping eCommerce data?

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Posted in:   eCommerce Data Gathering Tutorials, Web Scraping Tutorials

Responses

shree May 14, 2019

How to scrape the feedback from consumer?
Thanks in advance

Reply

Bharat Bhushan June 25, 2019

@ ScrapeHero
Can you please give some idea like how to crawl data from amazon for a specific city ?

Reply

Tiana August 15, 2019

I am getting this errors:

Amazon_Scraper.py”, line 72, in
ReadAsin()
Amazon_Scraper.py”, line 67, in ReadAsin
f=open(‘data.json’,’w’)
PermissionError: [Errno 13] Permission denied: ‘data.json’

Reply

    ScrapeHero August 16, 2019

    Looks like the output file cannot be written due to lack of permissions.
    Please google for such generic python errors.

    Reply

    Kashif March 8, 2020

    Hello.

    I want to be able to do the following with python.

    Initiate a search for any category of products using following parameters:

    No. Reviews
    Average review rating
    Average monthly sales
    Average monthly revenues

    Based on the above parameters, I want python to give me products who fall on the above criteria.

    Please tell me if it’s possible?

    If it’s possible, my next question would be how would we use python to access monthly sales and monthly revenue for a particular product?

    Please looking forward to your reply.

    Reply

jan August 19, 2019

Is there any way to scrape the Asin automatically? I mean, I want to scrapy over 1000+ products and I don’t want to make a list with that much Asin numbers.

Reply

Terry April 2, 2020

So how would one scrape an ecommerce site of their sale/clearance items automatically on a weekly basis and compare to Amazon’s prices?

Reply

Andrew July 14, 2020

I just wonder this there any technical way you can track the number of sales of a product from Amazon?

Reply

Anu Rani July 20, 2020

I am getting error while reading data in python??

raise JSONDecodeError(“Extra data”, s, end)

JSONDecodeError: Extra data

Reply

arka August 3, 2020

often occurs ;
traceback most recent call last at line — > data = scrape(url) and return e.extract(r.text)……

Reply

Jim September 22, 2020

The results returned from the search results never match with the results trough searches manually. Usually, the search results are multiple pages. But the search_results_1.jsonl file only contains a few records.

Reply

      Jim September 23, 2020

      Thanks. When I tried the tool using the url: ‘https://www.amazon.com/s?k=printer’
      it only returns me a few records. But you can see that there are at least 20 pages there.

      Reply

Comments or Questions?

Turn the Internet into meaningful, structured and usable data   

icons8-amazon

Get Amazon Product Details using our Real-Time API

ScrapeHero Logo

Can we help you get some data?