In this tutorial is we will build an amazon scraper for extracting product details and pricing. We will build this simple web scraper using python and LXML and run it in a console. But before we start, let’s look at what can you use it for.
What can you use an Amazon Scraper for ?
- Scrape Product Details that you can’t get with the Product Advertising API
Amazon provides a Product Advertising API, but like most other “API”s, this API doesn’t provide all the information that Amazon has on a product page. A scraper can help you extract all the details displayed on the product page.
- Monitor products for change in Price, Stock Count/Availability, Rating, etc.
By using a web scraper, you can update your data feeds on a timely basis to monitor any product changes. These data feeds can help you form pricing strategies by looking your competition – other sellers or brands.
- Analyze how a particular Brand sells on Amazon
If you’re a retailer, you can monitor your competitor’s products and see how well they do in the market and make adjustments to reprice and sell your products. You could also use it to monitor your distribution channel to identify how your products are sold on Amazon by sellers, and if it is causing you any harm.
- Find Customer Opinions from Amazon Product Reviews
Reviews offer abundant amounts of information. If you’re targeting an established set of sellers who have been selling reasonable volumes, you can extract the reviews of their products to find what you should avoid and what you could quickly improve on while trying to sell similar products on Amazon.
Or anything else – the possibilities are endless and only bound by your imagination
What data are we extracting from Amazon?
This tutorial is limited to extracting the data points below, from a product page:
- Product Name
- Original Price
- Sale Price
We’ll build a scraper in Python that can go to any Amazon product page using an ASIN – a unique ID Amazon uses to keep track of products in its database.
First, let’s identify a product ASIN.
For example, in this product – Imploding Kittens ( https://www.amazon.com/Imploding-Kittens-First-Expansion-Exploding/dp/B01HSIIFQ2/ ), the ASIN is B01HSIIFQ2.
Gather the ASINs for the products you need data from.
The next step is to build a script that goes to each one of those product pages, downloads its HTML and extracts the fields you need- e.g., Product Title, Price, Description, etc.
XPaths are used to tell the script where each field we need is present in the HTML. XPaths are one of the few ways in which you can select some content from a big blob of XML or HTML (properly structured HTML is similarly structured as an XML document) content. An XPath tells you the location of an element, just like a catalog card does for books. We’ll find XPaths for each of the fields we need and put that into our scraper.
Once we extract this information, we’ll save it into a JSON file.
Since we already have the list of products, let’s get started.
What tools do we need?
For this tutorial, we will stick to using Python and a couple of python packages for downloading and parsing the HTML. Below are the package requirements:
- Python 2.7 available here (https://www.python.org/downloads/ )
- Python PIP to install the following packages in Python (https://pip.pypa.io/en/stable/installing/)
- Python Requests available here ( http://docs.python-requests.org/en/master/user/install/). Requests allow you to send HTTP requests. You won’t need to add query strings to your URLs manually. It’s an easy-to-use library with a lot of features ranging from passing parameters in URLs to sending custom headers and SSL Verification.
- Python LXML (Learn how to install that here – http://lxml.de/installation.html)
If you have PIP, installing Requests and LXML would be as easy as running the line below in a python enabled terminal:
pip install requests lxml
The Amazon Scraper
If the embed above doesn’t work, you can download the code directly from here.
Modify the code shown below with a list of your own ASINs.
#Change the list below with the ASINs you want to track.
AsinList = ['B0046UR4F4',
extracted_data = 
for i in AsinList:
url = "http://www.amazon.com/dp/"+i
#Save the collected data into a json file.
Assuming the script is named amazon_scraper.py. Type in the script name in command prompt or terminal like this.
This will create a JSON output file called data.json with the data collected for the list of ASINs present in the AsinList.
The JSON output for a couple of ASINs will look similar to this:
"CATEGORY": "Electronics > Computers & Accessories > Data Storage > External Hard Drives",
"NAME": "G-Technology G-SPEED eS PRO High-Performance Fail-Safe RAID Solution for HD/2K Production 8TB (0G01873)",
"AVAILABILITY": "Only 1 left in stock."
"CATEGORY": "Electronics > Computers & Accessories > Data Storage > USB Flash Drives",
"NAME": "G-Technology G-RAID USB Removable Dual Drive Storage System 8TB (0G04069)",
"AVAILABILITY": "Only 2 left in stock."
You can also extract reviews from product pages. Head over to this new blog post to learn how.
6 things to keep in mind when scraping Amazon on a larger scale
Usually, there is a limit on large websites. Amazon lets you go through 400 pages per category. This should work for small-scale scraping and hobby projects and get you started on your road to building bigger and better scrapers. However, if you do want to scrape amazon for thousands of pages at short intervals there are some important things you should be aware of :
1. Use a Web Scraping Framework like PySpider or Scrapy
When you’re crawling a massive site like Amazon.com, you need to spend some time to figure out how to run your entire crawl smoothly. Choose an open-source framework for building your scraper, like Scrapy or PySpider which are both based in Python. These frameworks have pretty active communities and can take care of handling a lot of the errors that happen while scraping without disturbing the entire scraper. Most of them also let you use multiple threads to speed up scraping – if you are using a single computer. Scrapy can be deployed to your own servers using ScrapyD.
2. If you need speed, Distribute and Scale Up using a Cloud Provider
There is a limit to the number of pages you can scrape when using a single computer. If you are going to scrape Amazon on a large scale (millions of product pages a day), you need a lot of servers to get the data within a reasonable time. You could consider hosting your scraper in the cloud and use a scalable Version of the Framework – like Scrapy Redis. For a broader crawl, you can use a message broker like Redis, Rabbit MQ, Kafka, etc., so that you can run multiple spider instances to speed up the crawl.
3. Use a scheduler if you need to run the scraper periodically
If you are using a scraper to get updated prices or stock counts of products, you need to update your data frequently to keep track of the changes. If you are using the script in this tutorial, use CRON (in UNIX) or Task Scheduler in Windows to schedule it. If you are using Scrapy, scrapyd+cron can help schedule your spiders so you can refresh the data promptly.
4. Use a database to store the Scraped Data from Amazon
If you are scraping a large number of products from Amazon, writing data to a file would soon become inconvenient. Retrieving data becomes tough, and you might even end up getting gibberish inside the file when multiple processes write to a single file. Using a database is recommended even if you are scraping from a single computer. MySQL will be just fine for moderate workloads, and you can use simple analytics on the scraped data tools like Tableau, PowerBI or Metabase by connecting them to your database. For larger write loads you can look into some of the NoSQL databases like MongoDB, Cassandra etc.
5. Use Request Headers, Proxies, and IP Rotation to prevent getting Captchas from Amazon
Amazon has a lot of anti-scraping measures. If you are throttling amazon, you’ll be blocked in no time and you’ll start seeing captchas instead of product pages. To prevent that to a certain extent, while going through each Amazon product page, it’s better to change your headers by replacing your UserAgent value to make requests look like they’re coming from a browser and not a script.
If you’re going to crawl Amazon at a very large scale, use Proxies and IP Rotation to reduce the number of captchas you get. You can learn more techniques to prevent getting blocked by Amazon and other sites here – How to prevent getting blacklisted while scraping. You can also use python to solve some basic captchas using an OCR called Tesseract.
6. Write some simple data quality tests
Scraped data is always messy. An XPath that works for a page might not work for another variation of the same page on the same site. Amazon has LOTS of product page layouts. If you spend an hour writing some basic sanity check for your data – like verifying if the price is a decimal, a title a string less than say 250 characters, etc., you’ll know when your scraper breaks and you’ll also be able to minimize its impact. This is a must if you feed the scraped amazon data feeds into some price optimisation program.
We hope this tutorial gave you a better idea on how to scrape Amazon or similar e-commerce websites. As a company, we understand e-commerce data having worked with it before. If you are interested in professional help with scraping complex websites, let us know, and we will be glad to help.
Need some help with scraping eCommerce data?
Turn websites into meaningful and structured data through our web data extraction service
Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.