This tutorial is a follow-up to Tutorial: How To Scrape Amazon Product Details and Pricing using Python, by extending the Amazon price data to also cover product reviews. The scope of this tutorial is limited to web scraping an Amazon product page to retrieve review summary and the first page of customer reviews for any product from Amazon.
Scraping Customer Reviews from Amazon can be useful for
- Getting complete review details that you can’t get with the Amazon Product Advertising API.
- Monitoring customer opinion on products that you sell or manufacture using Data Analysis
- Create Amazon Review Datasets for Educational Purposes and Research
Amazon used to provide access to product reviews through their Product Advertising API to developers and sellers, a few years back. They discontinued that on November 8, 2010, preventing customers from displaying Amazon reviews about their products, embedded in their websites. As of now, Amazon only returns a link to the review.
Take a look at the screenshot below, from a StackOverflow thread on the same topic.
We were able to find few tutorials on doing this using Perl ( http://archive.oreilly.com/pub/h/977 ). Being the Python Enthusiasts, we are ( check out the other web scraping tutorials we have published before), we thought of making one using simple Python and the simple python library – LXML.
We’ll follow this post up with a tutorial on how to turn this code into a web API that you can use or integrate with your projects.
For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. Below are the package requirements.
Install Python 3 and Pip
Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/
Mac Users can follow this guide – http://docs.python-guide.org/en/latest/starting/install3/osx/
Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/
- PIP to install the following packages in Python (https://pip.pypa.io/en/stable/installing/)
- Python Requests, to make requests and download the HTML content of the pages ( http://docs.python-requests.org/en/master/user/install/).
- Python LXML, for parsing the HTML Tree Structure using Xpaths (Learn how to install that here – http://lxml.de/installation.html)
- Python Dateutil, for parsing review dates ( https://github.com/dateutil/dateutil/ )
Let us get our hands dirty now.
Here is the GIST link for the code above https://gist.github.com/scrapehero/900419a768c5fac9ebdef4cb246b25cb
If you would like the code in Python 2.7, you can check this link – https://gist.github.com/scrapehero/3d53ae193766bc51408ec6497fbd1016.
Modify the code below. Add your own ASINs to the line.
AsinList = ['B01ETPUQ6E','B017HW9DEW'] If you are getting banned by Amazon, try increasing the delay from 5 seconds by editing the line
. Increase to say 10 seconds.
def ReadAsin(): #Add your own ASINs here AsinList = ['B01ETPUQ6E','B017HW9DEW'] extracted_data =  for asin in AsinList: print "Downloading and processing page http://www.amazon.com/dp/"+asin extracted_data.append(ParseReviews(asin)) sleep(5) f=open('data.json','w') json.dump(extracted_data,f,indent=4)
Once you are done modifying the script, run this script using Python 3 in a Terminal or Command Prompt. We named our file `amazon_review_scraper.py`.
Once the script completes running, you can see a file called data.json, with the reviews data in a JSON format.
Below is the formatted output we received for the ASINs we supplied
Here is the full output attached in a GIST.
This code should work for a relatively small number of ASINs for your personal projects, but if you want to scrape websites for thousands of pages, learn about the challenges here Scalable do-it-yourself scraping – How to build and run scrapers on a large scale.
Thanks for reading and if you need help with your complex scraping projects let us know and we will be glad to help.
Do you need some professional help to scrape Amazon Data? Let us know
Turn the Internet into meaningful, structured and usable data