How to scrape Amazon Reviews using Python

This tutorial is a follow-up to Tutorial: How To Scrape Amazon Product Details and Pricing using Python, by extending the Amazon price data to also cover product reviews. The scope of this tutorial is limited to web scraping an Amazon product page to retrieve review summary and the first page of customer reviews for any product from Amazon.

Scraping Customer Reviews from Amazon can be useful for

  1. Getting complete review details that you can’t get with the Amazon Product Advertising API.
  2. Monitoring  customer opinion on products that you sell or manufacture using Data Analysis
  3. Create Amazon Review Datasets for Educational Purposes and Research

Amazon used to provide access to product reviews through their Product Advertising API to developers and sellers, a few years back. They discontinued that on November 8, 2010, preventing customers from displaying Amazon reviews about their products, embedded in their websites. As of now, Amazon only returns a link to the review.

Amazon Product Advertising API Review

Take a look at the screenshot below, from a StackOverflow thread on the same topic.

amazon-customer-review-api-discontinued-stack-over-flow

We were able to find few tutorials on doing this using Perl ( http://archive.oreilly.com/pub/h/977 ). Being the Python Enthusiasts, we are ( check out the other web scraping tutorials we have published before), we thought of making one using simple Python and the simple python library – LXML.

We’ll follow this post up with a tutorial on how to turn this code into a web API that you can use or integrate with your projects.

Requirements

For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. Below are the package requirements.

Install Python 3 and Pip

Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/

Mac Users can follow this guide – http://docs.python-guide.org/en/latest/starting/install3/osx/

Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/

Install Packages

 

Let us get our hands dirty now.

The Code

Here is the GIST link for the code above https://gist.github.com/scrapehero/900419a768c5fac9ebdef4cb246b25cb

If you would like the code in Python 2.7, you can check this link – https://gist.github.com/scrapehero/3d53ae193766bc51408ec6497fbd1016.

Modify the code below. Add your own ASINs to the line. AsinList = ['B01ETPUQ6E','B017HW9DEW'] If you are getting banned by Amazon, try increasing the delay from 5 seconds by editing the line sleep(5). Increase to say 10 seconds. sleep(10)

def ReadAsin():
	#Add your own ASINs here 
	AsinList = ['B01ETPUQ6E','B017HW9DEW']
	extracted_data = []
	for asin in AsinList:
		print "Downloading and processing page http://www.amazon.com/dp/"+asin
		extracted_data.append(ParseReviews(asin))
		sleep(5)
	f=open('data.json','w')
	json.dump(extracted_data,f,indent=4)

Once you are done modifying the script, run this script using Python 3 in a Terminal or Command Prompt. We named our file `amazon_review_scraper.py`.

 python amazon_review_scraper.py

Once the script completes running, you can see a file called data.json, with the reviews data in a JSON format.

Below is the formatted output we received for the ASINs we supplied

amazon-review-scraper-output-scrapehero

Here is the full output attached in a GIST.

This code should work for a relatively small number of ASINs for your personal projects, but if you want to scrape websites for thousands of pages, learn about the challenges here Scalable do-it-yourself scraping – How to build and run scrapers on a large scale.

Thanks for reading and if you need help with your complex scraping projects let us know and we will be glad to help.

Do you need some professional help to scrape Amazon Data? Let us know

Turn the Internet into meaningful, structured and usable data

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

21 comments on “How to scrape Amazon Reviews using Python

R

This script does not seem to work. The json written does not have any views in it.

    ScrapeHero

    Please copy the detailed error or how you ran this so we can check.
    Thanks

ARJUN S

how to increase the number of reviews obtained ??

    ScrapeHero

    Hi Arjun – that’s what’s called “an exercise left to the reader”. You will have to look at the pagination – click that and then get the next page and so on. Most likely you will get blocked pretty soon.

E

The ratings dictionary is very helpful for getting the percentage distributions of the reviews based on the number of stars, however is there an easy way to see the total number of reviews? For example, are those percentages based on 11 reviews or 3,000? Thanks!

    E

    I’m not very familiar with lxml so I think that’s where the I’m getting stuck

Amy Smith

Hi,
I don’t think it’s working. Can you help me fix it? This is the output of the json file:
[
{
“error”: “failed to process the page”,
“asin”: “B01ETPUQ6E”
},
{
“error”: “failed to process the page”,
“asin”: “B017HW9DEW”
}
]

Thank you!

SkyChaos

Not showing all reviews. Any ideas ? My products have alot of reviews and the total result after i used the script isnt even close to that.

    ScrapeHero

    This script doesn’t get you all reviews. It was written specifically to demonstrate scraping reviews using Python, and was never intended as a fully functional scraper for thousands of pages.

gargi

I ran the code on Jupyter. The code ran without any error but I am not getting any output file.

    ScrapeHero

    When using in Jupyter Notebook, you should call the function ParseReviews with your ASIN.

    For example,

    ParseReviews(`B01ETPUQ6E`) would return a dict similar to

    {'name': 'Samsung Galaxy J7 - No Contract Phone - White - (Boost Mobile)(Carrier locked phone )',
     'price': '$293.96',
     'ratings': {'1 star': '10%',
      '2 star': '4%',
      '3 star': '5%',
      '4 star': '17%',
      '5 star': '64%'},
     'reviews': [{'review_author': 'JLO',
       'review_comment_count': '',
       'review_header': 'Best phone I have owned with Boostmobile!',
       'review_posted_date': '19 Jul 2016',
       'review_rating': '5.0 ',
       'review_text': 'I love this Samsung J7, since it is bigger than the s6 and s7, and as good as my old S5 (as compared in the images). I had no issues charging it or switching it over from old phone to this new phone - via a 4 minute phone call to Boostmobile. Yes, the S6 and S7 have -much- faster processors, but I do not need that for what I do... and so far, after 1 month of use, I absolutely love this phone!. The phone feels great, responds quickly, and looks freaking awesome. Pros: -Great price -AFFORDABLE! -Bigger than most other phones available -Great quality screen Cons: -2 bottom buttons- on side of the main home button- DO NOT light up -Camera is not comparable to that of the S5, S6, and S7 -Overall quality does not feel as sturdy as the other models mentioned (Shell plastic is thinner). After one month of use, I rated this phone with 4 stars. I will be updating this review in about 6 months. REVIEW UPDATE: February 19, 2017 After eight (8) months of owning this phone, and 7 months since this original review, I am back today to continue my review as promised. I am doing 2 things for those of you who are reading this review for the first time: 1) I have added a star to make this a 5 star review! 2) I will explain why I have decided to come back to add this star to my review. Let me clarify that since reviewing this, I have purchased a second one for my wife. I will also list what items I purchased it along with: Mr Shield Tempered Glass Screen Protector for Samsung Galaxy J7 [Will Not Fit For 2016 Version] - 2-Pack Samsung EVO 64GB Micro SDXC Memory Card with Adapter up to 48/MB/s (MB-MP64DA/AM) Phonelicious SAMSUNG Galaxy J7 Case(Boost,Virgin,TMobile,Metro PCS)Slim Fit Heavy Duty Ultimate Drop Protection Rugged Cover with Screen Protector & Stylus (Navy Blue Matte) Of course, purchasing a case, the Screen protector, and a good quality memory card may also influence how this product has performed. This phone continues to work as expected, and has delivered so far. I feel my investment on this phone has paid off, and my money has been worth. Being a somewhat frugal shopper, I thought I would give this phone a try to since it\'s price was reasonable for what I was getting. My 1 year old (who is now 18 months) has used and abused this phone. He has thrown it on the floor dozens of times, and has scratched, as well chewed on it. What has happened to the phone so far? - It has gained a large crack, on the screen protector. The phone is still working as it did 8 months ago, and the cheap items I purchased to protected have taken quite the toll. I had originally mentioned that the phone felt "cheap", and the thin plastic that it is made out of, is certainly noticeable compared to the quality of the S5, and other S series for that matter. Given that this phone has performed well, continues to deliver, and has outlasted quite some abuse, I made the judgement to give this phone the extra star, since this is a great product to my standards. I plan to come back in June, and review this item again after a full year of use!'},
      {'review_author': 'Frankie',
       'review_comment_count': '',
       'review_header': 'Very good phone for the money',
       'review_posted_date': '13 May 2016',
       'review_rating': '4.0 ',
       'review_text': "I bought this phone from boost Mobile website for 200 bucks. Of course its not as good as the s7, but if you don't wanna spend 700$ for a phone, then you can't go wrong with j7. Very good phone for the money. I already bought a couple spare batteries. Pros- Good price, good call quality, fast internet, good camera, good selfie camera, great display, perfect size, great for texting, Great battery life (leave it on power save mode), good speaker, 6.0 marshmallow is great, Cons- No LED notification light, incoming ring tone couple be louder."},
      {'review_author': 'Colleen Marie',
       'review_comment_count': '',
       'review_header': "GORGEOUS phone .... I'm in love with it!",
       'review_posted_date': '06 May 2016',
       'review_rating': '5.0 ',
       'review_text': "I love this Samsung J7. I couldn't afford the sticker price for the S6 or S7 (sticker shock!). Therefore; I opted for this which is scaled down in terms of processing/memory BUT it's much better quality than the phone I was using for my Boost Mobile account. I had no issues charging it or switching it over from old phone to this new phone - via my online account (I didn't have to talk to anybody to do this and it worked out just great). My hubby has the Samsung S6 which he saved/paid for upfront (no contract at Boost Mobile). Comparing my new J7 to his S6 - well; apples to oranges. His has 2 processors (a quad-core and an octa-core). This J7 has just the octa-core BUT for me is proving to be plenty of processing power along with the 2GB ROM. This was pretty comparable to the ZTE Warp Elite I've been using since January 2016 but I'd have to say that the Samsung is much faster - also that the screen response is so much better in this J7 model (even with the glass protection I placed on the screen). My hubby bought me this for my upcoming 60th birthday! GREAT present for me.....all mine! Thank you 'shopcelldeals' for selling this at a reasonable price that was affordable. Recommended."},
      {'review_author': 'Leesaa',
       'review_comment_count': '',
       'review_header': 'Love the phone',
       'review_posted_date': '05 Jul 2016',
       'review_rating': '5.0 ',
       'review_text': 'Awesome phone for the price. Face it, these are not the S7 phones but a cheaper version. I have no problem with them at all. Plenty of memory and large screen. I love the phone. You are getting a Marshmallow system which should keep me going for a couple of years. Boost mobile is like every other cell company. Hard to deal with on the phone but once the service is set up, no problems. Fast 4gLTE.'},
      {'review_author': 'TG',
       'review_comment_count': '',
       'review_header': 'Worth it.',
       'review_posted_date': '19 Nov 2016',
       'review_rating': '5.0 ',
       'review_text': 'I love this phone! Coming from the galaxy S3 for Boost this is a big jump in the right direction. $20 cheaper on amazon than the boost mobile site another plus.The battery last all day. That is while talking/texting using apps. It is two times bigger than the S3 but def worth it. The operating system is much faster and more responsive as well.'},
      {'review_author': 'David Erickson',
       'review_comment_count': '',
       'review_header': 'This is a great phone for Boost Mobile',
       'review_posted_date': '06 Oct 2016',
       'review_rating': '5.0 ',
       'review_text': "This is a great phone for Boost Mobile. I have had it a few months and it's been great. Much nicer than my old one and the big screen has spoiled me. The iPhone 6 I use for work is tiny by comparison and I much prefer this phone over the iPhone."}],
     'url': 'https://www.amazon.com/dp/B01ETPUQ6E'}
    
Connor

I am quite new to Python so apologies for any ignorance. I am getting a urllib3 InsecureRequestWarning, even after following the instructions here:https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning. Any thoughts as to why? I am using Jupyter, Python version 2.7.

Connor

Any idea why I would be getting this warning: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning) ? I followed the instructions on the urllib3 page but am still getting the same warning. I am in Jupyter (python 2). Thank you!

pewds

Love what you guys are doing, big fan of yours. I am currently collecting emails of Amazon reviewers and it’s a very time consuming process. If you could help me with a code for doing this it would be awesome and thank you for reading all of this.

    ScrapeHero

    Sorry we can’t write code on demand but you can hire someone on upwork to do all this.

Katie

I keep getting the error “unable to find reviews in page”, what could be the problem? [ I promise the product has reviews ]

    Nithu

    The HTML parser seemed to have a depth limit. It wont traverse further to parse the text if the depth exceeds 254. We have updated our code to handle this.

    rijesh

    We found Amazon sending null bytes along with the response in some cases which caused the Lxml parser failure. Our code base is now updated.

Sarah

how would we get like 100 reviews off the site?

    ScrapeHero

    You would need to find the link to next page of reviews and parse it similarly as in this tutorial

Comments or Questions?

Turn the Internet into meaningful, structured and usable data