Learn to scrape Amazon using Python. Extract Amzaon product details like Name, Price, ASIN and more by scraping Amazon.
In this web scraping tutorial, we will build an Amazon Review Scraper using Python in 3 steps, which can extract review data from Amazon products such as – Review Title, Review Content, Product Name, Rating, Date, Author and more, into an Excel spreadsheet. You can also check out our tutorial on how to build a Python scraper to scrape Amazon product details and pricing. We will build this simple Amazon review scraper using Python and SelectorLib and run it in a console.
Here are the steps on how you can scrape Amazon reviews using Python
- Markup the data fields to be scraped using Selectorlib
- Copy and run the code provided
- Download the data in Excel (CSV) format.
We have also provided how you can scrape product details from Amazon search result page, how to avoid getting blocked by Amazon and how to scrape Amazon on a large scale below.
If you do not want to code, we have made it simple to do all this for FREE and in a few clicks. ScrapeHero Cloud can scrape reviews of Amazon products within seconds!
Here are some of the data fields that the Amazon product review scraper will extract into a spreadsheet from Amazon:
- Product Name
- Review Title
- Review Content/Review Text
- Rating
- Date of publishing review
- Verified Purchase
- Author Name
- URL
We will save the data as an Excel Spreadsheet (CSV).
Installing the required packages for running Amazon Reviews Web Scraper
For this web scraping tutorial to scrape Amazon product reviews using Python 3 and its libraries. We will not be using Scrapy for this tutorial. This code can run easily and quickly on any computer (including a Raspberry Pi)
If you do not have Python 3 installed, you can follow this guide to install Python in Windows here – How To Install Python Packages.
We will use these libraries:
- Python Requests, to make requests and download the HTML content of the pages ( http://docs.python-requests.org/en/master/user/install/).
- LXML, for parsing the HTML Tree Structure using Xpaths (Learn how to install that here – http://lxml.de/installation.html)
- Python Dateutil, for parsing review dates (https://github.com/dateutil/dateutil/)
- Selectorlib, to extract data using the YAML file we created from the webpages we download
Install them using pip3
pip3 install python-dateutil lxml requests selectorlib
Read More – Analyzing top shoe brands in Amazon
The Code
You can get all the code used in this tutorial from Github – https://github.com/scrapehero-code/amazon-review-scraper
Let’s create a file called reviews.py
and paste the following Python code into it.
Here is what the Amazon product review scraper does:
- Reads a list of Product Review Pages URLs from a file called urls.txt (This file will contain the URLs for the Amazon product pages you care about)
- Uses a selectorlib YAML file that identifies the data on an Amazon page and is saved in a file called selectors.yml (more on how to generate this file later in this tutorial)
- Scrapes the Data
- Saves the data as CSV Spreadsheet called data.csv
from selectorlib import Extractor import requests import json from time import sleep import csv from dateutil import parser as dateparser # Create an Extractor by reading from the YAML file e = Extractor.from_yaml_file('selectors.yml') def scrape(url): headers = { 'authority': 'www.amazon.com', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'dnt': '1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'none', 'sec-fetch-mode': 'navigate', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8', } # Download the page using requests print("Downloading %s"%url) r = requests.get(url, headers=headers) # Simple check to check if page was blocked (Usually 503) if r.status_code > 500: if "To discuss automated access to Amazon data please contact" in r.text: print("Page %s was blocked by Amazon. Please try using better proxies\n"%url) else: print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code)) return None # Pass the HTML of the page and create return e.extract(r.text) with open("urls.txt",'r') as urllist, open('data.csv','w') as outfile: writer = csv.DictWriter(outfile, fieldnames=["title","content","date","variant","images","verified","author","rating","product","url"],quoting=csv.QUOTE_ALL) writer.writeheader() for url in urllist.readlines(): data = scrape(url) if data: for r in data['reviews']: r["product"] = data["product_title"] r['url'] = url if 'verified' in r: if 'Verified Purchase' in r['verified']: r['verified'] = 'Yes' else: r['verified'] = 'Yes' r['rating'] = r['rating'].split(' out of')[0] date_posted = r['date'].split('on ')[-1] if r['images']: r['images'] = "\n".join(r['images']) r['date'] = dateparser.parse(date_posted).strftime('%d %b %Y') writer.writerow(r) # sleep(5)
If you don't like or want to code, ScrapeHero Cloud is just right for you!
Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.
Get Started for Free
Creating the YAML file – selectors.yml
You will notice in the code above that we used a file called selectors.yml
. This file is what makes this tutorial so easy to create and follow. The magic behind this file is a Web Scraper tool called Selectorlib.
Selectorlib is a tool that makes selecting, marking up, and extracting data from web pages visual and easy. The Selectorlib Web Scraper Chrome Extension lets you mark data that you need to extract, and creates the CSS Selectors or XPaths needed to extract that data. Then previews how the data would look like. You can learn more about Selectorlib and how to use it here
If you just need the data we have shown above, you do not need to use Selectorlib. Since we have done that for you already and generated a simple “template” that you can just use. However, if you want to add a new field, you can use Selectorlib to add that field to the template.
Here is how we marked up the fields for the data we need to scrape Amazon reviews from the Product Reviews Page using Selectorlib Chrome Extension.
Once you have created the template, click on ‘Highlight’ to highlight and preview all of your selectors. Finally, click on ‘Export’ and download the YAML file and that file is the selectors.yml file.
Here is how our template (selectors.yml) file looks like:
product_title: css: 'h1 a[data-hook="product-link"]' type: Text reviews: css: 'div.review div.a-section.celwidget' multiple: true type: Text children: title: css: a.review-title type: Text content: css: 'div.a-row.review-data span.review-text' type: Text date: css: span.a-size-base.a-color-secondary type: Text variant: css: 'a.a-size-mini' type: Text images: css: img.review-image-tile multiple: true type: Attribute attribute: src verified: css: 'span[data-hook="avp-badge"]' type: Text author: css: span.a-profile-name type: Text rating: css: 'div.a-row:nth-of-type(2) > a.a-link-normal:nth-of-type(1)' type: Attribute attribute: title next_page: css: 'li.a-last a' type: Link
Previous Versions of the Scraper
If you need a script that runs on older versions of Python, you can view the previous versions of this code to scrape Amazon reviews.
Python 3 (built in 2018) – https://gist.github.com/scrapehero/900419a768c5fac9ebdef4cb246b25cb
Python 2.7 (built in 2016) – https://gist.github.com/scrapehero/3d53ae193766bc51408ec6497fbd1016.
Running the Amazon Review Scraper
You can get all the code used in this tutorial from Github – https://github.com/scrapehero-code/amazon-review-scraper
All you need to do is add the URLs you need to scrape into a text file called urls.txt in the same folder and run the scraper using the command:
python3 reviews.py
Here is an example URL – https://www.amazon.com/HP-Business-Dual-core-Bluetooth-Legendary/product-reviews/B07VMDCLXV/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews
You can get this URL by clicking on “See all reviews” near the bottom of the product page.
Here is how the Amazon scraped reviews look like:
This code can be used to scrape Amazon reviews of a relatively small number of ASINs for your personal projects. But if you want to scrape websites for thousands of pages, learn about the challenges here How to build and run scrapers on a large scale.
Read More – Learn to analyze Amazon Reviews
What can you do with Scraping Amazon Reviews?
The data that you gather from this tutorial can help you with:
- You can get review details unavailable using the official Amazon Product Advertising API.
- Monitoring customer opinions on products that you sell or manufacture using Data Analysis
- Create Amazon Review Datasets for Educational Purposes and Research
- Monitor product quality sold by third-party sellers
Amazon used to provide access to product reviews through their Product Advertising API to developers and sellers, a few years back. They discontinued that on November 8th, 2010, preventing customers from displaying Amazon reviews about their products, embedded in their websites. As of now, Amazon only returns a link to the review.
Building a Free Amazon Reviews API using Python, Flask & Selectorlib
If you are looking for getting reviews as an API, like an Amazon Product Advertising API – you may find this tutorial below interesting.
Thanks for reading and if you need help with your complex scraping projects let us know and we will be glad to help.
Do you need some professional help to scrape Amazon Data? Let us know
Turn the Internet into meaningful, structured and usable data
Responses
This script does not seem to work. The json written does not have any views in it.
Please copy the detailed error or how you ran this so we can check.
Thanks
how to increase the number of reviews obtained ??
Hi Arjun – that’s what’s called “an exercise left to the reader”. You will have to look at the pagination – click that and then get the next page and so on. Most likely you will get blocked pretty soon.
The ratings dictionary is very helpful for getting the percentage distributions of the reviews based on the number of stars, however is there an easy way to see the total number of reviews? For example, are those percentages based on 11 reviews or 3,000? Thanks!
I’m not very familiar with lxml so I think that’s where the I’m getting stuck
Hi,
I don’t think it’s working. Can you help me fix it? This is the output of the json file:
[
{
“error”: “failed to process the page”,
“asin”: “B01ETPUQ6E”
},
{
“error”: “failed to process the page”,
“asin”: “B017HW9DEW”
}
]
Thank you!
Could be an IP block?
Not showing all reviews. Any ideas ? My products have alot of reviews and the total result after i used the script isnt even close to that.
This script doesn’t get you all reviews. It was written specifically to demonstrate scraping reviews using Python, and was never intended as a fully functional scraper for thousands of pages.
I ran the code on Jupyter. The code ran without any error but I am not getting any output file.
When using in Jupyter Notebook, you should call the function
ParseReviews
with your ASIN.For example,
ParseReviews(`B01ETPUQ6E`)
would return a dict similar toI am quite new to Python so apologies for any ignorance. I am getting a urllib3 InsecureRequestWarning, even after following the instructions here:https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning. Any thoughts as to why? I am using Jupyter, Python version 2.7.
Any idea why I would be getting this warning: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning) ? I followed the instructions on the urllib3 page but am still getting the same warning. I am in Jupyter (python 2). Thank you!
Love what you guys are doing, big fan of yours. I am currently collecting emails of Amazon reviewers and it’s a very time consuming process. If you could help me with a code for doing this it would be awesome and thank you for reading all of this.
Sorry we can’t write code on demand but you can hire someone on upwork to do all this.
I keep getting the error “unable to find reviews in page”, what could be the problem? [ I promise the product has reviews ]
The HTML parser seemed to have a depth limit. It wont traverse further to parse the text if the depth exceeds 254. We have updated our code to handle this.
We found Amazon sending null bytes along with the response in some cases which caused the Lxml parser failure. Our code base is now updated.
how would we get like 100 reviews off the site?
You would need to find the link to next page of reviews and parse it similarly as in this tutorial
Is there any way to get the product rank as well?
https://github.com/DavidRoldan523/amazon_reviews_allpages
This code is a Script to scrape all reviews on all Amazon pages
Tihs code is Script to scrape all reviews on all Amazon pages
https://github.com/DavidRoldan523/amazon_reviews_allpages
How would you scrape 12000 products by search query only?
Amazon restricts the number it shows and it is far below 12000
Hello, this is amazing. Can you please guide how to do similar process in BestBuy ? It would be really great for me and many others.
Comments are closed.