Build and host your own FREE Amazon Reviews API using Python and a free Web scraper tool called Selectorlib
In this web scraping tutorial, we will build an Amazon Review Scraper using Python in 3 steps, which can extract review data from Amazon products such as – Review Title, Review Content, Product Name, Rating, Date, Author and more, into an Excel spreadsheet. You can also check out our tutorial on how to build a Python scraper to scrape Amazon product details and pricing. We will build this simple Amazon review scraper using Python and SelectorLib and run it in a console.
Here are the steps on how you can scrape Amazon reviews using Python
- Markup the data fields to be scraped using Selectorlib
- Copy and run the code provided
- Download the data in Excel (CSV) format.
We have also provided how you can scrape product details from Amazon search result page, how to avoid getting blocked by Amazon and how to scrape Amazon on a large scale below.
If you do not want to code, we have made it simple to do all this for FREE and in a few clicks. ScrapeHero Cloud can scrape reviews of Amazon products within seconds!
Here are some of the data fields that the Amazon product review scraper will extract into a spreadsheet from Amazon:
- Product Name
- Review Title
- Review Content/Review Text
- Rating
- Date of publishing review
- Verified Purchase
- Author Name
- URL
We will save the data as an Excel Spreadsheet (CSV).
Installing the required packages for running Amazon Reviews Web Scraper
For this web scraping tutorial to scrape Amazon product reviews using Python 3 and its libraries. We will not be using Scrapy for this tutorial. This code can run easily and quickly on any computer (including a Raspberry Pi)
If you do not have Python 3 installed, you can follow this guide to install Python in Windows here – How To Install Python Packages.
We will use these libraries:
- Python Requests, to make requests and download the HTML content of the pages ( http://docs.python-requests.org/en/master/user/install/).
- LXML, for parsing the HTML Tree Structure using Xpaths (Learn how to install that here – http://lxml.de/installation.html)
- Python Dateutil, for parsing review dates (https://github.com/dateutil/dateutil/)
- Selectorlib, to extract data using the YAML file we created from the webpages we download
Install them using pip3
pip3 install python-dateutil lxml requests selectorlib
Read More – Analyzing top shoe brands in Amazon
The Code
You can get all the code used in this tutorial from Github – https://github.com/scrapehero-code/amazon-review-scraper
Let’s create a file called reviews.py
and paste the following Python code into it.
Here is what the Amazon product review scraper does:
- Reads a list of Product Review Pages URLs from a file called urls.txt (This file will contain the URLs for the Amazon product pages you care about)
- Uses a selectorlib YAML file that identifies the data on an Amazon page and is saved in a file called selectors.yml (more on how to generate this file later in this tutorial)
- Scrapes the Data
- Saves the data as CSV Spreadsheet called data.csv
from selectorlib import Extractor import requests import json from time import sleep import csv from dateutil import parser as dateparser # Create an Extractor by reading from the YAML file e = Extractor.from_yaml_file('selectors.yml') def scrape(url): headers = { 'authority': 'www.amazon.com', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'dnt': '1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'none', 'sec-fetch-mode': 'navigate', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8', } # Download the page using requests print("Downloading %s"%url) r = requests.get(url, headers=headers) # Simple check to check if page was blocked (Usually 503) if r.status_code > 500: if "To discuss automated access to Amazon data please contact" in r.text: print("Page %s was blocked by Amazon. Please try using better proxies\n"%url) else: print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code)) return None # Pass the HTML of the page and create return e.extract(r.text) with open("urls.txt",'r') as urllist, open('data.csv','w') as outfile: writer = csv.DictWriter(outfile, fieldnames=["title","content","date","variant","images","verified","author","rating","product","url"],quoting=csv.QUOTE_ALL) writer.writeheader() for url in urllist.readlines(): data = scrape(url) if data: for r in data['reviews']: r["product"] = data["product_title"] r['url'] = url if 'verified' in r: if 'Verified Purchase' in r['verified']: r['verified'] = 'Yes' else: r['verified'] = 'Yes' r['rating'] = r['rating'].split(' out of')[0] date_posted = r['date'].split('on ')[-1] if r['images']: r['images'] = "\n".join(r['images']) r['date'] = dateparser.parse(date_posted).strftime('%d %b %Y') writer.writerow(r) # sleep(5)
If you don't like or want to code, ScrapeHero Cloud is just right for you!
Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.
Get Started for Free
Creating the YAML file – selectors.yml
You will notice in the code above that we used a file called selectors.yml
. This file is what makes this tutorial so easy to create and follow. The magic behind this file is a Web Scraper tool called Selectorlib.
Selectorlib is a tool that makes selecting, marking up, and extracting data from web pages visual and easy. The Selectorlib Web Scraper Chrome Extension lets you mark data that you need to extract, and creates the CSS Selectors or XPaths needed to extract that data. Then previews how the data would look like. You can learn more about Selectorlib and how to use it here
If you just need the data we have shown above, you do not need to use Selectorlib. Since we have done that for you already and generated a simple “template” that you can just use. However, if you want to add a new field, you can use Selectorlib to add that field to the template.
Here is how we marked up the fields for the data we need to scrape Amazon reviews from the Product Reviews Page using Selectorlib Chrome Extension.
Once you have created the template, click on ‘Highlight’ to highlight and preview all of your selectors. Finally, click on ‘Export’ and download the YAML file and that file is the selectors.yml file.
Here is how our template (selectors.yml) file looks like:
product_title: css: 'h1 a[data-hook="product-link"]' type: Text reviews: css: 'div.review div.a-section.celwidget' multiple: true type: Text children: title: css: a.review-title type: Text content: css: 'div.a-row.review-data span.review-text' type: Text date: css: span.a-size-base.a-color-secondary type: Text variant: css: 'a.a-size-mini' type: Text images: css: img.review-image-tile multiple: true type: Attribute attribute: src verified: css: 'span[data-hook="avp-badge"]' type: Text author: css: span.a-profile-name type: Text rating: css: 'div.a-row:nth-of-type(2) > a.a-link-normal:nth-of-type(1)' type: Attribute attribute: title next_page: css: 'li.a-last a' type: Link
Previous Versions of the Scraper
If you need a script that runs on older versions of Python, you can view the previous versions of this code to scrape Amazon reviews.
Python 3 (built in 2018) – https://gist.github.com/scrapehero/900419a768c5fac9ebdef4cb246b25cb
Python 2.7 (built in 2016) – https://gist.github.com/scrapehero/3d53ae193766bc51408ec6497fbd1016.
Running the Amazon Review Scraper
You can get all the code used in this tutorial from Github – https://github.com/scrapehero-code/amazon-review-scraper
All you need to do is add the URLs you need to scrape into a text file called urls.txt in the same folder and run the scraper using the command:
python3 reviews.py
Here is an example URL – https://www.amazon.com/HP-Business-Dual-core-Bluetooth-Legendary/product-reviews/B07VMDCLXV/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews
You can get this URL by clicking on “See all reviews” near the bottom of the product page.
Here is how the Amazon scraped reviews look like:
This code can be used to scrape Amazon reviews of a relatively small number of ASINs for your personal projects. But if you want to scrape websites for thousands of pages, learn about the challenges here How to build and run scrapers on a large scale.
Read More – Learn to analyze Amazon Reviews
What can you do with Scraping Amazon Reviews?
The data that you gather from this tutorial can help you with:
- You can get review details unavailable using the official Amazon Product Advertising API.
- Monitoring customer opinions on products that you sell or manufacture using Data Analysis
- Create Amazon Review Datasets for Educational Purposes and Research
- Monitor product quality sold by third-party sellers
Amazon used to provide access to product reviews through their Product Advertising API to developers and sellers, a few years back. They discontinued that on November 8th, 2010, preventing customers from displaying Amazon reviews about their products, embedded in their websites. As of now, Amazon only returns a link to the review.
If you don't like or want to code, ScrapeHero Cloud is just right for you!
Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.
Get Started for Free
Building a Free Amazon Reviews API using Python, Flask & Selectorlib
If you are looking for getting reviews as an API, like an Amazon Product Advertising API – you may find this tutorial below interesting.
Thanks for reading and if you need help with your complex scraping projects let us know and we will be glad to help.
Do you need some professional help to scrape Amazon Data? Let us know
Turn the Internet into meaningful, structured and usable data
Responses
how would we get like 100 reviews off the site?
You would need to find the link to next page of reviews and parse it similarly as in this tutorial
Is there any way to get the product rank as well?
https://github.com/DavidRoldan523/amazon_reviews_allpages
This code is a Script to scrape all reviews on all Amazon pages
Tihs code is Script to scrape all reviews on all Amazon pages
https://github.com/DavidRoldan523/amazon_reviews_allpages
How would you scrape 12000 products by search query only?
Amazon restricts the number it shows and it is far below 12000