How to scrape Amazon Reviews using Python

In this web scraping tutorial, we will build an Amazon Review Scraper using Python, which can extract review data from Amazon products such as – Review Title, Review Content, Product Name, Rating, Date, Author and more, into an Excel spreadsheet. You can also check out our tutorial on how to build a Python scraper to scrape Amazon product details and pricing. We will build this simple Amazon review scraper using Python and SelectorLib and run it in a console.

Here are the steps on how you can scrape Amazon reviews using Python

  1. Markup the data fields to be scraped using Selectorlib
  2. Copy and run the code provided
  3. Download the data in Excel (CSV) format.

We have also provided how you can scrape product details from Amazon search result page, how to avoid getting blocked by Amazon and how to scrape Amazon on a large scale below.

If you do not want to code, we have made it simple to do all this for FREE and in a few clicks. ScrapeHero Cloud can scrape reviews of Amazon products within seconds!

Use Amazon Review Scraper from ScrapeHero Cloud

Here are some of the data fields that the Amazon product review scraper will extract into a spreadsheet from Amazon:

  1. Product Name
  2. Review Title
  3. Review Content/Review Text
  4. Rating
  5. Date of publishing review
  6. Verified Purchase
  7. Author Name
  8. URL

We will save the data as an Excel Spreadsheet (CSV).

Amazon Review Scraper Data Sample

Installing the required packages for running Amazon Reviews Web Scraper

For this web scraping tutorial to scrape Amazon product reviews using Python 3 and its libraries. We will not be using Scrapy for this tutorial. This code can run easily and quickly on any computer (including a Raspberry Pi)
If you do not have Python 3 installed, you can follow this guide to install Python in Windows here – How To Install Python Packages.

We will use these libraries:

Install them using pip3

pip3 install python-dateutil lxml requests selectorlib

The Code

You can get all the code used in this tutorial from Github – https://github.com/scrapehero-code/amazon-review-scraper

Let’s create a file called reviews.py and paste the following Python code into it.

Here is what the Amazon product review scraper does:

  1. Reads a list of Product Review Pages URLs from a file called urls.txt (This file will contain the URLs for the Amazon product pages you care about)
  2. Uses a selectorlib YAML file that identifies the data on an Amazon page and is saved in a file called selectors.yml (more on how to generate this file later in this tutorial)
  3. Scrapes the Data
  4. Saves the data as CSV Spreadsheet called data.csv
from selectorlib import Extractor
import requests 
import json 
from time import sleep
import csv
from dateutil import parser as dateparser

# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('selectors.yml')

def scrape(url):    
    headers = {
        'authority': 'www.amazon.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'none',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-dest': 'document',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }

    # Download the page using requests
    print("Downloading %s"%url)
    r = requests.get(url, headers=headers)
    # Simple check to check if page was blocked (Usually 503)
    if r.status_code > 500:
        if "To discuss automated access to Amazon data please contact" in r.text:
            print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
        else:
            print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
        return None
    # Pass the HTML of the page and create 
    return e.extract(r.text)

with open("urls.txt",'r') as urllist, open('data.csv','w') as outfile:
    writer = csv.DictWriter(outfile, fieldnames=["title","content","date","variant","images","verified","author","rating","product","url"],quoting=csv.QUOTE_ALL)
    writer.writeheader()
    for url in urllist.readlines():
        data = scrape(url) 
        if data:
            for r in data['reviews']:
                r["product"] = data["product_title"]
                r['url'] = url
                if 'verified' in r:
                    if 'Verified Purchase' in r['verified']:
                        r['verified'] = 'Yes'
                    else:
                        r['verified'] = 'Yes'
                r['rating'] = r['rating'].split(' out of')[0]
                date_posted = r['date'].split('on ')[-1]
                if r['images']:
                    r['images'] = "\n".join(r['images'])
                r['date'] = dateparser.parse(date_posted).strftime('%d %b %Y')
                writer.writerow(r)
            # sleep(5)
    

If you don't like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.

Get Started for Free
Deploy to ScrapeHero Cloud

Creating the YAML file – selectors.yml

You will notice in the code above that we used a file called selectors.yml. This file is what makes this tutorial so easy to create and follow. The magic behind this file is a Web Scraper tool called Selectorlib.

Selectorlib is a tool that makes selecting, marking up, and extracting data from web pages visual and easy. The Selectorlib Web Scraper Chrome Extension lets you mark data that you need to extract, and creates the CSS Selectors or XPaths needed to extract that data. Then previews how the data would look like. You can learn more about Selectorlib and how to use it here

If you just need the data we have shown above, you do not need to use Selectorlib. Since we have done that for you already and generated a simple “template” that you can just use. However, if you want to add a new field, you can use Selectorlib to add that field to the template.

Here is how we marked up the fields for the data we need to scrape Amazon reviews from the Product Reviews Page using Selectorlib Chrome Extension.

Selectorlib Amazon Reviews

Once you have created the template, click on ‘Highlight’ to highlight and preview all of your selectors. Finally, click on ‘Export’ and download the YAML file and that file is the selectors.yml file.

Here is how our template (selectors.yml) file looks like:

product_title:
    css: 'h1 a[data-hook="product-link"]'
    type: Text
reviews:
    css: 'div.review div.a-section.celwidget'
    multiple: true
    type: Text
    children:
        title:
            css: a.review-title
            type: Text
        content:
            css: 'div.a-row.review-data span.review-text'
            type: Text
        date:
            css: span.a-size-base.a-color-secondary
            type: Text
        variant:
            css: 'a.a-size-mini'
            type: Text
        images:
            css: img.review-image-tile
            multiple: true
            type: Attribute
            attribute: src
        verified:
            css: 'span[data-hook="avp-badge"]'
            type: Text
        author:
            css: span.a-profile-name
            type: Text
        rating:
            css: 'div.a-row:nth-of-type(2) > a.a-link-normal:nth-of-type(1)'
            type: Attribute
            attribute: title
next_page:
    css: 'li.a-last a'
    type: Link

Previous Versions of the Scraper

If you need a script that runs on older versions of Python, you can view the previous versions of this code to scrape Amazon reviews.

Python 3 (built in 2018) – https://gist.github.com/scrapehero/900419a768c5fac9ebdef4cb246b25cb
Python 2.7 (built in 2016) – https://gist.github.com/scrapehero/3d53ae193766bc51408ec6497fbd1016.

Running the Amazon Review Scraper

You can get all the code used in this tutorial from Github – https://github.com/scrapehero-code/amazon-review-scraper

All you need to do is add the URLs you need to scrape into a text file called urls.txt in the same folder and run the scraper using the command:

python3 reviews.py

Here is an example URL – https://www.amazon.com/HP-Business-Dual-core-Bluetooth-Legendary/product-reviews/B07VMDCLXV/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

You can get this URL by clicking on “See all reviews” near the bottom of the product page.

 

Here is how the Amazon scraped reviews look like:

Amazon Review Scraper Data Sample

 

This code can be used to scrape Amazon reviews of a relatively small number of ASINs for your personal projects. But if you want to scrape websites for thousands of pages, learn about the challenges here How to build and run scrapers on a large scale.

What can you do with Scraping Amazon Reviews?

The data that you gather from this tutorial can help you with:

  1. You can get review details unavailable using the official Amazon Product Advertising API.
  2. Monitoring customer opinions on products that you sell or manufacture using Data Analysis
  3. Create Amazon Review Datasets for Educational Purposes and Research
  4. Monitor product quality sold by third-party sellers

Amazon used to provide access to product reviews through their Product Advertising API to developers and sellers, a few years back. They discontinued that on November 8th, 2010, preventing customers from displaying Amazon reviews about their products, embedded in their websites. As of now, Amazon only returns a link to the review.

If you don't like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.

Get Started for Free
Deploy to ScrapeHero Cloud

Building a Free Amazon Reviews API using Python, Flask & Selectorlib

If you are looking for getting reviews as an API, like an Amazon Product Advertising API – you may find this tutorial below interesting.

Thanks for reading and if you need help with your complex scraping projects let us know and we will be glad to help.

Do you need some professional help to scrape Amazon Data? Let us know

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Posted in:   eCommerce Data Gathering Tutorials, Web Scraping Tutorials

Responses

Sarah November 25, 2018

how would we get like 100 reviews off the site?

Reply

    ScrapeHero November 25, 2018

    You would need to find the link to next page of reviews and parse it similarly as in this tutorial

    Reply

clarosantiago January 18, 2019

Is there any way to get the product rank as well?

Reply

doomhouse May 23, 2019

How would you scrape 12000 products by search query only?

Reply

    ScrapeHero May 24, 2019

    Amazon restricts the number it shows and it is far below 12000

    Reply

Comments or Questions?

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?