Analysing Amazon Product Reviews Using LDA Topic Modelling

With the boom in the number of online buyers and the simultaneous influx of reviews, understanding user experience is becoming an increasingly challenging task. Reviews talk volumes about a product, the seller and local partners. However, scraping such a myriad of customer feedback can be a tricky task. This tutorial helps you understand better ways of retrieving and structuring reviews of products to draw powerful insights.

For our use case here, we will be using reviews of Amazon Echo.

Let’s get started

Before we get started with the program, let’s make sure we have all the necessary tools and libraries.

The program below is written for Python 3. Install Python3 and PIP by following this guide.

Install the required libraries using the command below:

pip3 install pandas nltk gensim pyLdavis

Load the libraries

We are going to need quite a few libraries. Let’s load them.

import re # We clean text using regex
import csv # To read the csv
from collections import defaultdict # For accumlating values
from nltk.corpus import stopwords # To remove stopwords
from gensim import corpora # To create corpus and dictionary for the LDA model
from gensim.models import LdaModel # To use the LDA model
import pyLDAvis.gensim # To visualise LDA model effectively
import pandas as pd

If you are using NLTK stopwords for the first time, you might have to download it first.

import nltk'stopwords')

Loading Amazon Echo Review Data

Here is a sample dataset for Amazon Echo reviews.

Download Sample Dataset here

This is how your sample data would look:

fileContents = defaultdict(list)
with open('reviews_sample.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader: # read a row as {column1: value1, column2: value2,...}
        for (k,v) in row.items(): # go over each column name and value 
            fileContents[k].append(v) # append the value into the appropriate list

Extract just reviews to a list using

reviews = fileContents['review_body']

Cleaning Up The Data


Let’s remove all punctuations

reviews = [re.sub(r'[^\w\s]','',str(item)) for item in reviews]


The reviews we have contains a lot of words that aren’t really necessary for our study. These are called stopwords. We will remove them from our text while converting our reviews to tokens.

We use the NLTK stopwords.

stopwords = set(stopwords.words('english'))

Let’s remove those stopwords while converting the reviews list to a list of reviews which are split into words that matter. The list would look like this: [[word 1 of review1, word2 of review1…],[word1 of review 2, word2 of review2..],…].

texts = [[word for word in document.lower().split() if word not in stopwords] for document in reviews]

Taking out the less frequent words

One of the easiest markers of how important a certain word is in a text (stopwords are exceptions) is how many times it has occurred. If it has occurred just once, then it must be rather irrelevant in the context of topic modeling. Let’s remove those words out.

frequency = defaultdict(int)
for text in texts:
    for token in text:
         frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

Begin processing

Turning our text to dictionary

A dictionary in the context of machine learning is a mapping between words and their integer ids. We know that a machine can’t understand words and documents as they are. So we split and vectorize them.  As well written here,

In this representation, each document is represented by one vector where each vector element represents a question-answer pair, in the style of:

“How many times does the word system appear in the document? Once.”

It is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary. You can refer this link to know more

dictionary = corpora.Dictionary(texts)

If you try printing the dictionary, you can see the number of unique tokens in the same.

Dictionary(13320 unique tokens: ['warmth', 'exclusively', 'orchestral', 'techy', 'formal']...)

This also means that each document will now be represented by a 28054-D vector. To actually convert tokenized documents to vectors,

corpus = [dictionary.doc2bow(text) for text in texts]

doc2bow counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. So it would have lists of tuples which goes [(word id no, occurred this many times), … ]

So if corpus reads [(0,1),(1,4)] it means Word with ID no ‘0’ occurred one time and word with id number ‘1’ occurred 4 times in the document. Now that we have our reviews in a language the machine could understand, let’s get to finding topics in them.

What is an LDA Model?

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. LDA expands to Latent Dirichlet Allocation (LDA) is an example of a model which is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.


For the sake of this tutorial, we will be using the gensim version of LDA model. The kind of model we use for topic modeling largely depends on our type of data. LDA model is only used for the purpose of this tutorial. For more accurate results, use a topic model trained for small documents.

Let’s go with nine topics for now. The number of topics you give is largely a guess/arbitrary. The model assumes the document contains that many topics. You may use Coherence model to find an optimum number of topics.

NUM_TOPICS = 9 # This is an assumption. 
ldamodel = LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)#This might take some time.


There you go, you have your model built! Explaining the algorithm behind LDA is beyond the scope of this tutorial. This has a good explanation of the same.


Extracting Topics from your model

Let’s see the topics. Note that you might not receive the exact result as shown here. The objective function for LDA is non-convex, making it a multimodal problem. In layman’s terms, LDA topic modeling won’t give you one single best solution, it’s an optimization problem. It gives locally optimal solutions; you cannot expect that any given run would outperform some other run from different starting points. To know more, check out this discussion on Stack Exchange.

topics = ldamodel.show_topics()
for topic in topics:
(0, '0.229*"love" + 0.049*"alexa" + 0.040*"gift" + 0.027*"loves" + 0.026*"christmas" + 0.023*"family" + 0.023*"bought" + 0.023*"awesome" + 0.017*"best" + 0.016*"absolutely"')
(1, '0.017*"amazon" + 0.017*"app" + 0.012*"device" + 0.010*"alexa" + 0.010*"work" + 0.010*"wifi" + 0.009*"phone" + 0.008*"time" + 0.008*"set" + 0.008*"get"')
(2, '0.037*"home" + 0.034*"smart" + 0.029*"lights" + 0.028*"music" + 0.025*"turn" + 0.021*"control" + 0.021*"amazon" + 0.020*"alexa" + 0.016*"devices" + 0.014*"use"')
(3, '0.042*"music" + 0.033*"alexa" + 0.022*"weather" + 0.020*"play" + 0.020*"questions" + 0.020*"use" + 0.018*"ask" + 0.015*"things" + 0.012*"list" + 0.012*"news"')
(4, '0.035*"like" + 0.031*"alexa" + 0.029*"dont" + 0.022*"doesnt" + 0.020*"would" + 0.019*"know" + 0.017*"say" + 0.015*"get" + 0.015*"cant" + 0.014*"really"')
(5, '0.091*"one" + 0.020*"bought" + 0.017*"got" + 0.016*"im" + 0.015*"another" + 0.014*"buy" + 0.013*"get" + 0.013*"day" + 0.012*"first" + 0.012*"would"')
(6, '0.155*"echo" + 0.145*"dot" + 0.039*"love" + 0.024*"room" + 0.019*"bedroom" + 0.018*"dots" + 0.016*"house" + 0.015*"use" + 0.013*"music" + 0.012*"great"')
(7, '0.061*"speaker" + 0.039*"sound" + 0.031*"good" + 0.027*"bluetooth" + 0.021*"better" + 0.020*"speakers" + 0.020*"quality" + 0.018*"echo" + 0.016*"great" + 0.015*"small"')
(8, '0.145*"great" + 0.068*"works" + 0.062*"product" + 0.054*"easy" + 0.052*"fun" + 0.041*"use" + 0.026*"set" + 0.023*"device" + 0.022*"little" + 0.020*"well"')
Looks too messy to understand easily. Let’s print them in a better view using pandas data frame.
word_dict = {};
for i in range(NUM_TOPICS):
    words = ldamodel.show_topic(i, topn = 20)
    word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words]

You get,

Observing the words, we could make the below initial insights. Tweaking number of passes and topics might yield better topics and results.

Topic 1 – Alexa makes a great Christmas gift among families. Kids and adults like it alike.

Topic 2 – There is some noise around wifi connectivity, likely negative since ‘problem’ is one of the top 8 contributors to the topic.

Topic 3 – Users are talking about how Amazon echo interacts with elements at home like music, radio, lights etc turning homes into a smart home.

Topic 4 – Amazon Echo reviewed on everyday tasks like playing music or telling about the weather, news etc. Some discussion on Bluetooth connectivity too

Topic 5 – Amazon Echo when compared with Google Home cannot answer a lot of questions.

Topic 6 – Amazon Echo is recommended, positive reviews.

Topic 7 – Users are talking about their experience with the Echo in their bedrooms and living rooms, likely positive.

Topic 8  – Reviews on speaker sound quality and Bluetooth connectivity.

Topic 9 – In spite of the price, Echo is recommended. General positive reviews.


Visualization using PyLDAvis

PyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data, by showing them visually. Let’s see ours.

lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)

Which gives us, the below visualization.



To interact with the result, click here. This will helps us infer even more from our data.

Some insights from topic modeling are:

  1. We observe that Topics 4 and 5 have some reviews in common. Reviews that talk about how Amazon Echo involves in everyday tasks seem to frequently compare Echo with Google Home.
  2. A small interaction between Topics 2 and 5 indicate Echo was compared with Google Home on issues with Wi-Fi connectivity too.
  3. The interaction between Topics 2 and 3 supports that few of the top problems that customers complain and compare on are Wi-Fi connectivity, answering simple questions and helping users in everyday tasks.
  4.  ‘Good’ is the third biggest contributor to Topic 8. This shows that speaker sound quality is a strong point and could be used as a positive point in advertising.
  5. Topic 1 suggests that the Echo makes great gifts, especially during the Christmas season. Increased attention to advertising with this perspective is suggested during the Christmas season.

You can get better keywords if you perform stemming or lemmatizing on your text. Tweak your number of topics and passes to see what gives you the best results to study from. You can also try using Coherence model to compute the optimum model, though this might give you very generic topics.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data

Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.


Posted in:   Data, Featured


Heikki January 30, 2019

This code does not work when there are special characters like smileys and hand gestures inside the reviews.

How do you get rid of them?

I run the code as written but keep getting this error:

…in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 6272: character maps to

    Enfa February 1, 2019

    Hi Heikki,

    I am going to assume that you got the error when loading your dataset since the line

    reviews=[re.sub(r'[^\w\s]','',str(item)) for item in reviews]

    will remove all punctuations as well as emojis.

    The error is thrown because Python 3 opens your files with a default encoding that doesn’t match the contents of your file. What you can do to fix this is, find the encoding of your input file, let’s say it’s utf-8, specify it using the encoding parameter in the file open function as shown below.

    with open('reviews_sample_customer.csv','r', encoding='utf-8') as f:

    If the error persists, please be sure to post the entire traceback and we’ll fix it together. Hope this helps. Do let us know!

Comments are closed.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?