Sentiment Analysis on Amazon Reviews with Python
A study project of data cleaning, generating WordCloud, vidualizations in seaborn, logistic regression using text data
In this blog, I will guide you through how to perform a sentiment analysis on a text data.
We will be using the Razer_mouse_reviews.csv which I scraped from Amazon using Scrapy in Python to perform our analysis. This is only a portion of the reviews. The data contains the customer reviews for the product "Razer DeathAdder Essential Gaming Mouse" on Amazon and their star ratings. If you want to follow me along, you can download my dataset from here Razer_mouse_reviews.csv. You can also scrape your own dataset of any product you want on Amazon follow this tutorial: scraping amazon reviews use python scrapy. In addition, this study project is inspired by this wonderfull tutorial: A Beginner’s Guide to Sentiment Analysis with Python
Before we start our analysis, we need to clean up our data.
import numpy as np
import pandas as pd
import seaborn as sns
color = sns.color_palette()
import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report
#Optional, just want to ignore the warning text in the output
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)
df1 = pd.read_csv("Razer_mouse_reviews.csv")
df1.head()
From the table below, we can see the stars are strings instead of integer rating and there are so many white line in front of each comment.
Next, we want to replace "1.0 out of 5 stars" to 1, "2.0 out of 5 stars" to 2, "3.0 out of 5 stars" to 3, "4.0 out of 5 stars" to 4, and "5.0 out of 5 stars" to 5 in the stars column. Then use str.strip() to remove all the white lines at the begainning of each comment. The code is like following:
df1 =df1.replace({"1.0 out of 5 stars": 1, '2.0 out of 5 stars': 2, '3.0 out of 5 stars': 3, '4.0 out of 5 stars': 4, '5.0 out of 5 stars': 5})
#Remove the new lines at the begainning of the comment
df1['comment'] =df1['comment'].str.strip()
Good job! Your data should be look like this:
We will assign comments with rating four and five as 1 which means positive sentiment, rating one and two as -1 which means negative sentiment. Since, we are not interested in rating three which is the neutral sentiment in this analysis, we will drop the comments with rating three. The code below will do the job:
df = df1[df1['stars'] != 3]
#add one column called sentiment contains values 1 and -1
df['sentiment'] = df['stars'].apply(lambda rating : +1 if rating > 3 else -1)
#add another column sentimentt contains values negative and positive
df['sentimentt'] = df['sentiment'].replace({-1 : 'negative'})
df['sentimentt'] = df['sentimentt'].replace({1 : 'positive'})
Your new dataframe will look like this:
We want explore our dataset to see if there is any interesting dicoveries and findings. Althought, our dataset only has two columns, we can still do a lot of fancy graphs and vidualizations.
Next, we can viduallize the number of comments in each star rating using seaborn. Make sure you use df1 to do this plot, because we deleted rating 3 in df.
#set plot theme
sns.set_theme(style="darkgrid")
#Specifiy the figure size
plt.figure(figsize=(15,8))
ax = sns.countplot(x="stars", data = df1, palette="Blues")
ax.set_title("Number of Comments in Each Rating ", fontsize=20)
ax.set_xlabel("Star Rating",fontsize=15)
ax.set_ylabel("Number of Comments",fontsize=15)
plt.show()
The result plot looks like this:
From the plot above, we can see that the majority of the customers rating is positive. On the other hand, there are almost 100 comments are rated one star and two star. Therefore, we can build a model to predict the customers rating based on their comments. We will talk about modeling in the later section.
Now, we can take a closer look. We want to see the number of positive comments and the number of negative comments:
sns.set_theme(style="darkgrid")
plt.figure(figsize=(15,8))
ax = sns.countplot(x="sentimentt", data = df, palette="coolwarm")
ax.set_title("Product Sentiment", fontsize=20)
ax.set_xlabel("Sentiment",fontsize=15)
ax.set_ylabel("Count",fontsize=15)
plt.show()
Next, we can use the WordCloud to find the most frequent words that appeared in the comments.
# Create stopword list
stopwords = set(STOPWORDS)
stopwords.update(["mouse", "Razer", "gaming"])
#Joint each word in each comment together separate by space
textt = " ".join(review for review in df.comment)
wordcloud = WordCloud(stopwords=stopwords, background_color = "white").generate(textt)
# Plot the worldclou to show the most frequent words in the image
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
The result would be look like this:
From the above image, we can see the word, "great", "feel", "use", "button", "one", "software" are the most frequent words. Next, we can also find the most frequent words in the positive and negative comments by split the comments into positive and negative comments.
positive = df[df['sentiment'] == 1]
stopwords = set(STOPWORDS)
stopwords.update(["mouse", "Razer", "gaming", "use", "button"])
## good and great removed because they were included in negative sentiment
pos = " ".join(review for review in positive.comment)
wordcloud2 = WordCloud(stopwords=stopwords, background_color = "white").generate(pos)
plt.imshow(wordcloud2, interpolation='bilinear')
plt.axis("off")
plt.show()
negative = df[df['sentiment'] == -1]
negative = negative.dropna()
neg = " ".join(review for review in negative.comment)
wordcloud3 = WordCloud(stopwords=stopwords, background_color = "white").generate(neg)
plt.imshow(wordcloud3, interpolation='bilinear')
plt.axis("off")
plt.show()
Finally, it comes to the most exciting part, data modeling. We want to use logistic regression to predict whether the comments is positive or negative. Before doing so, we have to do some preprocessing.
You can use the following function to remove puctuations in the comments:
def remove_punctuation(text):
final = "".join(u for u in text if u not in ("?", ".", ";", ":", "!",'"'))
return final
df['comment'] = df['comment'].apply(remove_punctuation)
Your new dataframe will like following table:
In this example, our model only take two teatures. We will use comment to predict sentiment, 1 for positive, -1 for negative.
dfNew = df[['comment','sentiment']]
dfNew.head()
Your selected features are following columns:
We can randomly split our data using the train_test_split function in the sklearn package, which already imported in the very begining. 80% of our data will be used for training, and 20% will be used for testing. Thre are many other methods to split your data, feel free to use your own way.
train ,test = train_test_split(df,test_size=0.2)
In this step, we want to Convert the collection of text comments to a matrix of token counts, because the logistic regression algorithm cannot understand text.
We will use a count vectorizer from the Scikit-learn library to transform the text comments into a bag of words model, which will find the unique words in each comment, and count the occurence for each word in each comment.
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
#vectorize both train data and test data
train_matrix = vectorizer.fit_transform(train['comment'])
test_matrix = vectorizer.transform(test['comment'])
Finally, we can use the Logistic Regression from the Scikit-learn library to fit our trainning data and make predictions using our test data.
lr = LogisticRegression()
#Split target and independent variables
X_train = train_matrix
X_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']
#Fit model on data
lr.fit(X_train,y_train)
#Make predictions
predictions = lr.predict(X_test)
We can test our model accuracy use the confusion matrix, which imported from the sklearn package in the very beginning:
new = np.asarray(y_test)
cf_matrix = confusion_matrix(predictions,y_test)
#Display our confusion matrix in a heatmap:
group_names = ["True Negative","False Positive","False Negative","True Positive"]
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
plt.show()
We can also generate a classification report to validate our model accuracy:
print(classification_report(predictions,y_test))
The overall accuracy of the model on the test data is around 90%, which is pretty good since our dataset is not very large.
Thank you for reading it, I hope this tutorial will help you to understand the basics of the sentiment analysis, WorldCloud, and Logistic Regression. If you have any questions feel free to comment below. Good luck everyone, you are on the right track.