Introduction

In this blog, I will guide you through how to perform a sentiment analysis on a text data.

We will be using the Razer_mouse_reviews.csv which I scraped from Amazon using Scrapy in Python to perform our analysis. This is only a portion of the reviews. The data contains the customer reviews for the product "Razer DeathAdder Essential Gaming Mouse" on Amazon and their star ratings. If you want to follow me along, you can download my dataset from here Razer_mouse_reviews.csv. You can also scrape your own dataset of any product you want on Amazon follow this tutorial: scraping amazon reviews use python scrapy. In addition, this study project is inspired by this wonderfull tutorial: A Beginner’s Guide to Sentiment Analysis with Python

Data cleaning

Before we start our analysis, we need to clean up our data.

Prerequisites

First, we need to import the necessary packages. Make sure you have installed all the packages below:

import numpy as np
import pandas as pd
import seaborn as sns
color = sns.color_palette()
import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report

#Optional, just want to ignore the warning text in the output
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

Read the Dataframe

df1 = pd.read_csv("Razer_mouse_reviews.csv")
df1.head()

From the table below, we can see the stars are strings instead of integer rating and there are so many white line in front of each comment.

stars comment
0 4.0 out of 5 stars \n\n\n\n\n\n\n\n\n\n \n \n \n For an "el...
1 2.0 out of 5 stars \n\n\n\n\n\n\n\n\n\n \n \n \n Every time...
2 2.0 out of 5 stars \n\n\n\n\n\n\n\n\n\n \n \n \n The unit i...
3 1.0 out of 5 stars \n\n\n\n\n\n\n\n\n\n \n \n \n O.K., Im g...
4 1.0 out of 5 stars \n\n\n\n\n\n\n\n\n\n \n \n \n The mouse ...

Clean the Dataframe

Next, we want to replace "1.0 out of 5 stars" to 1, "2.0 out of 5 stars" to 2, "3.0 out of 5 stars" to 3, "4.0 out of 5 stars" to 4, and "5.0 out of 5 stars" to 5 in the stars column. Then use str.strip() to remove all the white lines at the begainning of each comment. The code is like following:

df1 =df1.replace({"1.0 out of 5 stars": 1, '2.0 out of 5 stars': 2, '3.0 out of 5 stars': 3, '4.0 out of 5 stars': 4, '5.0 out of 5 stars': 5})
#Remove the new lines at the begainning of the comment
df1['comment'] =df1['comment'].str.strip()

Good job! Your data should be look like this:

stars comment
0 4 For an "elite" gaming mouse with impressive fe...
1 2 Every time my computer starts or restarts Syna...
2 2 The unit is just built cheap. Not the quality ...
3 1 O.K., Im going to "throw" this to the air. Nob...
4 1 The mouse left click started to break within t...

Classify Positive and Negative Comments

We will assign comments with rating four and five as 1 which means positive sentiment, rating one and two as -1 which means negative sentiment. Since, we are not interested in rating three which is the neutral sentiment in this analysis, we will drop the comments with rating three. The code below will do the job:

df = df1[df1['stars'] != 3]
#add one column called sentiment contains values 1 and -1
df['sentiment'] = df['stars'].apply(lambda rating : +1 if rating > 3 else -1)
#add another column sentimentt contains values negative and positive
df['sentimentt'] = df['sentiment'].replace({-1 : 'negative'})
df['sentimentt'] = df['sentimentt'].replace({1 : 'positive'})

Your new dataframe will look like this:

stars comment sentiment sentimentt
0 4 For an "elite" gaming mouse with impressive fe... 1 positive
1 2 Every time my computer starts or restarts Syna... -1 negative
2 2 The unit is just built cheap. Not the quality ... -1 negative
3 1 O.K., Im going to "throw" this to the air. Nob... -1 negative
4 1 The mouse left click started to break within t... -1 negative

Exploratory Data Analysis

We want explore our dataset to see if there is any interesting dicoveries and findings. Althought, our dataset only has two columns, we can still do a lot of fancy graphs and vidualizations.

Stars Counts

Next, we can viduallize the number of comments in each star rating using seaborn. Make sure you use df1 to do this plot, because we deleted rating 3 in df.

#set plot theme
sns.set_theme(style="darkgrid")
#Specifiy the figure size 
plt.figure(figsize=(15,8))

ax = sns.countplot(x="stars", data = df1, palette="Blues")
ax.set_title("Number of Comments in Each Rating ", fontsize=20)
ax.set_xlabel("Star Rating",fontsize=15)
ax.set_ylabel("Number of Comments",fontsize=15)
plt.show()

The result plot looks like this:

From the plot above, we can see that the majority of the customers rating is positive. On the other hand, there are almost 100 comments are rated one star and two star. Therefore, we can build a model to predict the customers rating based on their comments. We will talk about modeling in the later section.

Sentiment Counts

Now, we can take a closer look. We want to see the number of positive comments and the number of negative comments:

sns.set_theme(style="darkgrid")
plt.figure(figsize=(15,8))
ax = sns.countplot(x="sentimentt", data = df, palette="coolwarm")
ax.set_title("Product Sentiment", fontsize=20)
ax.set_xlabel("Sentiment",fontsize=15)
ax.set_ylabel("Count",fontsize=15)
plt.show()

Most Frequent Words

Next, we can use the WordCloud to find the most frequent words that appeared in the comments.

# Create stopword list 
stopwords = set(STOPWORDS)
stopwords.update(["mouse", "Razer", "gaming"])
#Joint each word in each comment together separate by space
textt = " ".join(review for review in df.comment)
wordcloud = WordCloud(stopwords=stopwords, background_color = "white").generate(textt)
# Plot the worldclou to show the most frequent words in the image
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

The result would be look like this:

From the above image, we can see the word, "great", "feel", "use", "button", "one", "software" are the most frequent words. Next, we can also find the most frequent words in the positive and negative comments by split the comments into positive and negative comments.

Most Frequent Words in the Positive Comments

positive = df[df['sentiment'] == 1]

stopwords = set(STOPWORDS)
stopwords.update(["mouse", "Razer", "gaming", "use", "button"]) 
## good and great removed because they were included in negative sentiment
pos = " ".join(review for review in positive.comment)
wordcloud2 = WordCloud(stopwords=stopwords, background_color = "white").generate(pos)
plt.imshow(wordcloud2, interpolation='bilinear')
plt.axis("off")
plt.show()

Most Frequent Words in the Nagetive Comments

negative = df[df['sentiment'] == -1]
negative = negative.dropna()

neg = " ".join(review for review in negative.comment)
wordcloud3 = WordCloud(stopwords=stopwords, background_color = "white").generate(neg)
plt.imshow(wordcloud3, interpolation='bilinear')
plt.axis("off")
plt.show()

Data Modeling

Finally, it comes to the most exciting part, data modeling. We want to use logistic regression to predict whether the comments is positive or negative. Before doing so, we have to do some preprocessing.

Preprocessing

Step one: Remove punctuations

You can use the following function to remove puctuations in the comments:

def remove_punctuation(text):
    final = "".join(u for u in text if u not in ("?", ".", ";", ":",  "!",'"'))
    return final
df['comment'] = df['comment'].apply(remove_punctuation)

Your new dataframe will like following table:

stars comment sentiment sentimentt
0 4 For an elite gaming mouse with impressive feat... 1 positive
1 2 Every time my computer starts or restarts Syna... -1 negative
2 2 The unit is just built cheap Not the quality p... -1 negative
3 1 OK, Im going to throw this to the air Nobody a... -1 negative
4 1 The mouse left click started to break within t... -1 negative

Step two: Select the Feature for Modeling

In this example, our model only take two teatures. We will use comment to predict sentiment, 1 for positive, -1 for negative.

dfNew = df[['comment','sentiment']]
dfNew.head()

Your selected features are following columns:

comment sentiment
0 For an elite gaming mouse with impressive feat... 1
1 Every time my computer starts or restarts Syna... -1
2 The unit is just built cheap Not the quality p... -1
3 OK, Im going to throw this to the air Nobody a... -1
4 The mouse left click started to break within t... -1

Step three: Split Train and Test Data

We can randomly split our data using the train_test_split function in the sklearn package, which already imported in the very begining. 80% of our data will be used for training, and 20% will be used for testing. Thre are many other methods to split your data, feel free to use your own way.

train ,test = train_test_split(df,test_size=0.2)

Step four: Vectorize Comments

In this step, we want to Convert the collection of text comments to a matrix of token counts, because the logistic regression algorithm cannot understand text.

We will use a count vectorizer from the Scikit-learn library to transform the text comments into a bag of words model, which will find the unique words in each comment, and count the occurence for each word in each comment.

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')  

#vectorize both train data and test data
train_matrix = vectorizer.fit_transform(train['comment'])
test_matrix = vectorizer.transform(test['comment'])

Logistic Regression

Finally, we can use the Logistic Regression from the Scikit-learn library to fit our trainning data and make predictions using our test data.

lr = LogisticRegression()

#Split target and independent variables
X_train = train_matrix
X_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']

#Fit model on data
lr.fit(X_train,y_train)

#Make predictions
predictions = lr.predict(X_test)

Data Validation

We can test our model accuracy use the confusion matrix, which imported from the sklearn package in the very beginning:

new = np.asarray(y_test)
cf_matrix = confusion_matrix(predictions,y_test)

#Display our confusion matrix in a heatmap:
group_names = ["True Negative","False Positive","False Negative","True Positive"]
group_counts = ["{0:0.0f}".format(value) for value in
                cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
plt.show()

We can also generate a classification report to validate our model accuracy:

print(classification_report(predictions,y_test))
              precision    recall  f1-score   support

          -1       0.67      0.86      0.75        14
           1       0.97      0.91      0.94        70

    accuracy                           0.90        84
   macro avg       0.82      0.89      0.85        84
weighted avg       0.92      0.90      0.91        84

The overall accuracy of the model on the test data is around 90%, which is pretty good since our dataset is not very large.

Thank you for reading it, I hope this tutorial will help you to understand the basics of the sentiment analysis, WorldCloud, and Logistic Regression. If you have any questions feel free to comment below. Good luck everyone, you are on the right track.