1. Introduction

Do you like to drink wine? Can you guess the grape type after drinking it? If not, you are in the right place. This project will use some simple supervised machine learning techniques to predict the wine grape variety of which a bottle of wine was made from. In this project, we will use the wine descriptions to find the similarities and differences between wines, then predict the grape variety based on its description.

In fact, all the descriptions were done by wine experts called sommelier. It will take them years to practice in order to master blind tasting. Blind tasting a wine means tasting a wine with no idea of its grape variety, origin, vintage, or any evidence really other than the liquid in front of you.

Actually, this can be done by machine learning and data mining. However the computer cann't fully replace the sommeliers, we need to provide the machine some based characterizations about the wine like the description to predict the grape variety.

This project is the fisrt part, which I will focus on exploring the similarities between wine descriptions using some simple Natural Language Processing (NLP) techniques. In additon, this whole project is reimplemented based on this Wine Project using spaCy instead of the NLTK package.

2. Data Cleaning

In this project, I scraped my own data set from Majestic using selenium. If you want to follow me along, you can download my pre-cleaned data from here: my_wine_data_clean.csv. You can also scrape your own data from Majestic, Wine Enthusiast, Bibendum, etc... Another option is to download the big wine dataset from Kaggle published by zackthoutt, it will be a good dataset to start with. We will only use the wine descriptions, and grape varieties in this project.

First, we need to import all the neccessary packages. Make sure you have installed all the libraries in your computer if you want to follow me along.

import pandas as pd
import numpy as np
from functools import reduce

# function to split data into train and test samples
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# packages for visualization
import seaborn as sns
import matplotlib.pyplot as plt

from collections import Counter
import spacy
nlp = spacy.load("en_core_web_sm")
plt.style.use('seaborn-darkgrid')

Then we will load our wine data. my data have 503 rows and 3 columns which are the wine name, grape variety, and description. Below you can see the first five rows of my dataframe. In this project we are mainly interested in the grape_variety and description column.

a0 = pd.read_csv("my_wine_data_clean.csv")
a0.head()
Unnamed: 0 name grape_variety description
0 0 Edouard Delaunay 'Septembre' Chardonnay 2019, ... Chardonnay One half of the Abbotts & Delaunay powerhouse,...
1 1 The Ned Waihopai River Sauvignon Blanc 2020 Ma... Sauvignon Blanc Our bestselling white wine. Winemaker Brent Ma...
2 2 The King's Favour Sauvignon Blanc 2019/20 Marl... Sauvignon Blanc Brent Marris is winemaking royalty. And it tur...
3 3 Château Livran 2014, Médoc Cabernet Sauvignon Château Livran once belonged to both Edward I ...
4 4 Cave de Lugny 'Reserve' Mâcon-Chardonnay 2019 Chardonnay The small village of Mâcon-Chardonnay isn’t co...

Next, we want to find out what are the grape varieties in our dataframe, and how many rows for each type.

df_val_counts = pd.DataFrame(a0['grape_variety'].value_counts())
df_val_counts.head(10)
grape_variety
Chardonnay 219
Pinot Noir 101
Cabernet Sauvignon 92
Sauvignon Blanc 91

From the table above, we can see there are four grape types in our dataframe, and the number of appearence is accpeted. Therefore, we are going to prediction whether the wine is made of Chardonnay, Pinot Noir, Cabernet Sauvignon, or Sauvignon Blanc with their descriptions.

First, we need to split our data into test and train set:

combined_features = ['description', 'grape_variety']
target = 'grape_variety'

X_train, X_test, y_train, y_test = train_test_split(a0[combined_features], data_input[target], 
                                                    test_size=0.30, random_state=42)

The following bar chart shows the number of descriptions in each grape variety:

#set plot theme and font size
sns.set(style="darkgrid", font_scale=1.2)
#Specifiy the figure size 
plt.figure(figsize=(10,6))

ax = sns.countplot(x="grape_variety", data = X_train, palette="pastel", 
                   order = X_train['grape_variety'].value_counts().index)
ax.set_title("Number of Descriptions in Grape Variety", fontsize=20)
ax.set_xlabel('Count')
ax.set_ylabel('Grape Varieties')
plt.show()

Then, let's combine all the descriptions for each grape variety into one row, so we can find some patterns to distinguish them.

grouped = X_train[['grape_variety', 'description']].groupby(['grape_variety']).agg(
    {'description': lambda z: reduce(lambda x,y: ''.join(x+y), z)}
)
grouped["description"] = grouped["description"].str.lower()

Your will get the following dataframe:

description
grape_variety
Cabernet Sauvignon château livran once belonged to both edward i ...
Chardonnay produced exclusively from top quality chardonn...
Pinot Noir you might think california is too hot for the ...
Sauvignon Blanc this dessert wine from marisco vineyards' king...

2.1 Word Tokenization

We will need to define a funtion to tokenize our descriptions, so we are able to compare the words used in each grape variety. In the function word_count_df below, the parameter df is the input dataframe, src_col is the column name in df that you want to tokenize. The function will return you a pandas dataframe with three columns, grape, token, and count. You can also modify the column names in the parameter out_col.

Basically, we use a spaCy model to tokenize descriptions for each grape variety, then count the number of appearence for each unique words in each description. Finally return the word counts to a pandas dataframe.

def word_count_df(df, src_col, out_col=('grape', 'token', 'count')):
    dfp = pd.DataFrame()
    for i in df.index:
        doc = nlp(df[df.index == i][src_col].values[0])
        counts = Counter()
        for token in doc:
            counts[token.text] += 1 
        dftmp = pd.DataFrame(dict(zip(out_col, ([i]*len(counts.keys()), list(counts.keys()), list(counts.values())))))
        dfp=pd.concat([dfp, dftmp], ignore_index=True)
    return(dfp)

Let's call our word_count_df function:

tkn_count = word_count_df(grouped, 'description', out_col=('grape', 'token', 'count'))

Your tkn_count will look like the following table:

grape token count
0 Cabernet Sauvignon château 14
1 Cabernet Sauvignon livran 2
2 Cabernet Sauvignon once 3
3 Cabernet Sauvignon belonged 1
4 Cabernet Sauvignon to 42
... ... ... ...
6274 Sauvignon Blanc enchanting 1
6275 Sauvignon Blanc soaked 1
6276 Sauvignon Blanc sun 1
6277 Sauvignon Blanc gooseberry.awatere 1
6278 Sauvignon Blanc cheeses 1

6279 rows × 3 columns

Now, let's define a function to plot a barplot matrix to show the most frequent tokens in each grape variety:

def barplot_wordcounts(df, limit = 10):
    # Create subsets for each grap type
    chardonnay = df[df['grape'] == "Chardonnay"].sort_values(by=['count'], 
                                                                          ascending=False).head(limit).sort_values(by=['count'])
    sauvignon_blanc = df[df['grape'] == "Sauvignon Blanc"].sort_values(by=['count'], 
                                                                          ascending=False).head(limit).sort_values(by=['count'])
    pinot_noir= df[df['grape'] == "Pinot Noir"].sort_values(by=['count'], 
                                                                          ascending=False).head(limit).sort_values(by=['count'])
    syrah = df[df['grape'] == "Cabernet Sauvignon"].sort_values(by=['count'], 
                                                                          ascending=False).head(limit).sort_values(by=['count'])
    #build the plot
    fig, ax = plt.subplots(2, 2, figsize=(15, 8))

    ax[0, 0].barh(chardonnay['token'], chardonnay['count'], color= "gold") 
    ax[0, 0].set_ylabel("Tokens")
    ax[0, 0].set_title("Chardonnay")

    ax[0, 1].barh(sauvignon_blanc['token'], sauvignon_blanc['count'], color= "deepskyblue") 
    ax[0, 1].set_title("Sauvignon Blanc")

    ax[1, 0].barh(pinot_noir['token'], pinot_noir['count'], color= "violet") 
    ax[1, 0].set_ylabel("Tokens")
    ax[1, 0].set_xlabel("Count")
    ax[1, 0].set_title("Pinot Noir")

    ax[1, 1].barh(syrah['token'], syrah['count'], color= "limegreen") 
    ax[1, 1].set_xlabel("Count")
    ax[1, 1].set_title("Cabernet Sauvignon")

    plt.tight_layout()#Get rid of overlaps
    plt.show()

Below is the top 15 most frequent words used in Chardonnay, Sauvignon Blanc, Pinot Noir, and Cabernet Sauvignon descriptions. From the plot, we can see there are a lot of punctuations and common words like the, and, of, a, in, is, it, and to etc.. Therefore, we cann't really tell the difference between these four types of grapes.

barplot_wordcounts(tkn_count, limit = 15)

2.2 Filtering Noise

Next, we need to filter out the noises from our token list, so we can see the difference between each grape type clearly. SpaCy have a list of build-in stop words, we can also add some customized stop words to it like what I did in the follow:

customize_stop_words = ["wine","wines","fruit","fruits", "flavour",'flavours', 'aromas', 'palate', 'chardonnay','notes',
                        'note','sauvignon', 'blanc', 'de', 'pinot', 'noir', 'cabernet', 'best']
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True

Then, we can just simply modify the word_count_df function that we defined previously to remove the stop words and punctuations. token.is_alpha will check whether the word is alphabatic characters or not, so this will remove the punctuations. token.is_stop will check whether the word is stop words or not, we can only keep the words that are not stop words.

def word_count_df1(df, src_col, out_col=('grape', 'token', 'count')):
    dfp = pd.DataFrame()
    for i in df.index:
        doc = nlp(df[df.index == i][src_col].values[0])
        counts = Counter()
        for token in doc:
            #remove non-alphabatic characters and stop words
            if token.is_alpha == True and token.is_stop == False:
                counts[token.text] += 1 
        dftmp = pd.DataFrame(dict(zip(out_col, ([i]*len(counts.keys()), list(counts.keys()), list(counts.values())))))
        dfp=pd.concat([dfp, dftmp], ignore_index=True)
    return(dfp)

After filtered out the noises, our dataframe will look like the folloing table. Compare to the previous table, we removed 1347 rows.

tkn_count1 = word_count_df1(grouped, 'description', out_col=('grape', 'token', 'count'))
tkn_count1
grape token count
0 Cabernet Sauvignon château 14
1 Cabernet Sauvignon livran 2
2 Cabernet Sauvignon belonged 1
3 Cabernet Sauvignon edward 1
4 Cabernet Sauvignon pope 1
... ... ... ...
4927 Sauvignon Blanc sounds 1
4928 Sauvignon Blanc enchanting 1
4929 Sauvignon Blanc soaked 1
4930 Sauvignon Blanc sun 1
4931 Sauvignon Blanc cheeses 1

4932 rows × 3 columns

After filtering, some of the characteristics start to apppear in the most frequent words. From the table below, we can actually differentiate the varieties by these most frequent words.

barplot_wordcounts(tkn_count1, limit = 10)

2.3 Lemmatization

In the previous step, we removed some basic noise from the words token counts, but that is not enough. We can see grapes and grape, cherry and cherries. In fact, they are the same but one is singular noun another one is plural noun.

The good news is that we can simply lemmatize the tokens using spaCy. token.lemma is the lemmatized word. For example, grapes will become grape. Finally, we can simply add the token.lemma into the word_count_df1 function to limmatized the tokens.

def word_count_df2(df, src_col, out_col=('grape', 'token', 'count')):
    dfp = pd.DataFrame()
    for i in df.index:
        doc = nlp(df[df.index == i][src_col].values[0])
        counts = Counter()
        for token in doc:
            if token.is_alpha == True and token.is_stop == False: #remove non-alphabatic characters and stop words
                counts[token.lemma_] += 1 # Lemmatization
        dftmp = pd.DataFrame(dict(zip(out_col, ([i]*len(counts.keys()), list(counts.keys()), list(counts.values())))))
        dfp=pd.concat([dfp, dftmp], ignore_index=True)
    return(dfp)

Below is the token count table after lemmatization, compare to the previous table we have removed 503 rows.

tkn_count2 = word_count_df2(grouped, 'description', out_col=('grape', 'token', 'count'))  
tkn_count2
grape token count
0 Cabernet Sauvignon château 14
1 Cabernet Sauvignon livran 2
2 Cabernet Sauvignon belong 1
3 Cabernet Sauvignon edward 1
4 Cabernet Sauvignon pope 1
... ... ... ...
4424 Sauvignon Blanc dusky 1
4425 Sauvignon Blanc sound 1
4426 Sauvignon Blanc enchant 1
4427 Sauvignon Blanc soak 1
4428 Sauvignon Blanc sun 1

4429 rows × 3 columns

After lemmatization, we can see the word 'grapes' become 'grape' and their count increased, because it adds up the token count for both grape and grapes.

barplot_wordcounts(tkn_count2, limit = 10)

3. Comparison

3.1 Common Words

In the previous bar chart, we can see there are some words that are overlaped between the grapes. Next, we want to find the common words in the descriptions of each grape variety. Common words are the words that are not giving much information about the individual grape variety. After finding the common words we can check the correlation of their frequency between each grape variety.

dfs = []
#build a list of four objects of df for each graph.
for gr in varieties:
    tmp = tkn_count2[tkn_count2.grape == gr]
    tmp = tmp.add_suffix('_'+gr)
    tmp.columns=tmp.columns.str.replace('token_'+gr,'token')
    dfs.append(tmp)
    
# merge each grape variety with each other on tokens
df_final = reduce(lambda left,right: pd.merge(left,right,on='token', how='outer'), dfs)
df_common = df_final.dropna()
#Select the columns we need (the count column and token column)
cols = df_common.columns.str.contains('count') | df_common.columns.str.contains('token')
df_common = df_common[df_common.columns[cols]]
df_common
token count_Chardonnay count_Pinot Noir count_Cabernet Sauvignon count_Sauvignon Blanc
0 produce 21.0 9.0 5.0 4.0
2 quality 22.0 10.0 3.0 5.0
3 grape 49.0 25.0 10.0 18.0
4 grow 9.0 5.0 1.0 3.0
8 vineyard 45.0 26.0 4.0 12.0
... ... ... ... ... ...
1367 boast 1.0 2.0 2.0 1.0
1387 heart 2.0 1.0 3.0 1.0
1611 roasted 1.0 4.0 1.0 1.0
1644 standard 1.0 2.0 1.0 1.0
1658 complement 1.0 1.0 1.0 1.0

194 rows × 5 columns

Below is a correlation heatmap for the frequency of the common words between each grape variety.

sns.heatmap(df_common.corr() , 
        xticklabels=df_common.corr().columns,
        yticklabels=df_common.corr().columns,
           annot=True, cmap = 'Blues', vmin=0, vmax=1)
plt.show()

Actually, We cannot simply use the word frequency for each grape variety because the sample size of the grape descriptions are not the same. Therefore, we have to normalize the word count within each grape variety using the MinMaxScaler from the sklearn package. MinMaxScaler will generate outputs numbers between 0 and 1.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_norm = df_common.copy()
sort_col = df_norm.columns.str.contains('count')
df_norm[df_norm.columns[sort_col]] = scaler.fit_transform(df_norm[df_norm.columns[sort_col]])
df_norm['freq'] = df_norm[df_norm.columns[sort_col]].sum(axis=1)

# visualize the 10 most frequent word within grape verieties

dfplot = df_norm.sort_values(by = ['freq'], ascending = False).head(15)
dfplot
token count_Chardonnay count_Pinot Noir count_Cabernet Sauvignon count_Sauvignon Blanc freq
3 grape 0.923077 0.500000 0.473684 0.68 2.576761
171 year 0.711538 0.375000 0.684211 0.52 2.290749
50 oak 0.923077 0.458333 0.736842 0.16 2.278252
205 fresh 0.711538 0.291667 0.263158 1.00 2.266363
164 red 0.115385 1.000000 1.000000 0.00 2.115385
183 fine 0.596154 0.229167 0.947368 0.24 2.012689
8 vineyard 0.846154 0.520833 0.157895 0.44 1.964882
37 rich 0.769231 0.083333 0.684211 0.36 1.896775
40 white 1.000000 0.041667 0.052632 0.76 1.854298
131 world 0.500000 0.145833 0.631579 0.36 1.637412
228 expect 0.423077 0.270833 0.473684 0.36 1.527594
197 ripe 0.403846 0.104167 0.894737 0.12 1.522750
9 great 0.538462 0.291667 0.631579 0.04 1.501707
161 vintage 0.557692 0.208333 0.526316 0.12 1.412341
65 finish 0.230769 0.145833 0.578947 0.44 1.395550

Then I will plot the proportion of each grape for the top 15 most frequent conmmon words

from matplotlib import cm
import matplotlib

cmap = cm.get_cmap('cool')
fsize = 12

dfplot.index = dfplot['token']
sort_col = dfplot.columns[dfplot.columns.str.contains('count')]

# creating the proportion columns
llabel = []
for col in sort_col:
    llabel.append(col.replace('count_',''))
    dfplot[col] = dfplot[col] / dfplot['freq']

# plotting
ax = dfplot.loc[list(reversed(dfplot.index[:15])), sort_col].plot(kind='barh', stacked=True, cmap=cmap, 
                                                                  figsize=(10, 6), fontsize=fsize)
ax.legend(llabel,loc='best', bbox_to_anchor=(1., 1.), fontsize=fsize)
ax.set_facecolor('w')
ax.set_frame_on(False)
ax.set_ylabel(ax.get_ylabel(), fontsize=fsize)
plt.show()

3.2 Disjoint Words

Now, let's see what are the disjoint words or the words that are unique in each grape variety.

df_unique = pd.DataFrame()
for gr in varieties:
    cond = ~(df_final.columns.str.contains(gr) | df_final.columns.str.contains('token'))
    ind = df_final[df_final.columns[cond]].isna().all(1)  # find unique words in each column
    tmp = df_final.loc[ind, ~cond]
    tmp.columns = tmp.columns.str.replace('_'+gr, '')
    df_unique = pd.concat([df_unique, tmp], ignore_index=True, sort = True)
df_unique
count grape token
0 1.0 Chardonnay comte
1 1.0 Chardonnay patient
2 5.0 Chardonnay establish
3 2.0 Chardonnay inland
4 3.0 Chardonnay sumptuous
... ... ... ...
1705 1.0 Sauvignon Blanc flow
1706 1.0 Sauvignon Blanc dusky
1707 1.0 Sauvignon Blanc sound
1708 1.0 Sauvignon Blanc enchant
1709 1.0 Sauvignon Blanc soak

1710 rows × 3 columns

From the plot below, we can see some of the grape characteristics, but these words cannot be the best describers, because we are only showing the common and disjoint features so far. There may be features in the partially disjoint words.

barplot_wordcounts(df_unique, limit = 10)

4. Term Frequency And The Count Vectorizer

Actually, we have packages to do the word count. We can use the CountVectorizer from Scikit-learn to count the term frequency of a document. However, before we call the funtion we need to preprosess the descriptions. We will define a remove_stop_words_and_lemmatization() funtion to remove stop words, punctuations, and lemmatize the words in the descriptions.

def remove_stop_words_and_lemmatization(text):

    my_doc = nlp(text)

    # Create list of word tokens
    token_list = []
    for token in my_doc:
        token_list.append(token.lemma_)

    # Create list of word tokens after removing stopwords
    filtered_sentence =[] 
    for word in token_list:
        lexeme = nlp.vocab[word]
        if lexeme.is_stop == False and lexeme.is_alpha == True: #filter out stop word and non-alpha words
            filtered_sentence.append(word)  
    s = ' '
    text_final = s.join(filtered_sentence) 
    return text.replace(text, text_final)

We will apply the remove_stop_words_and_lemmatization() function to the description column in the dataframe called grouped.

grouped['description'] = grouped['description'].apply(remove_stop_words_and_lemmatization)

Then, we can call the CountVectorizer and fit it on the wine descriptions to create a word counts. We can use the transform method to obtain a bag of words 4 x K sparse matrix.

count_vec = CountVectorizer(analyzer='word', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)
count_train = count_vec.fit(grouped.loc[:,'description'])
# based on the fitted vocabulary transform the document into vector representation with counts
bag_of_words = count_vec.transform(grouped.loc[:,'description'])

Now, let's take a look at what the bag of words look like. We need to convert the sparse matrix to ndarray, so we can visualize it. In the following array, each row means each grape type, each number in the row coorrespond to the term frequency of a word in that row.

bag_of_words.toarray()
array([[0, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 3, 4, 1],
       [1, 2, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 5, 3, 1]])

Next, we will convert the bag of words to a pandas dataframe:

a=pd.DataFrame(bag_of_words.toarray(), columns=sorted(count_vec.vocabulary_))
# take its transpose
a=a.T
# rename column names
a.columns=grouped.index
a
grape_variety Cabernet Sauvignon Chardonnay Pinot Noir Sauvignon Blanc
aarde 0 1 1 0
abbott 0 0 2 0
ability 0 1 0 0
abound 0 0 1 0
absolute 0 1 0 0
... ... ... ... ...
zealand 1 4 7 6
zesti 0 0 1 0
zesty 0 3 0 5
zingy 0 4 0 3
zippy 0 1 0 1

2691 rows × 4 columns

Calculate the correlation heatmap of the term frequency for each grape variety:

a_count = a[a.columns].corr()
sns.heatmap(a_count, 
            xticklabels=a_count.columns, 
            yticklabels=a_count.columns,
            annot=True, cmap = 'Blues', vmin=0, vmax=1)
plt.show()

5. Inverse Document Frequency And The TF-IDF Vectorizer

Another method to do the word vectorizer is the term frequency combined with inverse document frequency. If you are not familiar with TF-IDF, you can reference this article: TF-IDF. Basically we can follow the same step as we did for the CountVectorizer.

tfid_vec = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)

tfidf_train = tfid_vec.fit(grouped.loc[:,'description'])
bag_of_words = tfid_vec.transform(grouped.loc[:,'description'])

To help us to get a better understanding on what actually the idf doing, we convert our vocabulary with the correspond idf factors in to the following data frame.

word_idf = pd.DataFrame(
    {
        'token': list(sorted(tfid_vec.vocabulary_)), 
        'idf': tfid_vec.idf_}
)
word_idf.head(10)
token idf
0 aarde 1.510826
1 abbott 1.916291
2 ability 1.916291
3 abound 1.916291
4 absolute 1.916291
5 absolutely 1.510826
6 abundance 1.916291
7 abundant 1.916291
8 acacia 1.916291
9 access 1.916291

As you can see, there are only for unique values, becuae we only have four rows of data or four grape varieties.

idfs = pd.unique(word_idf['idf'].values)
idfs
array([1.51082562, 1.91629073, 1.        , 1.22314355])

Then, lets print out the proportion of words that contained in 1 document, 2 documents, 3 documents, and 4 documents.

print('Ratio of words contained in 1 document: %.3f pct' % (word_idf[word_idf['idf']==idfs[0]].shape[0]/word_idf.shape[0]*100)) # number of words contained in 1 document
print('Ratio of words contained in 2 documents: %.3f pct' % (word_idf[word_idf['idf']==idfs[1]].shape[0]/word_idf.shape[0]*100)) # number of words contained in 2 documents
print('Ratio of words contained in 3 documents: %.3f pct' % (word_idf[word_idf['idf']==idfs[2]].shape[0]/word_idf.shape[0]*100)) # number of words contained in 3 documents
print('Ratio of words contained in 4 documents: %.3f pct' % (word_idf[word_idf['idf']==idfs[3]].shape[0]/word_idf.shape[0]*100)) # number of words contained in 4 documents
Ratio of words contained in 1 document: 20.810 pct
Ratio of words contained in 2 documents: 62.430 pct
Ratio of words contained in 3 documents: 7.061 pct
Ratio of words contained in 4 documents: 9.699 pct

Here, we will obtain the bag of word data frame using the tfidf method:

a=pd.DataFrame(bag_of_words.toarray(), columns=sorted(tfid_vec.vocabulary_))
a=a.T
a.columns=grouped.index
atfidf = a.corr()
a
grape_variety Cabernet Sauvignon Chardonnay Pinot Noir Sauvignon Blanc
aarde 0.000000 0.004938 0.010125 0.000000
abbott 0.000000 0.000000 0.025683 0.000000
ability 0.000000 0.006263 0.000000 0.000000
abound 0.000000 0.000000 0.012842 0.000000
absolute 0.000000 0.006263 0.000000 0.000000
... ... ... ... ...
zealand 0.008606 0.013074 0.046909 0.045126
zesti 0.000000 0.000000 0.012842 0.000000
zesty 0.000000 0.014814 0.000000 0.056814
zingy 0.000000 0.019752 0.000000 0.034089
zippy 0.000000 0.004938 0.000000 0.011363

2691 rows × 4 columns

Then, let us again compare the correlation between different grape varieties using this tf-idf results.

sns.heatmap(atfidf, 
        xticklabels=atfidf.columns,
        yticklabels=atfidf.columns,
           annot=True, cmap = 'Blues', vmin=0, vmax=1)
plt.show()

Finally, let us compare the correlation between the common words of different grape varieties using tf-idf:

acommon = a[a.index.isin(list(word_idf[word_idf['idf']==idfs[3]]['token']))].corr()
sns.heatmap(acommon, 
        xticklabels=acommon.columns,
        yticklabels=acommon.columns,
           annot=True, cmap = 'Blues', vmin=0, vmax=1)
plt.show()

Conclusion

Overall, we performed some text analysis using spaCy to find characteristic features of grape varieties from the wine description. We also obtained the bag-of-word using the CountVectorizer and TfidfVectorizer classes from the Scikit-learn package.

We can clearly see that TfidfVectorizer performs better, because it we give us a lower correlation between grapes compare to CountVectorizer. Therefore, we will use TfidfVectorizer in the next part to predict grape varieties using Random Forest Classifier