1. Introduction

Do you like to drink wine? Can you guess the grape type after drinking it? If not, you are in the right place. This project will use some simple supervised machine learning techniques to predict the wine grape variety of which a bottle of wine was made from. In this project, we will use the wine descriptions to find the similarities and differences between wines, then predict the grape variety based on its description.

In fact, all the descriptions were done by wine experts called sommelier. It will take them years to practice in order to master blind tasting. Blind tasting a wine means tasting a wine with no idea of its grape variety, origin, vintage, or any evidence really other than the liquid in front of you.

Actually, this can be done by machine learning and data mining. However the computer cann't fully replace the sommeliers, we need to provide the machine some based characterizations about the wine like the description to predict the grape variety.

This project is the fisrt part, which I will focus on exploring the similarities between wine descriptions using some simple Natural Language Processing (NLP) techniques. In additon, this whole project is reimplemented based on this Wine Project using spaCy instead of the NLTK package.

Image by How to Become a Sommelier: Tips and Tricks for Breaking Into the Wine Industry.

2. Data Cleaning

In this project, I scraped my own data set from Majestic using selenium. If you want to follow me along, you can download my pre-cleaned data from here: my_wine_data_clean.csv. You can also scrape your own data from Majestic, Wine Enthusiast, Bibendum, etc... Another option is to download the big wine dataset from Kaggle published by zackthoutt, it will be a good dataset to start with. We will only use the wine descriptions, and grape varieties in this project.

First, we need to import all the neccessary packages. Make sure you have installed all the libraries in your computer if you want to follow me along.

import pandas as pd
import numpy as np
from functools import reduce

# function to split data into train and test samples
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# packages for visualization
import seaborn as sns
import matplotlib.pyplot as plt

from collections import Counter
import spacy
nlp = spacy.load("en_core_web_sm")
plt.style.use('seaborn-darkgrid')

Then we will load our wine data. my data have 503 rows and 3 columns which are the wine name, grape variety, and description. Below you can see the first five rows of my dataframe. In this project we are mainly interested in the grape_variety and description column.

a0 = pd.read_csv("my_wine_data_clean.csv")
a0.head()

Next, we want to find out what are the grape varieties in our dataframe, and how many rows for each type.

df_val_counts = pd.DataFrame(a0['grape_variety'].value_counts())
df_val_counts.head(10)

From the table above, we can see there are four grape types in our dataframe, and the number of appearence is accpeted. Therefore, we are going to prediction whether the wine is made of Chardonnay, Pinot Noir, Cabernet Sauvignon, or Sauvignon Blanc with their descriptions.

First, we need to split our data into test and train set:

combined_features = ['description', 'grape_variety']
target = 'grape_variety'

X_train, X_test, y_train, y_test = train_test_split(a0[combined_features], data_input[target], 
                                                    test_size=0.30, random_state=42)

The following bar chart shows the number of descriptions in each grape variety:

#set plot theme and font size
sns.set(style="darkgrid", font_scale=1.2)
#Specifiy the figure size 
plt.figure(figsize=(10,6))

ax = sns.countplot(x="grape_variety", data = X_train, palette="pastel", 
                   order = X_train['grape_variety'].value_counts().index)
ax.set_title("Number of Descriptions in Grape Variety", fontsize=20)
ax.set_xlabel('Count')
ax.set_ylabel('Grape Varieties')
plt.show()

Then, let's combine all the descriptions for each grape variety into one row, so we can find some patterns to distinguish them.

grouped = X_train[['grape_variety', 'description']].groupby(['grape_variety']).agg(
    {'description': lambda z: reduce(lambda x,y: ''.join(x+y), z)}
)
grouped["description"] = grouped["description"].str.lower()

Your will get the following dataframe:

2.1 Word Tokenization

We will need to define a funtion to tokenize our descriptions, so we are able to compare the words used in each grape variety. In the function word_count_df below, the parameter df is the input dataframe, src_col is the column name in df that you want to tokenize. The function will return you a pandas dataframe with three columns, grape, token, and count. You can also modify the column names in the parameter out_col.

Basically, we use a spaCy model to tokenize descriptions for each grape variety, then count the number of appearence for each unique words in each description. Finally return the word counts to a pandas dataframe.

def word_count_df(df, src_col, out_col=('grape', 'token', 'count')):
    dfp = pd.DataFrame()
    for i in df.index:
        doc = nlp(df[df.index == i][src_col].values[0])
        counts = Counter()
        for token in doc:
            counts[token.text] += 1 
        dftmp = pd.DataFrame(dict(zip(out_col, ([i]*len(counts.keys()), list(counts.keys()), list(counts.values())))))
        dfp=pd.concat([dfp, dftmp], ignore_index=True)
    return(dfp)

Let's call our word_count_df function:

tkn_count = word_count_df(grouped, 'description', out_col=('grape', 'token', 'count'))

Your tkn_count will look like the following table:

Now, let's define a function to plot a barplot matrix to show the most frequent tokens in each grape variety:

def barplot_wordcounts(df, limit = 10):
    # Create subsets for each grap type
    chardonnay = df[df['grape'] == "Chardonnay"].sort_values(by=['count'], 
                                                                          ascending=False).head(limit).sort_values(by=['count'])
    sauvignon_blanc = df[df['grape'] == "Sauvignon Blanc"].sort_values(by=['count'], 
                                                                          ascending=False).head(limit).sort_values(by=['count'])
    pinot_noir= df[df['grape'] == "Pinot Noir"].sort_values(by=['count'], 
                                                                          ascending=False).head(limit).sort_values(by=['count'])
    syrah = df[df['grape'] == "Cabernet Sauvignon"].sort_values(by=['count'], 
                                                                          ascending=False).head(limit).sort_values(by=['count'])
    #build the plot
    fig, ax = plt.subplots(2, 2, figsize=(15, 8))

    ax[0, 0].barh(chardonnay['token'], chardonnay['count'], color= "gold") 
    ax[0, 0].set_ylabel("Tokens")
    ax[0, 0].set_title("Chardonnay")

    ax[0, 1].barh(sauvignon_blanc['token'], sauvignon_blanc['count'], color= "deepskyblue") 
    ax[0, 1].set_title("Sauvignon Blanc")

    ax[1, 0].barh(pinot_noir['token'], pinot_noir['count'], color= "violet") 
    ax[1, 0].set_ylabel("Tokens")
    ax[1, 0].set_xlabel("Count")
    ax[1, 0].set_title("Pinot Noir")

    ax[1, 1].barh(syrah['token'], syrah['count'], color= "limegreen") 
    ax[1, 1].set_xlabel("Count")
    ax[1, 1].set_title("Cabernet Sauvignon")

    plt.tight_layout()#Get rid of overlaps
    plt.show()

Below is the top 15 most frequent words used in Chardonnay, Sauvignon Blanc, Pinot Noir, and Cabernet Sauvignon descriptions. From the plot, we can see there are a lot of punctuations and common words like the, and, of, a, in, is, it, and to etc.. Therefore, we cann't really tell the difference between these four types of grapes.

barplot_wordcounts(tkn_count, limit = 15)

2.2 Filtering Noise

Next, we need to filter out the noises from our token list, so we can see the difference between each grape type clearly. SpaCy have a list of build-in stop words, we can also add some customized stop words to it like what I did in the follow:

customize_stop_words = ["wine","wines","fruit","fruits", "flavour",'flavours', 'aromas', 'palate', 'chardonnay','notes',
                        'note','sauvignon', 'blanc', 'de', 'pinot', 'noir', 'cabernet', 'best']
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True

Then, we can just simply modify the word_count_df function that we defined previously to remove the stop words and punctuations. token.is_alpha will check whether the word is alphabatic characters or not, so this will remove the punctuations. token.is_stop will check whether the word is stop words or not, we can only keep the words that are not stop words.

def word_count_df1(df, src_col, out_col=('grape', 'token', 'count')):
    dfp = pd.DataFrame()
    for i in df.index:
        doc = nlp(df[df.index == i][src_col].values[0])
        counts = Counter()
        for token in doc:
            #remove non-alphabatic characters and stop words
            if token.is_alpha == True and token.is_stop == False:
                counts[token.text] += 1 
        dftmp = pd.DataFrame(dict(zip(out_col, ([i]*len(counts.keys()), list(counts.keys()), list(counts.values())))))
        dfp=pd.concat([dfp, dftmp], ignore_index=True)
    return(dfp)

After filtered out the noises, our dataframe will look like the folloing table. Compare to the previous table, we removed 1347 rows.

tkn_count1 = word_count_df1(grouped, 'description', out_col=('grape', 'token', 'count'))
tkn_count1

After filtering, some of the characteristics start to apppear in the most frequent words. From the table below, we can actually differentiate the varieties by these most frequent words.

barplot_wordcounts(tkn_count1, limit = 10)

2.3 Lemmatization

In the previous step, we removed some basic noise from the words token counts, but that is not enough. We can see grapes and grape, cherry and cherries. In fact, they are the same but one is singular noun another one is plural noun.

The good news is that we can simply lemmatize the tokens using spaCy. token.lemma is the lemmatized word. For example, grapes will become grape. Finally, we can simply add the token.lemma into the word_count_df1 function to limmatized the tokens.

def word_count_df2(df, src_col, out_col=('grape', 'token', 'count')):
    dfp = pd.DataFrame()
    for i in df.index:
        doc = nlp(df[df.index == i][src_col].values[0])
        counts = Counter()
        for token in doc:
            if token.is_alpha == True and token.is_stop == False: #remove non-alphabatic characters and stop words
                counts[token.lemma_] += 1 # Lemmatization
        dftmp = pd.DataFrame(dict(zip(out_col, ([i]*len(counts.keys()), list(counts.keys()), list(counts.values())))))
        dfp=pd.concat([dfp, dftmp], ignore_index=True)
    return(dfp)

Below is the token count table after lemmatization, compare to the previous table we have removed 503 rows.

tkn_count2 = word_count_df2(grouped, 'description', out_col=('grape', 'token', 'count'))  
tkn_count2

After lemmatization, we can see the word 'grapes' become 'grape' and their count increased, because it adds up the token count for both grape and grapes.

barplot_wordcounts(tkn_count2, limit = 10)

3. Comparison

3.1 Common Words

In the previous bar chart, we can see there are some words that are overlaped between the grapes. Next, we want to find the common words in the descriptions of each grape variety. Common words are the words that are not giving much information about the individual grape variety. After finding the common words we can check the correlation of their frequency between each grape variety.

dfs = []
#build a list of four objects of df for each graph.
for gr in varieties:
    tmp = tkn_count2[tkn_count2.grape == gr]
    tmp = tmp.add_suffix('_'+gr)
    tmp.columns=tmp.columns.str.replace('token_'+gr,'token')
    dfs.append(tmp)
    
# merge each grape variety with each other on tokens
df_final = reduce(lambda left,right: pd.merge(left,right,on='token', how='outer'), dfs)

df_common = df_final.dropna()
#Select the columns we need (the count column and token column)
cols = df_common.columns.str.contains('count') | df_common.columns.str.contains('token')
df_common = df_common[df_common.columns[cols]]
df_common

Below is a correlation heatmap for the frequency of the common words between each grape variety.

sns.heatmap(df_common.corr() , 
        xticklabels=df_common.corr().columns,
        yticklabels=df_common.corr().columns,
           annot=True, cmap = 'Blues', vmin=0, vmax=1)
plt.show()

Actually, We cannot simply use the word frequency for each grape variety because the sample size of the grape descriptions are not the same. Therefore, we have to normalize the word count within each grape variety using the MinMaxScaler from the sklearn package. MinMaxScaler will generate outputs numbers between 0 and 1.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_norm = df_common.copy()
sort_col = df_norm.columns.str.contains('count')
df_norm[df_norm.columns[sort_col]] = scaler.fit_transform(df_norm[df_norm.columns[sort_col]])
df_norm['freq'] = df_norm[df_norm.columns[sort_col]].sum(axis=1)

# visualize the 10 most frequent word within grape verieties

dfplot = df_norm.sort_values(by = ['freq'], ascending = False).head(15)
dfplot

Then I will plot the proportion of each grape for the top 15 most frequent conmmon words

from matplotlib import cm
import matplotlib

cmap = cm.get_cmap('cool')
fsize = 12

dfplot.index = dfplot['token']
sort_col = dfplot.columns[dfplot.columns.str.contains('count')]

# creating the proportion columns
llabel = []
for col in sort_col:
    llabel.append(col.replace('count_',''))
    dfplot[col] = dfplot[col] / dfplot['freq']

# plotting
ax = dfplot.loc[list(reversed(dfplot.index[:15])), sort_col].plot(kind='barh', stacked=True, cmap=cmap, 
                                                                  figsize=(10, 6), fontsize=fsize)
ax.legend(llabel,loc='best', bbox_to_anchor=(1., 1.), fontsize=fsize)
ax.set_facecolor('w')
ax.set_frame_on(False)
ax.set_ylabel(ax.get_ylabel(), fontsize=fsize)
plt.show()

3.2 Disjoint Words

Now, let's see what are the disjoint words or the words that are unique in each grape variety.

df_unique = pd.DataFrame()
for gr in varieties:
    cond = ~(df_final.columns.str.contains(gr) | df_final.columns.str.contains('token'))
    ind = df_final[df_final.columns[cond]].isna().all(1)  # find unique words in each column
    tmp = df_final.loc[ind, ~cond]
    tmp.columns = tmp.columns.str.replace('_'+gr, '')
    df_unique = pd.concat([df_unique, tmp], ignore_index=True, sort = True)
df_unique

From the plot below, we can see some of the grape characteristics, but these words cannot be the best describers, because we are only showing the common and disjoint features so far. There may be features in the partially disjoint words.

barplot_wordcounts(df_unique, limit = 10)

4. Term Frequency And The Count Vectorizer

Actually, we have packages to do the word count. We can use the CountVectorizer from Scikit-learn to count the term frequency of a document. However, before we call the funtion we need to preprosess the descriptions. We will define a remove_stop_words_and_lemmatization() funtion to remove stop words, punctuations, and lemmatize the words in the descriptions.

def remove_stop_words_and_lemmatization(text):

    my_doc = nlp(text)

    # Create list of word tokens
    token_list = []
    for token in my_doc:
        token_list.append(token.lemma_)

    # Create list of word tokens after removing stopwords
    filtered_sentence =[] 
    for word in token_list:
        lexeme = nlp.vocab[word]
        if lexeme.is_stop == False and lexeme.is_alpha == True: #filter out stop word and non-alpha words
            filtered_sentence.append(word)  
    s = ' '
    text_final = s.join(filtered_sentence) 
    return text.replace(text, text_final)

We will apply the remove_stop_words_and_lemmatization() function to the description column in the dataframe called grouped.

grouped['description'] = grouped['description'].apply(remove_stop_words_and_lemmatization)

Then, we can call the CountVectorizer and fit it on the wine descriptions to create a word counts. We can use the transform method to obtain a bag of words 4 x K sparse matrix.

count_vec = CountVectorizer(analyzer='word', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)

count_train = count_vec.fit(grouped.loc[:,'description'])
# based on the fitted vocabulary transform the document into vector representation with counts
bag_of_words = count_vec.transform(grouped.loc[:,'description'])

Now, let's take a look at what the bag of words look like. We need to convert the sparse matrix to ndarray, so we can visualize it. In the following array, each row means each grape type, each number in the row coorrespond to the term frequency of a word in that row.

bag_of_words.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 3, 4, 1],
       [1, 2, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 5, 3, 1]])

Next, we will convert the bag of words to a pandas dataframe:

a=pd.DataFrame(bag_of_words.toarray(), columns=sorted(count_vec.vocabulary_))
# take its transpose
a=a.T
# rename column names
a.columns=grouped.index
a

Calculate the correlation heatmap of the term frequency for each grape variety:

a_count = a[a.columns].corr()
sns.heatmap(a_count, 
            xticklabels=a_count.columns, 
            yticklabels=a_count.columns,
            annot=True, cmap = 'Blues', vmin=0, vmax=1)
plt.show()

5. Inverse Document Frequency And The TF-IDF Vectorizer

Another method to do the word vectorizer is the term frequency combined with inverse document frequency. If you are not familiar with TF-IDF, you can reference this article: TF-IDF. Basically we can follow the same step as we did for the CountVectorizer.

tfid_vec = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)

tfidf_train = tfid_vec.fit(grouped.loc[:,'description'])
bag_of_words = tfid_vec.transform(grouped.loc[:,'description'])

To help us to get a better understanding on what actually the idf doing, we convert our vocabulary with the correspond idf factors in to the following data frame.

word_idf = pd.DataFrame(
    {
        'token': list(sorted(tfid_vec.vocabulary_)), 
        'idf': tfid_vec.idf_}
)
word_idf.head(10)

As you can see, there are only for unique values, becuae we only have four rows of data or four grape varieties.

idfs = pd.unique(word_idf['idf'].values)
idfs

array([1.51082562, 1.91629073, 1.        , 1.22314355])

Then, lets print out the proportion of words that contained in 1 document, 2 documents, 3 documents, and 4 documents.

print('Ratio of words contained in 1 document: %.3f pct' % (word_idf[word_idf['idf']==idfs[0]].shape[0]/word_idf.shape[0]*100)) # number of words contained in 1 document
print('Ratio of words contained in 2 documents: %.3f pct' % (word_idf[word_idf['idf']==idfs[1]].shape[0]/word_idf.shape[0]*100)) # number of words contained in 2 documents
print('Ratio of words contained in 3 documents: %.3f pct' % (word_idf[word_idf['idf']==idfs[2]].shape[0]/word_idf.shape[0]*100)) # number of words contained in 3 documents
print('Ratio of words contained in 4 documents: %.3f pct' % (word_idf[word_idf['idf']==idfs[3]].shape[0]/word_idf.shape[0]*100)) # number of words contained in 4 documents

Ratio of words contained in 1 document: 20.810 pct
Ratio of words contained in 2 documents: 62.430 pct
Ratio of words contained in 3 documents: 7.061 pct
Ratio of words contained in 4 documents: 9.699 pct

Here, we will obtain the bag of word data frame using the tfidf method:

a=pd.DataFrame(bag_of_words.toarray(), columns=sorted(tfid_vec.vocabulary_))
a=a.T
a.columns=grouped.index
atfidf = a.corr()
a

Then, let us again compare the correlation between different grape varieties using this tf-idf results.

sns.heatmap(atfidf, 
        xticklabels=atfidf.columns,
        yticklabels=atfidf.columns,
           annot=True, cmap = 'Blues', vmin=0, vmax=1)
plt.show()

Finally, let us compare the correlation between the common words of different grape varieties using tf-idf:

acommon = a[a.index.isin(list(word_idf[word_idf['idf']==idfs[3]]['token']))].corr()
sns.heatmap(acommon, 
        xticklabels=acommon.columns,
        yticklabels=acommon.columns,
           annot=True, cmap = 'Blues', vmin=0, vmax=1)
plt.show()

Conclusion

Overall, we performed some text analysis using spaCy to find characteristic features of grape varieties from the wine description. We also obtained the bag-of-word using the CountVectorizer and TfidfVectorizer classes from the Scikit-learn package.

We can clearly see that TfidfVectorizer performs better, because it we give us a lower correlation between grapes compare to CountVectorizer. Therefore, we will use TfidfVectorizer in the next part to predict grape varieties using Random Forest Classifier

	Unnamed: 0	name	grape_variety	description
0	0	Edouard Delaunay 'Septembre' Chardonnay 2019, ...	Chardonnay	One half of the Abbotts & Delaunay powerhouse,...
1	1	The Ned Waihopai River Sauvignon Blanc 2020 Ma...	Sauvignon Blanc	Our bestselling white wine. Winemaker Brent Ma...
2	2	The King's Favour Sauvignon Blanc 2019/20 Marl...	Sauvignon Blanc	Brent Marris is winemaking royalty. And it tur...
3	3	Château Livran 2014, Médoc	Cabernet Sauvignon	Château Livran once belonged to both Edward I ...
4	4	Cave de Lugny 'Reserve' Mâcon-Chardonnay 2019	Chardonnay	The small village of Mâcon-Chardonnay isn’t co...

	token	count_Chardonnay	count_Pinot Noir	count_Cabernet Sauvignon	count_Sauvignon Blanc
0	produce	21.0	9.0	5.0	4.0
2	quality	22.0	10.0	3.0	5.0
3	grape	49.0	25.0	10.0	18.0
4	grow	9.0	5.0	1.0	3.0
8	vineyard	45.0	26.0	4.0	12.0
...	...	...	...	...	...
1367	boast	1.0	2.0	2.0	1.0
1387	heart	2.0	1.0	3.0	1.0
1611	roasted	1.0	4.0	1.0	1.0
1644	standard	1.0	2.0	1.0	1.0
1658	complement	1.0	1.0	1.0	1.0

	token	count_Chardonnay	count_Pinot Noir	count_Cabernet Sauvignon	count_Sauvignon Blanc	freq
3	grape	0.923077	0.500000	0.473684	0.68	2.576761
171	year	0.711538	0.375000	0.684211	0.52	2.290749
50	oak	0.923077	0.458333	0.736842	0.16	2.278252
205	fresh	0.711538	0.291667	0.263158	1.00	2.266363
164	red	0.115385	1.000000	1.000000	0.00	2.115385
183	fine	0.596154	0.229167	0.947368	0.24	2.012689
8	vineyard	0.846154	0.520833	0.157895	0.44	1.964882
37	rich	0.769231	0.083333	0.684211	0.36	1.896775
40	white	1.000000	0.041667	0.052632	0.76	1.854298
131	world	0.500000	0.145833	0.631579	0.36	1.637412
228	expect	0.423077	0.270833	0.473684	0.36	1.527594
197	ripe	0.403846	0.104167	0.894737	0.12	1.522750
9	great	0.538462	0.291667	0.631579	0.04	1.501707
161	vintage	0.557692	0.208333	0.526316	0.12	1.412341
65	finish	0.230769	0.145833	0.578947	0.44	1.395550

	token	idf
0	aarde	1.510826
1	abbott	1.916291
2	ability	1.916291
3	abound	1.916291
4	absolute	1.916291
5	absolutely	1.510826
6	abundance	1.916291
7	abundant	1.916291
8	acacia	1.916291
9	access	1.916291

grape_variety	Cabernet Sauvignon	Chardonnay	Pinot Noir	Sauvignon Blanc
aarde	0.000000	0.004938	0.010125	0.000000
abbott	0.000000	0.000000	0.025683	0.000000
ability	0.000000	0.006263	0.000000	0.000000
abound	0.000000	0.000000	0.012842	0.000000
absolute	0.000000	0.006263	0.000000	0.000000
...	...	...	...	...
zealand	0.008606	0.013074	0.046909	0.045126
zesti	0.000000	0.000000	0.012842	0.000000
zesty	0.000000	0.014814	0.000000	0.056814
zingy	0.000000	0.019752	0.000000	0.034089
zippy	0.000000	0.004938	0.000000	0.011363

	grape	token	count
0	Cabernet Sauvignon	château	14
1	Cabernet Sauvignon	livran	2
2	Cabernet Sauvignon	once	3
3	Cabernet Sauvignon	belonged	1
4	Cabernet Sauvignon	to	42
...	...	...	...
6274	Sauvignon Blanc	enchanting	1
6275	Sauvignon Blanc	soaked	1
6276	Sauvignon Blanc	sun	1
6277	Sauvignon Blanc	gooseberry.awatere	1
6278	Sauvignon Blanc	cheeses	1

	count	grape	token
0	1.0	Chardonnay	comte
1	1.0	Chardonnay	patient
2	5.0	Chardonnay	establish
3	2.0	Chardonnay	inland
4	3.0	Chardonnay	sumptuous
...	...	...	...
1705	1.0	Sauvignon Blanc	flow
1706	1.0	Sauvignon Blanc	dusky
1707	1.0	Sauvignon Blanc	sound
1708	1.0	Sauvignon Blanc	enchant
1709	1.0	Sauvignon Blanc	soak