Introduction

In partI we found the tfidfVectorizer performes the best. Therefore, in this project we will combine this NLF model with some classifiers and test the model performance in different scenarios. We will mainlly focus on the Random Forest Classifier. We will also performe the hyperparameter optimization and kfold cross validataion to find the best parameters for the Random Forest Classifier.

Pre-Process

First, import all the necessary packages:

import pandas as pd
import re
import numpy as np
import seaborn as sns
from functools import reduce
import matplotlib.pyplot as plt

# sklearn packages
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.dummy import DummyClassifier
from xgboost import XGBClassifier

from collections import Counter
import spacy
nlp = spacy.load("en_core_web_sm")
plt.style.use('seaborn-darkgrid')

Loading Data

Then load our data:

a0 = pd.read_csv("my_wine_data_clean.csv")
a0.head()
Unnamed: 0 name grape_variety description
0 0 Edouard Delaunay 'Septembre' Chardonnay 2019, ... Chardonnay One half of the Abbotts & Delaunay powerhouse,...
1 1 The Ned Waihopai River Sauvignon Blanc 2020 Ma... Sauvignon Blanc Our bestselling white wine. Winemaker Brent Ma...
2 2 The King's Favour Sauvignon Blanc 2019/20 Marl... Sauvignon Blanc Brent Marris is winemaking royalty. And it tur...
3 3 Château Livran 2014, Médoc Cabernet Sauvignon Château Livran once belonged to both Edward I ...
4 4 Cave de Lugny 'Reserve' Mâcon-Chardonnay 2019 Chardonnay The small village of Mâcon-Chardonnay isn’t co...

Below is wine description for The Ned Waihopai River Sauvignon Blanc 2020 Marlborough:

a0.loc[1, 'description']
'Our bestselling white wine. Winemaker Brent Marris’ father was the first person to plant Sauvignon vines in Marlborough. It seems extraordinary that in a single generation a whole new type of wine could have been developed. Brent has had just as big an impact on Kiwi Sauvignon as his dad. He’s got rafts of awards, and has launched multiple famous wine brands. Suffice to say, he really knows what he’s doing and that comes through in The Ned Waihopai River Sauvignon Blanc. Its magic is that it is so consistently, definitively a Marlborough Sauvignon. It’s tropical, it’s citrussy – it’s the most refreshing thing you could pour into your glass at 6pm. Full of gooseberry and grapefruit flavours, it’s grassy, tropical and seriously aromatic. Fantastic with Thai-style fish cakes.'

Split Train and Test Set

Next, we need to split our data into test and train sets.

combined_features = ['description']
target = 'grape_variety'

X_train, X_test, y_train, y_test = train_test_split(a0[combined_features], a0[target], 
                                                    test_size=0.30, random_state=42)

Check the first five descriptions in the train set and the first five grape type in the train set

X_train.head()
description
116 Produced exclusively from top quality Chardonn...
45 Duckhorn Vineyards was one of the first forty ...
16 Duckhorn Vineyards was one of the first forty ...
465 You might think California is too hot for the ...
358 You might think that Australia's too warm for ...
y_train.head()
116    Chardonnay
45     Chardonnay
16     Chardonnay
465    Pinot Noir
358    Pinot Noir
Name: grape_variety, dtype: object

Definning Selectors

We will need to define a class that will select the text based columns from the input

class TextSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None, *parg, **kwarg):
        return self

    def transform(self, X):
        # returns the input as a string
        return X[self.key]

Text Cleaning

Recall our functions for remove stop words, punctuations, and lematization. In addition, add some custom stop words:

customize_stop_words = ["wine","fruit", "flavour", 'aromas', 'palate', 'chardonnay',
                        'note','sauvignon', 'blanc', 'de', 'pinot', 'noir', 'cabernet', 'best']
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True

def remove_stop_words_and_lemmatization(text):

    my_doc = nlp(text)

    # Create list of word tokens
    token_list = [t.lemma_ for t in my_doc] 
    #for token in my_doc:
     #   token_list.append(token.lemma_)

    # Create list of word tokens after removing stopwords
    filtered_sentence =[] 
    for word in token_list:
        lexeme = nlp.vocab[word]
        if lexeme.is_stop == False and lexeme.is_alpha == True: #filter out stop word and non-alpha words
            filtered_sentence.append(word)  
    s = ' '
    text_final = s.join(filtered_sentence) 
    return text.replace(text, text_final)
X_train['description'] = X_train['description'].apply(remove_stop_words_and_lemmatization)

Then, we can call the TfidfVectorizer and put into the Pipeline. In a pipeline we chain together all kind of actions on the data into one stable flow. The is similar to %/% in R, which means you do the first function first then do the second funtion based on the first function's result.

vec_tdidf = TfidfVectorizer(ngram_range=(1,1), analyzer='word', norm='l2')
text = Pipeline([
                ('selector', TextSelector(key='description')),
                ('vectorizer', vec_tdidf)
                ])

Ramdom Forest Classfier

The last step in our pipeline is to call the classifier. The first classifier that we are going to test is the Random Forest. It is an ensemble classifier of decision trees and it tends to be more accurate than a single decision tree classifier.

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree. You can find more information about the random forest from its documentation.

clf = RandomForestClassifier(random_state=42)
pipe = Pipeline([('description', text),
                 ('clf',clf)
                 ]) 

Train and Test The Model

Let's fit our model on the train set, you can use %timeit to check how long it takes to fit the model.

%timeit pipe.fit(X_train, y_train)
/Applications/Jupter/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
36.3 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Before we check our model performance, we can the following function to print out the important statistics like the accuracy, the classification report and the confusion matrix.

Accuracy is the number of correctly predicted grape types divided by the total number of grapes.

Classification report is a concise way of presenting estimator performance through the following metrics: precision, recall, f1-score and the number of samples belonging to each target feature. A good classifier has a value close to 1 for both precision and recall and therefore for f1-score too.

Confusion matrix is again a simple way to present how many grape types were correctly identified (diagonal elements), while the off diagonal elemnts tell us how many samples were classified into another target type. Obviously, one would like to decrease the values of the off diagonal elements to get perfect classification. The vertical axis represents the true class of the target, why the horizontal axis shows the predicted value of the target.

def print_stats(preds, target, labels, sep='-', sep_len=40, fig_size=(10,8)):
    print('Accuracy = %.3f' % metrics.accuracy_score(target, preds))
    print(sep*sep_len)
    print('Classification report:')
    print(metrics.classification_report(target, preds))
    print(sep*sep_len)
    print('Confusion matrix')
    cm=metrics.confusion_matrix(target, preds)
    cm = cm / np.sum(cm, axis=1)[:,None]
    sns.set(rc={'figure.figsize':fig_size})
    sns.heatmap(cm, 
        xticklabels=labels,
        yticklabels=labels,
           annot=True, cmap = 'YlGnBu')
    plt.pause(0.05)

Here is our model performance for the train set:

preds = pipe.predict(X_train)
print_stats(preds, y_train, pipe.classes_, fig_size=(7,4))
Accuracy = 0.994
----------------------------------------
Classification report:
                    precision    recall  f1-score   support

Cabernet Sauvignon       1.00      0.98      0.99        59
        Chardonnay       0.99      1.00      0.99       159
        Pinot Noir       1.00      1.00      1.00        71
   Sauvignon Blanc       1.00      0.98      0.99        63

          accuracy                           0.99       352
         macro avg       1.00      0.99      0.99       352
      weighted avg       0.99      0.99      0.99       352

----------------------------------------
Confusion matrix

From the statistics above, we can see the Accuracy is 0.994, which is what we expect it to do, since it was trained on it. Next we want to check the statistics for the test set.

Predictions on Test set

Below is our model performance for the test set. The accuracy is 0.841.

preds = pipe.predict(X_test)
print_stats(y_test, preds, pipe.classes_)
Accuracy = 0.868
----------------------------------------
Classification report:
                    precision    recall  f1-score   support

Cabernet Sauvignon       0.91      0.91      0.91        33
        Chardonnay       0.98      0.82      0.89        72
        Pinot Noir       0.67      0.91      0.77        22
   Sauvignon Blanc       0.79      0.92      0.85        24

          accuracy                           0.87       151
         macro avg       0.84      0.89      0.85       151
      weighted avg       0.89      0.87      0.87       151

----------------------------------------
Confusion matrix

Hyperparameter Tuning Using GridSearchCV

Next, let's see if we can make the model more accurate by finding its best parameters using the GridSearchCV. There are a certain number of parameters that can be adjusted to improve the performance of a classifier. This is hyperparameter tuning. We will improve the Random Forest classifier by using a grid search techinque over the predefined parameter values and apply cross validation. All this can be done with the GridSearchCV class.

# classifier and pipeline definition
clf = RandomForestClassifier(random_state=42)
pipe = Pipeline([('description', text),
                 ('clf',clf)
                 ])

# definition of parameter grid to scan through
param_grid = {
    #'clf__max_depth': [60, 100, 140],
    'clf__max_features': ['log2', 'auto', None],
    #'clf__min_samples_leaf': [5,10,50,100,200],
    'clf__n_estimators': [100, 500, 1000]
}

# grid search cross validation instantiation
grid_search = GridSearchCV(estimator = pipe, param_grid = param_grid, 
                          cv = 3, n_jobs = 1, verbose = 0)# try cv = 5 or 10

#hyperparameter fitting
grid_search.fit(X_train, y_train)

Let's see the accuracy on the test data of the cross validation:

grid_search.cv_results_['mean_test_score']
array([0.78977273, 0.80113636, 0.79261364, 0.86647727, 0.84659091,
       0.85227273, 0.85795455, 0.86079545, 0.86079545])

There were 9 combinations of the input parameters, so there are 9 accuracies.

Let us check the best parameter combination:

grid_search.best_params_
{'clf__max_features': 'auto', 'clf__n_estimators': 100}

Now, let us create a classifier with these inputs:

clf_opt=grid_search.best_estimator_

We can verify the parameters this classifier use, just to make sure everything is what we desired:

clf_opt.named_steps['clf'].get_params()
{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

Next, let's train it and test it on our test set. Actually, the accuracy increaded from 0.868 to 0.874, which is about 1% increase.

clf_opt.fit(X_train, y_train)
preds = clf_opt.predict(X_test)
print_stats(y_test, preds, clf_opt.classes_)
Accuracy = 0.874
----------------------------------------
Classification report:
                    precision    recall  f1-score   support

Cabernet Sauvignon       0.94      1.00      0.97        31
        Chardonnay       0.95      0.78      0.86        73
        Pinot Noir       0.70      0.88      0.78        24
   Sauvignon Blanc       0.82      1.00      0.90        23

          accuracy                           0.87       151
         macro avg       0.85      0.91      0.88       151
      weighted avg       0.89      0.87      0.87       151

----------------------------------------
Confusion matrix

Other Classifiers

Then, we can fit our data using different classifiers to find out the best model to predict the grape variety.

Dummy Classifier

DummyClassifier is a classifier that makes predictions using simple rules. After we fit the DummyClassifier on our train sets, our modle accuracy on the test set is only 0.325. The model accuracy is significantly lower that the random forest.

clf = DummyClassifier(strategy='stratified',random_state=42)
pipe = Pipeline([('description', text),
                 ('clf',clf)
                 ])
%timeit pipe.fit(X_train, y_train)
# test stats
preds = pipe.predict(X_test)
print_stats(y_test, preds, pipe.classes_)
20.5 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Accuracy = 0.325
----------------------------------------
Classification report:
                    precision    recall  f1-score   support

Cabernet Sauvignon       0.21      0.26      0.23        27
        Chardonnay       0.50      0.45      0.48        66
        Pinot Noir       0.17      0.16      0.16        32
   Sauvignon Blanc       0.25      0.27      0.26        26

          accuracy                           0.32       151
         macro avg       0.28      0.28      0.28       151
      weighted avg       0.33      0.32      0.33       151

----------------------------------------
Confusion matrix

XGBClassifier

Next, let's see the performance of the most used ensemble booster classifier, the XGBClassifier from the xgboost package. Gradient boosting sequentially adds predictors and corrects previous models. The classifier fits the new model to new residuals of the previous prediction and then minimizes the loss when adding the latest prediction.

The accuracy is 0.887, which is the highest among all the classifiers.

clf = XGBClassifier(random_state=42, n_jobs=1)
pipe = Pipeline([('description', text),
                 ('clf',clf)
                 ])
%timeit pipe.fit(X_train, y_train)
# test stats
preds = pipe.predict(X_test)
print_stats(y_test, preds, pipe.classes_)
4.34 s ± 382 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Accuracy = 0.887
----------------------------------------
Classification report:
                    precision    recall  f1-score   support

Cabernet Sauvignon       1.00      0.89      0.94        37
        Chardonnay       0.95      0.85      0.90        67
        Pinot Noir       0.77      0.96      0.85        24
   Sauvignon Blanc       0.75      0.91      0.82        23

          accuracy                           0.89       151
         macro avg       0.87      0.90      0.88       151
      weighted avg       0.90      0.89      0.89       151

----------------------------------------
Confusion matrix

Conclusion

As a result, we fitted several models to predict the grape varieties. The XGBClassifier is the most accurate one, then followed by the random forest classifier, and the Dummy Classifier have the worst performance. In fact, the accuracy difference between XGBCClassifier and the random forest classifier is not that big. In this project we will stick to the XGBClassifier because it has the highest accuracy.

Reference