Training a Computer to Blind Taste Wine (Part II)
An NLP Application with spaCy and Random Forest Classifier, inspired by this Wine Blog
In partI we found the tfidfVectorizer performes the best. Therefore, in this project we will combine this NLF model with some classifiers and test the model performance in different scenarios. We will mainlly focus on the Random Forest Classifier. We will also performe the hyperparameter optimization and kfold cross validataion to find the best parameters for the Random Forest Classifier.
First, import all the necessary packages:
import pandas as pd
import re
import numpy as np
import seaborn as sns
from functools import reduce
import matplotlib.pyplot as plt
# sklearn packages
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.dummy import DummyClassifier
from xgboost import XGBClassifier
from collections import Counter
import spacy
nlp = spacy.load("en_core_web_sm")
plt.style.use('seaborn-darkgrid')
Then load our data:
a0 = pd.read_csv("my_wine_data_clean.csv")
a0.head()
Below is wine description for The Ned Waihopai River Sauvignon Blanc 2020 Marlborough:
a0.loc[1, 'description']
Next, we need to split our data into test and train sets.
combined_features = ['description']
target = 'grape_variety'
X_train, X_test, y_train, y_test = train_test_split(a0[combined_features], a0[target],
test_size=0.30, random_state=42)
Check the first five descriptions in the train set and the first five grape type in the train set
X_train.head()
y_train.head()
We will need to define a class that will select the text based columns from the input
class TextSelector(BaseEstimator, TransformerMixin):
"""
Transformer to select a single column from the data frame to perform additional transformations on
Use on text columns in the data
"""
def __init__(self, key):
self.key = key
def fit(self, X, y=None, *parg, **kwarg):
return self
def transform(self, X):
# returns the input as a string
return X[self.key]
Recall our functions for remove stop words, punctuations, and lematization. In addition, add some custom stop words:
customize_stop_words = ["wine","fruit", "flavour", 'aromas', 'palate', 'chardonnay',
'note','sauvignon', 'blanc', 'de', 'pinot', 'noir', 'cabernet', 'best']
for w in customize_stop_words:
nlp.vocab[w].is_stop = True
def remove_stop_words_and_lemmatization(text):
my_doc = nlp(text)
# Create list of word tokens
token_list = [t.lemma_ for t in my_doc]
#for token in my_doc:
# token_list.append(token.lemma_)
# Create list of word tokens after removing stopwords
filtered_sentence =[]
for word in token_list:
lexeme = nlp.vocab[word]
if lexeme.is_stop == False and lexeme.is_alpha == True: #filter out stop word and non-alpha words
filtered_sentence.append(word)
s = ' '
text_final = s.join(filtered_sentence)
return text.replace(text, text_final)
X_train['description'] = X_train['description'].apply(remove_stop_words_and_lemmatization)
Then, we can call the TfidfVectorizer and put into the Pipeline. In a pipeline we chain together all kind of actions on the data into one stable flow. The is similar to %/% in R, which means you do the first function first then do the second funtion based on the first function's result.
vec_tdidf = TfidfVectorizer(ngram_range=(1,1), analyzer='word', norm='l2')
text = Pipeline([
('selector', TextSelector(key='description')),
('vectorizer', vec_tdidf)
])
The last step in our pipeline is to call the classifier. The first classifier that we are going to test is the Random Forest. It is an ensemble classifier of decision trees and it tends to be more accurate than a single decision tree classifier.
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree. You can find more information about the random forest from its documentation.
clf = RandomForestClassifier(random_state=42)
pipe = Pipeline([('description', text),
('clf',clf)
])
Let's fit our model on the train set, you can use %timeit to check how long it takes to fit the model.
%timeit pipe.fit(X_train, y_train)
Before we check our model performance, we can the following function to print out the important statistics like the accuracy, the classification report and the confusion matrix.
Accuracy is the number of correctly predicted grape types divided by the total number of grapes.
Classification report is a concise way of presenting estimator performance through the following metrics: precision, recall, f1-score and the number of samples belonging to each target feature. A good classifier has a value close to 1 for both precision and recall and therefore for f1-score too.
Confusion matrix is again a simple way to present how many grape types were correctly identified (diagonal elements), while the off diagonal elemnts tell us how many samples were classified into another target type. Obviously, one would like to decrease the values of the off diagonal elements to get perfect classification. The vertical axis represents the true class of the target, why the horizontal axis shows the predicted value of the target.
def print_stats(preds, target, labels, sep='-', sep_len=40, fig_size=(10,8)):
print('Accuracy = %.3f' % metrics.accuracy_score(target, preds))
print(sep*sep_len)
print('Classification report:')
print(metrics.classification_report(target, preds))
print(sep*sep_len)
print('Confusion matrix')
cm=metrics.confusion_matrix(target, preds)
cm = cm / np.sum(cm, axis=1)[:,None]
sns.set(rc={'figure.figsize':fig_size})
sns.heatmap(cm,
xticklabels=labels,
yticklabels=labels,
annot=True, cmap = 'YlGnBu')
plt.pause(0.05)
Here is our model performance for the train set:
preds = pipe.predict(X_train)
print_stats(preds, y_train, pipe.classes_, fig_size=(7,4))
From the statistics above, we can see the Accuracy is 0.994, which is what we expect it to do, since it was trained on it. Next we want to check the statistics for the test set.
Below is our model performance for the test set. The accuracy is 0.841.
preds = pipe.predict(X_test)
print_stats(y_test, preds, pipe.classes_)
Next, let's see if we can make the model more accurate by finding its best parameters using the GridSearchCV. There are a certain number of parameters that can be adjusted to improve the performance of a classifier. This is hyperparameter tuning. We will improve the Random Forest classifier by using a grid search techinque over the predefined parameter values and apply cross validation. All this can be done with the GridSearchCV class.
# classifier and pipeline definition
clf = RandomForestClassifier(random_state=42)
pipe = Pipeline([('description', text),
('clf',clf)
])
# definition of parameter grid to scan through
param_grid = {
#'clf__max_depth': [60, 100, 140],
'clf__max_features': ['log2', 'auto', None],
#'clf__min_samples_leaf': [5,10,50,100,200],
'clf__n_estimators': [100, 500, 1000]
}
# grid search cross validation instantiation
grid_search = GridSearchCV(estimator = pipe, param_grid = param_grid,
cv = 3, n_jobs = 1, verbose = 0)# try cv = 5 or 10
#hyperparameter fitting
grid_search.fit(X_train, y_train)
Let's see the accuracy on the test data of the cross validation:
grid_search.cv_results_['mean_test_score']
There were 9 combinations of the input parameters, so there are 9 accuracies.
Let us check the best parameter combination:
grid_search.best_params_
Now, let us create a classifier with these inputs:
clf_opt=grid_search.best_estimator_
We can verify the parameters this classifier use, just to make sure everything is what we desired:
clf_opt.named_steps['clf'].get_params()
Next, let's train it and test it on our test set. Actually, the accuracy increaded from 0.868 to 0.874, which is about 1% increase.
clf_opt.fit(X_train, y_train)
preds = clf_opt.predict(X_test)
print_stats(y_test, preds, clf_opt.classes_)
Then, we can fit our data using different classifiers to find out the best model to predict the grape variety.
DummyClassifier is a classifier that makes predictions using simple rules. After we fit the DummyClassifier on our train sets, our modle accuracy on the test set is only 0.325. The model accuracy is significantly lower that the random forest.
clf = DummyClassifier(strategy='stratified',random_state=42)
pipe = Pipeline([('description', text),
('clf',clf)
])
%timeit pipe.fit(X_train, y_train)
# test stats
preds = pipe.predict(X_test)
print_stats(y_test, preds, pipe.classes_)
Next, let's see the performance of the most used ensemble booster classifier, the XGBClassifier from the xgboost package. Gradient boosting sequentially adds predictors and corrects previous models. The classifier fits the new model to new residuals of the previous prediction and then minimizes the loss when adding the latest prediction.
The accuracy is 0.887, which is the highest among all the classifiers.
clf = XGBClassifier(random_state=42, n_jobs=1)
pipe = Pipeline([('description', text),
('clf',clf)
])
%timeit pipe.fit(X_train, y_train)
# test stats
preds = pipe.predict(X_test)
print_stats(y_test, preds, pipe.classes_)
As a result, we fitted several models to predict the grape varieties. The XGBClassifier is the most accurate one, then followed by the random forest classifier, and the Dummy Classifier have the worst performance. In fact, the accuracy difference between XGBCClassifier and the random forest classifier is not that big. In this project we will stick to the XGBClassifier because it has the highest accuracy.