Did you ever had the chance to look at wine reviews? If not, here's a glimpse:
Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish.
Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.
Samples taken from here.
It should be permitted to ask whether such reviews are just purple gibberish, or the esteemed spot-on judgment of a connoisseur. I wouldn't know! But maybe ML can tell us whether there is something to it.
The dataset consists of ~130k wine reviews that were scraped from the WineEnthusiast (winemag.com) website in 2017. It is released under CC BY-NC_SA 4.0 licenses and can be downloaded on kaggle.
Each review contains additional information about the wine (designation, country, province, region, winery, variety and price), the reviewer (identifier) and the review (text, score [1-100])
The goal of this project is to determine if wine reviews can be used to determine the grape variety of the reviewed wine. If we can train a machine learning model to predict the wine variety with some accuracy, then wine reviews must, at least occasionally, contain some information that is specific to a grape variety of the reviewed wine.
Hence we will define and train a machine learning model to predict the wine variety based on the written review of a wine. We will use existing wine review data and a combination of traditional and deep learning ML methods.
All of the 5 models we defined perform much better than educated guessing base on marginal label probability (baseline).
Depicted below are cross-validated accuracy and averaged f-score.
Interestingly, among the 5 models, the one that does not rely on deep learning for the feature engineering, the Common Words model, performs best.
The project follows largely the IBM cloud garage methodology for data science.
The order of the notebook list here-below follows the project development structure and contains all relevant information. Additional content that is present in the project is used or created by these notebooks.
-
reviewed_grapes.data_exp.pandas.v1.ipynb
- Data quality assessment
- Exploration of the original dataset
- Identification of required preprocessing
-
reviewed_grapes.etl.pandas.v1.ipynb
- Implementation of required preprocessing
- Conversion to data structure suitable for feature creation
-
reviewed_grapes.model_baseline.pyspark.v1.ipynb
- Implementation, training, evaluation and deployment of the baseline model
-
reviewed_grapes.feature_eng.pyspark.v1.ipynb
- Feature engineering steps including NLP and deep learning
- Definition of custom pyspark Transformer and Estimator sub-classes that implement the required feature engineering steps
-
reviewed_grapes.feature_eng_SI.pyspark_keras_elephas.v1.ipynb (supplemental information, unused downstream)
- word-2-vector encoding of review words and wine varieties based on keras implementation of an auto-encoder
- deployment of this word-2-vector encoding routine on spark worker with the elephas package.
-
reviewed_grapes.model_def_train.pyspark.v1.ipynb
- Definition of the various models as spark ML-Pipelines
- Training of the models
-
reviewed_grapes.model_evaluation.pyspark.v1.ipynb
- Evaluation of our trained models
- Exploration of several alternative models
- Hyperparameter tuning for some of the models
-
reviewed_grapes.model_deployment.pyspark.v1.ipynb
- Adapting the fitted models to make them deployable
- Packing of fitted models and necessary wrapper class to utilize python's setuptools for distribution
reviewed_grapes/
: Python package that holds fitted models and all necessary scripts to deploy them (see Model Deployment for details)data/
: Contains the original dataset, as well as any intermediary datautils/
: python package containing definitions of custom pyspark estimators and transformers used to construct our models.
The project can be cloned form GitHub:
git clone git@github.com:j-i-l/ReviewedGrapes.git
To set up the project, create a virtual environment and install all dependencies with:
pip install -r requirements.txt
You should also fetch the wine review dataset from kaggle and place the *.csv files under data/.
After these steps you should be able to run all the notebooks provided with this project. For further details on the project structure refer to the Project Structure section.
Several trained models can be found under reviewed_grapes/fitted_models/ and are readily deployable as spark MP-Pipelines.
For pyspark users trained models are made available in the reviewed_grapes
package that can be installed with the provided setup.py.
In short:
-
Fire-up your virtual environment.
-
Install the package with the trained models:
pip install git+https://github.com/j-i-l/ReviewedGrapes.git
Alternatively, you can also clone and then install it:
git clone git@github.com:j-i-l/ReviewedGrapes.git
cd ReviewedGrapes
pip install ./
Now you can simply import and use the fitted models:
>>> from reviewed_grapes import CommonWordsModel
>>> sentence_df = spark.createDataFrame(
[("A superbe red wine with blackberry and stuff.",),
("Acid dark too strong for me.",),
("Tart and snappy, supple plum aroma.",)],
["review"])
>>> cmw = CommonWordsModel(inputCol='review', outputCol='predicted variety')
>>> cmw.transform(sentenceDataFrame).select('review', 'predicted variety').show()
+--------------------+------------------+
| review| predicted variety|
+--------------------+------------------+
|A superbe red win...|cabernet sauvignon|
|Acid dark too str...| pinot noir|
|Tart and snappy, ...| pinot noir|
+--------------------+------------------+
The general idea behind the models we define is to track the presence/absence of a set of specific words in the review text.
The general pattern is as follows:
- Convert the wine variety column to a categorical label.
- Render each word in the review text to a canonical form.
- Use a predefined set of words, a target word set, to create a feature vector of binary features, each indicating the presence or absence of one of the words in the target word set.
- Define a ML model that can predict the label based on this features.
- Train the model on a training set and assess its performance.
- Deploy the model.
What will define our different models is how we define the target word set in step 3. It is also the point where we will implement and use some deep learning methods.
The aim of this model is to establish a baseline in terms of performance. It is a non-predictive model in the sense that it does not use the review text at all. Instead it simply relies on the label frequencies of the training data and uses their marginal probabilities to make predictions.
Here we count the occurrence of all different words in the reviews of the training dataset, rank them by count and take the x-most common words to be our target word set.
This approach is based on a more sophisticated feature engineering to create the target word set. We first train an autoencoder on word pairs, i.e. we create a word-to-vector map. For each word in the set of canonical words in a review (from general step 2) we combine it with the wine variety and compile a dataset of word-variety mapping pairs. We train a word2vec deep learning network and use the hidden layer as vector representations of all words and varieties.
With this embedding of review words and varieties into a vector space we can now construct target word sets based on the relation between vectors. In particular we can measure similarity between all members in this vector space and assemble target word sets based on similarity considerations.
A first approach is to create the target word set by picking for all varieties present in the training dataset the most similar words.
Another option is the exact opposite and pick the least similar words, i.e. the vectors that point into an opposite direction.
As a third variation we pick the most and the least similar words for each variety.
Finally, we note that the presence any word from the target word set should give us some information about what wine variety was concerned in that review. Words that are equally likely to be present in any review are thus bad candidates for the target words set. We want words that are found only in reviews from some varieties and not in others. In terms of the vector space we created, we are looking for word-vectors that are not equally similar to all variety-vectors. Said differently, we want to use words for which the distribution of similarities with all variety-vectors is as far away from uniform as possible. One way to measure the non-uniformity of a distribution is the information entropy. It is maximal if the similarity is the same for all variety-vectors. Thus, in this model, our target word set will consist of words with the lowest entropy in their similarity distribution.