Welcome to the OAO team project! This project is a text classification project done in the context of our CS-433 Machine Learning course at EPFL.
The goal of this project is to classify tweets as being either containing a smiling face or a sad face.
More concretely, we aim to predict the sentiment of tweets by classifying them as being either positive or negative.
To do so, we have trained 3 different models:
BERT
: a transformer-based model(RoBERTa)
: a transformer-based modelXLNet
: a transformer-based modelXGBoost
: a gradient boosting model
We have also implemented a text preprocessing pipeline that can be used to preprocess the data before training the models, as we know that social media language is often noisy and contains a lot of typos, abbreviations, etc.
The project is structured as follows:
Data/
: contains the data used for training and testing the modelModels/
: contains the saved models used for inferenceNotebooks/
: contains some notebooks used for data exploration and model trainingClassifiers/
: contains the code for different classifier classes used for training and inferenceResults/
: contains the predictions made by the models on the test setTextProc/
: contains the code for our text preprocessing methodsrun.py
: is the main script used for training the models and inference on the test set
We have implemented 4 different Classifiers that can be used for inference, if you just want to quickly check the results of the models, please follow the instructions below.
- Clone the repository
- Download the
Models/
folder from the following link and place it at the root of the project here
Please Make Sure that the Models folders is at the root of the project and do not change the name of the folders inside as the scripts depends on that specific structure - If you don't have the requirements installed, the cells in notebook QuickStart.ipynb will install them for you.
- Then you simply used the following command to make predictions on the test set:
python run.py --config {test_config_name} --save_result {predictions_name}
where test_config_name
is one of the following:
bert_config_full_test
xlnet_config_part_test
xgboost_config_full_test
roberta_config_test
And predictions_name
is the name of the file where to save the predictions on the test set. You will find the predictions under Results/{predictions_name}.csv
.
You can run the code by executing the run.py
script.
The script takes the following arguments:
--config
: the configuration file for training the model or predicting--train
: whether to train the model or not (default:False
)train_data_path
: the path to the training data (default:Data/twitter-datasets/
)--save_model_path
: if training a new model, the path where to save the model (default:Models/{model_type}
)--save_result
: the path where to save the predictions on the test set (default:predictions
)
If you want to train one of the models, use the following command:
python run.py --train --config {config_name} --save_model_path {path_to_save_model}
The config_name
is the name of the configuration file for training the model and can be found in the configs.py
file.
The path_to_save_model
is the path where to save the model, the saved model will be found under Models/{model_type}/{path_to_save_model}
.
If you want to make predictions on the test set, use the following command:
python run.py --config {test_config_name} --save_result {predictions_name}
The test_config_name
is the name of the configuration for predicting and can be found in the configs.py
file.
Note that if you have trained a model and saved it, you need to create a test_model_config
in the model_configs.py
file.
as well as a general configuration in the configs.py
file.
More details below.
Here are some example commands:
python run.py --config bert_base_part_train --save_result bert_base_part_predictions
python run.py --config xlnet_base_part_train --save_result xlnet_base_part_predictions
python run.py --config xgboost_base_part_train --save_result xgboost_base_part_predictions
This project is based on a configuration system that dictates for instance, parameters used for training the model, but also the type of model to use, the type of preprocessing to apply to the data, etc.
There are 3 configuration files:
configs.py
: contains the general configuration for the projectmodel_configs.py
: contains the configuration for the modelpreproc_configs.py
: contains the configuration for the preprocessing
Let's take a look at the structure of the model_configs.py
file:
Since we use different models, the configuration file for each model type will look slightly different.
For instance, the configuration file for training BERT
model looks like this:
bert_train = {
'model_type': 'bert',
'model_encoder': 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
'model_preproc': 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
'max_seq_length': 200,
'batch_size': 128,
'epochs': 5,
'warmup': 0.1,
'lr': 6e-10,
'validation_split': 0.15,
}
As you can see, the configuration file contains the following parameters:
model_type
: the type of model to usemodel_encoder
: the encoder to use for the modelmodel_preproc
: the preprocessing to apply to the datamax_seq_length
: the maximum length of the input sequencebatch_size
: the batch size to use for trainingepochs
: the number of epochs to train the modelwarmup
: the warmup proportion for the learning rate schedulerlr
: the learning rate to use for trainingvalidation_split
: the proportion of the training data to use for validation
The structure of the configuration file for training XLNet
and XGBoost
looks like this:
xlnet_train = {
'model_type': 'xlnet',
'max_seq_length': 200,
'batch_size': 128,
'epochs': 5,
'warmup': 0.1,
'lr': 6e-10,
'validation_split': 0.15,
}
xgboost_base_train = {
'model_type': 'xgboost',
'xgb_config': {
'learning_rate': 0.189,
'max_depth': 10,
'n_estimators': 742,
'subsample': 0.881,
},
'vectorizer_config': {
'vectorizer': "word2vec",
'vector_size': 350,
'window': 5,
'min_count': 1,
'workers': 4,
'sg': 1
}
}
In the case of XGBoost
, if you specify the vectorizer_config
parameter, the model will be trained using the Word2Vec
vectorizer.
Which also needs some specific parameters to be specified.
The inference configuration looks the same accross all models, we only need to specify the model type and the path to the saved model.
For instance, the inference configuration for BERT
looks like this:
bert_base_part_test = {
'model_type': 'bert',
'model_path': 'Models/BERT/AlisBERTPart/'
}
Now let's take a look at the structure of the preproc_configs.py
file:
A preprocessing configuration is simply a list of preprocessing functions to apply to the data.
We have curated two configurations for our specific use cases, but you can simply create your own configuration by adding the preprocessing functions you want to apply to the data.
All functions are available in the TextProc/helper.py
file.
Here is an example configuration:
preproc_config = [
remove_punctuation,
remove_stopwords,
to_lowercase,
remove_numbers,
remove_whitespace
]
Finally, let's take a look at the structure of the configs.py
file:
A general configuration is simply a mix of the model configuration and the preprocessing configuration.
For instance, the configuration for training BERT
looks like this:
bert_base_part_train = {
'model_config': model_configs.bert_base_train,
'preproc_config': preproc_configs.preproc_config_context
}
It is this configuration name bert_base_part_train
that you will need to specify when running the run.py
script.
And the same holds for the inference configuration.
Now let's dive into the structure of the Classifiers
folder.
Since our project aims at comparing different models (BERT, XLNet, XGBoost), we have created a class for each model type.
All of the classes inherit from the Classifier
abstract class, which contains the basic methods for training and inference.
Hence all of the classes provide the same set of methods, but the implementation of these methods is different for each model type.
Each classifier provides the following methods:
train
: trains the modelpredict
: makes predictions on the test setsave
: saves the modelload_model
: loads the modelvalidate
: validates the model on the validation set
The text preprocessing is done using the TextProc
package.
This package offers a Preprocessor
class that takes a list of preprocessing functions as input and applies them to the data.
The Preprocessor
can be set to verbose mode, which will print the data before and after each preprocessing step (useful for debugging).
Details about the preprocessing functions can be found in our report.
Model | Epochs | Time per epoch |
---|---|---|
Bert | 6 | 30min |
XLNet | 6 | 30min |
XGBoost | -- | 10 min |