GitHub - CS-433/ml-project-2-oao: ml-project-2-oao created by GitHub Classroom

Welcome to the OAO team project! This project is a text classification project done in the context of our CS-433 Machine Learning course at EPFL.
The goal of this project is to classify tweets as being either containing a smiling face or a sad face.
More concretely, we aim to predict the sentiment of tweets by classifying them as being either positive or negative.

To do so, we have trained 3 different models:

BERT: a transformer-based model
(RoBERTa): a transformer-based model
XLNet: a transformer-based model
XGBoost: a gradient boosting model

We have also implemented a text preprocessing pipeline that can be used to preprocess the data before training the models, as we know that social media language is often noisy and contains a lot of typos, abbreviations, etc.

Project structure

The project is structured as follows:

Data/: contains the data used for training and testing the model
Models/: contains the saved models used for inference
Notebooks/: contains some notebooks used for data exploration and model training
Classifiers/: contains the code for different classifier classes used for training and inference
Results/: contains the predictions made by the models on the test set
TextProc/: contains the code for our text preprocessing methods
run.py: is the main script used for training the models and inference on the test set

Quick start

We have implemented 4 different Classifiers that can be used for inference, if you just want to quickly check the results of the models, please follow the instructions below.

Clone the repository
Download the Models/ folder from the following link and place it at the root of the project here
Please Make Sure that the Models folders is at the root of the project and do not change the name of the folders inside as the scripts depends on that specific structure
If you don't have the requirements installed, the cells in notebook QuickStart.ipynb will install them for you.
Then you simply used the following command to make predictions on the test set:

python run.py --config {test_config_name} --save_result {predictions_name}

where test_config_name is one of the following:

bert_config_full_test
xlnet_config_part_test
xgboost_config_full_test
roberta_config_test

And predictions_name is the name of the file where to save the predictions on the test set. You will find the predictions under Results/{predictions_name}.csv.

Train your own model

You can run the code by executing the run.py script.
The script takes the following arguments:

--config: the configuration file for training the model or predicting
--train: whether to train the model or not (default: False)
train_data_path: the path to the training data (default: Data/twitter-datasets/)
--save_model_path: if training a new model, the path where to save the model (default: Models/{model_type})
--save_result: the path where to save the predictions on the test set (default: predictions)

If you want to train one of the models, use the following command:

python run.py --train --config {config_name} --save_model_path {path_to_save_model}

The config_name is the name of the configuration file for training the model and can be found in the configs.py file.
The path_to_save_model is the path where to save the model, the saved model will be found under Models/{model_type}/{path_to_save_model}.

If you want to make predictions on the test set, use the following command:

python run.py --config {test_config_name} --save_result {predictions_name}

The test_config_name is the name of the configuration for predicting and can be found in the configs.py file.
Note that if you have trained a model and saved it, you need to create a test_model_config in the model_configs.py file.
as well as a general configuration in the configs.py file.
More details below.

Here are some example commands:

python run.py --config bert_base_part_train --save_result bert_base_part_predictions

python run.py --config xlnet_base_part_train --save_result xlnet_base_part_predictions

python run.py --config xgboost_base_part_train --save_result xgboost_base_part_predictions

Configuration Structure

This project is based on a configuration system that dictates for instance, parameters used for training the model, but also the type of model to use, the type of preprocessing to apply to the data, etc.
There are 3 configuration files:

configs.py: contains the general configuration for the project
model_configs.py: contains the configuration for the model
preproc_configs.py: contains the configuration for the preprocessing

Model configuration

Let's take a look at the structure of the model_configs.py file:

Training configuration

Since we use different models, the configuration file for each model type will look slightly different.
For instance, the configuration file for training BERT model looks like this:

    bert_train = {
        'model_type': 'bert',
        'model_encoder': 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
        'model_preproc': 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
        'max_seq_length': 200,
        'batch_size': 128,
        'epochs': 5,
        'warmup': 0.1,
        'lr': 6e-10,
        'validation_split': 0.15,
    }

As you can see, the configuration file contains the following parameters:

model_type: the type of model to use
model_encoder: the encoder to use for the model
model_preproc: the preprocessing to apply to the data
max_seq_length: the maximum length of the input sequence
batch_size: the batch size to use for training
epochs: the number of epochs to train the model
warmup: the warmup proportion for the learning rate scheduler
lr: the learning rate to use for training
validation_split: the proportion of the training data to use for validation

The structure of the configuration file for training XLNetand XGBoost looks like this:

    xlnet_train = {
        'model_type': 'xlnet',
        'max_seq_length': 200,
        'batch_size': 128,
        'epochs': 5,
        'warmup': 0.1,
        'lr': 6e-10,
        'validation_split': 0.15,
    }

    xgboost_base_train = {
        'model_type': 'xgboost',
        'xgb_config': {
            'learning_rate': 0.189,
            'max_depth': 10,
            'n_estimators': 742,
            'subsample': 0.881,
        },
        'vectorizer_config': {
            'vectorizer': "word2vec",
            'vector_size': 350,
            'window': 5,
            'min_count': 1,
            'workers': 4,
            'sg': 1
        }
    }

In the case of XGBoost, if you specify the vectorizer_config parameter, the model will be trained using the Word2Vec vectorizer.
Which also needs some specific parameters to be specified.

Inference configuration

The inference configuration looks the same accross all models, we only need to specify the model type and the path to the saved model.
For instance, the inference configuration for BERT looks like this:

    bert_base_part_test = {
        'model_type': 'bert',
        'model_path': 'Models/BERT/AlisBERTPart/'
    }

Preprocessing configuration

Now let's take a look at the structure of the preproc_configs.py file:
A preprocessing configuration is simply a list of preprocessing functions to apply to the data.
We have curated two configurations for our specific use cases, but you can simply create your own configuration by adding the preprocessing functions you want to apply to the data.
All functions are available in the TextProc/helper.py file. Here is an example configuration:

    preproc_config = [
        remove_punctuation,
        remove_stopwords,
        to_lowercase,
        remove_numbers,
        remove_whitespace
    ]

General configuration

Finally, let's take a look at the structure of the configs.py file:
A general configuration is simply a mix of the model configuration and the preprocessing configuration.
For instance, the configuration for training BERT looks like this:

    bert_base_part_train = {
        'model_config': model_configs.bert_base_train,
        'preproc_config': preproc_configs.preproc_config_context
    }

It is this configuration name bert_base_part_train that you will need to specify when running the run.py script.
And the same holds for the inference configuration.

Classifiers Structure

Now let's dive into the structure of the Classifiers folder.
Since our project aims at comparing different models (BERT, XLNet, XGBoost), we have created a class for each model type.
All of the classes inherit from the Classifier abstract class, which contains the basic methods for training and inference.
Hence all of the classes provide the same set of methods, but the implementation of these methods is different for each model type.
Each classifier provides the following methods:

train: trains the model
predict: makes predictions on the test set
save: saves the model
load_model: loads the model
validate: validates the model on the validation set

Text preprocessing

The text preprocessing is done using the TextProc package.
This package offers a Preprocessor class that takes a list of preprocessing functions as input and applies them to the data.
The Preprocessor can be set to verbose mode, which will print the data before and after each preprocessing step (useful for debugging).
Details about the preprocessing functions can be found in our report.

Estimated running time

Model	Epochs	Time per epoch
Bert	6	30min
XLNet	6	30min
XGBoost	--	10 min

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project structure

Quick start

Train your own model

Configuration Structure

Model configuration

Training configuration

Inference configuration

Preprocessing configuration

General configuration

Classifiers Structure

Text preprocessing

Estimated running time

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
Classifiers		Classifiers
Data		Data
Notebooks		Notebooks
Results		Results
TextProc		TextProc
.gitignore		.gitignore
QuickStart.ipynb		QuickStart.ipynb
README.md		README.md
configs.py		configs.py
requirements.txt		requirements.txt
run.py		run.py

CS-433/ml-project-2-oao

Folders and files

Latest commit

History

Repository files navigation

Project structure

Quick start

Train your own model

Configuration Structure

Model configuration

Training configuration

Inference configuration

Preprocessing configuration

General configuration

Classifiers Structure

Text preprocessing

Estimated running time

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages