This project was developed as part of the EPFL Machine Learning course (2020).
- Marie Biolková
- Sena Necla Cetin
- Robert Pieniuta
This repository contains code used for building a classifier for text sentiment analysis. The task was performed on a large corpus of tweets where the goal was to determine whether the tweet contained a positive or negative smiley (before it was removed) from the remaining text. More information about the challenge and the data can be found here.
.
├── README.md
├── __init__.py
├── data
| ├── preprocessed_tweets.txt
| ├── preprocessed_tweets_full.txt
| ├── preprocessed_tweets_test.txt
| ├── test_data.txt
| ├── train_neg.txt
| ├── train_neg_full.txt
| ├── train_pos.txt
| ├── train_pos_full.txt
| ├── weights_gru.pt
| └── weights_lstm.pt
├── notebooks
│ ├── bow-tfidf-baselines.ipynb
│ ├── eda.ipynb
│ ├── fasttext.ipynb
│ ├── glove_base.ipynb
│ └── test-preprocessing.ipynb
└── src
├── __init__.py
├── consts.py
├── ft_helpers.py
├── get_embeddings.py
├── glove
│ ├── build_vocab.sh
│ ├── consts_glove.py
│ ├── cooc.py
│ ├── cut_vocab.sh
| ├── embeddings.txt
│ ├── glove_solution.py
│ ├── pickle_vocab.py
│ └── tmp
│ ├── cooc.pkl
│ ├── vocab.pkl
│ ├── vocab_cut.txt
│ └── vocab_full.txt
├── load.py
├── predict_helpers.py
├── preprocessing.py
├── representations.py
├── rnn.py
├── rnn_classifier.py
└── run.py
preprocessed_tweets.txt
,preprocessed_tweets_full.txt
,preprocessed_tweets_test.txt
: tweets from the development set, full dataset and test set respectivelt which have been pre-processedtest_data.txt
: unlabelled tweets to be predictedtrain_neg.txt
,train_neg_full.txt
: development and full set of negative tweetstrain_pos.txt
,train_pos_full.txt
: development and full set of positive tweetsweights_gru.pt
,weights_lstm.pt
: weights of the best GRU and LSTM modelbow-tfidf-baselines.ipynb
: code for exploration and tuning of baselines with Tf-Idf and Bag-of-Wordseda.ipynb
: exploratory data analysisfasttext.ipynb
: exploration and tuning of fastTextglove_base.ipynb
: code for exploration and tuning of baselines using GloVe embeddingstest-preprocessing.ipynb
: test file to check whether preprocessing was done correctlyconsts.py
,const_glove.py
: contain paths to files usedft_helpers.py
: helper files for fastText trainingget_embeddings.py
: executing this script from the command line will train GloVe embeddings on the preprocessed datasetbuild_vocab.sh
,cooc.py
,cut_vocab.sh
,pickle_vocab.py
,glove_solution.py
: scripts for training GloVe embeddings;produce theembeddings.txt
once executedcooc.pkl
,vocab.pkl
,vocab_cut.txt
,vocab_full.txt
: intermediate files for training GloVe embeddingsload.py
: helper functions for loading datasets and outputing predictionspredict_helpers.py
: helper functions for making predictions for the best modelpreprocessing.py
: methods for preprocessingrepresentations.py
: methods for generating and mapping GloVe embeddingsrnn.py
: methods for training RNNs and predicting their outputsrnn_classifier.py
: defines the recurrent neural network classrun.py
: script to produce our best submission
- Python 3
numpy
pandas
nltk
wordcloud
fasttext
sklearn
pytorch
matplotlib
andseaborn
Place the data in the data
folder. The data, as well as the embeddings we trained can be downloaded here.
In order to generate our final submission file, you have to run :
cd src
python run.py
This will generate the src/submission.csv
file.
Our best model is an ensemble of fastText, LSTM and GRU classifiers. It yielded a classification accuracy of 88.6% on AIcrowd (and an F1-score of 88.8%).
Please note that since it is not possible to set a seed in fastText, the outputs may vary slightly.