Natural Language Processing with Representation Learning (DS-GA 1011)
For the full write-up of the results, please see report.pdf.
$ #(HPC) module load anaconda3/4.3.1
$ conda create -n bow python=3.6
$ source activate bow
$ conda install -c conda-forge spacy
$ python -m spacy download en
$ conda install -c conda-forge matplotlib
$ #(loc) conda install pytorch torchvision -c pytorch
$ #(HPC) pip install torch torchvision
Download "Large Movie Review Dataset v1.0" from http://ai.stanford.edu/~amaas/data/sentiment/
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.
$ mkdir data
$ cd data/
$ wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
$ tar -xvzf aclImdb_v1.tar.gz
$ #rm aclImdb_v1.tar.gz
Run the following command in the root directory of the project to reproduce the final trial:
$ #[set up conda env and install requirements]
$ python main.py
See main.py
for the main script to run. Change default parameter settings in settings.py
. Slightly modified versions of the lab code can be found in bow_model.py
and torch_data_loader.py
. The utils.py
script contains all of the preprocessing code. Finally, the plots and result tables are generated in plot.py
(see also plots/
and results/
).
- Tokenzation schemes
- Number of epochs
- N-gram size
- Vocabulary size
- Embedding size
- Optimimizer (
adam
vs.sgd
) - Learning rate
- Baseline:
string.split()
- Tokenization using spaCy
- Tokenization using spaCy, filtering of stop words and punctuation
- Tokenization using spaCy, filtering of stop words and punctuation, lemmatization
So far, the second tokenization scheme [2]
works best. Lemmatization seems to be overkill, but filtering stop words and punctuation is helpful. It looks like the model is overfitting though, so let's adjust the learning rate next.
The default learning rate was set to 0.01
, which is pretty high for the Adam optimizer. Results:
1e-2
: clearly overfitting1e-3
: nice learning curve1e-4
: too slow
We will stick with a learning of 1e-3
for Adam for now. It looks like we can also reduce the number of epochs from 10 to 2 for the following experiments.
Testing n-gram sizes between [1-4]
surprisingly did not yield very promising results. Evaluating different vocabulary sizes [10k, 50k, 100k
] did not significantly affect model performance either. Reverting the tokenization scheme back to [1]
and including stop words and punctuation improves the results, especially when using bigrams over unigrams. Increasing the number of epochs to 5 again was necessary for the learning curve to converge. Still, the best results are consistently achieved with the more rigorous tokenization scheme [2]
and using unigrams [n=1
] with a maximum vocabulary size of [50k
].
Results: 200d
> 100d
> 50d
.
We will compare adam
vs. sgd
, both with default parameters. sgd
doesn't seem to be working well at all. We will keep using adam
for now.
Not helpful.
Training for 2
epochs seems sufficient, otherwise the model starts to overfit on the training data.
Final accuracy on the testing set: 86.212
.