This is an Torch implementation of fasttext based on A. Joulin's paper Bag of Tricks for Efficient Text Classification.
Author: Junwei Pan
Email: pandevirus@gmail.com
This code is written in Lua and requires Torch. If you're on Ubuntu, installing Torch in your home directory may look something like:
$ curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash
$ git clone https://github.com/torch/distro.git ~/torch --recursive
$ cd ~/torch
$ ./install.sh # and enter "yes" at the end to modify your bashrc
$ source ~/.bashrc
This code also require the nn
package:
$ luarocks install nn
First down load the text classification data mentioned in Xiang Zhang's paper: Character-level Convolutional Networks for Text Classification. We use the ag_news_csv dataset for training and evaluation.
Then run the following commands to train and evaluate the fasttext model:
$ th main.lua -corpus_train data/ag_news_csv/train.csv -corpus_test data/ag_news_csv/test.csv -dim 10 -minfreq 10 -stream 0 -epochs 5 -suffix 1 -n_classes 4 -n_gram 1 -decay 0 -lr 0.5
If the dataset is too large to fit in the memory, try to use the paratemer -stream 1
.
The trained model can get an accuracy of 90.93% on the g_news_csv dataset using the above configuration.
-corpus_train: path of the training data
-corpus_test: path of the testing data
-minfreq: only those words with frequence higher than this will be used as features, default 10
-dim: the embedding dimension, default 10
-lr: learning rate, default 0.5
-min_lr: the minimal learning rate, default 0.001
-decay: whether to decay learning rate, 1 for decay, 0 for no decay, default 0
-epochs: number of epochs to go through the training data, default 5
-stream: whether to stream the data: 1 for streaming, 0 for store all data in memory, default 0
-suffix: suffix of the model
-n_classes: number of classification categories
-n_gram: 1 for unigram, 2 for bigram, 3 for trigram, default 1
-title: whether use the title to generate features, default 1
-description: whether use the description to generate features, default 1
- Support hashtrick
- Efficiency improvement
This code is based on the word2vec_torch project, which extends Yoon Kim's word2vec_torch by implementing the Continuous Bag-of-words Model.