fastText_doc2vec is an extension of the Facebook fastText for document embedding.
fastText_doc2vec builds on similar environment of Facebook fastText.
In order to build fastText_doc2vec
, use the following:
$ git clone https://github.com/Skarface-/fastText_doc2vec.git
$ cd fastText_doc2vec
$ make
This will produce object files for all the classes as well as the main binary fasttext
.
If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).
This library has two use cases: document embedding using PV-DM and PV-DBOW model.
These were described in the paper 1.
In order to embed document in vector space, as described in 1, do:
$ ./fasttext pvdm -model model.bin -input data.txt -output docvecs
or
$ ./fasttext pvdbow -model model.bin -input data.txt -output docvecs
where model.bin
is a previously trained model using fasttext word representation learning.
As such, most options are inherited from fasttext word representation learning without epoch, thread, etc
data.txt
is a file containing utf-8
encoded labeled documents. (__label__<label>, <text>)
At the end of document embeding, the program will save a single file: docvecs.vec
.
docvecs.vec
is a text file containing the labeled document vectors, one per line.
Invoke a command without arguments to list available arguments and their default values:
$ ./fasttext pvdm
Empty model or input or output path.
The following arguments are mandatory:
-model (mandatory, only pvdm or pvdbow) model.bin file path for document embedding
-input training file path
-output output file path
The following arguments are optional:
-lr learning rate [0.05]
-epoch number of epochs [5]
-thread number of threads [12]
-verbose how often to print to stdout [10000]
-label labels prefix [__label__]
[1] Quoc V. Le, T. Mikolov, Distributed Representations Of Sentences And Documents
@article{quoc2014distributed,
title={Distributed Representations of Sentences and Documents},
author={Le, Quoc V. and Mikolov, Tomas},
journal={arXiv preprint arXiv:1405.4053v2},
year={2014}
}
[2] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2016enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.04606},
year={2016}
}
- Facebook page: https://www.facebook.com/groups/1174547215919768
- Google group: https://groups.google.com/forum/#!forum/fasttext-library