Skip to content

august-yeom/fastText_doc2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fastText_doc2vec

fastText_doc2vec is an extension of the Facebook fastText for document embedding.

Requirements

fastText_doc2vec builds on similar environment of Facebook fastText.

Building fastText

In order to build fastText_doc2vec, use the following:

$ git clone https://github.com/Skarface-/fastText_doc2vec.git
$ cd fastText_doc2vec
$ make

This will produce object files for all the classes as well as the main binary fasttext.
If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Example use cases

This library has two use cases: document embedding using PV-DM and PV-DBOW model.
These were described in the paper 1.

Document Embedding

In order to embed document in vector space, as described in 1, do:

$ ./fasttext pvdm -model model.bin -input data.txt -output docvecs
or
$ ./fasttext pvdbow -model model.bin -input data.txt -output docvecs

where model.bin is a previously trained model using fasttext word representation learning.
As such, most options are inherited from fasttext word representation learning without epoch, thread, etc
data.txt is a file containing utf-8 encoded labeled documents. (__label__<label>, <text>)
At the end of document embeding, the program will save a single file: docvecs.vec.
docvecs.vec is a text file containing the labeled document vectors, one per line.

Full documentation

Invoke a command without arguments to list available arguments and their default values:

$ ./fasttext pvdm
Empty model or input or output path.

The following arguments are mandatory:
  -model      (mandatory, only pvdm or pvdbow) model.bin file path for document embedding
  -input      training file path
  -output     output file path

The following arguments are optional:
  -lr         learning rate [0.05]
  -epoch      number of epochs [5]
  -thread     number of threads [12]
  -verbose    how often to print to stdout [10000]
  -label      labels prefix [__label__]

References

Please cite 1 and 2.

Distributed Representations of Sentences and Documents

[1] Quoc V. Le, T. Mikolov, Distributed Representations Of Sentences And Documents

@article{quoc2014distributed,
  title={Distributed Representations of Sentences and Documents},
  author={Le, Quoc V. and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1405.4053v2},
  year={2014}
}

Enriching Word Vectors with Subword Information

[2] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

The fastText community

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published