fastText_doc2vec

fastText_doc2vec is an extension of the Facebook fastText for document embedding.

Requirements

fastText_doc2vec builds on similar environment of Facebook fastText.

Building fastText

In order to build fastText_doc2vec, use the following:

$ git clone https://github.com/Skarface-/fastText_doc2vec.git
$ cd fastText_doc2vec
$ make

This will produce object files for all the classes as well as the main binary fasttext.
If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Example use cases

This library has two use cases: document embedding using PV-DM and PV-DBOW model.
These were described in the paper 1.

Document Embedding

In order to embed document in vector space, as described in 1, do:

$ ./fasttext pvdm -model model.bin -input data.txt -output docvecs
or
$ ./fasttext pvdbow -model model.bin -input data.txt -output docvecs

where model.bin is a previously trained model using fasttext word representation learning.
As such, most options are inherited from fasttext word representation learning without epoch, thread, etc
data.txt is a file containing utf-8 encoded labeled documents. (__label__<label>, <text>)
At the end of document embeding, the program will save a single file: docvecs.vec.
docvecs.vec is a text file containing the labeled document vectors, one per line.

Full documentation

Invoke a command without arguments to list available arguments and their default values:

$ ./fasttext pvdm
Empty model or input or output path.

The following arguments are mandatory:
  -model      (mandatory, only pvdm or pvdbow) model.bin file path for document embedding
  -input      training file path
  -output     output file path

The following arguments are optional:
  -lr         learning rate [0.05]
  -epoch      number of epochs [5]
  -thread     number of threads [12]
  -verbose    how often to print to stdout [10000]
  -label      labels prefix [__label__]

References

Please cite 1 and 2.

Distributed Representations of Sentences and Documents

[1] Quoc V. Le, T. Mikolov, Distributed Representations Of Sentences And Documents

@article{quoc2014distributed,
  title={Distributed Representations of Sentences and Documents},
  author={Le, Quoc V. and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1405.4053v2},
  year={2014}
}

Enriching Word Vectors with Subword Information

[2] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

The fastText community

Facebook page: https://www.facebook.com/groups/1174547215919768
Google group: https://groups.google.com/forum/#!forum/fasttext-library

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
PATENTS		PATENTS
README.md		README.md
eval.py		eval.py
wikifil.pl		wikifil.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fastText_doc2vec

Requirements

Building fastText

Example use cases

Document Embedding

Full documentation

References

Distributed Representations of Sentences and Documents

Enriching Word Vectors with Subword Information

The fastText community

About

Releases

Packages

Languages

License

august-yeom/fastText_doc2vec

Folders and files

Latest commit

History

Repository files navigation

fastText_doc2vec

Requirements

Building fastText

Example use cases

Document Embedding

Full documentation

References

Distributed Representations of Sentences and Documents

Enriching Word Vectors with Subword Information

The fastText community

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages