Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding.
For technical details, please read my blog: Chinese version English version
I tested the code using Python 3.9, it may work on other Python version, but not guaranteed. Use poetry
to setup the environment is recommended.
pip install poetry
poetry install
virtualenv .venv -p python3
source .venv/bin/activate
pip install -r requirement.txt
poetry run python train.py --lang en --model word2vec --size 300 --output data/en_wiki_word2vec_300.txt
--lang: en for English, zh for Chinese
--model: word2vec or fasttext
--size: number of dimension of trained word embedding
--output: path to save trained word embedding
If you are using pip, please run:
python train.py --lang en --model word2vec --size 300 --output data/en_wiki_word2vec_300.txt
The visualization supports only Chinese and English.
poetry run python demo.py --lang en --output data/en_wiki_word2vec_300.txt
--lang: en for English, zh for Chinese
--output: path for trained word embedding
If you are using pip, please run:
python demo.py --lang en --output data/en_wiki_word2vec_300.txt
Chinese | English | |
---|---|---|
Word2Vec | Download | Download |
FastText | Download | Download |