Skip to content

Latest commit

 

History

History
66 lines (56 loc) · 1.87 KB

File metadata and controls

66 lines (56 loc) · 1.87 KB

Original

download corpus

  • 코퍼스의 크기 때문에 lfs file을 사용합니다.
  • SourceTree 등의 tool을 이용하시면, lfs 파일의 git clone이 쉽습니다.
cd ~/workspace
git clone https://gitlab.com/bage79/nlp4kor-ko.wikipedia.org.git

Create a vocabulary

./word2vec_vocab.sh
  • create *.vocab file. (Word2VecVocab instance)
    • it takes 1~2 mins on macbook pro.
  • vocab size: 100,000
  • min count: 2
  • unknown word: '¿'

Create a corpus

./word2vec_corpus.sh
  • create *.corpus file. (Word2VecCorpus instance)
    • it takes 5~10 mins on macbook pro.
  • window size: 1

Create a word2vec embedding

./word2vec_trainer.sh
  • create *.embedding file. (Word2VecEmbedding instance)
    • it takes 16 hours on GPU PC. (GTX1080Ti)
  • embedding size: 300 (bigger is better.)
  • batch size: 500 (bigger is better.)
  • negative samples: 100 (bigger is better.)
  • optimizer & learning rate: Adam, 1e-4
  • epoch: 20
  • subsample rate: 1e-5

Use the word2vec embedding

python ./word2vec_embedding_test.py

screenshot

screenshot

screenshot

Create tensorboard format files for visualization. (option)

  • create checkpoint, *.ckpt, *.tsv, *.pbtxt files on tensorboard log directory.
    • it takes 10 ~ 20 secs on macbook pro.
./word2vec_tensorboard.sh
  • Start tensorboard
tensorboard --logdir=~/tensorboard_log/ --port=6006