Skip to content

bage79/word2vec4kor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word2vec4kor

tensorboard_log word2vec visualization data

  • tensorboard embedding(projector) file format
  • window size: 1
  • Total unique words: 10,000
  • Tokenized: white-space
  • Embedding Dimension: 300
  • Skip-Gram + Negative Sampling + Subsampling
mkdir ~/workspace
cd ~/workspace

git clone https://github.com/bage79/word2vec4kor
tensorboard --logdir=~/workspace/word2vec4kor/tensorboard_log

demo

ko.wikipedia.org.sentences raw corpus

  • from https://ko.wikipedia.org
  • Total sentences: about 3,115,431
wget https://gitlab.com/bage79/nlp4kor-ko.wikipedia.org/raw/master/data/ko.wikipedia.org.sentences.gz
gzip -d ko.wikipedia.org.sentences.gz

Tips

Download korean Wikipedia dump file

  • https://dumps.wikimedia.org/kowiki/20180220/

Parse dump file(mediawiki format) to text file

  • https://pypi.python.org/pypi/mediawiki-parser/

Word2vec open source

  • https://github.com/theeluwin/pytorch-sgns

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages