Skip to content

Latest commit

 

History

History
91 lines (66 loc) · 2.13 KB

README.md

File metadata and controls

91 lines (66 loc) · 2.13 KB

Training tokenizers

Here, we explain how to generate dictionary files for tokenizers.

1. Training tokenizers with MeCab

cd ./compare-ja-tokenizer/tokenizer

# BPE (Takes time.)
bash ./scripts/mecab/bpe.sh

# WordPiece (Takes time.)
bash ./scripts/mecab/wordpiece.sh

# Unigram (Takes time.)
bash ./scripts/mecab/unigram.sh

You might need to specify the location of mecabrc as follows if you have installed MeCab locally.

echo 'export MECABRC="[your path]/etc/mecabrc"' >> ~/.bashrc

[your path] should be $HOME/usr if you follow our instructions.

2. Training tokenizers with Juman++

# Pre-tokenize (Takes time.)
jumanpp ../data/tokenizer/tokenizer_data.txt --segment > ../data/tokenizer/juman++/pretokenized_data.txt

# BPE
bash ./scripts/juman++/bpe.sh

# WordPiece
bash ./scripts/juman++/wordpiece.sh

# Unigram
bash ./scripts/juman++/unigram.sh

3. Training tokenizers with Sudachi

# Pre-tokenize (Takes time.)
python src/sudachi/pretokenize_sudachi.py \
    --input_path ../data/tokenizer/tokenizer_data.txt \
    --output_path  ../data/tokenizer/sudachi/pretokenized_data.txt

# BPE
bash ./scripts/sudachi/bpe.sh

# WordPiece
bash ./scripts/sudachi/wordpiece.sh

# Unigram
bash ./scripts/sudachi/unigram.sh

4. Training tokenizers with Vaporetto

# Pre-tokenize (Takes time.)
cat ../data/tokenizer/tokenizer_data.txt | cargo run --release -p predict -- --model /path/to/bccwj-suw+unidic+tag.model.zst > ../data/tokenizer/vaporetto/temp_pretokenized_data.txt
sed -e 's/\\//g' ../data/tokenizer/vaporetto/temp_pretokenized_data.txt > ../data/tokenizer/vaporetto/pretokenized_data.txt

You must specify the path to bccwj-suw+unidic+tag.model.zst. In default, your model should be downloaded under the path: ../data/bccwj-suw+unidic+tag/.

# BPE
bash ./scripts/vaporetto/bpe.sh

# WordPiece
bash ./scripts/vaporetto/wordpiece.sh

# Unigram
bash ./scripts/vaporetto/unigram.sh

5. Training tokenizers without morphological analyzers

# BPE
bash ./scripts/nothing/bpe.sh

# WordPiece
bash ./scripts/nothing/wordpiece.sh

# Unigram
bash ./scripts/nothing/unigram.sh