Here, we explain the procedure to preprocess the pretraining data for BERT models and training data for tokenizers.
The following commands will generate the dataset under ./data/original/wikipedia_ja
.
cd compare-ja-tokenizer/preprocessing_for_tokenizers/
python src/generate_wiki_dataset.py
cd ../data/original
wget http://data.statmt.org/cc-100/ja.txt.xz
gunzip ja.txt.xz
cd ../../compare-ja-tokenizer/preprocessing_for_tokenizers/
The following command will generate the data as ../data/sentence/tokenizer_data_wiki.txt
. Note that this process takes a lot of time.
python src/make_sentence_data_wiki.py
The following command will generate the data as ../data/sentence/tokenizer_data_cc100.txt
. Note that this process takes a lot of time.
python src/make_sentence_data_cc100.py
# Pick up 10M sentences
shuf -n 10000000 ../data/sentence/tokenizer_data_wiki.txt > ../data/tokenizer/tokenizer_data.txt