Skip to content

GPT-2 Japanese model for HuggingFace's transformers

License

Notifications You must be signed in to change notification settings

colorfulscoop/gpt-ja

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPT-based Japanese model for 🤗 Transformers

This repository is for GPT-based Japanese model trained on Japanese Wikipedia dataset.

Current support models are:

Model summary:

🤗 Model Hub Data Revision Code Total params Test set PPL vocab_size n_ctx n_layer n_head n_embd Epochs Training time
colorfulscoop/gpt2-small-ja jawiki_20210820 20210820.1.0 ef927e1 110M 29.13 32,000 1,024 12 12 768 30 15 days
jawiki_20210301 20210301.1.0 - 110M - 32,000 1,024 12 12 768 30 -

Data summary:

Id Corpus #tokens in train set #tokens in valid set #tokens in test set
jawiki_20210820 Japanese Wikipedia on 20210820 540M 13M 13M

Note: a same tokenizer is used if models are trained on same data.

Sample usage:

>>> import transformers
>>> pipeline = transformers.pipeline("text-generation", "models/gpt2-small", revision="20210820.1.0")
>>> pipeline("統計的機械学習でのニューラルネットワーク", do_sample=True)
[{'generated_text': '統計的機械学習でのニューラルネットワークの解析は、多くのアルゴリズムの完全な実装をもたらした。これらの'}]

Training details

Training model was conducted on the following environment.

  • OS: Ubuntu 18.04.5 LTS
  • GPU: RTX 2080 Ti x1

Environment preparation

$ docker container run --gpus all --ipc=host --rm -it -v $(pwd):/work -w /work nvidia/cuda:11.1-devel-ubuntu20.04 bash
(container)$ apt update && apt install -y python3 python3-pip git wget
(container)$ pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
(container)$ pip3 install -r requirements.txt

Data preparation

Check the latest date in the list from https://dumps.wikimedia.org/jawiki/ .

(container)$ bash src/get_jawiki.sh 20210820 input

Finally generated data can be found under input directory.

(container)$ ls -1 input/20210820/{train,valid,test}.txt
input/20210820/test.txt
input/20210820/train.txt
input/20210820/valid.txt

Train tokenizer

Train SentencePiece model in the same container used in data peparation.

(container)$ python3 src/train_tokenizer.py --train_file input/20210820/train.txt --model_dir models/gpt2-small

Train model

Run training with the config file:

(container)$ python3 src/train.py train --config input/gpt2-small.json
...
255999it [10:21:51,  7.03it/s]{'epoch': 30, 'batch': 256000, 'step': 493108, 'train_loss': 0.190585415356369, 'lr': 0.0001}
263236it [10:39:12,  6.86it/s]
6788it [10:28, 10.81it/s]
{'epoch': 30, 'valid_loss': 3.417723441833458, 'valid_ppl': 30.49990112587307, 'save_model': True}

Test

(container)$ python3 src/train.py test --config input/gpt2-small.json
6793it [09:16, 12.20it/s]
{'test_loss': 3.371613106758486, 'test_ppl': 29.125471679484484}

Export Tensorflow model

(container)$ pip install tensorflow
(container)$ python3
>>> from transformers import TFGPT2LMHeadModel
>>> model = TFGPT2LMHeadModel.from_pretrained("models/gpt2-small", from_pt=True)
>>> model.save_pretrained("models/gpt2-small")

Upload to 🤗 Model Hub

Follow official document to upload model.

Prepare environment

Prepare git lfs. In a MacOS environment, git lfs can be installed as follows.

$ brew install git-lfs
$ git lfs install
Updated git hooks.
Git LFS initialized.

Then clone the repository.

$ git clone https://huggingface.co/colorfulscoop/gpt2-small-ja release/gpt2-small-ja

Copy model to release directory

$ cp models/gpt2-small/* release/gpt2-small-ja/
cp: models/gpt2-small/spm is a directory (not copied).
$ cd release/gpt2-small-ja

Then, modify config.json to specify default generation values by following diff.

   "unk_token_id": 1,
   "use_cache": true,
-  "vocab_size": 32000
+  "vocab_size": 32000,
+  "top_k": 50,
+  "top_p": 0.95,
+  "do_sample": true
 }

Commit changes to git.

$ git add .

Release

$ git push origin

About

GPT-2 Japanese model for HuggingFace's transformers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published