Training GPT-2 transformer language model on your own corpora with sentencepiece tokenization.
This repo contains a PyTorch implementation of GPT-2, which support multi-GPU
training.
It also contains a TensorFlow implementation in lm/gpt_2_tf
,
but it is not developed any more. They share the same data preparation scripts.
TF training command is gpt-2-tf-train
and needs TensorFlow 1.13.
Documentation below is for PyTorch version.
Python 3.6+ is required with torch nightly or 1.6.0+. Working in a virtualenv is assumed below. Install appropriate version of pytorch first, and then:
pip install -r requirements.txt python setup.py develop
Instructions are below. See also test/test_shakespeare.sh
for a complete pipeline demo on a small corpus (takes a minute on a CPU).
Corpus format: a directory with top-level train
, valid
and test
folders. Each top-level folder may contain sub-folders. Inside them,
there must be utf-8 encoded text files with .txt
extension.
The commands to train sentencepiece model and encode the corpus support
multiple corpora,
in below examples we assume they can be listed as data/corpora-*
.
Train sentencepiece model (
sp-text.txt
can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in thesp-train
command directly):sp-train data/corpora-* sp-text.txt sp-model
Encode corpora, producing numpy files:
sp-encode data/corpora-* sp-model.model data/encoded
Example command:
gpt-2 run-root data/encoded sp-model.model
run-root
would contain model checkpoints and json-lines logs,
which can be plotted in a jupyter notebook with
json_log_plots.plot("run-root")
, with number of tokens seen on the X axis.
Default hyperparameters correspond to released "small" GPT-2 model.
When multiple GPUs are available, they would be used for training with the
help of torch.distributed
.
If the path exists and --clean
key is NOT passed, training would be resumed.
Note that all parameters still need to be specified and
model parameters need to match.
Notes on training parameters:
--batch-size
is per-GPU, so you don't need to re-tune it when changing number of GPUs, just use max that fits into memory.--g-accum-gradients
is the global number of gradient accumulations, it must be divisible by the number of GPUs. Effective global batch size is alwaysbatch_size * g_accum_gradients
.--lr
does not need to be changed when changing--batch-size
or--g-accum-gradients
or number of GPUs or--n-ctx
: loss is already scaled appropriately.
Example command:
gpt-2-gen run-root "Artificial intelligence"
run-root
would contain model checkpoints
"Artificial intelligence"
is the text prefix used as a starting point for generating tokens
Notes on inference parameters:
--tokens-to-generate
: number of tokens to generate, default is 42--top-k
: number of token candidates to generate for each position (beam width), default is 8.
License is MIT.
TensorFlow GPT-2 model is taken from https://github.com/openai/gpt-2/blob/master/src/model.py and TensorFlow GPT-2 training code is based on https://github.com/nshepperd/gpt-2/blob/finetuning/train.py
PyTorch port is based on original OpenAI code.
Test Shakespeare corpus under tests/shakespeare
is from http://shakespeare.mit.edu under public domain.