Skip to content

Commit

Permalink
Merge pull request #91 from minimaxir/0.4.0
Browse files Browse the repository at this point in the history
0.4.0
  • Loading branch information
minimaxir authored Feb 23, 2021
2 parents a4f1d15 + b4dec6c commit 8dbc362
Show file tree
Hide file tree
Showing 10 changed files with 799 additions and 107 deletions.
24 changes: 24 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,30 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [0.4.0] - 2021-02-21

- Increased minimum versions of dependencies (`transformers` to 4.3.0, `pytorch-lightning` to 1.2.0)
- Remove dependency on `tokenizers` as `transformers` pins it.
- Made Fast tokenizers the default (as it is the default in `transformers` 4.0.0)
- Made serialized tokenizers the default for custom tokenizers, and added support for loading them for both `aitextgen` and `TokenDataset`s
- Added gradient checkpointing for GPT-2, and set it to the default for training 355M and 774M.
- Added layer freezing to freeze the first `n` layers of GPT-2 while training. This allows 1.5B GPT-2 to be trained with a high `n`.
- Added schema-based generation for specificed schema_tokens (which can be encoded in the Transformers config). This can be used with an appropriate dataset for schema-based generation.
- Switched TensorFlow weight download URL from GCP (as OpenAI removed it from there) to Azure
- Fixed issue where prompt character length was used to check for a too-long assert instead of prompt token length (#90)
- Workaround breaking issue in Transformers 4.3.0 by moving special token stripping into aitextgen instead of the tokenizer (#90)
- Added an `lstrip` param to generation, which strips all whitespace at the beginning of generated text (related to point above)

## [0.3.0] - 2020-11-30

- Increased minimum versions of dependencies (`transformers` to 4.0.0, `pytorch-lightning` to 1.0.8, Pytorch to 1.6)
- Fixed imports to account for new Transfomers file architecture
- Fixed training to account for new transformer/pytorch-lightning minimums
- Fully removed TorchScript code (ONNX implementation will supercede it)
- Made prompt specification for generation more canonical with Transformers
- Set default `vocab` size for new tokenizers to `1000`
- Began work on serializing tokenizers in accordance to the new `tokenizers` approach

## [0.2.1] - 2020-06-28

### Added
Expand Down
31 changes: 19 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ A robust Python tool for text-based AI training and generation using [OpenAI's](
aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Hugging Face Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:

- Finetunes on a pretrained 124M GPT-2 model from OpenAI...or create your own GPT-2 model + tokenizer and train from scratch!
- Generates text faster than gpt-2-simple and with better memory efficiency! (even [from the 1.5B GPT-2 model](https://docs.aitextgen.io/tutorials/generate_1_5b/)!)
- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the Hugging Face model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
- Generates text faster than gpt-2-simple and with better memory efficiency!
- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the HuggingFace model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
- With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
- The input dataset is its own object, allowing you to not only easily encode megabytes of data in seconds, cache, and compress it on a local computer before transporting to a remote server, but you are able to _merge_ datasets without biasing the resulting dataset, or _cross-train_ on multiple datasets to create blended output.

Expand Down Expand Up @@ -54,7 +54,7 @@ aitextgen generate
aitextgen generate --prompt "I believe in unicorns because" --to_file False
```

Want to train your own mini GPT-2 model on your own computer? Download this [text file of Shakespeare's plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt), cd to that directory in a Terminal, open up a `python3` console and go:
Want to train your own mini GPT-2 model on your own computer? You can follow along [in this Jupyter Notebook](/notebooks/training_hello_world.ipynb) or, download this [text file of Shakespeare's plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt), cd to that directory in a Terminal, open up a `python3` console and go:

```python
from aitextgen.TokenDataset import TokenDataset
Expand All @@ -66,29 +66,36 @@ from aitextgen import aitextgen
file_name = "input.txt"

# Train a custom BPE Tokenizer on the downloaded text
# This will save two files: aitextgen-vocab.json and aitextgen-merges.txt,
# which are needed to rebuild the tokenizer.
# This will save one file: `aitextgen.tokenizer.json`, which contains the
# information needed to rebuild the tokenizer.
train_tokenizer(file_name)
vocab_file = "aitextgen-vocab.json"
merges_file = "aitextgen-merges.txt"
tokenizer_file = "aitextgen.tokenizer.json"

# GPT2ConfigCPU is a mini variant of GPT-2 optimized for CPU-training
# e.g. the # of input tokens here is 64 vs. 1024 for base GPT-2.
config = GPT2ConfigCPU()

# Instantiate aitextgen using the created tokenizer and config
ai = aitextgen(vocab_file=vocab_file, merges_file=merges_file, config=config)
ai = aitextgen(tokenizer_file=tokenizer_file, config=config)

# You can build datasets for training by creating TokenDatasets,
# which automatically processes the dataset with the appropriate size.
data = TokenDataset(file_name, vocab_file=vocab_file, merges_file=merges_file, block_size=64)
data = TokenDataset(file_name, tokenizer_file=tokenizer_file, block_size=64)

# Train the model! It will save pytorch_model.bin periodically and after completion.
# On a 2016 MacBook Pro, this took ~25 minutes to run.
ai.train(data, batch_size=16, num_steps=5000)
# On a 2020 8-core iMac, this took ~25 minutes to run.
ai.train(data, batch_size=8, num_steps=50000, generate_every=5000, save_every=5000)

# Generate text from it!
ai.generate(10, prompt="ROMEO:")

# With your trained model, you can reload the model at any time by
# providing the pytorch_model.bin model weights, the config, and the tokenizer.
ai2 = aitextgen(model="trained_model/pytorch_model.bin",
tokenizer_file="aitextgen.tokenizer.json",
config="trained_model/config.json")

ai2.generate(10, prompt="ROMEO:")
```

Want to run aitextgen and finetune GPT-2? Use the Colab notebooks in the Demos section, or [follow the documentation](https://docs.aitextgen.io/) to get more information and learn some helpful tips!
Expand All @@ -102,7 +109,7 @@ Want to run aitextgen and finetune GPT-2? Use the Colab notebooks in the Demos s

## Upcoming Features

The current release (v0.2.X) of aitextgen **is considered to be a beta**, targeting the most common use cases. The Notebooks and examples written so far are tested to work, but more fleshing out of the docs/use cases will be done over the next few months in addition to fixing the known issues noted above.
The current release (v0.4.X) of aitextgen **is considered to be a beta**, targeting the most common use cases. The Notebooks and examples written so far are tested to work, but more fleshing out of the docs/use cases will be done over the next few months in addition to fixing the known issues noted above.

The next versions of aitextgen (and one of the reasons I made this package in the first place) will have native support for _schema-based generation_. (See [this repo](https://github.com/minimaxir/gpt-2-keyword-generation) for a rough proof-of-concept.)

Expand Down
41 changes: 28 additions & 13 deletions aitextgen/TokenDataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ def __init__(
file_path: str = None,
vocab_file: str = os.path.join(STATIC_PATH, "gpt2_vocab.json"),
merges_file: str = os.path.join(STATIC_PATH, "gpt2_merges.txt"),
tokenizer: GPT2TokenizerFast = None,
tokenizer_file: str = None,
texts: List[str] = None,
line_by_line: bool = False,
from_cache: bool = False,
Expand All @@ -70,7 +72,7 @@ def __init__(
eos_token: str = "<|endoftext|>",
unk_token: str = "<|endoftext|>",
pad_token: str = "<|endoftext|>",
progress_bar_refresh_rate: int = 10,
progress_bar_refresh_rate: int = 20,
**kwargs,
) -> None:

Expand All @@ -85,14 +87,27 @@ def __init__(

assert any([texts, file_path]), "texts or file_path must be specified."

tokenizer = GPT2TokenizerFast(
vocab_file=vocab_file,
merges_file=merges_file,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
pad_token=pad_token,
)
if not tokenizer:
if tokenizer_file:
# load the custom GPT-2 tokenizer from a serialized tokenizer
tokenizer = GPT2TokenizerFast(
vocab_file=None,
merges_file=None,
tokenizer_file=tokenizer_file,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
pad_token=pad_token,
)
else:
tokenizer = GPT2TokenizerFast(
vocab_file=vocab_file,
merges_file=merges_file,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
pad_token=pad_token,
)

# If a cache path is provided, load it.
if from_cache:
Expand Down Expand Up @@ -248,7 +263,7 @@ def encode_tokens_from_file(
tokenizer: GPT2TokenizerFast,
newline: str,
header: bool = True,
progress_bar_refresh_rate: int = 10,
progress_bar_refresh_rate: int = 20,
batch_size: int = 1024,
) -> List[int]:
"""
Expand Down Expand Up @@ -299,7 +314,7 @@ def encode_tokens_from_file(
if not batch:
break

encoded_texts = tokenizer.batch_encode_plus(
encoded_texts = tokenizer(
batch,
add_special_tokens=False,
return_token_type_ids=False,
Expand Down Expand Up @@ -340,7 +355,7 @@ def encode_tokens_from_list(
texts: List[str],
eos_token: str,
tokenizer: GPT2TokenizerFast,
progress_bar_refresh_rate: int = 10,
progress_bar_refresh_rate: int = 20,
batch_size: int = 1024,
) -> List[int]:
"""
Expand All @@ -367,7 +382,7 @@ def encode_tokens_from_list(
]
]

encoded_texts = tokenizer.batch_encode_plus(
encoded_texts = tokenizer(
batch,
add_special_tokens=False,
return_token_type_ids=False,
Expand Down
Loading

0 comments on commit 8dbc362

Please sign in to comment.