Skip to content

Commit

Permalink
Merge pull request #118 from minimaxir/0.5.0-docs
Browse files Browse the repository at this point in the history
Fix docs for 0.5.0 changes
  • Loading branch information
minimaxir authored Apr 19, 2021
2 parents 791e0e1 + 4a1c6dc commit 8a44c4d
Show file tree
Hide file tree
Showing 17 changed files with 166 additions and 122 deletions.
21 changes: 10 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# aitextgen

A robust Python tool for text-based AI training and generation using [OpenAI's](https://openai.com) [GPT-2](https://openai.com/blog/better-language-models/) architecture.
A robust Python tool for text-based AI training and generation using [OpenAI's](https://openai.com) [GPT-2](https://openai.com/blog/better-language-models/) and [EleutherAI's](https://www.eleuther.ai) [GPT Neo/GPT-3](https://github.com/EleutherAI/gpt-neo) architecture.

aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Hugging Face Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:

- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI...or create your own GPT-2 model + tokenizer and train from scratch!
- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI or a 125M/350M GPT Neo model from EleutherAI...or create your own GPT-2/GPT Neo model + tokenizer and train from scratch!
- Generates text faster than gpt-2-simple and with better memory efficiency!
- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the HuggingFace model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
- With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
Expand All @@ -16,7 +16,7 @@ You can read more about aitextgen [in the documentation](https://docs.aitextgen.

You can play with aitextgen _for free_ with powerful GPUs using these Colaboratory Notebooks!

- [Finetune OpenAI's 124M GPT-2 model on your own dataset (GPU)](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing)
- [Finetune OpenAI's 124M GPT-2 model (or GPT Neo) on your own dataset (GPU)](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing)
- [Train a GPT-2 model + tokenizer from scratch (GPU)](https://colab.research.google.com/drive/144MdX5aLqrQ3-YW-po81CQMrD6kpgpYh?usp=sharing)

You can also play with custom [Reddit](notebooks/reddit_demo.ipynb) and [Hacker News](notebooks/hacker_news_demo.ipynb) demo models on your own PC.
Expand All @@ -35,7 +35,7 @@ Here's how you can quickly test out aitextgen on your own computer, even if you

For generating text from a pretrained GPT-2 model:

```python
```py3
from aitextgen import aitextgen

# Without any parameters, aitextgen() will download, cache, and load the 124M GPT-2 "small" model
Expand All @@ -56,7 +56,7 @@ aitextgen generate --prompt "I believe in unicorns because" --to_file False

Want to train your own mini GPT-2 model on your own computer? You can follow along [in this Jupyter Notebook](/notebooks/training_hello_world.ipynb) or, download this [text file of Shakespeare's plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt), cd to that directory in a Terminal, open up a `python3` console and go:

```python
```py3
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU
Expand All @@ -82,18 +82,17 @@ ai = aitextgen(tokenizer_file=tokenizer_file, config=config)
# which automatically processes the dataset with the appropriate size.
data = TokenDataset(file_name, tokenizer_file=tokenizer_file, block_size=64)

# Train the model! It will save pytorch_model.bin periodically and after completion.
# Train the model! It will save pytorch_model.bin periodically and after completion to the `trained_model` folder.
# On a 2020 8-core iMac, this took ~25 minutes to run.
ai.train(data, batch_size=8, num_steps=50000, generate_every=5000, save_every=5000)

# Generate text from it!
ai.generate(10, prompt="ROMEO:")

# With your trained model, you can reload the model at any time by
# providing the pytorch_model.bin model weights, the config, and the tokenizer.
ai2 = aitextgen(model="trained_model/pytorch_model.bin",
tokenizer_file="aitextgen.tokenizer.json",
config="trained_model/config.json")
# providing the folder containing the pytorch_model.bin model weights + the config, and providing the tokenizer.
ai2 = aitextgen(model_folder="trained_model",
tokenizer_file="aitextgen.tokenizer.json")

ai2.generate(10, prompt="ROMEO:")
```
Expand All @@ -106,7 +105,7 @@ Want to run aitextgen and finetune GPT-2? Use the Colab notebooks in the Demos s

## Upcoming Features

The current release (v0.4.X) of aitextgen **is considered to be a beta**, targeting the most common use cases. The Notebooks and examples written so far are tested to work, but more fleshing out of the docs/use cases will be done over the next few months in addition to fixing the known issues noted above.
The current release (v0.5.X) of aitextgen **is considered to be a beta**, targeting the most common use cases. The Notebooks and examples written so far are tested to work, but more fleshing out of the docs/use cases will be done over the next few months in addition to fixing the known issues noted above.

The next versions of aitextgen (and one of the reasons I made this package in the first place) will have native support for _schema-based generation_. (See [this repo](https://github.com/minimaxir/gpt-2-keyword-generation) for a rough proof-of-concept.)

Expand Down
19 changes: 8 additions & 11 deletions aitextgen/aitextgen.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from transformers import (
AutoConfig,
AutoModelForCausalLM,
AutoTokenizer,
GPT2Config,
GPT2LMHeadModel,
GPT2TokenizerFast,
Expand Down Expand Up @@ -97,6 +98,12 @@ def __init__(
**kwargs,
) -> None:

if model:
assert not os.path.isfile(model), (
"As of aitextgen 0.5.0, you must "
+ "use `model_folder` to load an existing model."
)

if not verbose:
for module in [
"transformers.file_utils",
Expand Down Expand Up @@ -189,7 +196,7 @@ def __init__(
)
if model and "gpt2" not in model:
logger.info(f"Using the tokenizer for {model}.")
self.tokenizer = GPT2TokenizerFast.from_pretrained(
self.tokenizer = AutoTokenizer.from_pretrained(
model,
cache_dir=cache_dir,
)
Expand Down Expand Up @@ -472,7 +479,6 @@ def generate_to_file(
destination_path: str = None,
sample_delim: str = "=" * 20 + "\n",
seed: int = None,
cleanup: bool = True,
**kwargs,
) -> None:
"""
Expand Down Expand Up @@ -516,15 +522,6 @@ def generate_to_file(
for _ in range(n // batch_size):
gen_texts = self.generate(n=batch_size, return_as_list=True, **kwargs)

# Remove empty texts and strip out extra newlines/extra spaces
if cleanup:
texts_to_clean = gen_texts
gen_texts = []
for text in texts_to_clean:
clean_text = text.strip().strip("\n")
if clean_text and len(clean_text) >= 2:
gen_texts.append(clean_text)

for gen_text in gen_texts:
f.write("{}\n{}".format(gen_text, sample_delim))
pbar.update(batch_size)
Expand Down
33 changes: 23 additions & 10 deletions docs/dataset.md
Original file line number Diff line number Diff line change
@@ -1,54 +1,67 @@
# TokenDataset

aitextgen has a special class, `TokenDataset`, used for managing tokenized datasets to be fed into model training. (this is in contrast with other GPT-2 finetuning approaches, which tokenizes at training time although you can still do that if you want)
aitextgen has a special class, `TokenDataset`, used for managing tokenized datasets to be fed into model training. (this is in contrast with other GPT-2 finetuning approaches, which tokenizes at training time although you can still do that by passing a `file_path` and other relevant parameters to `ai.train()`.)

This has a few nice bonuses, including:

- Tokenize a dataset on a local machine ahead of time and compress it, saving time/bandwidth transporting data to a remote machine
- Supports both reading a dataset line-by-line (including single-column CSVs), or bulk texts.
- Debug and log the loaded texts.
- Merge datasets together without using external libraries
- Cross-train on multiple datasets to "blend" them together.

## Creating a TokenDataset For GPT-2 Finetuning

The easiest way to create a TokenDataset is to provide a target file. If no `vocab_file` and `merges_file` are provided, it will use the default GPT-2 tokenizer.
The easiest way to create a TokenDataset is to provide a target file. If no `tokenizer_file` is provided, it will use the default GPT-2 tokenizer.

```python
```py3
from aitextgen.TokenDataset import TokenDataset

data = TokenDataset("shakespeare.txt")
```

If you pass a single-column CSV and specify `line_by_line=True`, the TokenDataset will parse it row-by-row, and is the recommended way to handle multiline texts.

```python
```py3
data = TokenDataset("politics.csv", line_by_line=True)
```

You can also manually pass a list of texts to `texts` instead if you've processed them elsewhere.

```python
```py3
data = TokenDataset(texts = ["Lorem", "Ipsum", "Dolor"])
```

## Block Size

`block_size` is another parameter that can be passed when creating a TokenDataset, more useful for custom models. This should match the context window (e.g. the `n_positions` or `max_position_embeddings` config parameters). By default, it will choose `1024`: the GPT-2 context window.

When implicitly loading a dataset via `ai.train()`, the `block_size` will be set to what is supported by the corresponding model `config`.

## Debugging a TokenDataset

When loading a dataset, a progress bar will appear showing how many texts are loaded and

If you want to see what exactly is input to the model during training, you can access a slice via `data[0]`.

## Saving/Loading a TokenDataset

When creating a TokenDataset, you can automatically save it as a compressed gzipped numpy array when completed.

```python
```py3
data = TokenDataset("shakespeare.txt", save_cache=True)
```

Or save it after you've loaded it with the `save()` function.

```python
```py3
data = TokenDataset("shakespeare.txt")
data.save()
```

By default, it will save to `dataset_cache.tar.gz`. You can then reload that into another Python session by specifying the cache.

```python
```py3
data = TokenDataset("dataset_cache.tar.gz", from_cache=True)
```

Expand All @@ -58,7 +71,7 @@ data = TokenDataset("dataset_cache.tar.gz", from_cache=True)

## Using TokenDatasets with a Custom GPT-2 Model

The default TokenDataset has a `block_size` of `1024`, which corresponds to the _context window of the default GPT-2 model_. If you're using a custom model w/ a different maximum. Additionally, you must explicitly provide the vocab and merges files to rebuild the tokenizer, as the tokenizer will be different than the normal GPT-2 one.
The default TokenDataset has a `block_size` of `1024`, which corresponds to the _context window of the default GPT-2 model_. If you're using a custom model w/ a different maximum. Additionally, you must explicitly provide the tokenizer file to rebuild the tokenizer, as the tokenizer will be different than the normal GPT-2 one.

See the [Model From Scratch](tutorials/model-from-scratch.md) docs for more info.

Expand All @@ -72,7 +85,7 @@ Merging processed TokenDatasets can be done with the `merge_datasets()` function
!!! note "About Merging"
The current implementation merges by subset count, so equalization may not be perfect, but it will not significantly impact training.

```python
```py3
from aitextgen.TokenDataset import TokenDataset, merge_datasets

data1 = TokenDataset("politics1000.csv", line_by_line=True) # 1000 samples
Expand Down
6 changes: 3 additions & 3 deletions docs/generate-performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ PyTorch has the ability to quantize models on the CPU. Currently, it will only q

To quantize a model after it's loaded, just run:

```python
```py3
ai.quantize()
```

Expand All @@ -22,13 +22,13 @@ Certain GPUs, notably the cheap T4 and the expensive V100, support the ability t

Assuming you are using a compatable GPU and already have [apex](https://github.com/NVIDIA/apex) installed, you can convert a model to the "half" FP16 mode with this:

```python
```py3
ai.to_fp16()
```

If you want to convert the model _before_ loading it into GPU memory (which may help avoid memory leaks), you can instantiate the model like this:

```python
```py3
ai.to_fp16(to_gpu=True, to_fp16=True)
```

Expand Down
18 changes: 11 additions & 7 deletions docs/generate.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,18 @@ Thanks to the base Transformers package, aitextgen has more options for generati
See [this article](https://huggingface.co/blog/how-to-generate) by Huggingface engineer Patrick von Platen for how sampling and these parameters are used in practice.

- `n`: Number of texts generated.
- `max_length`: Maximum length of the generated text (default: 200; for GPT-2, the maximum is 1024.)
- `prompt`: Prompt that starts the generated text and is included in the generate text. (used to be `prefix` in previous tools)
- `max_length`: Maximum length of the generated text (default: 200; for GPT-2, the maximum is 1024; for GPT Neo, the maximum is 2048)
- `prompt`: Prompt that starts the generated text and is included in the generated text.
- `temperature`: Controls the "craziness" of the text (default: 0.7)
- `top_k`: If nonzero, limits the sampled tokens to the top _k_ values. (default: 0)
- `top_p`: If nonzero, limits the sampled tokens to the cumulative probability

Some lesser-known-but-still-useful-parameters that are unique to Transformers:

<!--prettier-ignore-->
!!! warning "Performance"
Enabling these parameters may slow down generation.

- `num_beams`: If greater than 1, executes beam search for cleaner text.
- `repetition_penalty`: If greater than 1.0, penalizes repetition in a text to avoid infinite loops.
- `length_penalty`: If greater than 1.0, penalizes text proportional to the length
Expand All @@ -30,17 +34,17 @@ Given a `aitextgen` object with a loaded model + tokenizer named `ai`:
want to generate on the GPU, make sure you call `ai.to_gpu()` beforehand, or
load the model into the GPU using `ai = aitextgen(to_gpu=True)`

- `ai.generate()`: Generates and prints text to console. If `prompt` is used, the `prompt` is bolded. (a la [Talk to Transformer](https://talktotransformer.com))
- `ai.generate()`: Generates and prints text to console. If `prompt` is used, the `prompt` is **bolded**.
- `ai.generate_one()`: A helper function which generates a single text and returns as a string (good for APIs)
- `ai.generate_samples()`: Generates multiple samples at specified temperatures: great for debugging.
- `ai.generate_to_file()`: Generates a bulk amount of texts to file. (this accepts a `batch_size` parameter which is useful if using on a GPU)
- `ai.generate_to_file()`: Generates a bulk amount of texts to file. (this accepts a `batch_size` parameter which is useful if using on a GPU, as it can generate texts in parallel with no performance loss)

<!-- prettier-ignore -->
!!! note "Cleanup"
By default, the `cleanup` parameter is set to True, which automatically removes texts that are blatantly malformed (e.g. only 2 characters long). Therefore, there may be less than `n` results returned. You can disabled this behavior by setting `cleanup=False`.
!!! note "lstrip and nonempty_output"
By default, the `lstrip` and `nonempty_output` parameters to `generate` are set to `True`, which alters the behavior of the generated text in a way that is most likely preferable. `lstrip`: Removes all whitespace at the beginning of the generated space. `nonempty_output`: If the output is empty (possible on shortform content), skip it if generating multiple texts, or try again if it's a single text. If `min_length` is specified, the same behavior occurs for texts below the minimum length after processing.

## Seed

aitextgen has a new `seed` parameter for generation. Using any generate function with a `seed` parameter (must be an integer) and all other models/parameters the same, and the generated text will be identical. This allows for reproducible generations.
aitextgen has a new `seed` parameter for generation. Using any generate function with a `seed` parameter (must be an integer) and all other models/parameters the same, and the generated text will be identical. This allows for reproducible generations in case someone accuses you of faking the AI output.

For `generate_to_file()`, the 8-digit number at the end of the file name will be the seed used to generate the file, making reprodicibility easy.
12 changes: 7 additions & 5 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
# aitextgen

A robust tool for advanced AI text generation via [GPT-2](https://openai.com/blog/better-language-models/).
_Last Updated: April 18th, 2021 (aitextgen v0.5.0)_

aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Huggingface Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:
A robust Python tool for text-based AI training and generation using [OpenAI's](https://openai.com) [GPT-2](https://openai.com/blog/better-language-models/) and [EleutherAI's](https://www.eleuther.ai) [GPT Neo/GPT-3](https://github.com/EleutherAI/gpt-neo) architecture.

- Finetunes on a pretrained 124M GPT-2 model from OpenAI...or create your own GPT-2 model + tokenizer and train from scratch!
- Generates text faster than gpt-2-simple and with better memory efficiency! (even [from the 1.5B GPT-2 model](tutorials/generate_1_5b/)!)
- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the Huggingface model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Hugging Face Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:

- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI or a 125M/355M GPT Neo model from EleutherAI...or create your own GPT-2/GPT Neo model + tokenizer and train from scratch!
- Generates text faster than gpt-2-simple and with better memory efficiency!
- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the HuggingFace model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
- With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
- The input dataset is its own object, allowing you to not only easily encode megabytes of data in seconds, cache, and compress it on a local computer before transporting to a remote server, but you are able to _merge_ datasets without biasing the resulting dataset, or _cross-train_ on multiple datasets to create blended output.

Expand Down
Loading

0 comments on commit 8a44c4d

Please sign in to comment.