Merge pull request #118 from minimaxir/0.5.0-docs

Fix docs for 0.5.0 changes
minimaxir · Apr 19, 2021 · 8a44c4d · 8a44c4d
2 parents 791e0e1 + 4a1c6dc
commit 8a44c4d
Show file tree

Hide file tree

Showing 17 changed files with 166 additions and 122 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,10 @@
 # aitextgen
 
-A robust Python tool for text-based AI training and generation using [OpenAI's](https://openai.com) [GPT-2](https://openai.com/blog/better-language-models/) architecture.
+A robust Python tool for text-based AI training and generation using [OpenAI's](https://openai.com) [GPT-2](https://openai.com/blog/better-language-models/) and [EleutherAI's](https://www.eleuther.ai) [GPT Neo/GPT-3](https://github.com/EleutherAI/gpt-neo) architecture.
 
 aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Hugging Face Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:
 
-- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI...or create your own GPT-2 model + tokenizer and train from scratch!
+- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI or a 125M/350M GPT Neo model from EleutherAI...or create your own GPT-2/GPT Neo model + tokenizer and train from scratch!
 - Generates text faster than gpt-2-simple and with better memory efficiency!
 - With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the HuggingFace model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
 - With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
@@ -16,7 +16,7 @@ You can read more about aitextgen [in the documentation](https://docs.aitextgen.
 
 You can play with aitextgen _for free_ with powerful GPUs using these Colaboratory Notebooks!
 
-- [Finetune OpenAI's 124M GPT-2 model on your own dataset (GPU)](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing)
+- [Finetune OpenAI's 124M GPT-2 model (or GPT Neo) on your own dataset (GPU)](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing)
 - [Train a GPT-2 model + tokenizer from scratch (GPU)](https://colab.research.google.com/drive/144MdX5aLqrQ3-YW-po81CQMrD6kpgpYh?usp=sharing)
 
 You can also play with custom [Reddit](notebooks/reddit_demo.ipynb) and [Hacker News](notebooks/hacker_news_demo.ipynb) demo models on your own PC.
@@ -35,7 +35,7 @@ Here's how you can quickly test out aitextgen on your own computer, even if you
 
 For generating text from a pretrained GPT-2 model:
 
-```python
+```py3
 from aitextgen import aitextgen
 
 # Without any parameters, aitextgen() will download, cache, and load the 124M GPT-2 "small" model
@@ -56,7 +56,7 @@ aitextgen generate --prompt "I believe in unicorns because" --to_file False
 
 Want to train your own mini GPT-2 model on your own computer? You can follow along [in this Jupyter Notebook](/notebooks/training_hello_world.ipynb) or, download this [text file of Shakespeare's plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt), cd to that directory in a Terminal, open up a `python3` console and go:
 
-```python
+```py3
 from aitextgen.TokenDataset import TokenDataset
 from aitextgen.tokenizers import train_tokenizer
 from aitextgen.utils import GPT2ConfigCPU
@@ -82,18 +82,17 @@ ai = aitextgen(tokenizer_file=tokenizer_file, config=config)
 # which automatically processes the dataset with the appropriate size.
 data = TokenDataset(file_name, tokenizer_file=tokenizer_file, block_size=64)
 
-# Train the model! It will save pytorch_model.bin periodically and after completion.
+# Train the model! It will save pytorch_model.bin periodically and after completion to the `trained_model` folder.
 # On a 2020 8-core iMac, this took ~25 minutes to run.
 ai.train(data, batch_size=8, num_steps=50000, generate_every=5000, save_every=5000)
 
 # Generate text from it!
 ai.generate(10, prompt="ROMEO:")
 
 # With your trained model, you can reload the model at any time by
-# providing the pytorch_model.bin model weights, the config, and the tokenizer.
-ai2 = aitextgen(model="trained_model/pytorch_model.bin",
-                tokenizer_file="aitextgen.tokenizer.json",
-                config="trained_model/config.json")
+# providing the folder containing the pytorch_model.bin model weights + the config, and providing the tokenizer.
+ai2 = aitextgen(model_folder="trained_model",
+                tokenizer_file="aitextgen.tokenizer.json")
 
 ai2.generate(10, prompt="ROMEO:")
 ```
@@ -106,7 +105,7 @@ Want to run aitextgen and finetune GPT-2? Use the Colab notebooks in the Demos s
 
 ## Upcoming Features
 
-The current release (v0.4.X) of aitextgen **is considered to be a beta**, targeting the most common use cases. The Notebooks and examples written so far are tested to work, but more fleshing out of the docs/use cases will be done over the next few months in addition to fixing the known issues noted above.
+The current release (v0.5.X) of aitextgen **is considered to be a beta**, targeting the most common use cases. The Notebooks and examples written so far are tested to work, but more fleshing out of the docs/use cases will be done over the next few months in addition to fixing the known issues noted above.
 
 The next versions of aitextgen (and one of the reasons I made this package in the first place) will have native support for _schema-based generation_. (See [this repo](https://github.com/minimaxir/gpt-2-keyword-generation) for a rough proof-of-concept.)
 

diff --git a/aitextgen/aitextgen.py b/aitextgen/aitextgen.py
@@ -16,6 +16,7 @@
 from transformers import (
     AutoConfig,
     AutoModelForCausalLM,
+    AutoTokenizer,
     GPT2Config,
     GPT2LMHeadModel,
     GPT2TokenizerFast,
@@ -97,6 +98,12 @@ def __init__(
         **kwargs,
     ) -> None:
 
+        if model:
+            assert not os.path.isfile(model), (
+                "As of aitextgen 0.5.0, you must "
+                + "use `model_folder` to load an existing model."
+            )
+
         if not verbose:
             for module in [
                 "transformers.file_utils",
@@ -189,7 +196,7 @@ def __init__(
             )
             if model and "gpt2" not in model:
                 logger.info(f"Using the tokenizer for {model}.")
-                self.tokenizer = GPT2TokenizerFast.from_pretrained(
+                self.tokenizer = AutoTokenizer.from_pretrained(
                     model,
                     cache_dir=cache_dir,
                 )
@@ -472,7 +479,6 @@ def generate_to_file(
         destination_path: str = None,
         sample_delim: str = "=" * 20 + "\n",
         seed: int = None,
-        cleanup: bool = True,
         **kwargs,
     ) -> None:
         """
@@ -516,15 +522,6 @@ def generate_to_file(
         for _ in range(n // batch_size):
             gen_texts = self.generate(n=batch_size, return_as_list=True, **kwargs)
 
-            # Remove empty texts and strip out extra newlines/extra spaces
-            if cleanup:
-                texts_to_clean = gen_texts
-                gen_texts = []
-                for text in texts_to_clean:
-                    clean_text = text.strip().strip("\n")
-                    if clean_text and len(clean_text) >= 2:
-                        gen_texts.append(clean_text)
-
             for gen_text in gen_texts:
                 f.write("{}\n{}".format(gen_text, sample_delim))
             pbar.update(batch_size)

diff --git a/docs/dataset.md b/docs/dataset.md
@@ -1,54 +1,67 @@
 # TokenDataset
 
-aitextgen has a special class, `TokenDataset`, used for managing tokenized datasets to be fed into model training. (this is in contrast with other GPT-2 finetuning approaches, which tokenizes at training time although you can still do that if you want)
+aitextgen has a special class, `TokenDataset`, used for managing tokenized datasets to be fed into model training. (this is in contrast with other GPT-2 finetuning approaches, which tokenizes at training time although you can still do that by passing a `file_path` and other relevant parameters to `ai.train()`.)
 
 This has a few nice bonuses, including:
 
 - Tokenize a dataset on a local machine ahead of time and compress it, saving time/bandwidth transporting data to a remote machine
 - Supports both reading a dataset line-by-line (including single-column CSVs), or bulk texts.
+- Debug and log the loaded texts.
 - Merge datasets together without using external libraries
 - Cross-train on multiple datasets to "blend" them together.
 
 ## Creating a TokenDataset For GPT-2 Finetuning
 
-The easiest way to create a TokenDataset is to provide a target file. If no `vocab_file` and `merges_file` are provided, it will use the default GPT-2 tokenizer.
+The easiest way to create a TokenDataset is to provide a target file. If no `tokenizer_file` is provided, it will use the default GPT-2 tokenizer.
 
-```python
+```py3
 from aitextgen.TokenDataset import TokenDataset
 
 data = TokenDataset("shakespeare.txt")
 ```
 
 If you pass a single-column CSV and specify `line_by_line=True`, the TokenDataset will parse it row-by-row, and is the recommended way to handle multiline texts.
 
-```python
+```py3
 data = TokenDataset("politics.csv", line_by_line=True)
 ```
 
 You can also manually pass a list of texts to `texts` instead if you've processed them elsewhere.
 
-```python
+```py3
 data = TokenDataset(texts = ["Lorem", "Ipsum", "Dolor"])
 ```
 
+## Block Size
+
+`block_size` is another parameter that can be passed when creating a TokenDataset, more useful for custom models. This should match the context window (e.g. the `n_positions` or `max_position_embeddings` config parameters). By default, it will choose `1024`: the GPT-2 context window.
+
+When implicitly loading a dataset via `ai.train()`, the `block_size` will be set to what is supported by the corresponding model `config`.
+
+## Debugging a TokenDataset
+
+When loading a dataset, a progress bar will appear showing how many texts are loaded and
+
+If you want to see what exactly is input to the model during training, you can access a slice via `data[0]`.
+
 ## Saving/Loading a TokenDataset
 
 When creating a TokenDataset, you can automatically save it as a compressed gzipped numpy array when completed.
 
-```python
+```py3
 data = TokenDataset("shakespeare.txt", save_cache=True)
 ```
 
 Or save it after you've loaded it with the `save()` function.
 
-```python
+```py3
 data = TokenDataset("shakespeare.txt")
 data.save()
 ```
 
 By default, it will save to `dataset_cache.tar.gz`. You can then reload that into another Python session by specifying the cache.
 
-```python
+```py3
 data = TokenDataset("dataset_cache.tar.gz", from_cache=True)
 ```
 
@@ -58,7 +71,7 @@ data = TokenDataset("dataset_cache.tar.gz", from_cache=True)
 
 ## Using TokenDatasets with a Custom GPT-2 Model
 
-The default TokenDataset has a `block_size` of `1024`, which corresponds to the _context window of the default GPT-2 model_. If you're using a custom model w/ a different maximum. Additionally, you must explicitly provide the vocab and merges files to rebuild the tokenizer, as the tokenizer will be different than the normal GPT-2 one.
+The default TokenDataset has a `block_size` of `1024`, which corresponds to the _context window of the default GPT-2 model_. If you're using a custom model w/ a different maximum. Additionally, you must explicitly provide the tokenizer file to rebuild the tokenizer, as the tokenizer will be different than the normal GPT-2 one.
 
 See the [Model From Scratch](tutorials/model-from-scratch.md) docs for more info.
 
@@ -72,7 +85,7 @@ Merging processed TokenDatasets can be done with the `merge_datasets()` function
 !!! note "About Merging"
     The current implementation merges by subset count, so equalization may not be perfect, but it will not significantly impact training.
 
-```python
+```py3
 from aitextgen.TokenDataset import TokenDataset, merge_datasets
 
 data1 = TokenDataset("politics1000.csv", line_by_line=True)   # 1000 samples

diff --git a/docs/generate-performance.md b/docs/generate-performance.md
@@ -10,7 +10,7 @@ PyTorch has the ability to quantize models on the CPU. Currently, it will only q
 
 To quantize a model after it's loaded, just run:
 
-```python
+```py3
 ai.quantize()
 ```
 
@@ -22,13 +22,13 @@ Certain GPUs, notably the cheap T4 and the expensive V100, support the ability t
 
 Assuming you are using a compatable GPU and already have [apex](https://github.com/NVIDIA/apex) installed, you can convert a model to the "half" FP16 mode with this:
 
-```python
+```py3
 ai.to_fp16()
 ```
 
 If you want to convert the model _before_ loading it into GPU memory (which may help avoid memory leaks), you can instantiate the model like this:
 
-```python
+```py3
 ai.to_fp16(to_gpu=True, to_fp16=True)
 ```
 

diff --git a/docs/generate.md b/docs/generate.md
@@ -7,14 +7,18 @@ Thanks to the base Transformers package, aitextgen has more options for generati
 See [this article](https://huggingface.co/blog/how-to-generate) by Huggingface engineer Patrick von Platen for how sampling and these parameters are used in practice.
 
 - `n`: Number of texts generated.
-- `max_length`: Maximum length of the generated text (default: 200; for GPT-2, the maximum is 1024.)
-- `prompt`: Prompt that starts the generated text and is included in the generate text. (used to be `prefix` in previous tools)
+- `max_length`: Maximum length of the generated text (default: 200; for GPT-2, the maximum is 1024; for GPT Neo, the maximum is 2048)
+- `prompt`: Prompt that starts the generated text and is included in the generated text.
 - `temperature`: Controls the "craziness" of the text (default: 0.7)
 - `top_k`: If nonzero, limits the sampled tokens to the top _k_ values. (default: 0)
 - `top_p`: If nonzero, limits the sampled tokens to the cumulative probability
 
 Some lesser-known-but-still-useful-parameters that are unique to Transformers:
 
+<!--prettier-ignore-->
+!!! warning "Performance"
+    Enabling these parameters may slow down generation.
+
 - `num_beams`: If greater than 1, executes beam search for cleaner text.
 - `repetition_penalty`: If greater than 1.0, penalizes repetition in a text to avoid infinite loops.
 - `length_penalty`: If greater than 1.0, penalizes text proportional to the length
@@ -30,17 +34,17 @@ Given a `aitextgen` object with a loaded model + tokenizer named `ai`:
     want to generate on the GPU, make sure you call `ai.to_gpu()` beforehand, or
     load the model into the GPU using `ai = aitextgen(to_gpu=True)`
 
-- `ai.generate()`: Generates and prints text to console. If `prompt` is used, the `prompt` is bolded. (a la [Talk to Transformer](https://talktotransformer.com))
+- `ai.generate()`: Generates and prints text to console. If `prompt` is used, the `prompt` is **bolded**.
 - `ai.generate_one()`: A helper function which generates a single text and returns as a string (good for APIs)
 - `ai.generate_samples()`: Generates multiple samples at specified temperatures: great for debugging.
-- `ai.generate_to_file()`: Generates a bulk amount of texts to file. (this accepts a `batch_size` parameter which is useful if using on a GPU)
+- `ai.generate_to_file()`: Generates a bulk amount of texts to file. (this accepts a `batch_size` parameter which is useful if using on a GPU, as it can generate texts in parallel with no performance loss)
 
 <!-- prettier-ignore -->
-!!! note "Cleanup"
-    By default, the `cleanup` parameter is set to True, which automatically removes texts that are blatantly malformed (e.g. only 2 characters long). Therefore, there may be less than `n` results returned. You can disabled this behavior by setting `cleanup=False`.
+!!! note "lstrip and nonempty_output"
+    By default, the `lstrip` and `nonempty_output` parameters to `generate` are set to `True`, which alters the behavior of the generated text in a way that is most likely preferable.  `lstrip`: Removes all whitespace at the beginning of the generated space. `nonempty_output`: If the output is empty (possible on shortform content), skip it if generating multiple texts, or try again if it's a single text. If `min_length` is specified, the same behavior occurs for texts below the minimum length after processing.
 
 ## Seed
 
-aitextgen has a new `seed` parameter for generation. Using any generate function with a `seed` parameter (must be an integer) and all other models/parameters the same, and the generated text will be identical. This allows for reproducible generations.
+aitextgen has a new `seed` parameter for generation. Using any generate function with a `seed` parameter (must be an integer) and all other models/parameters the same, and the generated text will be identical. This allows for reproducible generations in case someone accuses you of faking the AI output.
 
 For `generate_to_file()`, the 8-digit number at the end of the file name will be the seed used to generate the file, making reprodicibility easy.
diff --git a/docs/index.md b/docs/index.md
@@ -1,12 +1,14 @@
 # aitextgen
 
-A robust tool for advanced AI text generation via [GPT-2](https://openai.com/blog/better-language-models/).
+_Last Updated: April 18th, 2021 (aitextgen v0.5.0)_
 
-aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Huggingface Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:
+A robust Python tool for text-based AI training and generation using [OpenAI's](https://openai.com) [GPT-2](https://openai.com/blog/better-language-models/) and [EleutherAI's](https://www.eleuther.ai) [GPT Neo/GPT-3](https://github.com/EleutherAI/gpt-neo) architecture.
 
-- Finetunes on a pretrained 124M GPT-2 model from OpenAI...or create your own GPT-2 model + tokenizer and train from scratch!
-- Generates text faster than gpt-2-simple and with better memory efficiency! (even [from the 1.5B GPT-2 model](tutorials/generate_1_5b/)!)
-- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the Huggingface model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
+aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Hugging Face Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:
+
+- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI or a 125M/355M GPT Neo model from EleutherAI...or create your own GPT-2/GPT Neo model + tokenizer and train from scratch!
+- Generates text faster than gpt-2-simple and with better memory efficiency!
+- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the HuggingFace model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
 - With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
 - The input dataset is its own object, allowing you to not only easily encode megabytes of data in seconds, cache, and compress it on a local computer before transporting to a remote server, but you are able to _merge_ datasets without biasing the resulting dataset, or _cross-train_ on multiple datasets to create blended output.