This repository contains an implementation of a Transformer Decoder in PyTorch from scratch. The purpose of this project is to generate text based on Jules Verne's literary works, using the original Transformer model as proposed in the "Attention is All You Need" paper and its subsequent improvements. This is also an application of my learnings from Andrey Karpathy's latest youtube series.
To use this code, first, clone the repository:
git clone https://github.com/joaoflf/transformer_decoder_pytorch.git
cd transformer_decoder_pytorch
Next, install the dependencies:
pip install -r requirements.txt
The train.py
script trains the model. It accepts the following command line arguments:
--iters
: Total iterations to train. Default is 5000.--batch-size
: Batch size. Default is 32.--lr
: Learning rate. Default is 3e-4.--device
: Device to use for training. Default is "cuda" if CUDA is available, otherwise "mps".--checkpoint_dir
: Directory to save the model checkpoints. Default is "checkpoints".
Example usage:
python train.py --iters 10000 --batch-size 64 --lr 1e-4 --device cuda --checkpoint_dir my_checkpoints
This will train the model for 10000 iterations with a batch size of 64, a learning rate of 1e-4, using a CUDA device for training. The model checkpoints will be saved in the my_checkpoints
directory.
The generate.py
script generates new text from a trained model. It accepts the following command line arguments:
--checkpoint_path
: Path to the model checkpoint. This argument is required.- You can download the latest trained weights here
--num_tokens
: Number of tokens to generate. Default is 100.
Example usage:
python generate.py --checkpoint_path my_checkpoints/model_state_10000.pt --num_tokens 500
This will generate 500 new tokens from the model checkpoint at my_checkpoints/model_state_10000.pt
.
-
✅ Start with a basic bigram model and a basic table lookup embedding layer.
iterations: 10,000 batch_size: 32
Metric Value Train Loss 2.57 Val Loss N/A
-
✅ Add a self-attention block and introduce basic positional embeddings.
iterations: 10,000 batch_size: 32 block_size: 8 embed_size: 256
Metric Value Train Loss 2.4980 Val Loss 2.5421
-
✅ Implement multihead self-attention.
iterations: 10,000 batch_size: 32 block_size: 8 embed_size: 256 num_heads: 8
Metric Value Train Loss 2.1 Val Loss 2.13
-
✅ Add a feed-forward network and stack multiple blocks of multi-head attention.
iterations: 10,000 batch_size: 32 block_size: 8 embed_size: 256 num_heads: 8 num_blocks: 4
Metric Value Train Loss 3.13 Val Loss 3.17 *the networks is now too deep and is hurting training performance
-
✅ Implement Layer Normalization and residual connections. Scale up the model
GPU: M1 Pro 10-core iterations: 5,000 batch_size: 64 block_size: 256 embed_size: 384 num_heads: 6 num_blocks: 6 dropout: 0.2
Metric Value Train Loss 1.02 Val Loss 1.19 Generated Text
F the fact of this life appeared for its last ten to the Northern minutes which formed me a mountain number of our worthy and millions that we have made for land known of the Central Sea." "Well," said the Professor; "it is a depth of extraordinary track, their island wood." "But it is quite getting at Ned Land." At this moment, I saw the amed horizontal horrible at last would the hargonal man. I came to fain the extraordinary and excitement power on the other you."
-
✅ Replace char level tokenizer with TikToken ('gpt2').
GPU: M1 Pro 10-core iterations: 5,000 batch_size: 64 block_size: 256 embed_size: 384 num_heads: 6 num_blocks: 6 dropout: 0.2
Metric Value Train Loss 0.128 Val Loss 7.09 The model now overfits as the training data is too small. Due to the new tokenizer, the model now has a vocabulary of 50k+ tokens, which increases training time by 4x. (~4it/s -> ~1it/s on a M1 Pro 10-core) The generated text is now much more coherent and readable.
Generated Text
"Then," he said, "it is impossible in a contrary, your cannot be easy to the weight being about. We must put utterly at last observation to the end of this gallery." "My dear uncle," I ventured mildly to his answer. "Let the way to the old--of no means a minute or of the sentence as he did not care answer. The fartherfied forth in the high seas of the volcano. I looked around. The excellent Professor, and did not speak English with fancy a most despairing form a dull rocks. His telescope began to uncle, which his great deal of supper, appeared to be a wide thinking of steed--one that we were to discovered surrounding us on all sides point. TheHaving got over this occasion, I sought for it my head simply eating made from his making the circumstances. Our stock of my uncle partly confounded towards Hans. The Icelander gently pressed our departure, and the guide, I began to feel a powerful arms. My uncle made no longer moved myface ready. I began to think or not.