A PyTorch implementation of GPT-2 with Flash Attention support. This implementation focuses on efficiency and readability while maintaining good performance.
- Flash Attention and traditional attention implementations
- Configurable architecture (embedding size, heads, layers, etc.)
- Checkpoint saving and loading
- Training progress tracking
- Memory efficient
- PyTorch
- safetensors
- CUDA-capable GPU (for Flash Attention)
To download the training dataset (TinyShakespeare), run:
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
This implementation is inspired by Andrej Karpathy's nanoGPT, a minimal implementation of GPT-2 in PyTorch.