We will be building a GPT-like decoder-only transformer from scratch using PyTorch in phases, starting with the original Transformers introduced in the paper "Attention is All You Need" by Vaswani et al. We progressively move on to more advanced architectural improvements that have been proposed in recent research papers.
The transformer is implemented in model.py
and the training can be done by setting the training
flag to True
in model.py
.
- Self-Attention
- Scaled Dot-Product Attention
- FeedForward Network
- Absolute Positional Embedding
- Residual Connection (Attention and FeedForward)
- Layer Normalization (Attention and FeedForward)
- Multi-Head Attention
- Dropout
- Rotary Positional Embedding
- Layer Normalization (Final)
- RMS Layer Normalization
- KV Cache
- Grouped-Query Attention
- SwiGLU Activation (FeedForward Network)
- Flash Attention
- Sliding Window Attention
- Mixture of Experts
Each of the improvement were introduced over the years with a research paper.
These were introduced in the paper RoFormer: Enhanced Transformer with Rotary Position Embedding
RMSNorms got introduced by Zhang et. al in 2019 in a paper called Root Mean Square Layer Normalization
Paper: GLU Variants Improve Transformer
Paper: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity