Skip to content

Latest commit

 

History

History
46 lines (33 loc) · 1.92 KB

README.md

File metadata and controls

46 lines (33 loc) · 1.92 KB

Improving the Transformers

We will be building a GPT-like decoder-only transformer from scratch using PyTorch in phases, starting with the original Transformers introduced in the paper "Attention is All You Need" by Vaswani et al. We progressively move on to more advanced architectural improvements that have been proposed in recent research papers.

The transformer is implemented in model.py and the training can be done by setting the training flag to True in model.py.

Original Transformer

  • Self-Attention
  • Scaled Dot-Product Attention
  • FeedForward Network
  • Absolute Positional Embedding
  • Residual Connection (Attention and FeedForward)
  • Layer Normalization (Attention and FeedForward)
  • Multi-Head Attention
  • Dropout

Improvement over the years

  • Rotary Positional Embedding
  • Layer Normalization (Final)
  • RMS Layer Normalization
  • KV Cache
  • Grouped-Query Attention
  • SwiGLU Activation (FeedForward Network)
  • Flash Attention
  • Sliding Window Attention
  • Mixture of Experts

Each of the improvement were introduced over the years with a research paper.

RoPE: Rotary Positional Embedding

These were introduced in the paper RoFormer: Enhanced Transformer with Rotary Position Embedding

RMSNorm: Root Mean Square Layer Normalization

RMSNorms got introduced by Zhang et. al in 2019 in a paper called Root Mean Square Layer Normalization

SwiGLU: Swish Gated Linear Units

Paper: GLU Variants Improve Transformer

GQA: Grouped Query Attention

Paper: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

MoE: Mixture of Experts

Paper: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity