GitHub - sushantkumar23/nano-gpt: Simple character level Transformer

Improving the Transformers

We will be building a GPT-like decoder-only transformer from scratch using PyTorch in phases, starting with the original Transformers introduced in the paper "Attention is All You Need" by Vaswani et al. We progressively move on to more advanced architectural improvements that have been proposed in recent research papers.

The transformer is implemented in model.py and the training can be done by setting the training flag to True in model.py.

Original Transformer

Improvement over the years

Each of the improvement were introduced over the years with a research paper.

RoPE: Rotary Positional Embedding

These were introduced in the paper RoFormer: Enhanced Transformer with Rotary Position Embedding

RMSNorm: Root Mean Square Layer Normalization

RMSNorms got introduced by Zhang et. al in 2019 in a paper called Root Mean Square Layer Normalization

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.vscode		.vscode
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
bigram.py		bigram.py
input.txt		input.txt
model.py		model.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving the Transformers

Original Transformer

Improvement over the years

RoPE: Rotary Positional Embedding

RMSNorm: Root Mean Square Layer Normalization

SwiGLU: Swish Gated Linear Units

GQA: Grouped Query Attention

MoE: Mixture of Experts

About

Languages

sushantkumar23/nano-gpt

Folders and files

Latest commit

History

Repository files navigation

Improving the Transformers

Original Transformer

Improvement over the years

RoPE: Rotary Positional Embedding

RMSNorm: Root Mean Square Layer Normalization

SwiGLU: Swish Gated Linear Units

GQA: Grouped Query Attention

MoE: Mixture of Experts

About

Topics

Resources

Stars

Watchers

Forks

Languages