Simplified Scratch Pytorch Implementation of Large Language Models (LLM) with Detailed Steps (Refer to gpt.py and llama.py)
- Contains two models: GPT and LLAMA.
- GPT model serves as the base simple decoder-only transformer and is easier to learn.
- LLAMA contains advanced concepts: Rotational Positional Encoding (RoPe), SwishGLU, RMSNorm, Mixture of Experts, etc. (Refer below.)
- These models are scaled-down versions of their original architectures.
- Number of training parameters: 141k (GPT) and 423k (LLAMA). LLAMA has more training parameters due to a mixture of experts, but the inference cost is similar for both models.
- Downloads the Taylor Swift song lyrics dataset by default for training.
✅ ByTe-Pair Tokenization [Here]
✅ Temperature, Top-p and Top-k [Here]
✅ RMSNorm [Here]
✅ SWiGLU [Here]
✅ Rotational Positional Embedding (RoPe) [Here]
✅ KV Cache [Here]
✅ Mixture of Experts [Here]
🔳 Grouped Query Attention
🔳 Infini Attention
Feel free to comment if you want anything integrated here.
python main.py --network_type llama
- The network can be selected between llama and gpt.
- I have tested the model on Taylor Swift song lyrics.
- By default, the Taylor Swift song lyrics dataset will be downloaded to the text file (default name: "data.txt").
- To use a custom dataset, replace the file's content or provide a different text file using the data_file argument.
A sample output generated by my trained model:
You know you're not sure
I can still see you speak, go
And now I'm fallin' in love
But I'm standin' in love
[Pre-Chorus]
So you got the rain one thing that I know
What you were right here, right now
But you're the one I want to say
'Cause you got a six back in your face
I'm not happy and you say you've got a girl for me
But you can tell me now that you're mine
And all I'm just think I can solve them
And I just wanna stay in that night
Results can be improved with more training data and a bigger model.