Skip to content

AdityaKulshrestha/Mooshak

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mooshak

Mooshak

Mooshak is a Large Language Model, written completely from sratch in Pytorch. The model is trained on a corpus of mix of Hindi and English Corpus.

Architectural Modifications in Mooshak

Mooshak is mainly inspired from Llama Architecture and incorporates a custom layers and architecture. The overall size of the model is ~ 2 Billion

Architecture

graph TD
    Input[Input Tokens] --> Embeddings[Embeddings Layer]
    Embeddings --> DecoderStack[Decoder Stack]
    DecoderStack --> LMHead[Language Model Head]
    LMHead --> Output[Output Logits]

    subgraph DecoderStack
        D1[Decoder Layer 1]
        D2[Decoder Layer 2]
        D3[Decoder Layer 3]
        DN[Decoder Layer 8]
        D1 --> D2
        D2 --> D3
        D3 --> |...| DN
    end

    subgraph DecoderLayer[Decoder Layer]
        Input2[Layer Input] --> AttentionBlock[Attention Block]
        AttentionBlock --> Add1[Add]
        Input2 --> Add1
        Add1 --> RMSNorm1[RMS Norm]
        RMSNorm1 --> MLPBlock[MLP Block]
        MLPBlock --> Add2[Add]
        Add1 --> Add2
        Add2 --> RMSNorm2[RMS Norm]
        RMSNorm2 --> Output2[Layer Output]
    end

    subgraph AttentionBlock
        QKV[Q, K, V Linear Layers]
        RoPE[Rotary Position Embedding]
        SDPA[Scaled Dot-Product Attention]
        QKV --> RoPE
        RoPE --> SDPA
    end

    subgraph MLPBlock
        W1[Linear w1]
        W2[Linear w2]
        W3[Linear w3]
        Swish[SiLU Activation]
        W1 --> Swish
        W3 --> Multiply
        Swish --> Multiply
        Multiply --> W2
    end

    classDef config fill:#f9f,stroke:#333,stroke-width:2px;
    class Config config;

    Config[<u>Model Configuration</u><br/>vocab_size: 64128<br/>dim: 4096<br/>num_layers: 8<br/>n_heads: 32<br/>seq_len: 2048]
Loading

RoPE

RoPE stands for Rotatory Positional Embedding.

Previous solution for positional encoding:

  • Learning from Data
  • Sinusoidal functions

Why RoPE was introduced?

  • Original positional encoding methods were limited in sequence length due to the fact that they were trained on limited size data.
  • The absolute positional encoding (sinusoidal encodings) didn't contain the information about the relative position of tokens.

Checklist

  • Inference script for Llama
  • Model initialization
  • Embedding Module
  • RMS Module
  • Attention Module
  • Final Output Shape
  • Correct expected shape in cross entropy
  • Small scale training
  • Add FusedAdam
  • Add support for DDP
  • Add fine tuning script
  • Add FSDP Support
  • Add automatic mixed precision
  • Add decoding strategies
  • Add preference alignment
  • Add Evaluation Strategy
  • Enable Deep Speed Integration

About

Democratizing small language model building!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages