Mooshak

Mooshak is a Large Language Model, written completely from sratch in Pytorch. The model is trained on a corpus of mix of Hindi and English Corpus.

Architectural Modifications in Mooshak

Mooshak is mainly inspired from Llama Architecture and incorporates a custom layers and architecture. The overall size of the model is ~ 2 Billion

Architecture

graph TD
    Input[Input Tokens] --> Embeddings[Embeddings Layer]
    Embeddings --> DecoderStack[Decoder Stack]
    DecoderStack --> LMHead[Language Model Head]
    LMHead --> Output[Output Logits]

    subgraph DecoderStack
        D1[Decoder Layer 1]
        D2[Decoder Layer 2]
        D3[Decoder Layer 3]
        DN[Decoder Layer 8]
        D1 --> D2
        D2 --> D3
        D3 --> |...| DN
    end

    subgraph DecoderLayer[Decoder Layer]
        Input2[Layer Input] --> AttentionBlock[Attention Block]
        AttentionBlock --> Add1[Add]
        Input2 --> Add1
        Add1 --> RMSNorm1[RMS Norm]
        RMSNorm1 --> MLPBlock[MLP Block]
        MLPBlock --> Add2[Add]
        Add1 --> Add2
        Add2 --> RMSNorm2[RMS Norm]
        RMSNorm2 --> Output2[Layer Output]
    end

    subgraph AttentionBlock
        QKV[Q, K, V Linear Layers]
        RoPE[Rotary Position Embedding]
        SDPA[Scaled Dot-Product Attention]
        QKV --> RoPE
        RoPE --> SDPA
    end

    subgraph MLPBlock
        W1[Linear w1]
        W2[Linear w2]
        W3[Linear w3]
        Swish[SiLU Activation]
        W1 --> Swish
        W3 --> Multiply
        Swish --> Multiply
        Multiply --> W2
    end

    classDef config fill:#f9f,stroke:#333,stroke-width:2px;
    class Config config;

    Config[<u>Model Configuration</u><br/>vocab_size: 64128<br/>dim: 4096<br/>num_layers: 8<br/>n_heads: 32<br/>seq_len: 2048]

RoPE

RoPE stands for Rotatory Positional Embedding.

Previous solution for positional encoding:

Learning from Data
Sinusoidal functions

Why RoPE was introduced?

Original positional encoding methods were limited in sequence length due to the fact that they were trained on limited size data.
The absolute positional encoding (sinusoidal encodings) didn't contain the information about the relative position of tokens.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
assets		assets
finetuning		finetuning
inference		inference
training		training
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mooshak

Architectural Modifications in Mooshak

Architecture

RoPE

Checklist

About

Releases

Packages

Contributors 2

Languages

AdityaKulshrestha/Mooshak

Folders and files

Latest commit

History

Repository files navigation

Mooshak

Architectural Modifications in Mooshak

Architecture

RoPE

Checklist

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages