Replicating Andrej Karpathys Transformer Lecture #3179

felixmin · 2023-07-05T09:50:11Z

felixmin
Jul 5, 2023

Hi everybody,

I'm currently trying to build a clone of Andrej Karpathys Transformer lecture in Flax.
https://www.youtube.com/watch?v=kCc8FmEb1nY&t=1374s&ab_channel=AndrejKarpathy
https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py

However even seems to work fine and the loss is deceasing during training, the results are completely off.

Does anybody see the problem in my code?

And is there a way to use nn.Sequential providing the deterministic variable to each block in GPTLanguageModel?

from flax.linen import MultiHeadDotProductAttention, SelfAttention, Embed, Sequential
from flax import linen as nn
import optax
from jax import numpy as jnp
from jax import random
import jax
from flax.training import train_state

# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
eval_iters = 200
n_embd = 384 # dimension to which tokens and position is embedded
n_head = 6 # number of parallel attention heads per block
n_layer = 6 # number of transformer blocks
dropout = 0.2
# ------------

key = random.PRNGKey(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('tinyshakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = jnp.array(encode(text))
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]


# data loading
def get_batch(key, split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    key, cur_key = random.split(key)
    ix = random.randint(cur_key, (batch_size,), 0, len(data) - block_size)
    x = jnp.stack([data[i:i+block_size] for i in ix])
    y = jnp.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

def estimate_loss(key, params):
    out = {}
    for split in ['train', 'val']:
        losses = jnp.zeros(eval_iters)
        for k in range(eval_iters):
            key, batch_key, loss_fn_key = random.split(key, 3)
            X, Y = get_batch(batch_key, split)
            loss_fn = make_loss_fn(loss_fn_key, X, Y)
            loss = loss_fn(params)
            losses = losses.at[k].set(loss)
        out[split] = losses.mean()
    return out

class FeedForward(nn.Module):
    n_embd: int
    
    def setup(self):
        self.net = nn.Sequential([
            nn.Dense(4 * n_embd),
            nn.relu,
            nn.Dense(n_embd)
        ])
        self.dropout = nn.Dropout(dropout)

    def __call__(self, x, deterministic: bool):
        x = self.net(x)
        return self.dropout(x, deterministic=deterministic)

class TransformerBlock(nn.Module):
    n_embd: int
    n_head: int

    def setup(self):
        self.sa = SelfAttention(n_head, dropout_rate=dropout) # head size is calculated internally
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm()
        self.ln2 = nn.LayerNorm()

    def __call__(self, x, deterministic):
        x = x + self.sa(self.ln1(x), deterministic=deterministic)
        x = x + self.ffwd(self.ln2(x), deterministic=deterministic)
        return x

class GPTLanguageModel(nn.Module):
    """ This block performs the embedding of the tokens and the positional embedding, it feeds the result
    into the transformer blocks, normalizes the result and feeds it into the final linear layer."""

    def setup(self):
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = Embed(vocab_size, n_embd)
        self.position_embedding_table = Embed(block_size, n_embd)
        self.block1 = TransformerBlock(n_embd, n_head)
        self.block2 = TransformerBlock(n_embd, n_head)
        self.block3 = TransformerBlock(n_embd, n_head)
        self.block4 = TransformerBlock(n_embd, n_head)
        self.block5 = TransformerBlock(n_embd, n_head)
        self.block6 = TransformerBlock(n_embd, n_head)
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Dense(vocab_size)

    def __call__(self, xb, deterministic=True):
        B, T = xb.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(xb) # (B,T,C)
        pos_emb = self.position_embedding_table(jnp.arange(T)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.block1(x, deterministic) # (B,T,C)
        x = self.block2(x, deterministic) # (B,T,C)
        x = self.block3(x, deterministic) # (B,T,C)
        x = self.block4(x, deterministic) # (B,T,C)
        x = self.block5(x, deterministic) # (B,T,C)
        x = self.block6(x, deterministic) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        return logits

model = GPTLanguageModel()
key, batch_key, params_key, dropout_rng = random.split(key, 4)
xb, yb = get_batch(batch_key, 'train')

model_key = {'params': params_key, 'dropout': dropout_rng}

params = model.init(model_key, xb, deterministic=True)

# print the number of parameters in the model
# print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create an optimizer
optimizer = optax.adamw(learning_rate=learning_rate)
#opt_state = optimizer.init(params)
trn_state = train_state.TrainState.create(apply_fn=model.apply, params=params, tx=optimizer)

key, dropout_key = random.split(key)

def make_loss_fn(key, xb, yb):
    def loss_fn(params):
        yb_inner = yb

        logits = model.apply(params, xb, deterministic=False, rngs={'dropout':key})
        B, T, C = logits.shape
        logits = logits.reshape(B*T, C)
        yb_inner = yb_inner.reshape(B*T)

        yb_inner_one_hot = jnp.eye(C)[yb_inner]  # One-hot encoding of targets

        loss = optax.softmax_cross_entropy(logits, yb_inner_one_hot).mean()
        return loss
    return loss_fn

for iter in range(max_iters):

    key, batch_key, loss_fn_key, est_loss_key = random.split(key, 4)

    # sample a batch of data
    xb, yb = get_batch(batch_key, 'train')

    loss_fn = make_loss_fn(loss_fn_key, xb, yb)

    loss, grads = jax.value_and_grad(loss_fn)(trn_state.params)
    trn_state = trn_state.apply_gradients(grads=grads)

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss(est_loss_key, trn_state.params)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")



def generate(key, params, idx, max_new_tokens):
    loc_key = key
    # idx is (B, T) array of indices in the current context
    for _ in range(max_new_tokens):
        loc_key, sample_key, dropout_key = random.split(loc_key, 3)
        # crop idx to the last block_size tokens
        idx_cond = idx[:, -block_size:]
        # get the predictions
        logits = model.apply(params, idx_cond, deterministic=True, rngs={'dropout' : dropout_key})
        # focus only on the last time step
        logits = logits[:, -1, :] # becomes (B, C)
        # apply softmax to get probabilities
        probs = nn.softmax(logits) # (B, C)
        # sample from the distribution
        idx_next = random.categorical(sample_key, probs) # (B, 1)

        idx_next = jnp.reshape(idx_next, (1,-1))
        # append sampled index to the running sequence
        idx = jnp.concatenate((idx, idx_next), axis=1) # (B, T+1)
    return idx


key, generate_key = random.split(key)

# generate from the model
context = jnp.zeros((1, 1), dtype=jnp.int32)
print(decode(generate(generate_key, trn_state.params, context, max_new_tokens=500)[0].tolist()))
#open('more.txt', 'w').write(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))

Answered by cgarciae

Jul 5, 2023

Hey @felixmin, in case it helps, here is my port of nanoGPT which is very similar to the lecture: https://github.com/cgarciae/nanoGPT-jax/blob/master/model.py

And is there a way to use nn.Sequential providing the deterministic variable to each block in GPTLanguageModel?

Currently no. There is a proposal that could allow layers like Dropout to receive some of their arguments via flags, #2131. While it was implemented for internal usage, flags are currently not exposed to the user.

View full answer

cgarciae · 2023-07-05T14:48:10Z

cgarciae
Jul 5, 2023
Maintainer

Hey @felixmin, in case it helps, here is my port of nanoGPT which is very similar to the lecture: https://github.com/cgarciae/nanoGPT-jax/blob/master/model.py

And is there a way to use nn.Sequential providing the deterministic variable to each block in GPTLanguageModel?

Currently no. There is a proposal that could allow layers like Dropout to receive some of their arguments via flags, #2131. While it was implemented for internal usage, flags are currently not exposed to the user.

1 reply

felixmin Jul 8, 2023
Author

Thanks! That helped me find the error.

The rewrite can be found here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicating Andrej Karpathys Transformer Lecture #3179

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Replicating Andrej Karpathys Transformer Lecture #3179

felixmin Jul 5, 2023

Replies: 1 comment · 1 reply

cgarciae Jul 5, 2023 Maintainer

felixmin Jul 8, 2023 Author

felixmin
Jul 5, 2023

Replies: 1 comment 1 reply

cgarciae
Jul 5, 2023
Maintainer

felixmin Jul 8, 2023
Author