Replies: 43 comments 79 replies
-
Few more pointers: |
Beta Was this translation helpful? Give feedback.
-
I wanted to try it but sadly I am gpu poor :/ |
Beta Was this translation helpful? Give feedback.
-
Thank you @karpathy for your valuable teaching lessons in your GitHub repositories. I cloned llm.c to check how you do the dropout. I found some random number generation functions that run on the NVIDIA CUDA GPU devices. Where is the Dropout being performed ? |
Beta Was this translation helpful? Give feedback.
-
Hello Mr @karpathy I saw MPI_Allgather in https://github.com/karpathy/llm.c/blob/master/train_gpt2.cu#L425 Why is MPI_Allgather used here if all the 8 A100 80GB SXM GPUs are on the same node ? |
Beta Was this translation helpful? Give feedback.
-
There are a few places where |
Beta Was this translation helpful? Give feedback.
-
Here is a model (500M) i am training it for last few days using llama2 architecture. Hoping to train it to around 200 billion tokens . This is using fineweb 2024 and 4x 4090 |
Beta Was this translation helpful? Give feedback.
-
What a fucking legend. I'm starting on this tonight!! |
Beta Was this translation helpful? Give feedback.
-
"You can train the model with a single GPU too, it would just take proportionally longer (e.g. ~4-24 hours depending on the GPU)" Sorry to bother, but whats the oldest/cheapest/weakest GPU that will be able to train this within 24 hours? |
Beta Was this translation helpful? Give feedback.
-
Wow. Nice work. Are you planning to make a video about llm repo and this post as a summary of this work? Regards. |
Beta Was this translation helpful? Give feedback.
-
I just want to confirm how much VRAM will be needed?
…On Tue, May 28, 2024 at 6:57 PM Yuchen Jin ***@***.***> wrote:
I thought making a video was always in the plan :)
—
Reply to this email directly, view it on GitHub
<#481 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A34CCZUAROV53MQOJVGFTZDZEUYY3AVCNFSM6AAAAABINF4WSKVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TKOBXGM2DE>
.
You are receiving this because you commented.Message ID: <karpathy/llm.
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Running at 4M tokens/s on 8xH100 80GB HBM3! Insane 😱 step 1979/18882 | train loss 1.932283 | norm 0.4582 | lr 5.93e-04 | 132.83 ms | 136.0% A100 fp16 MFU | 3913480 tok/s
step 1980/18882 | train loss 1.985383 | norm 0.5368 | lr 5.93e-04 | 134.43 ms | 134.4% A100 fp16 MFU | 3912810 tok/s
step 1981/18882 | train loss 1.926125 | norm 0.4854 | lr 5.93e-04 | 135.17 ms | 133.7% A100 fp16 MFU | 3911104 tok/s
step 1982/18882 | train loss 1.870695 | norm 0.4489 | lr 5.93e-04 | 132.60 ms | 136.3% A100 fp16 MFU | 3913252 tok/s
step 1983/18882 | train loss 1.862540 | norm 0.4103 | lr 5.93e-04 | 133.60 ms | 135.2% A100 fp16 MFU | 3913810 tok/s |
Beta Was this translation helpful? Give feedback.
-
@karpathy |
Beta Was this translation helpful? Give feedback.
-
What kind of controls do we have while training GPT-2 on multi-node? |
Beta Was this translation helpful? Give feedback.
-
Thank you very much @karpathy. |
Beta Was this translation helpful? Give feedback.
-
4x A6000 - Around 8 hours - GPT 2 - 124M |
Beta Was this translation helpful? Give feedback.
-
Test it with a new learning rate schedule ("better" than cosine) and here are the result Also if you want to see the code it's here #508 :) |
Beta Was this translation helpful? Give feedback.
-
Will running on rtx 3060ti be possible? |
Beta Was this translation helpful? Give feedback.
-
How can I use the final checkpoint with |
Beta Was this translation helpful? Give feedback.
-
It took me 19h15min on an RTX 4090. That's not counting the pre-processing of the training set. |
Beta Was this translation helpful? Give feedback.
-
If the model can fit on a single GPU, why are we using ZeRO-DP? Won't it increase communication overhead? This is for future extension, correct? |
Beta Was this translation helpful? Give feedback.
-
great work🐂🍺 |
Beta Was this translation helpful? Give feedback.
-
I had trained a GPT-2-124M model with batch_size 32. However, I have unexpectedly observed that when the evaluation batch size is varied from the size used during training( i.e 8,16,or 64), the evaluation loss also varies. This observation is counterintuitive. The evaluation loss should remain consistent regardless of the evaluation batch size. Does any one know the reason why? |
Beta Was this translation helpful? Give feedback.
-
It seems that all the bias going into the matmul_forward_cublaslt() function are ignored because of beta=0, including l_fcb and l_fcprojb, which should not be zero. Can anyone tell me whether I was wrong or not? |
Beta Was this translation helpful? Give feedback.
-
@karpathy
|
Beta Was this translation helpful? Give feedback.
-
this is truly amazing |
Beta Was this translation helpful? Give feedback.
-
I reproduced this training process on my personal computer, which took 44 hours. The setup includes an i7 processor, 64GB of RAM, a single RTX 4080 Super GPU, Windows 11, and Ubuntu 22.04 in WSL. Using the following training parameters, CUDNN was not used.:
The total power consumption of the system was 400W, and it took 44 hours, resulting in approximately 17.6 kWh. |
Beta Was this translation helpful? Give feedback.
-
I reproduced on 3 different GPU setups, here are the training times and costs:
Most of the completions are pretty nonsensical, but here are some interesting ones:
Full write-up is here 👈 Side note, did anyone run into this issue: #786 🤔? |
Beta Was this translation helpful? Give feedback.
-
Let's reproduce the GPT-2 (124M) in llm.c (~4,000 lines of C/CUDA) in 90 minutes for $20. The 124M model is the smallest model in the GPT-2 series released by OpenAI in 2019, and is actually quite accessible today, even for the GPU poor. With llm.c, which is quite efficient at up to ~60% model flops utilization, reproducing this model on one 8X A100 80GB SXM node takes ~90 minutes. For example, on Lambda this node goes for ~$14/hr, so the total cost of reproducing this model today is about $20. You can train the model with a single GPU too, it would just take proportionally longer (e.g. ~4-24 hours depending on the GPU). In addition, llm.c still has a lot of pending optimizations and people haven't tried to tune the training in the style of cramming, so I'd say we're likely to see significant improvements on this number. So here is the run, training the 12-layer, 12-headed, 768-dimension, 124M Transformer on 10 billion tokens of FineWeb:
The left pane shows that we outperform the checkpoint released by OpenAI on the FineWeb withheld validation dataset. This is not the ideal metric because the data distribution of GPT-2 was different (it was trained on the never released "WebText" dataset) and the statistics of the internet may have been different 5 years ago, so it's not a super fair comparison. Therefore, in addition on the right we also plot the HellaSwag accuracy, a benchmark commonly used to assess LLM capability that is nice, smooth, and well-behaved. I'd mostly look at HellaSwag, but FineWeb val is a nice confirmation. That said, HellaSwag has no math/code so it slightly favors our setting (common crawl-like data). One more point of reference is that GPT-3 in Appendix H cites HellaSwag accuracy at 33.7 for GPT-3 Small (124M) model. We get to 29.9 here, which surpasses GPT-2 (124M) at 29.4. Keep in mind that here we trained for 10B tokens, while GPT-3 models were all trained for 300B tokens.
Now here is the shortest path to reproducing this result yourself. You'll need a GPU. I like and run my work on Lambda labs (who graciously sponsors in llm.c development), though the inventory can be limited at times. Many other providers exist and you can use the Discussion below for tips and tricks around this. Here is the example process for a Linux x86 64bit Ubuntu 22.04 with CUDA 12 (this is somewhere around the current, default "modern" configuration). If you're on a different system, the comments and discussion in the main README file might be helpful.
Args guide. A lot of these hyperparameters follow the GPT-3 paper instead of the GPT-2 paper, because it was a lot more detailed. Args explanation:
-i -j
are training and validation splits token files, written byfineweb.py
-o
is the output directory to write logs and checkpoints into-e "d12"
asks to initialize, a depth 12 GPT-2 model from scratch-b 64
sets the micro-batch size to 64 . If you are running out of memory, decrease this value, e.g. try 32, 16, 8, all the way down to 1 potentially.-t 1024
sets the maximum sequence length to 1024, as GPT-2 did-d 524288
requests that the total batch size per single update be ~0.5M tokens. The code will take this desired batch size and calculate the needed gradient accumulation "inner loop" steps of the optimization. For example on 8 GPUs, at -b 64 and -t 1024, every microbatch is doing exactly 8 X 64 X 1024 = 524288 tokens, so there is no need for gradient accumulation. But if we we only have 1 GPU, then the code will set it to 8, and do an inner loop of 8 iterations to add up to this "total batch size" per step. While the batch size used to train GPT-2 is unknown, this number ~0.5M comes from the GPT-3 paper table, for this model size.-r 1
sets the recompute setting = 1, so we will re-compute the GeLU activations. This slightly increases the runtime, but saves quite a bit of memory, allowing us to increase the batch size and get a net increase in token throughput.-z 1
turns on ZeRO-1 (i.e. optimizer state sharding) across multiple GPUs. If you're training with > 1 GPU, this setting is a no-brainer and should basically always be on. On 1 GPU this setting is a no-op.-c 0.1
sets the weight decay to 0.1. Only (2D) weights are decayed exactly as in GPT-2, and this number comes from the GPT-3 paper-l 0.0006
sets the maximum learning rate, from GPT-3 paper.-q 0.0
says that we will decay the learning rate to 0 over the course of training.-u 700
says that we will ramp up the learning rate from 0 to max learning rate over the first 700 iterations, which at total batch size 0.5M is 350M tokens, following GPT-3 paper.-n 5000
asks to save model checkpoints every 5000 steps.-v 250
asks to evaluate and log the validation loss every 250 steps-s 20000
asks to sample some tokens every 20000 steps. Because the total number of steps will be less than this (see below), this basically turns generation off and we will only basically sample a single time at the very end.-h 1
asks to evaluate the HellaSwag accuracy, something we can compare across papers.-x
flag, it defaults to exactly one epoch over the training data, i.e. 10B tokens. Because the total batch size is ~0.5M and total number of tokens is 10B, there will be a total of ~ 10B/0.5M = 20K steps.There's a lot of detail above but the TLDR is that we're training a 12-layer GPT-2 (124M), from scratch, on 10B tokens of FineWeb, with max sequence length of 1024 tokens. If you are running out of memory, I would first make sure you have
-r 1
turned on, and then I would start decreasing the batch size-b
by dividing it by 2, until the runs. Once it runs, I'd see if you can get away with turning-r 0
back on to recover a little bit of speed.Training. The code will print something like this over time (this is an example of a single A100 40GB PCIe GPU, $1.29/hr):
What is going on? Well, we have 10B training tokens and our batch size is ~0.5M, so we'd expect about 10B/0.5M ~= 20K steps in total. It actually works out to exactly 18,865 because one of the data shards is reserved for validation data and the exact batch size is a nice power of 2 @ 524,288. So here we are on step 80/18865, which in total took 2950.68ms. MFU is short for "Model Flops Utilization". The A100 claims to offer 312 TFLOPS, but in practice this is very hard to achieve because the training is memory-bound and we can't feed the TensorCores that do the matrix multiplies. On this A100 40GB PCIe GPU, we see that when we count up the FLOPs we're doing and divide by time, we're roughly at half the theoretical, maximum peak FLOPS, which is quite good. If you used the A100 80GB SXM with higher memory bandwidth and max thermal design power, this goes up to ~60%. (If you use a GPU that is not A100, ignore this number because it is in units of A100 fp16 FLOPS). We also see that the token throughput we are achieving is about 178K tok/s. Next, our current loss is 7.577. The lower this is, the better our model is at predicting the next token in the sequence on average. Step 80 is very early in the training here. Because the perplexity is exp(7.577) ~= 2K, our model is as confused about each next token on average, as if it was guessing at random from 2,000 tokens. The full vocab size is 50,257. By the end of the optimization we'll get to about 3.29, so it's as if we're guessing uniformly at random from exp(3.29) ~= 27 tokens at each time step. Finally we see the gradient norm is 1.1461. When this number spikes, the gradient is exploding and this is very bad. To mitigate gradient explosions, as is standard, llm.c uses gradient clipping at 1.0, so if the gradient norm exceeds 1.0 (like in this time step) we forcefully scale it down so it's norm is up to 1.0. Later in the optimization, the gradient norm usually "calms down" to lower values.
Visualization. Finally, you'll want to make pretty charts like the one I posted up above. For that, our program is printing some very rudimentary logs to an improvised
log124M/main.log
file. I have attached an example Jupyter notebook that parses these files and visualizes them in the style above.Tokenizer. When you're training up above, you'll see a warning that llm.c couldn't find the GPT-2 tokenizer .bin file. That's totally fine for training, but it means that we can't decode - i.e. we can't convert integer tokens that we sample into little string pieces, to create text that we can read. Here is how we can generate it:
The Python script is a parallel implementation to llm.c used for error checking and unit tests (but doesn't have full feature parity). In particular, if we run it like above it will write the file
gpt2_tokenizer.bin
, which the C code can read and use to output nice text during sampling.Sampling. The code is currently not really intended for inference, but you can hack the code to do inference very inefficiently (without any kv-cache etc.) with something like this:
The
-i -j
flags are spurious.-e
flag is pointing at the final model checkpoint of our GPT-2 124M model, which llm.c will initialize the model from. The-b 1
is saying to use only a single batch element (one row of length 1024 tokens in which we sample from left to right). The-x 1
is saying we only want to run for a single step, and-l 0.0
is setting the learning rate to zero so we don't actually train the model on this single step. Finally-s 1
is saying "sample every step" and-g 256
is saying sample 256 tokens.Now, the above is just unconditional sampling. It's possible to hack the code to do conditional sampling, i.e. sequence completion. E.g. I asked our 124M model to complete the text "The GitHub project llm.c is a", and it continued: "free service to enhance the scholarly infrastructure of the academic community.". I then re-sampled with a different seed and got "The GitHub project llm.c is a collaborative effort that rocks GitHub itself". So, not bad I guess :) I had to directly hack the code by setting
gen_tokens[1:10]
to be the prompt tokens 464, 21722, 1628, 32660, 76, 13, 66, 318, 257 (from tiktokenizer ty), then hacked the loop index that samples to start at token position 10, ... you get the idea TLDR conditional generation is not really supported but in principle possible, possibly coming soon.Code. 95% of the heavy lifting is in the train_gpt2.cu file. It started as a nice clean 1,000 LOC C code, but has grown quite a bit and now it's closer to 3,500 LOC, with 4 supporting files of file I/O utils, tokenizer, dataloader, and random number generation. Roughly speaking, the first 500 LOC are just basic setup of up MPI, NCCL, cuDNN, cuBLAS, etc etc. The next 1,500 LOC are all the layers of the Transformer, and both their forward and backward implementation in efficient CUDA code. All the CUDA kernel development for these files happens in
dev/cuda
. So for example there is agelu_forward()
and then also agelu_backward()
, and the same way for all the other layers. The next 1,000 LOC are thegpt2
model, which just strings together the layers and itself has one biggpt2_forward()
andgpt2_backward()
. The last 1,000 LOC are int main(), which has the main training loop and all the related bookkeeping and argument parsing, and a lot of tedious code around e.g. resuming training from a previous checkpoint, etc.350M model. Overnight I also reproduced the 350M parameter model. Take a look at the file scripts/run_gpt2_350M.sh for the exact launch command. I found that 10B tokens was not enough for the 350M model, so you'll have to download and preprocess the FineWeb100B (or try to do multiple epochs on just the 10B above, which might work, I have not checked). I configured it to train for 30B tokens, so we have that:
FLOPS using 6ND approximation:
On 8X A100 80GB SXM the 350M stepped at 820ms/iter. Trained for 60K steps (instead of ~20K), for a total of ~30B tokens (instead of ~10B tokens). Total training time 14 hours. Cost $14/hr => 14 X 14 ~= $200 (10X of 124M). However looking at the plot, it's possible that we could have gotten away with slightly less:
Coming up. That's it for now! We are moving on to the 740M and then, of course, the actual "GPT-2" 1558M. If I can find the GPUs... By very rough napkin math, on my single 8X A100 80GB GPU box, the 1558M model would take ~1 week and cost ~$2.5K. This is in acceptable territory, but we'll want to take some time to make the current code better, cleaner, better tested, and add multi-node training support. And also very much still on my mind, I want to build the whole thing again, from scratch and piece by piece, coming to you soon^TM.
FAQ:
train_gpt2.py
does not have full feature parity (e.g. doesn't do sharded data loading, etc.) and is meant to be more as a reference, but I think you can get something similar to the 124M model above stepping as follows:torchrun --standalone --nproc_per_node=4 python train_gpt2.py --input_bin dev/data/fineweb10B/fineweb_train_000001.bin --write_tensors 0 --model d12 --batch_size 64 --sequence_length 1024 --total_batch_size 524288 --dtype bfloat16 --compile 1 --tensorcores 1 --flash 1 --num_iterations 18865 --weight_decay 0.1 --overfit_single_batch 0
. I am interested in and would accept PRs that bring the PyTorch training closer up to feature parity to the llm.c training loop.Acknowledgements
Call out to @ngc92 and @ademeure who have both made substantial contributions to llm.c across the board and especially on CUDA kernel optimization, @chinthysl and @PeterZhizhin for distributed optimization PRs, and @rosslwheeler for Windows support and tooling.
Please feel free to use the Discussions for any FAQ and related, or if you'd like something faster, #llmc on Discord, or #llmdotc on CUDA MODE Discord.
Beta Was this translation helpful? Give feedback.
All reactions