-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pretraining an OLMo model on the SlimPajama dataset #1837
Comments
CC: @rasbt @Andrei-Aksionov |
Hello @aflah02 Could you bring it back in a PR? |
Sure, I'll do that |
I tried using the code to process the dataset however it doesn't seem to work for the train set due to size issues. Is there a way to reduce how many things are moved to/kept in the tmp dir? Error -
|
A simple fix that I'm using is to create a symlink with my NFS where I have more storage with /tmp/data and then running it. It seems to run for now (still in progress) |
I was trying to set up a multinode run via SLURM and was testing this on 2 nodes with ethernet based interconnect however the init fails -
I also see this warning -
Here's the config -
This is my run command -
The code works when running on 1 node Any clue what might be going wrong? I am using SLURM btw |
I just realized the error message and this tutorial (https://lightning.ai/docs/fabric/stable/guide/multi_node/slurm.html) seems to imply I should use srun. Running with this now -
|
This command works -
But when I look at wandb I only see logs for one node (even though the loss is aggregated prior to backprop I don't see any device stats for the other node) |
Hi @Andrei-Aksionov @rasbt Here is my config -
I tried on a single GPU as well as 8xA100 machines and I get the same OOMs |
I looked at numbers from the Pythia paper and while training the 1B model they were able to use a batch size of 16 for a 40 GB A100 but I can't use that for OLMo 1B despite having a 2x larger GPU |
Just a guess.
It might be there are only a couple of samples in the training set that have such a length. But it's only a guess :) |
I do plan to but I think even if the entire batch was this big it should still not OOM as Pythia had the same seq length and a GPU with half the size but still worked with larger batch sizes |
To better isolate the problem, could you try to repeat Pythia 1B with 40 batch size. |
Thanks I'll do that Also is there a simple way to use the profiler when pretraining? or do I need to modify pretrain.py and add the profiler in manually? |
Hi!
I am planning to test pretraining OLMo 1B model on the slim pajama dataset. I was trying to follow the tutorial for tinyllama but one of the steps for preparing the dataset uses the
litgpt/data/prepare_slimpajama.py
file which seems to be missing to me in the repo. Any workarounds for this?The text was updated successfully, but these errors were encountered: