-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-gpu training with slurm times out #1832
Comments
Same error! |
I fixed this issue by downgrading litdata from the latest version to 0.2.17 and increasing the number of workers in the dataloader. There seems to be some compatibility issue with the newer versions of litdata (besides freezing it also sometimes segfaults) |
Thank you for sharing that. How can you increase the increasing the number of workers in the dataloader? |
I set |
Thank you |
I used version 0.2.17 of the dataloader for my pretraining experiments and kept the default setting of num_workers as 8. During execution, I encountered the following warning: /home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of workers in the current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might cause the DataLoader to run slowly or even freeze. Lower the worker number to avoid potential slowness/freeze if necessary. The only modification I made to the pretraining code was adding time.perf_counter() to log the time before and after the final checkpoint save operation. Despite this warning, the DataLoader seemed to function, but the system's freezes. I use Llama3-70 but the max workload that I can see for transaction data is around 700m. I want to increase this amount., and not sure if there is boundary by litgpt |
Bug description
I was transferring some checkpoints from a cluster that didn't use slurm to one that does use slurm. I trained the checkpoint using multiple gpus/nodes, and I found that I'm able to load and start training it when using an interactive job. However, when I use sbatch to submit my job, the job times out after some time.
I've seen this post: https://lightning.ai/docs/fabric/2.4.0/guide/multi_node/slurm.html and added
srun
to my submission script. However, even though 4 devices seem to be initialized, the model still gets stuck before training and times out.A debug log and my submission script is linked. My sbatch script is a bit different since it runs another sh script, which does a bunch of stuff and then
litgpt pretrain <...>
, but I'm not sure this would be an issue...I also tried setting the fabric initialization to explicitly have the number of nodes, devices, etc like in the example in
pretrain.py
but it didn't make a difference:Details:
My script:
Debug example error:
Node information:
What operating system are you using?
Linux
LitGPT Version
The text was updated successfully, but these errors were encountered: