-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi core, gpu training fails #1974
Comments
What resource manager are you using? So far we tested multi-node in a SLURM environment only |
I am using google cloud vm without any resource manager. |
Doesn't GCP have at maximum just 8 GPUs? I'm not familiar with any one click solution to launching a 32 gpu (8 nodes, 4 GPUs per node) cluster on GCP - please point to such a config. If you want to use just 8 GPUs, your trainer config would be trainer = pl.Trainer(gpus=8, num_nodes=1, accelerator='ddp',max_epochs=200,amp_level='O1', precision=16, ...) |
in GCP you can spin up a vm with 8 nodes, and 4 GPUs I am using this command
I dont think this is a GCP issue but rather pytorch ligtning 1.7.1 issue where ddp hangs when initialising https://github.com/PyTorchLightning/pytorch-lightning/issues/4612 Is there a way to update nemo to work with an older or newer version of pytorch? |
@sciai-ai just like workaround to fix your problem:
I suppose problem related to Lightning-AI/pytorch-lightning#5865, Lightning-AI/pytorch-lightning#4612 and Lightning-AI/pytorch-lightning#6569 |
@sciai-ai can you please try 1.0.0 to see if this issue still persists? |
@ericharper I noticed you re-opened this issue. We're still seeing this behavior in identical fashion, but also on CLI. Double checking, shall we re-open this one? |
Feel free to use this issue. Please update with the code that was run and the log files. |
Great. Will do. |
@ericharper We have it resolved! By submitting the batch job in Slurm, we were able to get multi-GPU training working. We confirmed that the trainer was configured in the YAML file. |
I am doing the nemo asr training on multi gpus and workers and it hangs on the step:
trainer = pl.Trainer(gpus=4, num_nodes=8, accelerator='ddp',max_epochs=200,amp_level='O1', precision=16,
At the console
The same code works fine when using a single GPU and 1 core.
Is there any fix for it, my dataset is very large and thus training will take very long if using single core and GPU.
THanks
The text was updated successfully, but these errors were encountered: