Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi core, gpu training fails #1974

Closed
sciai-ai opened this issue Mar 26, 2021 · 10 comments
Closed

multi core, gpu training fails #1974

sciai-ai opened this issue Mar 26, 2021 · 10 comments
Labels
bug Something isn't working

Comments

@sciai-ai
Copy link

sciai-ai commented Mar 26, 2021

I am doing the nemo asr training on multi gpus and workers and it hangs on the step:

trainer = pl.Trainer(gpus=4, num_nodes=8, accelerator='ddp',max_epochs=200,amp_level='O1', precision=16,

At the console

image

The same code works fine when using a single GPU and 1 core.

Is there any fix for it, my dataset is very large and thus training will take very long if using single core and GPU.

THanks

@sciai-ai sciai-ai added the bug Something isn't working label Mar 26, 2021
@okuchaiev
Copy link
Member

What resource manager are you using? So far we tested multi-node in a SLURM environment only

@sciai-ai
Copy link
Author

I am using google cloud vm without any resource manager.

@titu1994
Copy link
Collaborator

titu1994 commented Mar 27, 2021

Doesn't GCP have at maximum just 8 GPUs? I'm not familiar with any one click solution to launching a 32 gpu (8 nodes, 4 GPUs per node) cluster on GCP - please point to such a config.

If you want to use just 8 GPUs, your trainer config would be

trainer = pl.Trainer(gpus=8, num_nodes=1, accelerator='ddp',max_epochs=200,amp_level='O1', precision=16, ...)

@sciai-ai
Copy link
Author

sciai-ai commented Mar 27, 2021

in GCP you can spin up a vm with 8 nodes, and 4 GPUs

I am using this command

trainer = pl.Trainer(gpus=4, num_nodes=8, accelerator='ddp',max_epochs=200,amp_level='O1', precision=16,..)

I dont think this is a GCP issue but rather pytorch ligtning 1.7.1 issue where ddp hangs when initialising

https://github.com/PyTorchLightning/pytorch-lightning/issues/4612

Is there a way to update nemo to work with an older or newer version of pytorch?

@Oktai15
Copy link
Contributor

Oktai15 commented Mar 27, 2021

@sciai-ai just like workaround to fix your problem:

pip uninstall -y torch pytorch-lightning torchtext torchaudio
pip install torchtext==0.8.0 torch==1.7.1 torchaudio==0.7.2 pytorch-lightning==1.2.5

I suppose problem related to Lightning-AI/pytorch-lightning#5865, Lightning-AI/pytorch-lightning#4612 and Lightning-AI/pytorch-lightning#6569

@okuchaiev
Copy link
Member

@sciai-ai can you please try 1.0.0 to see if this issue still persists?

@ericharper ericharper reopened this Jul 8, 2021
@BenikaHall
Copy link

@ericharper I noticed you re-opened this issue. We're still seeing this behavior in identical fashion, but also on CLI. Double checking, shall we re-open this one?

@ericharper ericharper reopened this Jul 9, 2021
@ericharper
Copy link
Collaborator

Feel free to use this issue. Please update with the code that was run and the log files.

@BenikaHall
Copy link

Great. Will do.

@BenikaHall
Copy link

@ericharper We have it resolved! By submitting the batch job in Slurm, we were able to get multi-GPU training working. We confirmed that the trainer was configured in the YAML file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants