multi core, gpu training fails #1974

sciai-ai · 2021-03-26T05:59:15Z

I am doing the nemo asr training on multi gpus and workers and it hangs on the step:

trainer = pl.Trainer(gpus=4, num_nodes=8, accelerator='ddp',max_epochs=200,amp_level='O1', precision=16,

At the console

The same code works fine when using a single GPU and 1 core.

Is there any fix for it, my dataset is very large and thus training will take very long if using single core and GPU.

THanks

The text was updated successfully, but these errors were encountered:

okuchaiev · 2021-03-26T19:38:29Z

What resource manager are you using? So far we tested multi-node in a SLURM environment only

sciai-ai · 2021-03-27T02:43:10Z

I am using google cloud vm without any resource manager.

titu1994 · 2021-03-27T02:47:52Z

Doesn't GCP have at maximum just 8 GPUs? I'm not familiar with any one click solution to launching a 32 gpu (8 nodes, 4 GPUs per node) cluster on GCP - please point to such a config.

If you want to use just 8 GPUs, your trainer config would be

trainer = pl.Trainer(gpus=8, num_nodes=1, accelerator='ddp',max_epochs=200,amp_level='O1', precision=16, ...)

sciai-ai · 2021-03-27T03:09:42Z

in GCP you can spin up a vm with 8 nodes, and 4 GPUs

I am using this command

trainer = pl.Trainer(gpus=4, num_nodes=8, accelerator='ddp',max_epochs=200,amp_level='O1', precision=16,..)

I dont think this is a GCP issue but rather pytorch ligtning 1.7.1 issue where ddp hangs when initialising

https://github.com/PyTorchLightning/pytorch-lightning/issues/4612

Is there a way to update nemo to work with an older or newer version of pytorch?

Oktai15 · 2021-03-27T11:34:23Z

@sciai-ai just like workaround to fix your problem:

pip uninstall -y torch pytorch-lightning torchtext torchaudio
pip install torchtext==0.8.0 torch==1.7.1 torchaudio==0.7.2 pytorch-lightning==1.2.5

I suppose problem related to Lightning-AI/pytorch-lightning#5865, Lightning-AI/pytorch-lightning#4612 and Lightning-AI/pytorch-lightning#6569

okuchaiev · 2021-06-04T18:24:01Z

@sciai-ai can you please try 1.0.0 to see if this issue still persists?

BenikaHall · 2021-07-09T16:08:01Z

@ericharper I noticed you re-opened this issue. We're still seeing this behavior in identical fashion, but also on CLI. Double checking, shall we re-open this one?

ericharper · 2021-07-09T17:06:39Z

Feel free to use this issue. Please update with the code that was run and the log files.

BenikaHall · 2021-07-09T17:40:34Z

Great. Will do.

BenikaHall · 2021-07-16T17:35:24Z

@ericharper We have it resolved! By submitting the batch job in Slurm, we were able to get multi-GPU training working. We confirmed that the trainer was configured in the YAML file.

sciai-ai added the bug Something isn't working label Mar 26, 2021

okuchaiev closed this as completed Apr 9, 2021

ericharper reopened this Jul 8, 2021

ericharper closed this as completed Jul 8, 2021

ericharper reopened this Jul 9, 2021

ericharper closed this as completed Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi core, gpu training fails #1974

multi core, gpu training fails #1974

sciai-ai commented Mar 26, 2021 •

edited

Loading

okuchaiev commented Mar 26, 2021

sciai-ai commented Mar 27, 2021

titu1994 commented Mar 27, 2021 •

edited

Loading

sciai-ai commented Mar 27, 2021 •

edited

Loading

Oktai15 commented Mar 27, 2021 •

edited

Loading

okuchaiev commented Jun 4, 2021

BenikaHall commented Jul 9, 2021

ericharper commented Jul 9, 2021

BenikaHall commented Jul 9, 2021

BenikaHall commented Jul 16, 2021

multi core, gpu training fails #1974

multi core, gpu training fails #1974

Comments

sciai-ai commented Mar 26, 2021 • edited Loading

okuchaiev commented Mar 26, 2021

sciai-ai commented Mar 27, 2021

titu1994 commented Mar 27, 2021 • edited Loading

sciai-ai commented Mar 27, 2021 • edited Loading

Oktai15 commented Mar 27, 2021 • edited Loading

okuchaiev commented Jun 4, 2021

BenikaHall commented Jul 9, 2021

ericharper commented Jul 9, 2021

BenikaHall commented Jul 9, 2021

BenikaHall commented Jul 16, 2021

sciai-ai commented Mar 26, 2021 •

edited

Loading

titu1994 commented Mar 27, 2021 •

edited

Loading

sciai-ai commented Mar 27, 2021 •

edited

Loading

Oktai15 commented Mar 27, 2021 •

edited

Loading