Can't train MASK-RCNN in distributed enviroment #1073

lpuglia · 2019-07-02T14:06:41Z

I would like to use the train script at: references/detection location. The single-GPU training works good (both in master and v0.3.0 tag). Now, on my server i have 4 GPUs and i really would like to use them, as per my understanding i have to set up the torch.distributed package (which i'm not familiar with). First i tried:

RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 python train.py

which returns:

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

but reading this tutorial it is shown that:

MASTER_ADDR - required (except for rank 0); address of rank 0 node

is not required with Rank 0. in fact, when i do:

 RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 MASTER_ADDR=localhost MASTER_PORT=12345 python train.py

it stucks.

How do i train Mask-RCNN on a multiple GPU enviroment? is that possible? is there any example/guide/tutorial showing the correct settings?

The text was updated successfully, but these errors were encountered:

fmassa · 2019-07-02T15:40:09Z

Here is the command you should use:

python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py

You can find more information about the torch.distributed.launch utility in https://pytorch.org/docs/stable/distributed.html#launch-utility

But this is a good point, I should add further information on how to launch multi-GPU jobs in a new README in each folder. A PR adding some basic instructions would be awesome.

Let me know if you have further questions

FrancescoSaverioZuppichini · 2021-09-30T08:10:10Z

The current solution from @fmassa doesn't work. The python code runs correctly but no train is ever started, just hanging there without doing anything

fmassa · 2021-09-30T10:52:22Z

@FrancescoSaverioZuppichini what PyTorch version are you using? Note that we are training models with that script right now and it doesn't hang. What is the error message you get when you stop the job? Also, did you make any modifications to the model or training scripts?

cc @datumbox for visibility

FrancescoSaverioZuppichini · 2021-09-30T11:37:11Z

@fmassa my bad, I wrongly commented on this one while I meant to comment on a different issue from a different package. This is why you shouldn't keep too many chrome's tab open at the same time. Apologies

fmassa closed this as completed Jul 2, 2019

fmassa added module: reference scripts topic: object detection labels Jul 2, 2019

flauted mentioned this issue Jul 11, 2019

Clean det ref #1109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't train MASK-RCNN in distributed enviroment #1073

Can't train MASK-RCNN in distributed enviroment #1073

lpuglia commented Jul 2, 2019

fmassa commented Jul 2, 2019 •

edited

Loading

FrancescoSaverioZuppichini commented Sep 30, 2021

fmassa commented Sep 30, 2021

FrancescoSaverioZuppichini commented Sep 30, 2021

Can't train MASK-RCNN in distributed enviroment #1073

Can't train MASK-RCNN in distributed enviroment #1073

Comments

lpuglia commented Jul 2, 2019

fmassa commented Jul 2, 2019 • edited Loading

FrancescoSaverioZuppichini commented Sep 30, 2021

fmassa commented Sep 30, 2021

FrancescoSaverioZuppichini commented Sep 30, 2021

fmassa commented Jul 2, 2019 •

edited

Loading