Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't train MASK-RCNN in distributed enviroment #1073

Closed
lpuglia opened this issue Jul 2, 2019 · 4 comments
Closed

Can't train MASK-RCNN in distributed enviroment #1073

lpuglia opened this issue Jul 2, 2019 · 4 comments

Comments

@lpuglia
Copy link

lpuglia commented Jul 2, 2019

I would like to use the train script at: references/detection location. The single-GPU training works good (both in master and v0.3.0 tag). Now, on my server i have 4 GPUs and i really would like to use them, as per my understanding i have to set up the torch.distributed package (which i'm not familiar with). First i tried:

RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 python train.py

which returns:

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

but reading this tutorial it is shown that:

MASTER_ADDR - required (except for rank 0); address of rank 0 node

is not required with Rank 0. in fact, when i do:

 RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 MASTER_ADDR=localhost MASTER_PORT=12345 python train.py

it stucks.

How do i train Mask-RCNN on a multiple GPU enviroment? is that possible? is there any example/guide/tutorial showing the correct settings?

@fmassa
Copy link
Member

fmassa commented Jul 2, 2019

Here is the command you should use:

python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py

You can find more information about the torch.distributed.launch utility in https://pytorch.org/docs/stable/distributed.html#launch-utility

But this is a good point, I should add further information on how to launch multi-GPU jobs in a new README in each folder. A PR adding some basic instructions would be awesome.

Let me know if you have further questions

@FrancescoSaverioZuppichini

The current solution from @fmassa doesn't work. The python code runs correctly but no train is ever started, just hanging there without doing anything

@fmassa
Copy link
Member

fmassa commented Sep 30, 2021

@FrancescoSaverioZuppichini what PyTorch version are you using? Note that we are training models with that script right now and it doesn't hang. What is the error message you get when you stop the job? Also, did you make any modifications to the model or training scripts?

cc @datumbox for visibility

@FrancescoSaverioZuppichini

@fmassa my bad, I wrongly commented on this one while I meant to comment on a different issue from a different package. This is why you shouldn't keep too many chrome's tab open at the same time. Apologies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants