-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't train MASK-RCNN in distributed enviroment #1073
Comments
Here is the command you should use:
You can find more information about the But this is a good point, I should add further information on how to launch multi-GPU jobs in a new README in each folder. A PR adding some basic instructions would be awesome. Let me know if you have further questions |
The current solution from @fmassa doesn't work. The python code runs correctly but no train is ever started, just hanging there without doing anything |
@FrancescoSaverioZuppichini what PyTorch version are you using? Note that we are training models with that script right now and it doesn't hang. What is the error message you get when you stop the job? Also, did you make any modifications to the model or training scripts? cc @datumbox for visibility |
@fmassa my bad, I wrongly commented on this one while I meant to comment on a different issue from a different package. This is why you shouldn't keep too many chrome's tab open at the same time. Apologies |
I would like to use the train script at: references/detection location. The single-GPU training works good (both in
master
andv0.3.0
tag). Now, on my server i have 4 GPUs and i really would like to use them, as per my understanding i have to set up thetorch.distributed
package (which i'm not familiar with). First i tried:which returns:
but reading this tutorial it is shown that:
MASTER_ADDR - required (except for rank 0); address of rank 0 node
is not required with Rank 0. in fact, when i do:
it stucks.
How do i train Mask-RCNN on a multiple GPU enviroment? is that possible? is there any example/guide/tutorial showing the correct settings?
The text was updated successfully, but these errors were encountered: