-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replaces ddp .spawn with subprocess #2029
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that lowering max_epochs
is a solution, just hiding the true issue :{
os.environ['LOCAL_RANK'] = '0' | ||
|
||
# pull out the commands used to run the script and resolve the abs file path | ||
command = sys.argv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assumes that it can be called only from cmd with Trainer like arguments...
but what about just a sprit with loaded params from a file...
cc @PyTorchLightning/core-contributors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give an example?
these tests were done with cli flags and also trainer args. this local rank is coming from the gpus flag in the trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also assumes that it is called as (python) path/to/some/script.py
and not e.g. from a console entry point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact, this part of code is not tested...
https://codecov.io/gh/PyTorchLightning/pytorch-lightning/commit/82a20296e308a67c8d9202e4cbdf92a44b90b077
def _signal_kill_handler(*args): | ||
return TrainerTrainLoopMixin.run_training_teardown(self) | ||
|
||
orig_signal_handlers = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @justusschock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned @justusschock i disabled this since it was a blocker.
TBH, i don't know how this was allowed on master with the exceptions being thrown every time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with temporarily disabling this. In our CI these exceptions did not appear
Does this merge make ddp work in jupyter notebooks now? |
weights_summary: Optional[str] = 'full', | ||
weights_summary: Optional[str] = 'top', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not reflected in the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mind shot a fix PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes unless it was not intentional. @williamFalcon yes or no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, forgot to update docs on this.
i propose we make it top going forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it here #2021, hope it's fine including it there.
* replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix
Problem
A ton of the reported issues on DDP when using a single node are because of the .spawn used in the implementation.
But more importantly... removes the need to pickle things!
Approach
Here we remove .spawn and instead use subprocess to spin-off independent script calls while treating the original script as the master.
Fixes
(Issue authors, please verify if it does)
Fixes #2028
Fixes #1981
Fixes #1972
Fixes #1970
Fixes #1943
Fixes #1942
Fixes #1890
Fixes #1834
Fixes #1831
Fixes #1774
Fixes #1710
Fixes #1704
Fixes #1694
Fixes #1682
Fixes #1479
Fixes #1461
Fixes #1376
Fixes #981
Fixes #965
Fixes #958
Maybe fixes (authors, please check if it does)
#1761
#1714
#1550
#1542
#1846
Additional checks
I verified the following cases, but @tullie @justusschock @ethanwharris @jeremyjordan check on your own to make sure we have no issues?
Review
Sorry to ping multiple people here lol, just want a good number of sanity checks since this is so critical.
@justusschock i had to comment out the signal thing... not sure what that was solving but it was causing the ctrl+c to not work in killing the processes. I did add a fix to the keyboard thing which now registers ctrl+c only once and avoids leaving hanging procs