-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC: Accelerator refactor #5743
Conversation
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
@@ -307,29 +305,6 @@ def load_spawn_weights(self, original_model): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for later: we should use fsspec here in case trainer.default_root_dir is a remote path.
Return: | ||
A tensor of shape (world_size, batch, ...) | ||
""" | ||
return all_gather_ddp_if_available(tensor, group=group, sync_grads=sync_grads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to rename this utility? ddp is a training type plugin but this is in the base accelerator which can be confusing
same question as above: why does all_gather get called via the accelerator vs the training type plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe replace ddp with distributed here.
Also called here for backwards compatibility. We can remove this after a while I think
@@ -177,7 +180,19 @@ def set_world_ranks(self): | |||
self.global_rank = self.node_rank * self.num_processes + self.local_rank | |||
self.world_size = self.num_nodes * self.num_processes | |||
|
|||
def pre_configure_ddp(self): | |||
# todo: PyTorch 1.7.0 DDP introduces ``self.reducer._rebuild_buckets()`` breaking manual_optimization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an issue with lightning's manual optimization or with the pytorch implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an issue with PyTorch having a flag that seems to not completely disable the experimental feature despite it should!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ananthsub. I would like to catch up on this issue.
@@ -33,6 +35,10 @@ def __init__(self) -> None: | |||
self._results = None | |||
self.global_rank = 0 | |||
|
|||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we think this list of hooks is comprehensive? are there more we want to add down the line?
@property | ||
def accelerator(self): | ||
return self.accelerator_connector.accelerator | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type hints here would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be added. Aiming for complete mypy coverage in the near future:)
@@ -518,12 +479,15 @@ def optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_ | |||
def on_before_zero_grad(self, optimizer): | |||
self.trainer.call_hook('on_before_zero_grad', optimizer) | |||
|
|||
def optimizer_zero_grad(self, batch_idx, optimizer, opt_idx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
What does this PR do?
And once again... This is a new version of #5616 which moves it to the main repo to merge some branches from there.
Closes #5385
Fixes #4510
This PR separates the Accelerator (Hardware Part) from the actual different training routines.
Workflow actions:
Remaining TODOs:
block_backward_sync
andprepare_for_backward
) from old DDPPlugin to new to avoid performance hits (Adrian/Justus)So far this PR was co-authored with @awaelchli !
cc our beloved @Borda who helps us with this ❤️
Slides for motivation and high-level overview
List of PRs to look out for (when rebasing, code we need to manually copy over to new files):
#5221, #5195, #5300, #5388