Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: Accelerator refactor #5743

Merged
merged 314 commits into from
Feb 12, 2021
Merged

PoC: Accelerator refactor #5743

merged 314 commits into from
Feb 12, 2021

Conversation

justusschock
Copy link
Member

@justusschock justusschock commented Feb 2, 2021

What does this PR do?

And once again... This is a new version of #5616 which moves it to the main repo to merge some branches from there.
Closes #5385

Fixes #4510

This PR separates the Accelerator (Hardware Part) from the actual different training routines.

Workflow actions:

Remaining TODOs:

  • TPUAccelerator (Justus)
  • DDP2 Plugin (Adrian)
  • Shared Training Plugin (Sean/Justus)
  • RPC Plugin (Justus)
  • Rebase on release branch
  • Testing DDP2 and DDP Slurm
  • Remove old plugins (pl/plugins/old)
  • Make Tuner work (requires setting some attrs through trainer) (Adrian)
  • Port the left-over functions (like block_backward_sync and prepare_for_backward) from old DDPPlugin to new to avoid performance hits (Adrian/Justus)

So far this PR was co-authored with @awaelchli !

cc our beloved @Borda who helps us with this ❤️

Slides for motivation and high-level overview

List of PRs to look out for (when rebasing, code we need to manually copy over to new files):
#5221, #5195, #5300, #5388

@carmocca carmocca mentioned this pull request Feb 12, 2021
11 tasks
@SeanNaren SeanNaren self-requested a review February 12, 2021 16:44
Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

pytorch_lightning/accelerators/accelerator.py Show resolved Hide resolved
pytorch_lightning/accelerators/gpu.py Show resolved Hide resolved
@@ -307,29 +305,6 @@ def load_spawn_weights(self, original_model):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for later: we should use fsspec here in case trainer.default_root_dir is a remote path.

Return:
A tensor of shape (world_size, batch, ...)
"""
return all_gather_ddp_if_available(tensor, group=group, sync_grads=sync_grads)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to rename this utility? ddp is a training type plugin but this is in the base accelerator which can be confusing

same question as above: why does all_gather get called via the accelerator vs the training type plugin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe replace ddp with distributed here.

Also called here for backwards compatibility. We can remove this after a while I think

@@ -177,7 +180,19 @@ def set_world_ranks(self):
self.global_rank = self.node_rank * self.num_processes + self.local_rank
self.world_size = self.num_nodes * self.num_processes

def pre_configure_ddp(self):
# todo: PyTorch 1.7.0 DDP introduces ``self.reducer._rebuild_buckets()`` breaking manual_optimization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an issue with lightning's manual optimization or with the pytorch implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an issue with PyTorch having a flag that seems to not completely disable the experimental feature despite it should!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ananthsub. I would like to catch up on this issue.

@@ -33,6 +35,10 @@ def __init__(self) -> None:
self._results = None
self.global_rank = 0

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we think this list of hooks is comprehensive? are there more we want to add down the line?

Comment on lines +68 to +71
@property
def accelerator(self):
return self.accelerator_connector.accelerator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type hints here would be helpful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be added. Aiming for complete mypy coverage in the near future:)

@@ -518,12 +479,15 @@ def optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_
def on_before_zero_grad(self, optimizer):
self.trainer.call_hook('on_before_zero_grad', optimizer)

def optimizer_zero_grad(self, batch_idx, optimizer, opt_idx):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Connectors Refactoring Discussion
9 participants