PoC: Accelerator refactor #5743

justusschock · 2021-02-02T10:34:29Z

What does this PR do?

And once again... This is a new version of #5616 which moves it to the main repo to merge some branches from there.
Closes #5385

Fixes #4510

This PR separates the Accelerator (Hardware Part) from the actual different training routines.

Workflow actions:

Move current accelerators to legacy - change imports (still use them for training) >> Refactor: 1/n legacy accelerators and plugins #5645
Move current accelerator connector to legacy (still use it for training) and add new connector (necessary for tests)
Add new accelerators (still use legacy for training)
Add new plugins one by one (still use legacy for training)
Add missing logic
Drop Legacy for new backends. (Not sure if keeping them for next release).
Merge accelerator refactor - remove legacy tests that cannot pass #5849 unmittelbar before this PR

Remaining TODOs:

TPUAccelerator (Justus)
DDP2 Plugin (Adrian)
Shared Training Plugin (Sean/Justus)
RPC Plugin (Justus)
Rebase on release branch
Testing DDP2 and DDP Slurm
Remove old plugins (pl/plugins/old)
Make Tuner work (requires setting some attrs through trainer) (Adrian)
Port the left-over functions (like block_backward_sync and prepare_for_backward) from old DDPPlugin to new to avoid performance hits (Adrian/Justus)

So far this PR was co-authored with @awaelchli !

cc our beloved @Borda who helps us with this ❤️

Slides for motivation and high-level overview

List of PRs to look out for (when rebasing, code we need to manually copy over to new files):
#5221, #5195, #5300, #5388

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

ananthsub

🚀

pytorch_lightning/accelerators/accelerator.py

pytorch_lightning/accelerators/gpu.py

ananthsub · 2021-02-12T17:36:23Z

pytorch_lightning/accelerators/legacy/tpu_accelerator.py

@@ -307,29 +305,6 @@ def load_spawn_weights(self, original_model):



for later: we should use fsspec here in case trainer.default_root_dir is a remote path.

ananthsub · 2021-02-12T17:39:59Z

pytorch_lightning/accelerators/accelerator.py

+        Return:
+            A tensor of shape (world_size, batch, ...)
+        """
+        return all_gather_ddp_if_available(tensor, group=group, sync_grads=sync_grads)


do we need to rename this utility? ddp is a training type plugin but this is in the base accelerator which can be confusing

same question as above: why does all_gather get called via the accelerator vs the training type plugin?

Yes, maybe replace ddp with distributed here.

Also called here for backwards compatibility. We can remove this after a while I think

pytorch_lightning/accelerators/accelerator_connector.py

ananthsub · 2021-02-12T18:25:47Z

pytorch_lightning/plugins/training_type/ddp.py

@@ -177,7 +180,19 @@ def set_world_ranks(self):
        self.global_rank = self.node_rank * self.num_processes + self.local_rank
        self.world_size = self.num_nodes * self.num_processes

+    def pre_configure_ddp(self):
+        # todo: PyTorch 1.7.0 DDP introduces ``self.reducer._rebuild_buckets()`` breaking manual_optimization


is this an issue with lightning's manual optimization or with the pytorch implementation?

This is an issue with PyTorch having a flag that seems to not completely disable the experimental feature despite it should!

Hey @ananthsub. I would like to catch up on this issue.

pytorch_lightning/plugins/training_type/parallel.py

ananthsub · 2021-02-12T18:31:26Z

pytorch_lightning/plugins/training_type/training_type_plugin.py

@@ -33,6 +35,10 @@ def __init__(self) -> None:
        self._results = None
        self.global_rank = 0

+    @property


do we think this list of hooks is comprehensive? are there more we want to add down the line?

ananthsub · 2021-02-12T18:38:53Z

pytorch_lightning/trainer/properties.py

+    @property
+    def accelerator(self):
+        return self.accelerator_connector.accelerator
+


type hints here would be helpful

Will be added. Aiming for complete mypy coverage in the near future:)

ananthsub · 2021-02-12T18:41:00Z

pytorch_lightning/trainer/training_loop.py

@@ -518,12 +479,15 @@ def optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_
    def on_before_zero_grad(self, optimizer):
        self.trainer.call_hook('on_before_zero_grad', optimizer)

+    def optimizer_zero_grad(self, batch_idx, optimizer, opt_idx):


awaelchli and others added 30 commits January 25, 2021 09:06

restoring the result from subprocess

259c7f7

fix queue.get() order for results

dfab52a

add missing "block_backward_sync" context manager

6742488

add missing "block_backward_sync" context manager

8c89932

fix sync_batchnorm

0186a0f

fix supported gpu-ids for tuple

b2ac1f4

fix clip gradients and inf recursion

07a41ce

accelerator selection: added cluster_environment plugin

63b7eaf

fix torchelastic test

f8344c5

fix reduce early stopping decision for DDP

34e3c15

fix tests: callbacks, conversion to lightning optimizer

27a4cff

fix lightning optimizer does not pickle

df5ac30

fix setting benchmark and deterministic option

dcf917a

fix slurm amp test

272f088

fix prepare_data test and determine node_rank

4529476

fix retrieving last path when testing

5319b0f

remove obsolete plugin argument

3b54cfb

fix test: test_trainer_config

6540b87

fix torchscript tests

6b450e1

fix trainer.model access

4ef539f

move properties

1001ccf

fix test_transfer_batch_hook

38a1d0f

fix auto_select_gpus

46cf7ef

fix omegaconf test

258f50e

fix test that needs to simulate slurm ddp

a5d69b9

add horovod plugin

88a7ed5

fix test with named arguments

40daa41

clean up whitespace

96fc074

fix datamodules test

210831a

remove old accelerators

98b6dd4

carmocca mentioned this pull request Feb 12, 2021

Add LSF support #5102

Merged

11 tasks

tchaton and others added 6 commits February 12, 2021 13:45

update typo

80dacb6

Update pytorch_lightning/trainer/properties.py

99573eb

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

update

ab859d7

suggestion from code review

ad5742a

suggestion from code review

5eaec98

Merge branch 'master' into accelerator-refactor-sharded

941cf77

SeanNaren self-requested a review February 12, 2021 16:44

SeanNaren approved these changes Feb 12, 2021

View reviewed changes

Merge branch 'master' into accelerator-refactor-sharded

8491c29

ananthsub approved these changes Feb 12, 2021

View reviewed changes

Merge branch 'master' into accelerator-refactor-sharded

bd2b23d

tchaton enabled auto-merge (squash) February 12, 2021 20:03

tchaton merged commit da6dbc8 into master Feb 12, 2021

tchaton deleted the accelerator-refactor-sharded branch February 12, 2021 20:48

This was referenced Feb 12, 2021

move accelerator legacy tests #5948

Merged

remove legacy accelerators #5949

Merged

Borda mentioned this pull request Feb 12, 2021

remove legacy plugins #5950

Merged

12 tasks

carmocca mentioned this pull request Feb 19, 2021

Encapsulate logic in DistributedType #6079

Closed

3 tasks

carmocca mentioned this pull request Mar 25, 2021

extend Enum api #5478

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC: Accelerator refactor #5743

PoC: Accelerator refactor #5743

justusschock commented Feb 2, 2021 •

edited

Loading

ananthsub left a comment

ananthsub Feb 12, 2021

ananthsub Feb 12, 2021

justusschock Feb 12, 2021

ananthsub Feb 12, 2021

justusschock Feb 12, 2021

tchaton Feb 12, 2021

ananthsub Feb 12, 2021

ananthsub Feb 12, 2021

justusschock Feb 12, 2021

ananthsub Feb 12, 2021

		@@ -307,29 +305,6 @@ def load_spawn_weights(self, original_model):

PoC: Accelerator refactor #5743

PoC: Accelerator refactor #5743

Conversation

justusschock commented Feb 2, 2021 • edited Loading

What does this PR do?

Remaining TODOs:

ananthsub left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justusschock commented Feb 2, 2021 •

edited

Loading