Hardware specific parts of Accelerator Refactoring #5719

justusschock · 2021-01-30T17:33:52Z

What does this PR do?

Adds the Hardware specific parts of the refactoring (#5616 ) as Accelerators for CPU, GPU and TPU, the accelerator connector and the specific plugins for single-device training, single tpu training and multiple tpu training (TPUSpawn).

Only files with new classes are added and some imports are changed. No integration into Trainer yet.

Should be merged after #5715 #5718 #5714

pep8speaks · 2021-01-30T17:33:59Z

Hello @justusschock! Thanks for updating this PR.

In the file pytorch_lightning/accelerators/accelerator_connector.py:

Line 383:13: W503 line break before binary operator

Comment last updated at 2021-02-01 12:37:53 UTC

pytorch_lightning/plugins/training_type/training_type_plugin.py

pytorch_lightning/plugins/training_type/single_tpu.py

pytorch_lightning/plugins/training_type/single_device.py

Co-Authored with @awaelchi

Co-authored-by: @awaelchi

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-Authored with @awaelchi

Co-authored-by: @awaelchi

codecov · 2021-01-31T20:37:55Z

Codecov Report

Merging #5719 (ea88661) into release/1.2-dev (3bacac7) will decrease coverage by 4%.
The diff coverage is 30%.

@@               Coverage Diff                @@
##           release/1.2-dev   #5719    +/-   ##
================================================
- Coverage               89%     86%    -4%     
================================================
  Files                  173     181     +8     
  Lines                12495   13110   +615     
================================================
+ Hits                 11175   11226    +51     
- Misses                1320    1884   +564

Borda · 2021-02-01T08:10:19Z

seems as last issue:
E ImportError: cannot import name 'ParallelLoader' from 'torch_xla' (/root/miniconda3/envs/lightning/lib/python3.7/site-packages/torch_xla/__init__.py)

tchaton

@justusschock Really good work there ! Not all comments should be adressed in this PR.

pytorch_lightning/accelerators/accelerator_connector.py

tchaton · 2021-02-01T09:49:39Z

pytorch_lightning/plugins/training_type/tpu_spawn.py

+
+        if self.is_global_zero:
+            # load weights saved in ddp
+            path = os.path.join(original_model.trainer.default_root_dir, "__temp_weight_distributed_end.ckpt")


I feel like __temp_weight_distributed_end.ckpt should be a GLOBAL property.

what do you mean by global property?

class TEMPORARY_WEIGHT_PATH = "__temp_weight_distributed_end.ckpt" def __init__ ... def path = os.path.join(original_model.trainer.default_root_dir, self.TEMPORARY_WEIGHT_PATH)

pytorch_lightning/plugins/training_type/tpu_spawn.py

tchaton · 2021-02-01T09:55:06Z

pytorch_lightning/plugins/training_type/tpu_spawn.py

+
+        # load weights if not interrupted
+        # TODO: check for trainer reference
+        if self.on_colab_kaggle and not model.trainer.testing:


Could we have a ColabEnv for TPU and move some logic there ? In another PR :)

not sure I completely understand this... You mean a ClusterEnvironment for that? We can certainly look into that.

Not sure Colab is a ClusterEnviromenent.

I meant more something like TPURunningEnviromenent -> Colab, Kaggle_colab, etc...

Best,
T.C

pytorch_lightning/plugins/training_type/tpu_spawn.py

Co-authored-by: chaton <thomas@grid.ai>

justusschock added the refactor label Jan 30, 2021

justusschock added this to the 1.2 milestone Jan 30, 2021

justusschock self-assigned this Jan 30, 2021

justusschock requested a review from awaelchli as a code owner January 30, 2021 17:33

justusschock requested review from Borda, SeanNaren, tchaton and williamFalcon as code owners January 30, 2021 17:33

mergify bot added the has conflicts label Jan 30, 2021

Borda requested a review from ananthsub January 30, 2021 20:01

ananthsub reviewed Jan 30, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/training_type_plugin.py Outdated Show resolved Hide resolved

pytorch_lightning/plugins/training_type/single_tpu.py Outdated Show resolved Hide resolved

pytorch_lightning/plugins/training_type/single_device.py Show resolved Hide resolved

awaelchli approved these changes Jan 31, 2021

View reviewed changes

justusschock and others added 10 commits January 31, 2021 17:06

add basic accelerator class.

d86861d

Co-Authored with @awaelchi

pep8

8b32117

Co-authored-by: @awaelchi

add cpu accelerator

1269a47

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

add gpu accelerator

a9caca0

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

add tpu accelerator

1121ee3

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

add accelerator connector

31b9b29

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

add single device training

b58be97

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

add single tpu

9f985b4

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

add tpu spawn

1c4db3e

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

make on_colab_kaggle utility func

398abb9

justusschock force-pushed the ref/hardware-specific branch from a950f48 to 398abb9 Compare January 31, 2021 16:12

mergify bot added has conflicts and removed has conflicts labels Jan 31, 2021

justusschock added 2 commits January 31, 2021 21:04

add basic accelerator class.

a9608d3

Co-Authored with @awaelchi

pep8

1f33b9f

Co-authored-by: @awaelchi

.

618f6d8

Borda approved these changes Jan 31, 2021

View reviewed changes

Borda added the ready PRs ready to be merged label Jan 31, 2021

Borda and others added 4 commits January 31, 2021 22:28

.

36d469c

flake8

df0900c

sync accelerator connector changes from dev1.2

0a20f95

changelog

1085a23

awaelchli force-pushed the ref/hardware-specific branch from 3019414 to 1085a23 Compare February 1, 2021 00:52

justusschock and others added 3 commits February 1, 2021 09:16

merge

224c8ee

fix tpu handling

4695882

tpu

f461f2b

justusschock mentioned this pull request Feb 1, 2021

Accelerator Refactor/RPC + Sharded #5732

Merged

Borda added 2 commits February 1, 2021 10:19

aval

a4190fd

yapf

5571ce6

tchaton approved these changes Feb 1, 2021

View reviewed changes

tchaton self-requested a review February 1, 2021 09:58

justusschock and others added 5 commits February 1, 2021 11:05

Update pytorch_lightning/plugins/training_type/tpu_spawn.py

90e379e

Co-authored-by: chaton <thomas@grid.ai>

Update pytorch_lightning/accelerators/accelerator_connector.py

725d18e

Co-authored-by: chaton <thomas@grid.ai>

Update pytorch_lightning/plugins/training_type/tpu_spawn.py

d5dd1bb

Co-authored-by: chaton <thomas@grid.ai>

Update tpu_spawn.py

90bb35d

Update pytorch_lightning/accelerators/accelerator_connector.py

53338f8

Co-authored-by: chaton <thomas@grid.ai>

SeanNaren approved these changes Feb 1, 2021

View reviewed changes

indentation

3a44c73

awaelchli enabled auto-merge (squash) February 1, 2021 12:34

stupid formatting

ea88661

awaelchli merged commit b3ebc18 into release/1.2-dev Feb 1, 2021

awaelchli deleted the ref/hardware-specific branch February 1, 2021 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardware specific parts of Accelerator Refactoring #5719

Hardware specific parts of Accelerator Refactoring #5719

justusschock commented Jan 30, 2021 •

edited by awaelchli

Loading

pep8speaks commented Jan 30, 2021 •

edited

Loading

codecov bot commented Jan 31, 2021 •

edited

Loading

Borda commented Feb 1, 2021

tchaton left a comment

tchaton Feb 1, 2021

justusschock Feb 1, 2021

tchaton Feb 1, 2021

tchaton Feb 1, 2021

justusschock Feb 1, 2021

tchaton Feb 1, 2021

Hardware specific parts of Accelerator Refactoring #5719

Hardware specific parts of Accelerator Refactoring #5719

Conversation

justusschock commented Jan 30, 2021 • edited by awaelchli Loading

What does this PR do?

pep8speaks commented Jan 30, 2021 • edited Loading

Comment last updated at 2021-02-01 12:37:53 UTC

codecov bot commented Jan 31, 2021 • edited Loading

Codecov Report

Borda commented Feb 1, 2021

tchaton left a comment

Choose a reason for hiding this comment

tchaton Feb 1, 2021

Choose a reason for hiding this comment

justusschock Feb 1, 2021

Choose a reason for hiding this comment

tchaton Feb 1, 2021

Choose a reason for hiding this comment

tchaton Feb 1, 2021

Choose a reason for hiding this comment

justusschock Feb 1, 2021

Choose a reason for hiding this comment

tchaton Feb 1, 2021

Choose a reason for hiding this comment

justusschock commented Jan 30, 2021 •

edited by awaelchli

Loading

pep8speaks commented Jan 30, 2021 •

edited

Loading

codecov bot commented Jan 31, 2021 •

edited

Loading