Clean up environment access in plugins #6941

awaelchli · 2021-04-10T03:45:46Z

What does this PR do?

Fixes

delayed availability rank information (global, local)
enables torch elastic fault tolerance
potentially some reports of ddp hanging

Credit: Proposed by @ananthsub

With this PR:

trainer = Trainer(accelerator=...)
trainer.global_rank  # this is now returning the correct rank on clusters, not only just at trainer.fit()!

TODO:

Update TPU plugins

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2021-04-10T03:45:53Z

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-13 17:39:06 UTC

pytorch_lightning/trainer/connectors/accelerator_connector.py

codecov · 2021-04-10T05:04:46Z

Codecov Report

Merging #6941 (7d39a92) into master (80c5293) will decrease coverage by 9%.
The diff coverage is 83%.

@@           Coverage Diff            @@
##           master   #6941     +/-   ##
========================================
- Coverage      92%     83%     -9%     
========================================
  Files         194     194             
  Lines       12366   13132    +766     
========================================
- Hits        11350   10897    -453     
- Misses       1016    2235   +1219

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

srib · 2021-04-13T02:44:11Z

pytorch_lightning/plugins/environments/cluster_environment.py

+
+    @abstractmethod
+    def global_rank(self) -> int:
+        """ The rank (index) of the currently running process across all nodes and devices. """


Shouldn't this be pass too?

it doesn't make a difference. pass is only required when we have nothing under the function. here the docstring is already enough :)

…bugfix/elastic-world-size

pytorch_lightning/plugins/environments/lightning_environment.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

carmocca

@carmocca regarding getter/setter, what's your recommendation. Doing it in this PR would touch the full interface.

Let's do it after this one

carmocca · 2021-04-13T12:05:37Z

pytorch_lightning/plugins/training_type/ddp.py

@@ -78,11 +78,11 @@ def __init__(
        self._ddp_kwargs = kwargs
        self._has_spawned_children = False
        self.task_idx = None


remove this?

justusschock · 2021-04-13T12:10:12Z

pytorch_lightning/plugins/environments/lightning_environment.py

@@ -34,6 +35,8 @@ class LightningEnvironment(ClusterEnvironment):
    def __init__(self):
        super().__init__()
        self._master_port = None
+        self._global_rank: int = 0
+        self._world_size: Optional[int] = None


How can a world size be None? Isn't it 1 then? Also what happens if this is None? Do we check this?

yes, we test it in tests/plugins/environments/test_lightning_environments
you are right, it could be 1 by default now.
we never see None because at trainer init we immediately let the plugin overwrite it to an int.

justusschock · 2021-04-13T12:10:53Z

pytorch_lightning/plugins/environments/slurm_environment.py

+        # no-op, we are not allowed to change rank in SLURM
+        pass


Maybe better raise an Error here to make that more explicit?

Because if I call a function like this and it doesn't error I expect the new rank to be set.

hmm, but then we would have to make the call to the setter conditional in the training plugins.
For example, in DDPSpawnPlugin we use the setter to update the global rank once we have spawned the new processes.

cc @ananthsub

Maybe we can make add log an error here to warn users but not raise an exception?

Just a warning would be fine. Might be a good time to introduce several warn / log level depending on risks.

I went for log.debug for now. I don't want to risk users seeing error messages pop up that they don't understand. The team wants to include this PR in in the patch release today. I could not find a smarter way to avoid the setter in the limited time.

pytorch_lightning/plugins/environments/slurm_environment.py

pytorch_lightning/plugins/environments/torchelastic_environment.py

justusschock · 2021-04-13T12:12:38Z

pytorch_lightning/plugins/environments/torchelastic_environment.py

-    def __init__(self):
-        super().__init__()
+    @staticmethod
+    def is_using_torchelastic() -> bool:


Why don't we have something similar for slurm?

we have, it's in the accelerator connector. and I will do a follow up. didn't want to make the PR larger

justusschock · 2021-04-13T12:14:42Z

pytorch_lightning/plugins/training_type/horovod.py

+    def global_rank(self) -> int:
+        return hvd.rank()
+
+    @property
+    def local_rank(self) -> int:
+        return hvd.local_rank()
+
+    @property
+    def world_size(self) -> int:
+        return hvd.size()


Wouldn't it be cleaner to also have a horovod environment plugin here? since for this part it's similar to torchelastic and should also be handled like that.

pytorch_lightning/trainer/connectors/accelerator_connector.py

justusschock · 2021-04-13T12:17:01Z

tests/accelerators/test_accelerator_connector.py

+        "SLURM_PROCID": "1",
+        "SLURM_LOCALID": "1",


Why Do you need to change these?

these were artificial values, and they don't match with what we expect.
SLURM_NTASKS=2 but LOCAL_RANK 10 is not valid, and would actually lead to Lightning ignoring the SlurmEnvironment and instead select the LightningEnvironment (there is a comment in the code, we should document that behavior better, and write proper tests instead of these misleading ones that were here)

tests/plugins/test_common.py

tchaton · 2021-04-10T12:25:17Z

pytorch_lightning/trainer/connectors/accelerator_connector.py

@@ -302,8 +302,8 @@ def root_gpu(self) -> Optional[int]:

    @property
    def is_using_torchelastic(self) -> bool:
-        te_flags_passed = "WORLD_SIZE" in os.environ and ("GROUP_RANK" in os.environ or "NODE_RANK" in os.environ)
-        return te_flags_passed
+        required_env_vars = ("RANK", "GROUP_RANK", "LOCAL_RANK")


What about WORLD_SIZE ?

tchaton · 2021-04-13T17:02:08Z

pytorch_lightning/plugins/environments/lightning_environment.py

+    def world_size(self) -> int:
+        return self._world_size
+
+    def set_world_size(self, size: int) -> None:


Do we need set_world_size. Can you just use a setter and make everything properties ?

Yes but this interface is old and all getters are methods from the beginning.
It would be better to have proper setters and getters. I decided not to keep backward compatibility and make the interface consistent. I propose to do a deprecation in a follow up to keep this PR managable

tchaton · 2021-04-13T17:03:05Z

pytorch_lightning/plugins/environments/slurm_environment.py

+        # no-op, we are not allowed to change rank in SLURM
+        pass


Just a warning would be fine. Might be a good time to introduce several warn / log level depending on risks.

awaelchli · 2021-04-13T18:14:37Z

Great, one more headache solved. A few items from reviews that I could not address fully will be tracked here: #6303
I will make sure these concerns get sorted out. Thanks everyone!

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

awaelchli added 24 commits April 9, 2021 13:22

initial draft

c87cf9a

x

903e673

x

bdf4eaa

x

f9478c8

x

a2f67bd

x

64c8c36

torchelastic

bb2acdc

x

03fe590

x

d4060e2

rename

1ac0a69

spawn

ab00577

init ddp

7a7f4ac

x

8d33db9

horovod

3fa1264

horovod

22b1ebf

ranks for DP

cd7327a

slurm variables

405734f

fix test

39f2961

slurm env vars in test

ca64092

fix test

b17a6b7

rank

06fec87

fix test

c2744e6

TYPO

93a0538

cpu_te

ee3f7f8

awaelchli commented Apr 10, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/accelerator_connector.py Outdated Show resolved Hide resolved

awaelchli added 3 commits April 10, 2021 06:27

slurm environment tests

ca6ee97

rpc

90f1d37

clean up

391624d

Update pytorch_lightning/plugins/environments/lightning_environment.py

2218fac

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

srib reviewed Apr 13, 2021

View reviewed changes

awaelchli added 2 commits April 13, 2021 04:47

add ddp2 test and fix

7005b4e

Merge remote-tracking branch 'origin/bugfix/elastic-world-size' into …

5bbfe17

…bugfix/elastic-world-size

kaushikb11 reviewed Apr 13, 2021

View reviewed changes

pytorch_lightning/plugins/environments/lightning_environment.py Outdated Show resolved Hide resolved

awaelchli and others added 5 commits April 13, 2021 10:33

Update pytorch_lightning/plugins/environments/lightning_environment.py

a6d0f5d

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

test for ddp_cpu, ddp_spawn

c3b9db4

set ranks in ddp_spawn

246384d

fix signature

dae1d73

patch xla

01eb6de

SeanNaren approved these changes Apr 13, 2021

View reviewed changes

carmocca approved these changes Apr 13, 2021

View reviewed changes

justusschock approved these changes Apr 13, 2021

View reviewed changes

awaelchli added 3 commits April 13, 2021 16:13

world size defaults to 1

45e9f78

deprecation docs

9490ce9

rename test file

a0a53b7

tchaton reviewed Apr 13, 2021

View reviewed changes

log debug setter

7d39a92

awaelchli merged commit 33cc9fe into master Apr 13, 2021

awaelchli deleted the bugfix/elastic-world-size branch April 13, 2021 18:07

awaelchli added a commit that referenced this pull request Apr 13, 2021

Clean up environment access in plugins (#6941)

bc341ed

awaelchli added a commit that referenced this pull request Apr 13, 2021

Clean up environment access in plugins (#6941)

ac71300

Borda pushed a commit that referenced this pull request Apr 14, 2021

Clean up environment access in plugins (#6941)

b8f7f56

lexierule pushed a commit that referenced this pull request Apr 14, 2021

Clean up environment access in plugins (#6941)

75bcb91

awaelchli mentioned this pull request Apr 15, 2021

RuntimeError: CUDA error: invalid device ordinal in pytorch_lightning version:1.2.7 #7027

Closed

srib mentioned this pull request Apr 15, 2021

NCCL error using DDP #6527

Closed

awaelchli mentioned this pull request Apr 16, 2021

Training stuck running on the SLURM cluster with multiple gpus per node #6206

Closed

awaelchli mentioned this pull request Nov 4, 2021

Rename "master" methods to "main" in ClusterEnvironment plugins #10103

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up environment access in plugins #6941

Clean up environment access in plugins #6941

awaelchli commented Apr 10, 2021 •

edited

Loading

pep8speaks commented Apr 10, 2021 •

edited

Loading

codecov bot commented Apr 10, 2021 •

edited

Loading

srib Apr 13, 2021

awaelchli Apr 13, 2021

carmocca left a comment

carmocca Apr 13, 2021

justusschock Apr 13, 2021

awaelchli Apr 13, 2021

justusschock Apr 13, 2021

awaelchli Apr 13, 2021

awaelchli Apr 13, 2021

ananthsub Apr 13, 2021

tchaton Apr 13, 2021

awaelchli Apr 13, 2021 •

edited

Loading

justusschock Apr 13, 2021

awaelchli Apr 13, 2021

justusschock Apr 13, 2021

justusschock Apr 13, 2021

awaelchli Apr 13, 2021

tchaton Apr 10, 2021

tchaton Apr 13, 2021

awaelchli Apr 13, 2021

tchaton Apr 13, 2021

awaelchli commented Apr 13, 2021

Clean up environment access in plugins #6941

Clean up environment access in plugins #6941

Conversation

awaelchli commented Apr 10, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented Apr 10, 2021 • edited Loading

Comment last updated at 2021-04-13 17:39:06 UTC

codecov bot commented Apr 10, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carmocca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awaelchli Apr 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awaelchli commented Apr 13, 2021

awaelchli commented Apr 10, 2021 •

edited

Loading

pep8speaks commented Apr 10, 2021 •

edited

Loading

codecov bot commented Apr 10, 2021 •

edited

Loading

awaelchli Apr 13, 2021 •

edited

Loading