[Horovod] Fix Reduce for Horovod #6585

amogkam · 2021-03-18T18:41:47Z

What does this PR do?

#6410 added an additional reduce_boolean_decision step in ModelCheckpoint and EarlyStopping. However Horovod's reduce functionality is not currently working, therefore preventing Horovod backend from being used with the ModelCheckpoint. This PR fixes the bug with Horovod's reduce function.

cc @tchaton

Fixes #<issue_number>

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pytorch_lightning/plugins/training_type/horovod.py

Borda

can we pls add test for this case

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

… into horovod-reduce

pep8speaks · 2021-03-19T22:34:01Z

Hello @amogkam! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-19 10:28:00 UTC

tests/models/test_horovod.py

tests/checkpointing/test_checkpoint_callback_frequency.py

tchaton

Great catch ! Yes, definitely miss this.

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

codecov · 2021-03-25T23:02:22Z

Codecov Report

Merging #6585 (d22297d) into master (4c07ab5) will decrease coverage by 7%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #6585    +/-   ##
=======================================
- Coverage      92%     85%    -7%     
=======================================
  Files         194     196     +2     
  Lines       12403   13047   +644     
=======================================
- Hits        11433   11139   -294     
- Misses        970    1908   +938

awaelchli

great fix!
horovod ftw

carmocca · 2021-03-25T23:42:20Z

I changed the test to re-use the same code we already have for ddp, and also added it to special_tests.sh.
However, the test does not work:

tests/checkpointing/test_checkpoint_callback_frequency.py:139: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pytorch_lightning/trainer/connectors/env_vars_connector.py:40: in insert_env_defaults
    return fn(self, **kwargs)
pytorch_lightning/trainer/trainer.py:308: in __init__
    replace_sampler_ddp, deterministic, precision, amp_backend, amp_level, plugins
pytorch_lightning/trainer/connectors/accelerator_connector.py:127: in __init__
    self.set_distributed_mode()
pytorch_lightning/trainer/connectors/accelerator_connector.py:546: in set_distributed_mode
    self._set_horovod_backend()
pytorch_lightning/trainer/connectors/accelerator_connector.py:567: in _set_horovod_backend
    self.check_horovod()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pytorch_lightning.trainer.connectors.accelerator_connector.AcceleratorConnector object at 0x7f78aee90950>

    def check_horovod(self):
        """Raises a `MisconfigurationException` if the Trainer is not configured correctly for Horovod."""
        if not _HOROVOD_AVAILABLE:
            raise MisconfigurationException(
                'Requested `distributed_backend="horovod"`, but Horovod is not installed.'
                "Install with \n $HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]"
            )
    
        if self.num_gpus > 1 or self.num_nodes > 1:
            raise MisconfigurationException(
>               "Horovod does not support setting num_nodes / num_gpus explicitly. Use "
                "horovodrun / mpirun to configure the number of processes."
            )
E           pytorch_lightning.utilities.exceptions.MisconfigurationException: Horovod does not support setting num_nodes / num_gpus explicitly. Use horovodrun / mpirun to configure the number of processes.

pytorch_lightning/trainer/connectors/accelerator_connector.py:601: MisconfigurationException

I guess the test needs to use?:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/models/test_horovod.py#L46-L63

But this doesn't allow passing a custom ModelCheckpoint so a different way of testing this is necessary.

awaelchli · 2021-04-18T01:51:51Z

This change recently got merged here #6958
Apologies, we didn't remember this PR was open and I just saw it.

awaelchli · 2021-05-04T08:10:48Z

tests/checkpointing/test_checkpoint_callback_frequency.py

 @pytest.mark.parametrize(['k', 'epochs', 'val_check_interval', 'expected'], [(1, 1, 1.0, 1), (2, 2, 0.3, 5)])
-def test_top_k_ddp(save_mock, tmpdir, k, epochs, val_check_interval, expected):
+def test_top_k_distributed(save_mock, tmpdir, accelerator, k, epochs, val_check_interval, expected):


I would not include horovod in the parameterization here yet, otherwise we risk getting another flaky test. I believe we need to improve our testing integration with horovod first. #6935

since the main fix of this PR was merged, I suggest to close this one

bug fix

2505ccc

amogkam requested review from awaelchli, justusschock, SeanNaren and tchaton as code owners March 18, 2021 18:41

amogkam mentioned this pull request Mar 18, 2021

PTL 1.2 Compatibility ray-project/ray_lightning#15

Merged

Borda added 3rd party Related to a 3rd-party bug Something isn't working labels Mar 19, 2021

Borda reviewed Mar 19, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/horovod.py Outdated Show resolved Hide resolved

Borda reviewed Mar 19, 2021

View reviewed changes

amogkam and others added 3 commits March 19, 2021 15:21

Update pytorch_lightning/plugins/training_type/horovod.py

6ccfd68

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

try add test

e9d9c08

Merge branch 'horovod-reduce' of github.com:amogkam/pytorch-lightning…

cf988b9

… into horovod-reduce

amogkam requested review from carmocca and williamFalcon as code owners March 19, 2021 22:33

awaelchli reviewed Mar 20, 2021

View reviewed changes

tests/models/test_horovod.py Outdated Show resolved Hide resolved

carmocca reviewed Mar 21, 2021

View reviewed changes

tests/checkpointing/test_checkpoint_callback_frequency.py Outdated Show resolved Hide resolved

tchaton approved these changes Mar 22, 2021

View reviewed changes

Apply suggestions from code review

d1790cd

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Borda requested review from carmocca, awaelchli and Borda March 25, 2021 22:13

Reuse code

6b9c62f

mergify bot added the has conflicts label Mar 25, 2021

Merge branch 'master' into horovod-reduce

4e28fd2

mergify bot removed the has conflicts label Mar 25, 2021

Avoid name conflict

1b05074

carmocca added this to the 1.2.x milestone Mar 25, 2021

carmocca approved these changes Mar 25, 2021

View reviewed changes

Forgot argument

29fc2e8

awaelchli approved these changes Mar 25, 2021

View reviewed changes

carmocca self-requested a review March 25, 2021 23:42

Minor changes

cae1419

mergify bot added the has conflicts label Mar 26, 2021

awaelchli added the duplicate This issue or pull request already exists label Apr 18, 2021

Borda modified the milestones: 1.2.x, 1.3 Apr 18, 2021

Merge branch 'master' into horovod-reduce

3b293a7

amogkam requested a review from kaushikb11 as a code owner April 19, 2021 10:27

Bad merge

d22297d

mergify bot added has conflicts and removed has conflicts labels Apr 19, 2021

awaelchli reviewed May 4, 2021

View reviewed changes

edenlightning removed this from the v1.3 milestone May 4, 2021

awaelchli closed this May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Horovod] Fix Reduce for Horovod #6585

[Horovod] Fix Reduce for Horovod #6585

amogkam commented Mar 18, 2021 •

edited by carmocca

Loading

Borda left a comment

pep8speaks commented Mar 19, 2021 •

edited

Loading

tchaton left a comment

codecov bot commented Mar 25, 2021 •

edited

Loading

awaelchli left a comment

carmocca commented Mar 25, 2021

awaelchli commented Apr 18, 2021

awaelchli May 4, 2021

awaelchli May 4, 2021

[Horovod] Fix Reduce for Horovod #6585

[Horovod] Fix Reduce for Horovod #6585

Conversation

amogkam commented Mar 18, 2021 • edited by carmocca Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

Borda left a comment

Choose a reason for hiding this comment

pep8speaks commented Mar 19, 2021 • edited Loading

Comment last updated at 2021-04-19 10:28:00 UTC

tchaton left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 25, 2021 • edited Loading

Codecov Report

awaelchli left a comment

Choose a reason for hiding this comment

carmocca commented Mar 25, 2021

awaelchli commented Apr 18, 2021

awaelchli May 4, 2021

Choose a reason for hiding this comment

awaelchli May 4, 2021

Choose a reason for hiding this comment

amogkam commented Mar 18, 2021 •

edited by carmocca

Loading

pep8speaks commented Mar 19, 2021 •

edited

Loading

codecov bot commented Mar 25, 2021 •

edited

Loading