Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Horovod] Fix Reduce for Horovod #6585

Closed
wants to merge 12 commits into from

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Mar 18, 2021

What does this PR do?

#6410 added an additional reduce_boolean_decision step in ModelCheckpoint and EarlyStopping. However Horovod's reduce functionality is not currently working, therefore preventing Horovod backend from being used with the ModelCheckpoint. This PR fixes the bug with Horovod's reduce function.

cc @tchaton

Fixes #<issue_number>

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@Borda Borda added 3rd party Related to a 3rd-party bug Something isn't working labels Mar 19, 2021
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pls add test for this case

@pep8speaks
Copy link

pep8speaks commented Mar 19, 2021

Hello @amogkam! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-19 10:28:00 UTC

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch ! Yes, definitely miss this.

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
@mergify mergify bot removed the has conflicts label Mar 25, 2021
@carmocca carmocca added this to the 1.2.x milestone Mar 25, 2021
@codecov
Copy link

codecov bot commented Mar 25, 2021

Codecov Report

Merging #6585 (d22297d) into master (4c07ab5) will decrease coverage by 7%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #6585    +/-   ##
=======================================
- Coverage      92%     85%    -7%     
=======================================
  Files         194     196     +2     
  Lines       12403   13047   +644     
=======================================
- Hits        11433   11139   -294     
- Misses        970    1908   +938     

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great fix!
horovod ftw

@carmocca
Copy link
Contributor

I changed the test to re-use the same code we already have for ddp, and also added it to special_tests.sh.
However, the test does not work:

tests/checkpointing/test_checkpoint_callback_frequency.py:139: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pytorch_lightning/trainer/connectors/env_vars_connector.py:40: in insert_env_defaults
    return fn(self, **kwargs)
pytorch_lightning/trainer/trainer.py:308: in __init__
    replace_sampler_ddp, deterministic, precision, amp_backend, amp_level, plugins
pytorch_lightning/trainer/connectors/accelerator_connector.py:127: in __init__
    self.set_distributed_mode()
pytorch_lightning/trainer/connectors/accelerator_connector.py:546: in set_distributed_mode
    self._set_horovod_backend()
pytorch_lightning/trainer/connectors/accelerator_connector.py:567: in _set_horovod_backend
    self.check_horovod()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pytorch_lightning.trainer.connectors.accelerator_connector.AcceleratorConnector object at 0x7f78aee90950>

    def check_horovod(self):
        """Raises a `MisconfigurationException` if the Trainer is not configured correctly for Horovod."""
        if not _HOROVOD_AVAILABLE:
            raise MisconfigurationException(
                'Requested `distributed_backend="horovod"`, but Horovod is not installed.'
                "Install with \n $HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]"
            )
    
        if self.num_gpus > 1 or self.num_nodes > 1:
            raise MisconfigurationException(
>               "Horovod does not support setting num_nodes / num_gpus explicitly. Use "
                "horovodrun / mpirun to configure the number of processes."
            )
E           pytorch_lightning.utilities.exceptions.MisconfigurationException: Horovod does not support setting num_nodes / num_gpus explicitly. Use horovodrun / mpirun to configure the number of processes.

pytorch_lightning/trainer/connectors/accelerator_connector.py:601: MisconfigurationException

I guess the test needs to use?:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/models/test_horovod.py#L46-L63

But this doesn't allow passing a custom ModelCheckpoint so a different way of testing this is necessary.

@carmocca carmocca self-requested a review March 25, 2021 23:42
@awaelchli
Copy link
Contributor

This change recently got merged here #6958
Apologies, we didn't remember this PR was open and I just saw it.

@awaelchli awaelchli added the duplicate This issue or pull request already exists label Apr 18, 2021
@Borda Borda modified the milestones: 1.2.x, 1.3 Apr 18, 2021
@pytest.mark.parametrize(['k', 'epochs', 'val_check_interval', 'expected'], [(1, 1, 1.0, 1), (2, 2, 0.3, 5)])
def test_top_k_ddp(save_mock, tmpdir, k, epochs, val_check_interval, expected):
def test_top_k_distributed(save_mock, tmpdir, accelerator, k, epochs, val_check_interval, expected):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not include horovod in the parameterization here yet, otherwise we risk getting another flaky test. I believe we need to improve our testing integration with horovod first. #6935

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the main fix of this PR was merged, I suggest to close this one

@edenlightning edenlightning removed this from the v1.3 milestone May 4, 2021
@awaelchli awaelchli closed this May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party bug Something isn't working duplicate This issue or pull request already exists has conflicts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants