Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: clean trainer device & distrib setters #5297

Merged
merged 25 commits into from
Jan 4, 2021

Conversation

Borda
Copy link
Member

@Borda Borda commented Dec 29, 2020

What does this PR do?

Fixes # (issue) <- this links related issue to this PR

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes [if needed]?
  • Did you write any new necessary tests [no need for typos, docs]?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Check that target branch and milestone are aligned!

Did you have fun?

Make sure you had fun coding 🙃

@Borda Borda added the refactor label Dec 29, 2020
@Borda Borda added this to the 1.2 milestone Dec 29, 2020
@pep8speaks
Copy link

pep8speaks commented Dec 29, 2020

Hello @Borda! Thanks for updating this PR.

Line 334:21: W503 line break before binary operator
Line 342:17: W503 line break before binary operator

Comment last updated at 2021-01-04 16:06:42 UTC

@Borda Borda self-assigned this Dec 29, 2020
@Borda Borda changed the title Refactor: clean trainer device & distrib setters [blocked by #5303] Refactor: clean trainer device & distrib setters Dec 30, 2020
@Borda Borda marked this pull request as ready for review December 30, 2020 19:01
@Borda Borda mentioned this pull request Dec 30, 2020
12 tasks
@Borda Borda changed the title [blocked by #5303] Refactor: clean trainer device & distrib setters Refactor: clean trainer device & distrib setters Dec 31, 2020
@Borda
Copy link
Member Author

Borda commented Jan 1, 2021

As a user input? No, I don't think so. When requesting ddp_cpu explicitly, we would need to know how many processes.

so it seems we had a bug there... :D
I have added if num_processes is None then sent nb CPU available

Also, be careful with the changes here, it needs to match the previous behavior exactly. The order of the if-elif blocks is important.

yes, that is why I change only setting and not reading in this PR and no tests shall be changed either...
the reading change is in #5300

@Borda
Copy link
Member Author

Borda commented Jan 1, 2021

@SeanNaren mind check why the parity params are different for ddp_cpu?

# Assert model parameters are identical after fit
        for ddp_param, custom_param in zip(ddp_model.parameters(), custom_plugin_model.parameters()):
>           assert torch.equal(ddp_param, custom_param), 'Model parameters are different between DDP and Custom plugin'
E           AssertionError: Model parameters are different between DDP and Custom plugin
E           assert False
E            +  where False = <built-in method equal of type object at 0x11d2d95b0>(Parameter containing:\ntensor([[ 0.1203,  0.0808, -0.0999,  0.1504, -0.1179,  0.0486, -0.1525,  0.1665,\n          0.207... -0.0007,\n         -0.0356, -0.2548,  0.0780, -0.1915, -0.1204, -0.1929,  0.1851, -0.1996]],\n       requires_grad=True), Parameter containing:\ntensor([[ 0.1277,  0.1138, -0.0707,  0.1564, -0.0783,  0.0421, -0.1193,  0.1352,\n          0.181...  0.0362,\n          0.0102, -0.1289,  0.1082, -0.1586, -0.0546, -0.1568,  0.1198, -0.1302]],\n       requires_grad=True))
E            +    where <built-in method equal of type object at 0x11d2d95b0> = torch.equal

@codecov
Copy link

codecov bot commented Jan 2, 2021

Codecov Report

Merging #5297 (233515c) into release/1.2-dev (73e06fd) will decrease coverage by 2%.
The diff coverage is 95%.

@@               Coverage Diff                @@
##           release/1.2-dev   #5297    +/-   ##
================================================
- Coverage               93%     91%    -2%     
================================================
  Files                  144     146     +2     
  Lines                10146   10417   +271     
================================================
+ Hits                  9425    9516    +91     
- Misses                 721     901   +180     

benchmarks/test_sharded_parity.py Outdated Show resolved Hide resolved
benchmarks/test_sharded_parity.py Outdated Show resolved Hide resolved
benchmarks/test_sharded_parity.py Outdated Show resolved Hide resolved
@Borda Borda added ready PRs ready to be merged bug Something isn't working feature Is an improvement or enhancement labels Jan 4, 2021
Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pytorch_lightning/plugins/plugin_connector.py Outdated Show resolved Hide resolved
Copy link
Contributor

@SeanNaren SeanNaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ddp_cpu can now only work if we set the number of processes above 1, and if specified automatically allocate all CPU resources. I think this is fine, just ensuring there is no case where we'd like to keep the current behaviour!

@Borda Borda merged commit b72ed71 into release/1.2-dev Jan 4, 2021
@Borda Borda deleted the refactor/trainer-setters branch January 4, 2021 17:10
@Borda
Copy link
Member Author

Borda commented Jan 4, 2021

So ddp_cpu can now only work if we set the number of processes above 1, and if specified automatically allocate all CPU resources. I think this is fine, just ensuring there is no case where we'd like to keep the current behaviour!

happy to verify, but no idea when you do not want to use max resources...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature Is an improvement or enhancement ready PRs ready to be merged refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants