Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model checkpint on rank_zero_only & global rank state #1408

Merged
merged 15 commits into from
Apr 24, 2020

Conversation

Borda
Copy link
Member

@Borda Borda commented Apr 7, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1366. As suggested in the issue
MOreover it it defines global proc rank so no instance needs to carry it...

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@Borda Borda added the bug Something isn't working label Apr 7, 2020
@Borda Borda added this to the 0.7.2 milestone Apr 7, 2020
@mergify mergify bot requested review from a team April 7, 2020 19:27
@codecov
Copy link

codecov bot commented Apr 7, 2020

Codecov Report

Merging #1408 into master will increase coverage by 0%.
The diff coverage is 95%.

@@          Coverage Diff           @@
##           master   #1408   +/-   ##
======================================
  Coverage      89%     89%           
======================================
  Files          69      70    +1     
  Lines        4148    4138   -10     
======================================
- Hits         3677    3670    -7     
+ Misses        471     468    -3     

@awaelchli
Copy link
Contributor

using the rank_zero decorator is maybe not a bad idea. it could be moved out of logger package to a more general package. and the ModelCheckpoint could get a rank property.

@pep8speaks
Copy link

pep8speaks commented Apr 7, 2020

Hello @Borda! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-24 20:53:37 UTC

@Borda Borda changed the title try delete in async or DDP use-case model checkpint on rank zero only Apr 7, 2020
Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@justusschock justusschock requested review from justusschock and a team April 8, 2020 06:35
@justusschock justusschock requested a review from a team April 8, 2020 06:36
Copy link
Contributor

@williamFalcon williamFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a few changes here that can break functionality.
I'd prefer @neggert to look at this closer since he spent a long time on this a while back

@Borda Borda requested review from neggert and a team April 8, 2020 12:29
@mergify
Copy link
Contributor

mergify bot commented Apr 8, 2020

This pull request is now in conflict... :(

@Borda Borda modified the milestones: 0.7.2, 0.7.3 Apr 8, 2020
@Borda
Copy link
Member Author

Borda commented Apr 8, 2020

changelog need to be rebased on new release #1419

@Borda Borda changed the title model checkpint on rank zero only model checkpint on rank_zero_only & global rank state Apr 11, 2020
@Borda Borda added the feature Is an improvement or enhancement label Apr 11, 2020
@mergify
Copy link
Contributor

mergify bot commented Apr 13, 2020

This pull request is now in conflict... :(

@Borda Borda added the ready PRs ready to be merged label Apr 13, 2020
@mergify
Copy link
Contributor

mergify bot commented Apr 23, 2020

This pull request is now in conflict... :(

@awaelchli
Copy link
Contributor

@Borda I saw the problem with my proposal and fixed it in my branch here:
https://github.com/awaelchli/pytorch-lightning/tree/delete-checkpoint-proposal
It is forked from your branch.
The decorator you had before was better than mine. It can be applied to both methods and functions.
I added the rank as the function attribute. What do you think?

@Borda
Copy link
Member Author

Borda commented Apr 24, 2020

@williamFalcon updated with @awaelchli

@mergify
Copy link
Contributor

mergify bot commented Apr 24, 2020

This pull request is now in conflict... :(

@mergify mergify bot requested a review from a team April 24, 2020 14:53
@Borda Borda added the ready PRs ready to be merged label Apr 24, 2020
Comment on lines +15 to +19
try:
# add the attribute to the function but don't overwrite in case Trainer has already set it
getattr(rank_zero_only, 'rank')
except AttributeError:
rank_zero_only.rank = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
# add the attribute to the function but don't overwrite in case Trainer has already set it
getattr(rank_zero_only, 'rank')
except AttributeError:
rank_zero_only.rank = 0
# add the attribute to the function but don't overwrite in case Trainer has already set it
rank_zero_only.rank = getattr(rank_zero_only, 'rank', 0)

I added the try catch block but this is probably more elegant :)

@mergify mergify bot requested a review from a team April 24, 2020 21:13
@williamFalcon williamFalcon merged commit 58a467d into master Apr 24, 2020
@Borda Borda modified the milestones: 0.7.4, v0.7.x Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature Is an improvement or enhancement priority: 0 High priority task ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ModelCheckpoint tries to remove already removed checkpoint in DDP mode
5 participants