Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.fit() returns last not best weights in ddp_spawn #2565

Merged
merged 13 commits into from
Jul 9, 2020
Merged

Conversation

williamFalcon
Copy link
Contributor

@williamFalcon williamFalcon commented Jul 9, 2020

Fixes #2547

@mergify mergify bot requested a review from a team July 9, 2020 10:46
@williamFalcon williamFalcon changed the title .fit() returns last not best weights .fit() returns last not best weights in ddp_spawn Jul 9, 2020
@@ -559,9 +560,13 @@ def ddp_train(self, process_idx, q, model, is_master=False, proc_offset=0):
torch.cuda.empty_cache()

if self.global_rank == 0 and q is not None:
rank_zero_warn('cleaning up ddp environment...')
q.put(self.checkpoint_callback.best_model_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a None check here for checkpoint_callback? because the user can set it to None, if they want.
See #2547

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@mergify mergify bot requested a review from a team July 9, 2020 10:51
Copy link
Contributor

@yukw777 yukw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this PR a bit? Just curious since it touched the code I touched and I want to make sure I understand it. :) is this only for “testing” as in unit/integration tests?

@williamFalcon
Copy link
Contributor Author

In ddp_spawn the model is only updated in a subprocess. Thus when .fit() ends the original model is still untrained.
This PR makes sure we restore the weights to that model from the last state of the model instead of the "best"

@Borda Borda added the feature Is an improvement or enhancement label Jul 9, 2020
@mergify mergify bot requested a review from a team July 9, 2020 12:40
@yukw777
Copy link
Contributor

yukw777 commented Jul 9, 2020

@williamFalcon ok and you use the queue to send back the best weight path to the main process in case the user wants the best weights for testing. got it.

@Borda Borda self-requested a review July 9, 2020 13:24
@pep8speaks
Copy link

pep8speaks commented Jul 9, 2020

Hello @williamFalcon! Thanks for updating this PR.

Line 569:12: E713 test for membership should be 'not in'

Comment last updated at 2020-07-09 15:22:21 UTC

@codecov
Copy link

codecov bot commented Jul 9, 2020

Codecov Report

Merging #2565 into master will increase coverage by 4%.
The diff coverage is 57%.

@@           Coverage Diff           @@
##           master   #2565    +/-   ##
=======================================
+ Coverage      87%     91%    +4%     
=======================================
  Files          70      70            
  Lines        5703    5718    +15     
=======================================
+ Hits         4960    5209   +249     
+ Misses        743     509   -234     

@williamFalcon williamFalcon merged commit 4bbcfa0 into master Jul 9, 2020
@Borda Borda deleted the last branch July 9, 2020 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't use None (anymore) in checkpoint_callback
5 participants