Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor on training_epoch_end with ModelCheckpoint #5084

Closed
wants to merge 17 commits into from

Conversation

tchaton
Copy link
Contributor

@tchaton tchaton commented Dec 11, 2020

What does this PR do?

Fixes #4797

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Dec 11, 2020

Hello @tchaton! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-12-14 07:40:07 UTC

@tchaton tchaton self-assigned this Dec 11, 2020
@tchaton tchaton changed the title wip Monitor on training_epoch_end with ModelCheckpoint Dec 11, 2020
@tchaton tchaton added the checkpointing Related to checkpointing label Dec 11, 2020
@tchaton tchaton added this to the 1.2.x milestone Dec 11, 2020
@tchaton tchaton marked this pull request as ready for review December 11, 2020 12:07
@codecov
Copy link

codecov bot commented Dec 11, 2020

Codecov Report

Merging #5084 (985d4f0) into master (b4d926b) will increase coverage by 0%.
The diff coverage is 94%.

@@          Coverage Diff           @@
##           master   #5084   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         134     134           
  Lines        9905    9913    +8     
======================================
+ Hits         9204    9213    +9     
+ Misses        701     700    -1     

@tchaton tchaton modified the milestones: 1.2.x, 1.1.x Dec 11, 2020
carmocca
carmocca previously approved these changes Dec 12, 2020
pytorch_lightning/utilities/distributed.py Outdated Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
@carmocca carmocca self-requested a review December 12, 2020 02:11
@carmocca
Copy link
Contributor

(misclicked approve instead of comment) 🙃

@mergify mergify bot requested a review from a team December 12, 2020 14:47
@rohitgr7
Copy link
Contributor

rohitgr7 commented Dec 12, 2020

I disagree with the warning here. checkpoint_callback should monitor the metric logged in training_epoch_end and create checkpoints accordingly whether or not validation_step is happening or not.

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved
chks = os.listdir(tmpdir)
assert 'epoch=4.ckpt' not in chks
assert 'epoch=3.ckpt' not in chks
assert 'epoch=2.ckpt' not in chks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is missing a pytest.warns check with the warning thrown 😄

Also it might be better to test

os.listdir(tmpdir) == ['epoch=0.ckpt', ...]

instead of testing that the rest don't exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning was removed. Also, I am thinking it might be better not to merge this PR as it is a wacky solution and work a on train_sanity check instead. @carmocca @rohitgr7 What are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's close this one for now.

@tchaton tchaton closed this Dec 16, 2020
@tchaton tchaton deleted the bugfix/4797_model_checkpoint branch December 16, 2020 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ModelCheckpoint not working when monitor is logged in training_epoch_end
6 participants