-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 #6075
Merged
rohitgr7
merged 10 commits into
Lightning-AI:master
from
SkafteNicki:lr_scheduler_bugfix
Feb 24, 2021
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
eeb9787
fix bug
SkafteNicki c99d769
fix tests
SkafteNicki 8781a23
changelog
SkafteNicki 961f7f8
fix pep8
SkafteNicki 6830344
fix tests
SkafteNicki fc93a3e
fix and add some tests
rohitgr7 ba124a2
add test for rlop
rohitgr7 f596ab5
chlog
rohitgr7 bc19f34
Merge branch 'master' into lr_scheduler_bugfix
SkafteNicki cede66a
Update CHANGELOG.md
SkafteNicki File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these checks guaranteed to be mutually exclusive? i'm updating to 1.2.4 and see a test failure with one of my modules where it looks like there's a gap:
before:
run_evaluation
ran first, allowing the module to run the validation loop. inside validation step/validation epoch end, the module could log metrics. afterward, the checkpoint is force-run. so if the checkpoint configured a metric that was available only during validation, then somehow this still worked.by moving the run_evaluation to happen after the force checkpoint saving, we fail when looking up the monitor value
i think the correct fix is dropping the training loop force running checkpoint/early stopping callbacks here. they should be part of the callback, but that's a longer term thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe if validation is disabled or val batches=0 then only we run the force checkpoint else it will call run_evaluation and the the callbacks will be called inside and monitor will be taken care of as expected. Can you paste a small example with your case where it doesn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not, see #7207
this PR introduced the change where we call the evaluation loop after we call the checkpoint/early stopping callbacks from training.
As a result, this check for
should_train_only
is incomplete - it inherently depends on the evaluation loop to populatenum_val_batches
correctly. inrun_evaluation
we set the validation dataloader, but this is too late as the validation dataloader is what's used to determineshould_skip_eval
above