-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune][Fix]Remove the clear_checkpoint
function during Trial restoration error handling.
#48532
Merged
justinvyu
merged 7 commits into
ray-project:master
from
hongpeng-guo:hpguo/v2/tune_restore_fix
Nov 6, 2024
Merged
Changes from 1 commit
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
312406b
remove clear checkpoint
hongpeng-guo a40b92f
fix faulty unit test of test_tune_restore
hongpeng-guo 8688364
change the unit test back
hongpeng-guo 72e83c9
remove pdb, making num_failures 2
hongpeng-guo 0a376ff
adding documents of the unit test
hongpeng-guo fe661bb
Merge remote-tracking branch 'origin' into hpguo/v2/tune_restore_fix
hongpeng-guo 0dbb310
Merge remote-tracking branch 'origin' into hpguo/v2/tune_restore_fix
hongpeng-guo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are diff between
self.run_metadata.num_failures
andself.temporary_state.num_restore_failures
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_restore_failures
is the number of failed restoration.num_failures
is the number of failures that caused by user/ application code.because restoration is not a user defined behavior, but some feature we provided. We don't treat restoration failure same as normal application failure. The behavior is, when the program failed due to application, we increment the
num_failures
and trying to restore the application. If the restoration is successful, the program just goes on. If the restoration fail, we will keep on trying to restore but increments the number ofnum_restore_failures
by 1. When theTUNE_RESTORE_RETRY_NUM
restore reaches, we stop restoration, and increment thenum_failures
by another 1.