-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune][Fix]Remove the clear_checkpoint
function during Trial restoration error handling.
#48532
[Tune][Fix]Remove the clear_checkpoint
function during Trial restoration error handling.
#48532
Conversation
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. Btw, how does this test work...
if self.temporary_state.num_restore_failures >= int( | ||
os.environ.get("TUNE_RESTORE_RETRY_NUM", 0) | ||
): | ||
# Restore was unsuccessful, try again without checkpoint. | ||
self.clear_checkpoint() | ||
self.run_metadata.num_failures += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are diff between self.run_metadata.num_failures
and self.temporary_state.num_restore_failures
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_restore_failures
is the number of failed restoration.
num_failures
is the number of failures that caused by user/ application code.
because restoration is not a user defined behavior, but some feature we provided. We don't treat restoration failure same as normal application failure. The behavior is, when the program failed due to application, we increment the num_failures
and trying to restore the application. If the restoration is successful, the program just goes on. If the restoration fail, we will keep on trying to restore but increments the number of num_restore_failures
by 1. When the TUNE_RESTORE_RETRY_NUM
restore reaches, we stop restoration, and increment the num_failures
by another 1.
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
The unit test is not super clear and not testing the issue mentioned above. I slightly modified the unit test and adding some documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the test!
Can we also update the entry in this doc? https://docs.ray.io/en/latest/tune/api/env.html
Btw, I found the original source of this clear_checkpoint
behavior. I think it was a catch-all fix for checkpoints that were held in-memory in the object store (which is no longer a thing). These in-memory checkpoints could be lost on node failure, which would cause the restoration to fail. Then, in this situation, it kind of made sense to restart from scratch since the in-memory checkpoint cannot be found.
TL;DR: We are safe to remove the clear checkpoint functionality, since it was a patch for SUPER-LEGACY constraints.
Sure, but I don't think this change modifies the original definition of the env var BTW, there is a |
Not sure what happened with that build, looks like it succeeded on this one. |
…ation error handling. (ray-project#48532) This PR removes the `clear_checkpoint` function, so that Tune doesn't try to "restart trials from scratch. `clear_checkpoint` solved for a legacy use case that doesn't apply anymore, and "restoration failures" are also now an edge case for function Trainables and Ray Train usage. --------- Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
…ation error handling. (ray-project#48532) This PR removes the `clear_checkpoint` function, so that Tune doesn't try to "restart trials from scratch. `clear_checkpoint` solved for a legacy use case that doesn't apply anymore, and "restoration failures" are also now an edge case for function Trainables and Ray Train usage. --------- Signed-off-by: Hongpeng Guo <hpguo@anyscale.com> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Why are these changes needed?
This PR is trying to fix a issue happened during a Trial Restoration. This is a bug that doesn’t happen often, but it will happen if the actor dies during
Trainable.restore
such as the preemption scenarios.If Trainable.restore doesn’t run successfully, the TuneController will handle it as a special “restore error,” which clears the checkpoint.
Tune treats restore errors differently because it doesn’t necessarily increment num_failures, so trials can try “restoring” multiple times without the run erroring out even if max_failures=0.
Then, on the next restoration, we have an invalid state where
latest_checkpoint_result = TrainingResult(checkpoint=None, metrics=…)
since it was modified byclear_checkpoint
, which will result in an error inTrainable.restore
.This PR removes the
clear_checkpoint
function. So that the new behavior is:TUNE_RESTORE_RETRY_NUM
restoration failures contributes to onenum_failures
.clear_checkpoints
is not applied.After this fix, we can still provide special handling of restoration errors. If restoration is interrupted by preemption, we provide extra chances to do restoration, without directly increasing total
num_failures
. If the restoration failure is due to some deterministic problem, i.e., corrupted latest checkpoint. The job will fail after total number of
TUNE_RESTORE_RETRY_NUM*
num_failures` retries.We reduced our internal logic to make the overall process clearer.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.