Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune][Fix]Remove the clear_checkpoint function during Trial restoration error handling. #48532

Merged
10 changes: 2 additions & 8 deletions python/ray/tune/experiment/trial.py
Original file line number Diff line number Diff line change
Expand Up @@ -793,11 +793,11 @@ def get_error(self) -> Optional[TuneError]:
return None

def _handle_restore_error(self, exc: Exception):
# For Restoration errors, we only increment the restore failure count
# if the number of failures exceeds the restore retry limit.
if self.temporary_state.num_restore_failures >= int(
os.environ.get("TUNE_RESTORE_RETRY_NUM", 0)
):
# Restore was unsuccessful, try again without checkpoint.
self.clear_checkpoint()
self.run_metadata.num_failures += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are diff between self.run_metadata.num_failures and self.temporary_state.num_restore_failures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_restore_failures is the number of failed restoration.
num_failures is the number of failures that caused by user/ application code.

because restoration is not a user defined behavior, but some feature we provided. We don't treat restoration failure same as normal application failure. The behavior is, when the program failed due to application, we increment the num_failures and trying to restore the application. If the restoration is successful, the program just goes on. If the restoration fail, we will keep on trying to restore but increments the number of num_restore_failures by 1. When the TUNE_RESTORE_RETRY_NUM restore reaches, we stop restoration, and increment the num_failures by another 1.

else:
self.temporary_state.num_restore_failures += 1
Expand Down Expand Up @@ -883,12 +883,6 @@ def should_checkpoint(self):
def has_checkpoint(self) -> bool:
return self.checkpoint is not None

def clear_checkpoint(self):
if self.latest_checkpoint_result:
self.latest_checkpoint_result.checkpoint = None
self.temporary_state.restoring_from = None
self.run_metadata.invalidate_cache()

def on_checkpoint(self, checkpoint_result: _TrainingResult):
"""Hook for handling checkpoints taken by the Trainable.

Expand Down