[Tune][Fix]Remove the `clear_checkpoint` function during Trial restoration error handling. #48532

hongpeng-guo · 2024-11-04T04:27:41Z

Why are these changes needed?

This PR is trying to fix a issue happened during a Trial Restoration. This is a bug that doesn’t happen often, but it will happen if the actor dies during Trainable.restore such as the preemption scenarios.

If Trainable.restore doesn’t run successfully, the TuneController will handle it as a special “restore error,” which clears the checkpoint.

Tune treats restore errors differently because it doesn’t necessarily increment num_failures, so trials can try “restoring” multiple times without the run erroring out even if max_failures=0.
Then, on the next restoration, we have an invalid state where latest_checkpoint_result = TrainingResult(checkpoint=None, metrics=…) since it was modified by clear_checkpoint, which will result in an error in Trainable.restore.

This PR removes the clear_checkpoint function. So that the new behavior is:

Every TUNE_RESTORE_RETRY_NUM restoration failures contributes to one num_failures.
We don't do extra handing of existing checkpoints, i.e., clear_checkpoints is not applied.

After this fix, we can still provide special handling of restoration errors. If restoration is interrupted by preemption, we provide extra chances to do restoration, without directly increasing total num_failures. If the restoration failure is due to some deterministic problem, i.e., corrupted latest checkpoint. The job will fail after total number of TUNE_RESTORE_RETRY_NUM*num_failures` retries.

We reduced our internal logic to make the overall process clearer.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

justinvyu

This makes sense to me. Btw, how does this test work...

https://github.com/anyscale/rayturbo/blob/7b749834329bc4cd4c87c2adf7a9c5eb084a2166/python/ray/tune/tests/test_tuner_restore.py#L539

rkooo567 · 2024-11-04T20:44:24Z

python/ray/tune/experiment/trial.py

        if self.temporary_state.num_restore_failures >= int(
            os.environ.get("TUNE_RESTORE_RETRY_NUM", 0)
        ):
-            # Restore was unsuccessful, try again without checkpoint.
-            self.clear_checkpoint()
            self.run_metadata.num_failures += 1


what are diff between self.run_metadata.num_failures and self.temporary_state.num_restore_failures?

num_restore_failures is the number of failed restoration.
num_failures is the number of failures that caused by user/ application code.

because restoration is not a user defined behavior, but some feature we provided. We don't treat restoration failure same as normal application failure. The behavior is, when the program failed due to application, we increment the num_failures and trying to restore the application. If the restoration is successful, the program just goes on. If the restoration fail, we will keep on trying to restore but increments the number of num_restore_failures by 1. When the TUNE_RESTORE_RETRY_NUM restore reaches, we stop restoration, and increment the num_failures by another 1.

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo · 2024-11-05T19:35:41Z

This makes sense to me. Btw, how does this test work...

https://github.com/anyscale/rayturbo/blob/7b749834329bc4cd4c87c2adf7a9c5eb084a2166/python/ray/tune/tests/test_tuner_restore.py#L539

The unit test is not super clear and not testing the issue mentioned above. I slightly modified the unit test and adding some documentation.
The unit test now captures the issue, and passes after the fix.

justinvyu

Thanks for fixing the test!

Can we also update the entry in this doc? https://docs.ray.io/en/latest/tune/api/env.html

Btw, I found the original source of this clear_checkpoint behavior. I think it was a catch-all fix for checkpoints that were held in-memory in the object store (which is no longer a thing). These in-memory checkpoints could be lost on node failure, which would cause the restoration to fail. Then, in this situation, it kind of made sense to restart from scratch since the in-memory checkpoint cannot be found.

TL;DR: We are safe to remove the clear checkpoint functionality, since it was a patch for SUPER-LEGACY constraints.

hongpeng-guo · 2024-11-06T17:55:11Z

Can we also update the entry in this doc? https://docs.ray.io/en/latest/tune/api/env.html

Sure, but I don't think this change modifies the original definition of the env var TUNE_RESTORE_RETRY_NUM that is specified in the doc. We removed the clear_checkpoint function, which is not specified by its definition in doc, either.

BTW, there is a readthedocs fail on this PR. Could it be related to anything that I am missing? cc @justinvyu

justinvyu · 2024-11-06T18:54:18Z

Not sure what happened with that build, looks like it succeeded on this one.

…ation error handling. (ray-project#48532) This PR removes the `clear_checkpoint` function, so that Tune doesn't try to "restart trials from scratch. `clear_checkpoint` solved for a legacy use case that doesn't apply anymore, and "restoration failures" are also now an edge case for function Trainables and Ray Train usage. --------- Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

…ation error handling. (ray-project#48532) This PR removes the `clear_checkpoint` function, so that Tune doesn't try to "restart trials from scratch. `clear_checkpoint` solved for a legacy use case that doesn't apply anymore, and "restoration failures" are also now an edge case for function Trainables and Ray Train usage. --------- Signed-off-by: Hongpeng Guo <hpguo@anyscale.com> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>

remove clear checkpoint

312406b

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo requested review from justinvyu, matthewdeng, raulchen and woshiyyya as code owners November 4, 2024 04:27

hongpeng-guo assigned hongpeng-guo, justinvyu and matthewdeng Nov 4, 2024

hongpeng-guo requested a review from rkooo567 November 4, 2024 18:27

justinvyu reviewed Nov 4, 2024

View reviewed changes

rkooo567 approved these changes Nov 4, 2024

View reviewed changes

hongpeng-guo added 4 commits November 4, 2024 18:25

fix faulty unit test of test_tune_restore

a40b92f

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

change the unit test back

8688364

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

remove pdb, making num_failures 2

72e83c9

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

adding documents of the unit test

0a376ff

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

Merge remote-tracking branch 'origin' into hpguo/v2/tune_restore_fix

fe661bb

justinvyu approved these changes Nov 6, 2024

View reviewed changes

Merge remote-tracking branch 'origin' into hpguo/v2/tune_restore_fix

0dbb310

justinvyu enabled auto-merge (squash) November 6, 2024 18:57

github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 6, 2024

justinvyu merged commit a1bb4a4 into ray-project:master Nov 6, 2024
6 of 7 checks passed

hongpeng-guo deleted the hpguo/v2/tune_restore_fix branch November 6, 2024 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune][Fix]Remove the `clear_checkpoint` function during Trial restoration error handling. #48532

[Tune][Fix]Remove the `clear_checkpoint` function during Trial restoration error handling. #48532

hongpeng-guo commented Nov 4, 2024 •

edited by justinvyu

Loading

justinvyu left a comment

rkooo567 Nov 4, 2024

hongpeng-guo Nov 5, 2024

hongpeng-guo commented Nov 5, 2024 •

edited

Loading

justinvyu left a comment

hongpeng-guo commented Nov 6, 2024

justinvyu commented Nov 6, 2024

[Tune][Fix]Remove the clear_checkpoint function during Trial restoration error handling. #48532

[Tune][Fix]Remove the clear_checkpoint function during Trial restoration error handling. #48532

Conversation

hongpeng-guo commented Nov 4, 2024 • edited by justinvyu Loading

Why are these changes needed?

Related issue number

Checks

justinvyu left a comment

Choose a reason for hiding this comment

rkooo567 Nov 4, 2024

Choose a reason for hiding this comment

hongpeng-guo Nov 5, 2024

Choose a reason for hiding this comment

hongpeng-guo commented Nov 5, 2024 • edited Loading

justinvyu left a comment

Choose a reason for hiding this comment

hongpeng-guo commented Nov 6, 2024

justinvyu commented Nov 6, 2024

[Tune][Fix]Remove the `clear_checkpoint` function during Trial restoration error handling. #48532

[Tune][Fix]Remove the `clear_checkpoint` function during Trial restoration error handling. #48532

hongpeng-guo commented Nov 4, 2024 •

edited by justinvyu

Loading

hongpeng-guo commented Nov 5, 2024 •

edited

Loading