Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[no_early_kickoff] [Train] Recommend Trainer.restore on errors raised by trainer.fit() #33610

Merged
merged 22 commits into from
Mar 28, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Mar 23, 2023

Why are these changes needed?

Currently, TuneErrors that recommend Tuner.restore get propagated to users even when they're using Trainer.fit. This PR fixes this by wrapping raised errors with a TrainingFailedError that includes a message on how to restore the run, and also how to configure a new run to retry on training failures. This PR also separates AIR "error propagation" tests into a new file.

Notes on the TODOs

I added a few TODOs in the code regarding "driver error propagation" in Tune. Currently, errors that happen in the Tune driver (ex: within user defined callback hooks) get surfaced to the user in different ways.

For example:

  • An error within on_trial_result will surface as a TuneError(TuneError(OriginalError))
  • An error within on_checkpoint will not surface at all. It'll just log a warning.
  • An error within on_step_begin will surface as OriginalError.

This needs to be fixed in 2.5 (all hooks should handle errors in the same way), tracker issue here: TODO.

For the purposes of 2.4, wrapping errors raised by trainer.fit() TrainingFailedError error (which already existed) is good enough to fix the issue of "Tune concepts showing up when you're not even using Tune."

Related issue number

Closes #33566

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

assert len(result_grid) == 1
result = result_grid[0]
if result.error:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the difference between result.error and the errors captured in tuner.fit()? Should we have a unified way to handle them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result.error: Error that happens in the trainable (ex: runtime error in the training loop per worker -> the trainer actor will raise -> Tune will record this error associated with the trial -> can be accessed by result.error)
tuner.fit(): Error that happens in the Tune driver itself (aka in the execution loop happening on the driver node).

These should be handled separately because driver error should crash the entire experiment execution, whereas individual trial errors should not crash other ongoing trials, unless otherwise configured.

In the trainer.fit case, there's only 1 trial so the full experiment will crash on trial failure.

python/ray/train/base_trainer.py Outdated Show resolved Hide resolved
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a few minor comments.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu requested a review from gjoliver March 24, 2023 19:28
Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu changed the title [Train] Recommend Trainer.restore on errors raised by trainer.fit() [no_early_kickoff] [Train] Recommend Trainer.restore on errors raised by trainer.fit() Mar 24, 2023
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 25, 2023
@gjoliver gjoliver merged commit 9b8d4ce into ray-project:master Mar 28, 2023
@justinvyu justinvyu deleted the train/errortype branch April 10, 2023 16:25
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…)` (ray-project#33610)

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…)` (ray-project#33610)

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AIR] TuneError shouldn't get propagated when using Train only
5 participants