[no_early_kickoff] [Train] Recommend `Trainer.restore` on errors raised by `trainer.fit()` #33610

justinvyu · 2023-03-23T02:27:53Z

Why are these changes needed?

Currently, TuneErrors that recommend Tuner.restore get propagated to users even when they're using Trainer.fit. This PR fixes this by wrapping raised errors with a TrainingFailedError that includes a message on how to restore the run, and also how to configure a new run to retry on training failures. This PR also separates AIR "error propagation" tests into a new file.

Notes on the TODOs

I added a few TODOs in the code regarding "driver error propagation" in Tune. Currently, errors that happen in the Tune driver (ex: within user defined callback hooks) get surfaced to the user in different ways.

For example:

An error within on_trial_result will surface as a TuneError(TuneError(OriginalError))
An error within on_checkpoint will not surface at all. It'll just log a warning.
An error within on_step_begin will surface as OriginalError.

This needs to be fixed in 2.5 (all hooks should handle errors in the same way), tracker issue here: TODO.

For the purposes of 2.4, wrapping errors raised by trainer.fit() TrainingFailedError error (which already existed) is good enough to fix the issue of "Tune concepts showing up when you're not even using Tune."

Related issue number

Closes #33566

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…n/errortype

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

woshiyyya · 2023-03-23T16:58:03Z

python/ray/train/base_trainer.py

+
+        assert len(result_grid) == 1
+        result = result_grid[0]
+        if result.error:


Can you explain the difference between result.error and the errors captured in tuner.fit()? Should we have a unified way to handle them?

result.error: Error that happens in the trainable (ex: runtime error in the training loop per worker -> the trainer actor will raise -> Tune will record this error associated with the trial -> can be accessed by result.error)
tuner.fit(): Error that happens in the Tune driver itself (aka in the execution loop happening on the driver node).

These should be handled separately because driver error should crash the entire experiment execution, whereas individual trial errors should not crash other ongoing trials, unless otherwise configured.

In the trainer.fit case, there's only 1 trial so the full experiment will crash on trial failure.

python/ray/train/base_trainer.py

woshiyyya

lgtm, just a few minor comments.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…n/errortype

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…n/errortype

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…n/errortype

Yard1

Looks good to me, thanks!

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…)` (ray-project#33610) Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: elliottower <elliot@elliottower.com>

…)` (ray-project#33610) Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Jack He <jackhe2345@gmail.com>

justinvyu added 6 commits March 20, 2023 18:51

Fix error propagation for Train

059c5a7

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Improve error messages

4809d9e

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Consolidate AIR unit tests related to error propagation into one place

ce571dc

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add driver error propagation tests

e3b474b

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into trai…

a83c87d

…n/errortype

Fix lint

93d70d2

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

justinvyu requested review from amogkam, Yard1 and woshiyyya March 23, 2023 02:27

justinvyu assigned amogkam, Yard1 and woshiyyya Mar 23, 2023

woshiyyya reviewed Mar 23, 2023

View reviewed changes

woshiyyya approved these changes Mar 23, 2023

View reviewed changes

justinvyu added 13 commits March 23, 2023 11:53

Fix max_failures=-1

22d77b8

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into trai…

a8d19bc

…n/errortype

Merge branch 'master' of https://github.com/ray-project/ray into trai…

39d5708

…n/errortype

Fix hf error assertions

0666f03

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

fix test torch trainer

22d9845

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

fix lightning restore test

8b0c74c

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

fix hf steps test

4deb146

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

fix mosaic test

b5fff92

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

fix gpu test

d24f518

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

fix trainer restore

f7b1b6b

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into trai…

c238e3b

…n/errortype

working now??

d3de246

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

fix lint

25ed071

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

justinvyu requested a review from gjoliver March 24, 2023 19:28

Merge branch 'master' of https://github.com/ray-project/ray into trai…

bc69a35

…n/errortype

Yard1 approved these changes Mar 24, 2023

View reviewed changes

[no_early_kickoff] merge

8b682fc

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

justinvyu changed the title ~~[Train] Recommend Trainer.restore on errors raised by trainer.fit()~~ [no_early_kickoff] [Train] Recommend Trainer.restore on errors raised by trainer.fit() Mar 24, 2023

[no_early_kickoff] merge

209c7d3

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 25, 2023

gjoliver merged commit 9b8d4ce into ray-project:master Mar 28, 2023

justinvyu deleted the train/errortype branch April 10, 2023 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[no_early_kickoff] [Train] Recommend `Trainer.restore` on errors raised by `trainer.fit()` #33610

[no_early_kickoff] [Train] Recommend `Trainer.restore` on errors raised by `trainer.fit()` #33610

justinvyu commented Mar 23, 2023 •

edited

Loading

woshiyyya Mar 23, 2023

justinvyu Mar 23, 2023

woshiyyya left a comment

Yard1 left a comment

[no_early_kickoff] [Train] Recommend Trainer.restore on errors raised by trainer.fit() #33610

[no_early_kickoff] [Train] Recommend Trainer.restore on errors raised by trainer.fit() #33610

Conversation

justinvyu commented Mar 23, 2023 • edited Loading

Why are these changes needed?

Notes on the TODOs

Related issue number

Checks

woshiyyya Mar 23, 2023

Choose a reason for hiding this comment

justinvyu Mar 23, 2023

Choose a reason for hiding this comment

woshiyyya left a comment

Choose a reason for hiding this comment

Yard1 left a comment

Choose a reason for hiding this comment

[no_early_kickoff] [Train] Recommend `Trainer.restore` on errors raised by `trainer.fit()` #33610

[no_early_kickoff] [Train] Recommend `Trainer.restore` on errors raised by `trainer.fit()` #33610

justinvyu commented Mar 23, 2023 •

edited

Loading