-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[no_early_kickoff] [Train] Recommend Trainer.restore
on errors raised by trainer.fit()
#33610
Conversation
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
|
||
assert len(result_grid) == 1 | ||
result = result_grid[0] | ||
if result.error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the difference between result.error
and the errors captured in tuner.fit()
? Should we have a unified way to handle them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result.error
: Error that happens in the trainable (ex: runtime error in the training loop per worker -> the trainer actor will raise -> Tune will record this error associated with the trial -> can be accessed by result.error
)
tuner.fit()
: Error that happens in the Tune driver itself (aka in the execution loop happening on the driver node).
These should be handled separately because driver error should crash the entire experiment execution, whereas individual trial errors should not crash other ongoing trials, unless otherwise configured.
In the trainer.fit
case, there's only 1 trial so the full experiment will crash on trial failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just a few minor comments.
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks!
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Trainer.restore
on errors raised by trainer.fit()
Trainer.restore
on errors raised by trainer.fit()
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…)` (ray-project#33610) Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: elliottower <elliot@elliottower.com>
…)` (ray-project#33610) Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
Currently,
TuneError
s that recommendTuner.restore
get propagated to users even when they're usingTrainer.fit
. This PR fixes this by wrapping raised errors with aTrainingFailedError
that includes a message on how to restore the run, and also how to configure a new run to retry on training failures. This PR also separates AIR "error propagation" tests into a new file.Notes on the TODOs
I added a few TODOs in the code regarding "driver error propagation" in Tune. Currently, errors that happen in the Tune driver (ex: within user defined callback hooks) get surfaced to the user in different ways.
For example:
on_trial_result
will surface as aTuneError(TuneError(OriginalError))
on_checkpoint
will not surface at all. It'll just log a warning.on_step_begin
will surface asOriginalError
.This needs to be fixed in 2.5 (all hooks should handle errors in the same way), tracker issue here: TODO.
For the purposes of 2.4, wrapping errors raised by
trainer.fit()
TrainingFailedError error (which already existed) is good enough to fix the issue of "Tune concepts showing up when you're not even using Tune."Related issue number
Closes #33566
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.