-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR/Train] Add Trainer.restore
API for train experiment-level fault tolerance
#31920
Conversation
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…o regular retry on failure case) Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…reprocessor, w/ datasets) Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Add back xgboost trainer to test Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…n/restore Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
return trainer_init_config | ||
|
||
@classmethod | ||
def restore( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it kinda bothers me a lot now that we have to create a custom restore() for every single Trainer we have.
nobody would understand how to contribute a new Trainer if we do this ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
Allow user to re-specify everything in __init__
except for the run config. We can inspect the arguments of cls.__init__
to see what's allowed. Then just overwrite everything that's passed in. Now, the subclasses don't need to re-implement restore.
One downside is that validation will be harder, and the signature of restore will be something like def restore(cls, **kwargs):
which is very opaque.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest iteration: User subclass doesn't need to set any attributes themselves. They just need to define what attributes from their constructor can get re-specified. Then, they just pass these through as kwargs to the BaseTrainer.restore
.
class MyTrainerSubclass(BaseTrainer):
def restore(cls, path, my_arg, ...):
return super().restore(path, my_arg=my_arg, ...)
# No more setting things afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is nice!
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…n/restore Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Please ping me when tests pass
…t tolerance (ray-project#31920) This PR introduces a `Trainer.restore` API that resumes a Train experiment that crashed/got interrupted. Previously, `resume_from_checkpoint` only allowed starting a _new_ experiment, which writes to a completely new log directory and a new set of results. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…t tolerance (ray-project#31920) This PR introduces a `Trainer.restore` API that resumes a Train experiment that crashed/got interrupted. Previously, `resume_from_checkpoint` only allowed starting a _new_ experiment, which writes to a completely new log directory and a new set of results. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
…t tolerance (ray-project#31920) This PR introduces a `Trainer.restore` API that resumes a Train experiment that crashed/got interrupted. Previously, `resume_from_checkpoint` only allowed starting a _new_ experiment, which writes to a completely new log directory and a new set of results. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: elliottower <elliot@elliottower.com>
Why are these changes needed?
This PR introduces a
Trainer.restore
API that resumes a Train experiment that crashed/got interrupted. Previously,resume_from_checkpoint
only allowed starting a new experiment, which writes to a completely new log directory and a new set of results.Context
The restoration API is inconsistent between Train and Tune, since
Tuner.restore
. It’s not clear how to restore an interrupted Train experiment, and the existingresume_from_checkpoint
API has issues as reported by users (see here [github, discourse, github]).Added Documentation
TODO before merging
trainer.fit
call.Future Todos
Dealing with
ActorHandle
TL;DR: Object references don't immediately throw an exception when unpickled from a new ray cluster, but actor handles do.
If an
ActorHandle
gets captured in the scope of a training function/config and gets pickled along with the trainer, then it will be impossible to load the pickled trainer from a new ray cluster. You will run into an error that looks like:We can't really do anything about this, and this de-serialization in a new ray cluster behavior needs to be fixed by Core.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.