[AIR/Train] Add `Trainer.restore` API for train experiment-level fault tolerance #31920

justinvyu · 2023-01-25T08:17:20Z

Why are these changes needed?

This PR introduces a Trainer.restore API that resumes a Train experiment that crashed/got interrupted. Previously, resume_from_checkpoint only allowed starting a new experiment, which writes to a completely new log directory and a new set of results.

Context

The restoration API is inconsistent between Train and Tune, since Tuner.restore. It’s not clear how to restore an interrupted Train experiment, and the existing resume_from_checkpoint API has issues as reported by users (see here [github, discourse, github]).

Added Documentation

TODO before merging

Merge Jun's PR, overwrite the param space, and do the shutdown/init in the unit test to make sure that it doesn't hang/crash.
Add to docs
Fix restore for with user-defined Trainer subclasses that are defined in the same scope as the trainer.fit call.

Future Todos

Dealing with `ActorHandle`

TL;DR: Object references don't immediately throw an exception when unpickled from a new ray cluster, but actor handles do.

If an ActorHandle gets captured in the scope of a training function/config and gets pickled along with the trainer, then it will be impossible to load the pickled trainer from a new ray cluster. You will run into an error that looks like:

Traceback (most recent call last):
  File "train_example_nfs.py", line 85, in <module>
    trainer = DataParallelTrainer.restore(
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/data_parallel_trainer.py", line 315, in restore
    return super(DataParallelTrainer, cls).restore(
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/base_trainer.py", line 281, in restore
    original_trainer = pickle.load(fp)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/_private/serialization.py", line 89, in _actor_handle_deserializer
    return ray.actor.ActorHandle._deserialization_helper(serialized_obj, outer_id)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/actor.py", line 1280, in _deserialization_helper
    return worker.core_worker.deserialize_and_register_actor_handle(
  File "python/ray/_raylet.pyx", line 2310, in ray._raylet.CoreWorker.deserialize_and_register_actor_handle
  File "python/ray/_raylet.pyx", line 2279, in ray._raylet.CoreWorker.make_actor_handle
  File "/Users/justin/Developer/justinvyu-dev/python/ray/_private/function_manager.py", line 522, in load_actor_class
    actor_class = self._load_actor_class_from_gcs(
  File "/Users/justin/Developer/justinvyu-dev/python/ray/_private/function_manager.py", line 617, in _load_actor_class_from_gcs
    class_name = ensure_str(class_name)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/_private/utils.py", line 293, in ensure_str
    assert isinstance(s, bytes)

We can't really do anything about this, and this de-serialization in a new ray cluster behavior needs to be fixed by Core.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…o regular retry on failure case) Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…reprocessor, w/ datasets) Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Add back xgboost trainer to test Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…n/restore Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

python/ray/train/BUILD

python/ray/train/base_trainer.py

gjoliver · 2023-01-25T20:12:59Z

python/ray/train/huggingface/huggingface_trainer.py

+        return trainer_init_config
+
+    @classmethod
+    def restore(


it kinda bothers me a lot now that we have to create a custom restore() for every single Trainer we have.
nobody would understand how to contribute a new Trainer if we do this ...

What about:
Allow user to re-specify everything in __init__ except for the run config. We can inspect the arguments of cls.__init__ to see what's allowed. Then just overwrite everything that's passed in. Now, the subclasses don't need to re-implement restore.

One downside is that validation will be harder, and the signature of restore will be something like def restore(cls, **kwargs): which is very opaque.

Latest iteration: User subclass doesn't need to set any attributes themselves. They just need to define what attributes from their constructor can get re-specified. Then, they just pass these through as kwargs to the BaseTrainer.restore.

class MyTrainerSubclass(BaseTrainer): def restore(cls, path, my_arg, ...): return super().restore(path, my_arg=my_arg, ...) # No more setting things afterwards.

this is nice!

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…n/restore

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…n/restore Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…n/restore

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…n/restore

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…n/restore

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…n/restore

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

krfricke

Thanks! Please ping me when tests pass

…t tolerance (ray-project#31920) This PR introduces a `Trainer.restore` API that resumes a Train experiment that crashed/got interrupted. Previously, `resume_from_checkpoint` only allowed starting a _new_ experiment, which writes to a completely new log directory and a new set of results. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…t tolerance (ray-project#31920) This PR introduces a `Trainer.restore` API that resumes a Train experiment that crashed/got interrupted. Previously, `resume_from_checkpoint` only allowed starting a _new_ experiment, which writes to a completely new log directory and a new set of results. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

…t tolerance (ray-project#31920) This PR introduces a `Trainer.restore` API that resumes a Train experiment that crashed/got interrupted. Previously, `resume_from_checkpoint` only allowed starting a _new_ experiment, which writes to a completely new log directory and a new set of results. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: elliottower <elliot@elliottower.com>

justinvyu added 24 commits January 23, 2023 18:49

Add API skeleton for trainer restore

36b607c

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix some errors (imports, wrong return type)

864f2b1

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Save trainer.pkl

2af61de

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add object ref check utility

fd93dae

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add test_trainer_restore

bbbf5fa

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add preprocessor loading in restore+no new preprocessor case (and als…

0e10972

…o regular retry on failure case) Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix typo

ded1d27

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Change test to check exception type

d12e155

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix training failed error capture

0a4bec3

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix should fit preprocessor logic for new train run

a9f10bb

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Improve check object refs (search for actor handles too)

149863b

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add more unit tests (gbdt, obj ref in train loop/config, obj ref in p…

4ace2c9

…reprocessor, w/ datasets) Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Remove unused imports

73d1fde

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix HF test function if no eval dataset is passed

847978e

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add trainer w/ init tests + fix gbdt tests

e481610

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix lightgbm test

cfb3859

Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Add back xgboost trainer to test Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add to bazel build file

4f3caf1

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Disable mosaic trainer restore functionality

d180135

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add config validation

d5ab478

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix case where datasets = None

41b2e90

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add test restoring from a different trainer class

5de0b9a

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix gbdt test assertion

56c80e4

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add new args to dummy trainer used in ingest tests

8c59952

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into trai…

202ea18

…n/restore Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

gjoliver reviewed Jan 25, 2023

View reviewed changes

justinvyu added 5 commits January 25, 2023 14:45

Change to specifying optional restore fields

1cb225b

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix for HF special case

8638d75

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Clean up mosaic restore args

484a873

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add error message for invalid restore kwargs

b98ae79

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix 'should fit preprocessor' logic

61af0ad

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

justinvyu added 22 commits February 15, 2023 00:01

Merge branch 'master' of https://github.com/ray-project/ray into trai…

0529739

…n/restore

Re-specify param space in trainer restore + reenable test

410a34f

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Improve comments

d1c2e11

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add trainer restore to API ref

4196ce3

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add FAQ post

4e1f532

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Convert BaseTrainer example to be framework agnostic

1a415bc

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into trai…

d865435

…n/restore Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Improve faq + make example work

e1b41e8

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into trai…

859ecf3

…n/restore

Add typing imports

da94da8

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Remove accidentally included files

5c5c009

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Remove ipdb (oops)

a5fde62

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix duplicate HFTrainer.restore ref

4b53d64

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into trai…

726cf81

…n/restore

Shouldn't check for subclass

ae7568e

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into trai…

8b4a876

…n/restore

Convert error to warning instead + try/catch new trainer instantiation

a7e4336

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add public api alpha decorators

61bb672

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Update unit test to check for warning

5a62a5e

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Remove trailing header chars

f47970f

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into trai…

94058a0

…n/restore

Update docstring example to define trainer subclass inline

4589b52

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

krfricke approved these changes Feb 17, 2023

View reviewed changes

krfricke merged commit 08d4537 into ray-project:master Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR/Train] Add `Trainer.restore` API for train experiment-level fault tolerance #31920

[AIR/Train] Add `Trainer.restore` API for train experiment-level fault tolerance #31920

justinvyu commented Jan 25, 2023 •

edited

Loading

gjoliver Jan 25, 2023

justinvyu Jan 25, 2023

justinvyu Jan 31, 2023

gjoliver Jan 31, 2023

krfricke left a comment

[AIR/Train] Add Trainer.restore API for train experiment-level fault tolerance #31920

[AIR/Train] Add Trainer.restore API for train experiment-level fault tolerance #31920

Conversation

justinvyu commented Jan 25, 2023 • edited Loading

Why are these changes needed?

Context

Added Documentation

TODO before merging

Future Todos

Dealing with ActorHandle

Related issue number

Checks

gjoliver Jan 25, 2023

Choose a reason for hiding this comment

justinvyu Jan 25, 2023

Choose a reason for hiding this comment

justinvyu Jan 31, 2023

Choose a reason for hiding this comment

gjoliver Jan 31, 2023

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

[AIR/Train] Add `Trainer.restore` API for train experiment-level fault tolerance #31920

[AIR/Train] Add `Trainer.restore` API for train experiment-level fault tolerance #31920

justinvyu commented Jan 25, 2023 •

edited

Loading

Dealing with `ActorHandle`