Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR/Train] Add Trainer.restore API for train experiment-level fault tolerance #31920

Merged
merged 94 commits into from
Feb 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
36b607c
Add API skeleton for trainer restore
justinvyu Jan 24, 2023
864f2b1
Fix some errors (imports, wrong return type)
justinvyu Jan 24, 2023
2af61de
Save trainer.pkl
justinvyu Jan 24, 2023
fd93dae
Add object ref check utility
justinvyu Jan 24, 2023
bbbf5fa
Add `test_trainer_restore`
justinvyu Jan 24, 2023
0e10972
Add preprocessor loading in restore+no new preprocessor case (and als…
justinvyu Jan 24, 2023
ded1d27
Fix typo
justinvyu Jan 24, 2023
d12e155
Change test to check exception type
justinvyu Jan 24, 2023
0a4bec3
Fix training failed error capture
justinvyu Jan 24, 2023
a9f10bb
Fix should fit preprocessor logic for new train run
justinvyu Jan 25, 2023
149863b
Improve check object refs (search for actor handles too)
justinvyu Jan 25, 2023
4ace2c9
Add more unit tests (gbdt, obj ref in train loop/config, obj ref in p…
justinvyu Jan 25, 2023
73d1fde
Remove unused imports
justinvyu Jan 25, 2023
847978e
Fix HF test function if no eval dataset is passed
justinvyu Jan 25, 2023
e481610
Add trainer w/ init tests + fix gbdt tests
justinvyu Jan 25, 2023
cfb3859
Fix lightgbm test
justinvyu Jan 25, 2023
4f3caf1
Add to bazel build file
justinvyu Jan 25, 2023
d180135
Disable mosaic trainer restore functionality
justinvyu Jan 25, 2023
d5ab478
Add config validation
justinvyu Jan 25, 2023
41b2e90
Fix case where datasets = None
justinvyu Jan 25, 2023
5de0b9a
Add test restoring from a different trainer class
justinvyu Jan 25, 2023
56c80e4
Fix gbdt test assertion
justinvyu Jan 25, 2023
8c59952
Add new args to dummy trainer used in ingest tests
justinvyu Jan 25, 2023
202ea18
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Jan 25, 2023
1cb225b
Change to specifying optional restore fields
justinvyu Jan 25, 2023
8638d75
Fix for HF special case
justinvyu Jan 25, 2023
484a873
Clean up mosaic restore args
justinvyu Jan 25, 2023
b98ae79
Add error message for invalid restore kwargs
justinvyu Jan 25, 2023
61af0ad
Fix 'should fit preprocessor' logic
justinvyu Jan 25, 2023
396fa43
Fix missing preprocessor import
justinvyu Jan 25, 2023
4bbd79c
Revert "Add to bazel build file"
justinvyu Jan 25, 2023
7721ac8
Add back to build file without formatting
justinvyu Jan 25, 2023
a486c4e
Revert "Fix for HF special case"
justinvyu Jan 26, 2023
30393d2
Revert "Change to specifying optional restore fields"
justinvyu Jan 26, 2023
2ce1c31
Remove validation logic
justinvyu Jan 27, 2023
ecbcd1a
Fix unit tests
justinvyu Jan 27, 2023
5c03050
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Jan 27, 2023
74d99cc
Simplify skipping preprocessor fit logic
justinvyu Jan 27, 2023
3ac12e2
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Jan 30, 2023
964ebe5
Add BaseTrainer.can_restore
justinvyu Jan 30, 2023
f55f5cd
Always save the trainer pkl, even on restore
justinvyu Jan 30, 2023
d885f15
Fill in restore error messages
justinvyu Jan 30, 2023
b070b01
Add tests for can restore utility and restoring from invalid dir
justinvyu Jan 30, 2023
e2ff5a9
Add docstrings for tests
justinvyu Jan 30, 2023
9adcd38
New way of doing restore where param_dict actually gets updated
justinvyu Jan 31, 2023
1f9db34
Update tests to actually catch param dict not being updated
justinvyu Jan 31, 2023
b3d8d0e
Fix loading logic for restored + re-specified preprocessor
justinvyu Jan 31, 2023
0433128
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Jan 31, 2023
ea122ad
Add restore docstrings
justinvyu Jan 31, 2023
8abe4a7
Revert "Fix training failed error capture"
justinvyu Jan 31, 2023
60cff3f
Fix tests after reverting error handling change
justinvyu Jan 31, 2023
62aed3e
Remove duplicate api ref
justinvyu Jan 31, 2023
3945c70
Fix preprocessor not found error
justinvyu Jan 31, 2023
f7c341a
Fix test failures
justinvyu Jan 31, 2023
72cf533
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Jan 31, 2023
28913f3
Update to raise value errors instead of asserts
justinvyu Jan 31, 2023
5080cb3
Update test to expect value errors
justinvyu Jan 31, 2023
543b2a8
Fix lint
justinvyu Jan 31, 2023
8fabbb7
Expand user path in can restore utility
justinvyu Feb 1, 2023
c5eeb53
Explicit ray cluster shutdown in tests
justinvyu Feb 1, 2023
1d1e9ba
Fix typo (fit_status -> fit_status())
justinvyu Feb 1, 2023
ca2a476
Add a comment about BaseTrainer._save
justinvyu Feb 1, 2023
c2e24b0
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Feb 1, 2023
5a75eaa
Apply suggestions from code review
justinvyu Feb 1, 2023
63a7dc3
Add rl trainer restore test
justinvyu Feb 1, 2023
1ed3fc2
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Feb 1, 2023
9c0d8a8
Update trainable param name in tuner restore
justinvyu Feb 1, 2023
43153db
Merge branch 'train/restore' of https://github.com/justinvyu/ray into…
justinvyu Feb 1, 2023
3f17166
Make can restore consistent with the other pr
justinvyu Feb 1, 2023
1f8132e
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Feb 1, 2023
334a69a
Add api stability for session
justinvyu Feb 1, 2023
9f6e04c
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Feb 8, 2023
0529739
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Feb 15, 2023
410a34f
Re-specify param space in trainer restore + reenable test
justinvyu Feb 15, 2023
d1c2e11
Improve comments
justinvyu Feb 15, 2023
4196ce3
Add trainer restore to API ref
justinvyu Feb 15, 2023
4e1f532
Add FAQ post
justinvyu Feb 15, 2023
1a415bc
Convert BaseTrainer example to be framework agnostic
justinvyu Feb 15, 2023
d865435
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Feb 16, 2023
e1b41e8
Improve faq + make example work
justinvyu Feb 16, 2023
859ecf3
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Feb 16, 2023
da94da8
Add typing imports
justinvyu Feb 16, 2023
5c5c009
Remove accidentally included files
justinvyu Feb 16, 2023
a5fde62
Remove ipdb (oops)
justinvyu Feb 16, 2023
4b53d64
Fix duplicate HFTrainer.restore ref
justinvyu Feb 16, 2023
726cf81
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Feb 16, 2023
ae7568e
Shouldn't check for subclass
justinvyu Feb 16, 2023
8b4a876
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Feb 16, 2023
a7e4336
Convert error to warning instead + try/catch new trainer instantiation
justinvyu Feb 17, 2023
61bb672
Add public api alpha decorators
justinvyu Feb 17, 2023
5a62a5e
Update unit test to check for warning
justinvyu Feb 17, 2023
f47970f
Remove trailing header chars
justinvyu Feb 17, 2023
94058a0
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu Feb 17, 2023
4589b52
Update docstring example to define trainer subclass inline
justinvyu Feb 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 50 additions & 4 deletions doc/source/train/api/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ Trainer Base Classes
~train.data_parallel_trainer.DataParallelTrainer
~train.gbdt_trainer.GBDTTrainer

``BaseTrainer`` Methods
************************
``BaseTrainer`` API
*******************

.. autosummary::
:toctree: doc/
Expand All @@ -40,7 +40,7 @@ Trainer Base Classes


Train Backend Base Classes
~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _train-backend:
.. _train-backend-config:
Expand Down Expand Up @@ -170,10 +170,56 @@ Mosaic


Reinforcement Learning (RLlib)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: doc/

~train.rl.RLTrainer
~train.rl.RLCheckpoint


.. _trainer-restore:

Ray Train Experiment Restoration
--------------------------------

.. autosummary::
:toctree: doc/

train.trainer.BaseTrainer.restore

.. note::

All trainer classes have a `restore` method that takes in a path
pointing to the directory of the experiment to be restored.
`restore` also exposes a subset of construtor arguments that can be re-specified.
See :ref:`train-framework-specific-restore`
below for details on `restore` arguments for different AIR trainer integrations.

.. _train-framework-specific-restore:

Restoration API for Built-in Trainers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: doc/

train.data_parallel_trainer.DataParallelTrainer.restore

.. autosummary::

train.huggingface.HuggingFaceTrainer.restore

.. note::

`TorchTrainer.restore`, `TensorflowTrainer.restore`, and `HorovodTrainer.restore`
can take in the same parameters as their parent class's
:meth:`DataParallelTrainer.restore <ray.train.data_parallel_trainer.DataParallelTrainer.restore>`.

Unless otherwise specified, other trainers will accept the same parameters as
:meth:`BaseTrainer.restore <ray.train.trainer.BaseTrainer.restore>`.

.. seealso::

See :ref:`train-restore-faq` for more details on when and how trainer restore should be used.
102 changes: 102 additions & 0 deletions doc/source/train/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,108 @@ you can initialize the ``Trainer`` with ``resources_per_worker`` specified in ``
currently assume each worker is allocated exactly 1 GPU. The partial GPU and multi GPU use-cases
can still be run with Ray Train today without these functions.

.. _train-restore-faq:

How do I restore a Ray Train experiment?
----------------------------------------

A Train experiment may be interrupted due to one of the following reasons:

- The experiment was manually interrupted (e.g., Ctrl+C, or pre-empted head node instance).
- The head node crashed (e.g., OOM or some other runtime error).
- The entire cluster went down (e.g., network error affecting all nodes).

In these cases, a Trainer :ref:`can be restored <trainer-restore>` for the experiment to resume.

Since this is applicable to all of Ray Train's built-in trainers,
we'll use `FrameworkTrainer` to refer to a generic trainer for the remainder of this answer.

To restore an experiment, first find the experiment directory that your previous
run was saved to. If you saved locally, this will look like ``{local_dir}/{name}``,
where ``local_dir`` may be ``~/ray_results``, and ``name`` is something
like ``FrameworkTrainer_2023-xxx``.

Note that these are the same parameters that you pass through :class:`~ray.air.RunConfig`.

.. code-block:: python

datasets = {"train": ray.data.from_items([{"x": i, "y": 2 * i} for i in range(10)])}

restored_trainer = FrameworkTrainer.restore(
path="~/ray_results/FrameworkTrainer_2023-02-15_00-46-58",
datasets=datasets,
)

It's also possible to restore from a remote path (e.g., from an experiment directory
stored in a s3 bucket).

.. code-block:: python

datasets = {"train": ray.data.from_items([{"x": i, "y": 2 * i} for i in range(10)])}

restored_trainer = FrameworkTrainer.restore(
path="s3://results-bucket/FrameworkTrainer_2023-02-15_00-46-58",
datasets=datasets,
)

.. note::

`FrameworkTrainer.restore` may allow more parameters to be re-specified depending
on which trainer you're using. See :ref:`train-framework-specific-restore` for more details.


Single Script for Automatic Restoration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Adding the branching logic below will allow you to run the same script after the interrupt,
picking up training from where you left on the previous run. Notice that we use the
:meth:`FrameworkTrainer.can_restore <ray.train.trainer.BaseTrainer.can_restore>` utility method
to determine the existence/validity of the given experiment directory.

.. code-block:: python

# run_train_experiment.py

# Load datasets, define a preprocessor, etc.
# datasets = { ... }
# preprocessor = ...

experiment_name = "train_experiment"
experiment_dir = f"~/ray_results/{experiment_name}"

if FrameworkTrainer.can_restore(experiment_dir):
trainer = FrameworkTrainer.restore(
experiment_dir,
datasets=datasets,
)
else:
trainer = FrameworkTrainer(
datasets=datasets,
preprocessor=preprocessor,
scaling_config=air.ScalingConfig(num_workers=2, use_gpu=False),
run_config=air.RunConfig(
name=experiment_name,
local_dir="~/ray_results",
failure_config=air.FailureConfig(max_failures=3),
stop={"training_iteration": 10},
),
)

.. seealso::

See the :meth:`BaseTrainer.restore <ray.train.trainer.BaseTrainer.restore>` docstring
for a full example.

.. note::

`FrameworkTrainer.restore` is different from
:class:`FrameworkTrainer(..., resume_from_checkpoint=...) <ray.train.trainer.BaseTrainer>`.
`resume_from_checkpoint` is meant to be used to start a *new* Train experiment,
which writes results to a new directory and starts over from iteration 0.

`FrameworkTrainer.restore` is used to continue an existing experiment, where
new results will continue to be appended to existing logs.


My multi-node PyTorch GPU training is hanging or giving me obscure NCCL errors. What do I do?
---------------------------------------------------------------------------------------------
Expand Down
12 changes: 12 additions & 0 deletions python/ray/train/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -584,6 +584,18 @@ py_test(
deps = [":train_lib"]
)

py_test(
name = "test_trainer_restore",
size = "medium",
srcs = ["tests/test_trainer_restore.py"],
tags = [
"exclusive",
"ray_air",
"team:ml",
],
deps = [":train_lib"],
)

# This is a dummy test dependency that causes the above tests to be
# re-run if any of these files changes.
py_library(
Expand Down
11 changes: 8 additions & 3 deletions python/ray/train/_internal/dataset_spec.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@
from ray.air.config import DatasetConfig

from ray.data import Dataset, DatasetPipeline
from ray.data.preprocessor import Preprocessor
from ray.data.preprocessors import Chain
from ray.air._internal.util import _estimate_avail_object_store_memory

if TYPE_CHECKING:
from ray.data import DatasetIterator
from ray.data.preprocessor import Preprocessor

RayDataset = Union["Dataset", "DatasetPipeline"]

Expand Down Expand Up @@ -113,7 +113,9 @@ def __init__(self, dataset_config: Dict[str, DatasetConfig]):
self.preprocessor: Optional["Preprocessor"] = None

def preprocess_datasets(
self, prep: "Preprocessor", datasets: Dict[str, "Dataset"]
self,
prep: "Preprocessor",
datasets: Dict[str, "Dataset"],
) -> Dict[str, "Dataset"]:
"""Preprocess the given datasets.

Expand Down Expand Up @@ -142,7 +144,10 @@ def preprocess_datasets(
continue
if conf.fit:
ds_to_fit = datasets[k]
if ds_to_fit:
if ds_to_fit and prep.fit_status() in (
Preprocessor.FitStatus.NOT_FITTED,
Preprocessor.FitStatus.PARTIALLY_FITTED,
):
Comment on lines +147 to +150
Copy link
Contributor Author

@justinvyu justinvyu Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a controversial change? A loaded preprocessor that's been fitted already shouldn't fit again. This way, I don't need to pass a fit_preprocessor=False flag in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable. can you make this an actual comment for this logic?

prep.fit(ds_to_fit)
new_datasets = {}

Expand Down
Loading