Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc][tune] Add tune checkpoint user guide. #33145

Merged
merged 14 commits into from
Mar 22, 2023
Merged
1 change: 1 addition & 0 deletions doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@ parts:
- file: tune/tutorials/tune-stopping
title: "How to Stop and Resume"
- file: tune/tutorials/tune-storage
- file: tune/tutorials/tune-trial-checkpoint
- file: tune/tutorials/tune-metrics
title: "Using Callbacks and Metrics"
- file: tune/tutorials/tune_get_data_in_and_out
Expand Down
10 changes: 5 additions & 5 deletions doc/source/tune/api/schedulers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ This can be enabled by setting the ``scheduler`` parameter of ``tune.TuneConfig`

When the PBT scheduler is enabled, each trial variant is treated as a member of the population.
Periodically, **top-performing trials are checkpointed**
(this requires your Trainable to support :ref:`save and restore <tune-function-checkpointing>`).
(this requires your Trainable to support :ref:`save and restore <tune-trial-checkpoint>`).
**Low-performing trials clone the hyperparameter configurations of top performers and
perturb them** slightly in the hopes of discovering even better hyperparameter settings.
**Low-performing trials also resume from the checkpoints of the top performers**, allowing
Expand Down Expand Up @@ -261,7 +261,7 @@ PB2 can be enabled by setting the ``scheduler`` parameter of ``tune.TuneConfig``

When the PB2 scheduler is enabled, each trial variant is treated as a member of the population.
Periodically, top-performing trials are checkpointed (this requires your Trainable to
support :ref:`save and restore <tune-function-checkpointing>`).
support :ref:`save and restore <tune-trial-checkpoint>`).
Low-performing trials clone the checkpoints of top performers and perturb the configurations
in the hope of discovering an even better variation.

Expand Down Expand Up @@ -308,9 +308,9 @@ It wraps around another scheduler and uses its decisions.
which will let your model know about the new resources assigned. You can also obtain the current trial resources
by calling ``Trainable.trial_resources``.

* If you are using the functional API for tuning, the current trial resources can be
obtained by calling `tune.get_trial_resources()` inside the training function.
The function should be able to :ref:`load and save checkpoints <tune-function-checkpointing>`
* If you are using the functional API for tuning, get the current trial resources obtained by calling
`tune.get_trial_resources()` inside the training function.
The function should be able to :ref:`load and save checkpoints <tune-function-trainable-checkpointing>`
(the latter preferably every iteration).

An example of this in use can be found here: :doc:`/tune/examples/includes/xgboost_dynamic_resources_example`.
Expand Down
109 changes: 11 additions & 98 deletions doc/source/tune/api/trainable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
Training in Tune (tune.Trainable, session.report)
=================================================

Training can be done with either a **Function API** (:ref:`session.report <tune-function-docstring>`) or **Class API** (:ref:`tune.Trainable <tune-trainable-docstring>`).
Training can be done with either a **Function API** (:ref:`session.report <tune-function-docstring>`) or
**Class API** (:ref:`tune.Trainable <tune-trainable-docstring>`).

For the sake of example, let's maximize this objective function:

Expand All @@ -18,11 +19,11 @@ For the sake of example, let's maximize this objective function:

.. _tune-function-api:

Tune's Function API
-------------------
Function Trainable API
----------------------

The Function API allows you to define a custom training function that Tune will run in parallel Ray actor processes,
one for each Tune trial.
Use the Function API to define a custom training function that Tune runs in Ray actor processes. Each trial is placed
into a Ray actor process and runs in parallel.

The ``config`` argument in the function is a dictionary populated automatically by Ray Tune and corresponding to
the hyperparameters selected for the trial from the :ref:`search space <tune-key-concepts-search-spaces>`.
Expand Down Expand Up @@ -51,36 +52,14 @@ It's also possible to return a final set of metrics to Tune by returning them fr
:start-after: __function_api_return_final_metrics_start__
:end-before: __function_api_return_final_metrics_end__

You'll notice that Ray Tune will output extra values in addition to the user reported metrics,
such as ``iterations_since_restore``. See :ref:`tune-autofilled-metrics` for an explanation/glossary of these values.

.. _tune-function-checkpointing:

Function API Checkpointing
~~~~~~~~~~~~~~~~~~~~~~~~~~

Many Tune features rely on checkpointing, including the usage of certain Trial Schedulers and fault tolerance.
You can save and load checkpoints in Ray Tune in the following manner:

.. literalinclude:: /tune/doc_code/trainable.py
:language: python
:start-after: __function_api_checkpointing_start__
:end-before: __function_api_checkpointing_end__

.. note:: ``checkpoint_frequency`` and ``checkpoint_at_end`` will not work with Function API checkpointing.

In this example, checkpoints will be saved by training iteration to ``<local_dir>/<exp_name>/trial_name/checkpoint_<step>``.

Tune also may copy or move checkpoints during the course of tuning. For this purpose,
it is important not to depend on absolute paths in the implementation of ``save``.
Note that Ray Tune outputs extra values in addition to the user reported metrics,
such as ``iterations_since_restore``. See :ref:`tune-autofilled-metrics` for an explanation of these values.

See :ref:`here for more information on creating checkpoints <air-checkpoint-ref>`.
If using framework-specific trainers from Ray AIR, see :ref:`here <air-trainer-ref>` for
references to framework-specific checkpoints such as `TensorflowCheckpoint`.
See how to configure checkpointing for a function trainable :ref:`here <tune-function-trainable-checkpointing>`.

.. _tune-class-api:

Tune's Trainable Class API
Class Trainable API
--------------------------

.. caution:: Do not use ``session.report`` within a ``Trainable`` class.
Expand Down Expand Up @@ -111,73 +90,7 @@ You'll notice that Ray Tune will output extra values in addition to the user rep
such as ``iterations_since_restore``.
See :ref:`tune-autofilled-metrics` for an explanation/glossary of these values.

.. _tune-trainable-save-restore:

Class API Checkpointing
~~~~~~~~~~~~~~~~~~~~~~~

You can also implement checkpoint/restore using the Trainable Class API:

.. literalinclude:: /tune/doc_code/trainable.py
:language: python
:start-after: __class_api_checkpointing_start__
:end-before: __class_api_checkpointing_end__

You can checkpoint with three different mechanisms: manually, periodically, and at termination.

**Manual Checkpointing**: A custom Trainable can manually trigger checkpointing by returning ``should_checkpoint: True``
(or ``tune.result.SHOULD_CHECKPOINT: True``) in the result dictionary of `step`.
This can be especially helpful in spot instances:

.. code-block:: python

def step(self):
# training code
result = {"mean_accuracy": accuracy}
if detect_instance_preemption():
result.update(should_checkpoint=True)
return result


**Periodic Checkpointing**: periodic checkpointing can be used to provide fault-tolerance for experiments.
This can be enabled by setting ``checkpoint_frequency=<int>`` and ``max_failures=<int>`` to checkpoint trials
every *N* iterations and recover from up to *M* crashes per trial, e.g.:

.. code-block:: python

tuner = tune.Tuner(
my_trainable,
run_config=air.RunConfig(
checkpoint_config=air.CheckpointConfig(checkpoint_frequency=10),
failure_config=air.FailureConfig(max_failures=5))
)
results = tuner.fit()

**Checkpointing at Termination**: The checkpoint_frequency may not coincide with the exact end of an experiment.
If you want a checkpoint to be created at the end of a trial, you can additionally set the ``checkpoint_at_end=True``:

.. code-block:: python
:emphasize-lines: 5

tuner = tune.Tuner(
my_trainable,
run_config=air.RunConfig(
checkpoint_config=air.CheckpointConfig(checkpoint_frequency=10, checkpoint_at_end=True),
failure_config=air.FailureConfig(max_failures=5))
)
results = tuner.fit()


Use ``validate_save_restore`` to catch ``save_checkpoint``/``load_checkpoint`` errors before execution.

.. code-block:: python

from ray.tune.utils import validate_save_restore

# both of these should return
validate_save_restore(MyTrainableClass)
validate_save_restore(MyTrainableClass, use_object_store=True)

See how to configure checkpoint for class trainable :ref:`here <tune-class-trainable-checkpointing>`.
xwjiang2010 marked this conversation as resolved.
Show resolved Hide resolved


Advanced: Reusing Actors in Tune
Expand Down
12 changes: 6 additions & 6 deletions doc/source/tune/key-concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,12 +98,12 @@ Tune Trials

You use :ref:`Tuner.fit <tune-run-ref>` to execute and manage hyperparameter tuning and generate your `trials`.
At a minimum, your ``Tuner`` call takes in a trainable as first argument, and a ``param_space`` dictionary
to define your search space.
to define the search space.

The ``Tuner.fit()`` function also provides many features such as :ref:`logging <tune-logging>`,
:ref:`checkpointing <tune-function-checkpointing>`, and :ref:`early stopping <tune-stopping-ref>`.
Continuing with the example defined earlier (minimizing ``a (x ** 2) + b``), a simple Tune run with a simplistic
search space for ``a`` and ``b`` would look like this:
:ref:`checkpointing <tune-trial-checkpoint>`, and :ref:`early stopping <tune-stopping-ref>`.
In the example, minimizing ``a (x ** 2) + b``, a simple Tune run with a simplistic search space for ``a`` and ``b``
looks like this:

.. literalinclude:: doc_code/key_concepts.py
:language: python
Expand Down Expand Up @@ -306,10 +306,10 @@ and `Population Based Bandits (PB2) <https://arxiv.org/abs/2002.02518>`__.

When using schedulers, you may face compatibility issues, as shown in the below compatibility matrix.
Certain schedulers cannot be used with search algorithms,
and certain schedulers require :ref:`checkpointing to be implemented <tune-function-checkpointing>`.
and certain schedulers require that you implementing :ref:`checkpointing <tune-trial-checkpoint>`.
xwjiang2010 marked this conversation as resolved.
Show resolved Hide resolved

Schedulers can dynamically change trial resource requirements during tuning.
This is currently implemented in :ref:`ResourceChangingScheduler<tune-resource-changing-scheduler>`,
This is implemented in :ref:`ResourceChangingScheduler<tune-resource-changing-scheduler>`,
which can wrap around any other scheduler.

.. list-table:: Scheduler Compatibility Matrix
Expand Down
18 changes: 13 additions & 5 deletions doc/source/tune/tutorials/tune-distributed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,9 @@ In GCP, you can use the following configuration modification:
scheduling:
- preemptible: true

Spot instances may be removed suddenly while trials are still running. Often times this may be difficult to deal with when using other distributed hyperparameter optimization frameworks. Tune allows users to mitigate the effects of this by preserving the progress of your model training through :ref:`checkpointing <tune-function-checkpointing>`.
Spot instances may be removed suddenly while trials are still running.
xwjiang2010 marked this conversation as resolved.
Show resolved Hide resolved
Tune allows you to mitigate the effects of this by preserving the progress of your model training through
:ref:`checkpointing <tune-trial-checkpoint>`.

.. literalinclude:: /../../python/ray/tune/tests/tutorial.py
:language: python
Expand Down Expand Up @@ -219,13 +221,19 @@ You can also specify ``sync_config=tune.SyncConfig(upload_dir=...)``, as part of
Fault Tolerance of Tune Runs
----------------------------

Tune will automatically restart trials in case of trial failures/error (if ``max_failures != 0``), both in the single node and distributed setting.
Tune automatically restarts trials in the case of trial failures (if ``max_failures != 0``),
both in the single node and distributed setting.

Tune will restore trials from the latest checkpoint, where available. In the distributed setting, Tune will automatically sync the trial folder with the driver. For example, if a node is lost while a trial (specifically, the corresponding Trainable actor of the trial) is still executing on that node and a checkpoint of the trial exists, Tune will wait until available resources are available to begin executing the trial again.
See :ref:`here for information on checkpointing <tune-function-checkpointing>`.
Tune restores trials from the latest available checkpoint. In the distributed setting, Tune automatically
syncs the trial folder with the driver. For example, if a node is lost while a trial (specifically,
the corresponding Trainable actor of the trial) is still executing on that node and a checkpoint of the trial exists,
Tune waits until available resources are available to begin executing the trial again.
xwjiang2010 marked this conversation as resolved.
Show resolved Hide resolved
See :ref:`here for information on checkpointing <tune-trial-checkpoint>`.


If the trial/actor is placed on a different node, Tune will automatically push the previous checkpoint file to that node and restore the remote trial actor state, allowing the trial to resume from the latest checkpoint even after failure.
If the trial or actor is then placed on a different node, Tune automatically pushes the previous checkpoint file
to that node and restores the remote trial actor state, allowing the trial to resume from the latest checkpoint
even after failure.

Recovering From Failures
~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
93 changes: 43 additions & 50 deletions doc/source/tune/tutorials/tune-storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,47 @@
How to Configure Storage Options for a Distributed Tune Experiment?
===================================================================
xwjiang2010 marked this conversation as resolved.
Show resolved Hide resolved

Before diving into storage options, let's first take a look at what are the types of data stored by Ray Tune.
xwjiang2010 marked this conversation as resolved.
Show resolved Hide resolved
There are mainly three types of data.

.. _tune-persisted-experiment-data:

Types of data stored by Tune
----------------------------

Experiment Checkpoints
~~~~~~~~~~~~~~~~~~~~~~

Experiment-level checkpoints save the experiment state. This includes the state of the searcher,
the list of trials and their statuses (e.g., PENDING, RUNNING, TERMINATED, ERROR), and
metadata pertaining to each trial (e.g., hyperparameter configuration, some derived trial results
(min, max, last), etc).

Trial Checkpoints
~~~~~~~~~~~~~~~~~

Trial-level checkpoints capture the per-trial state. They are saved by the
:ref:`trainable <tune_60_seconds_trainables>` itself. This often includes the model and optimizer states.
Following are a few uses of trial checkpoints:

- If the trial is interrupted for some reason (e.g., on spot instances), it can be resumed from the
last state. No training time is lost.
- Some searchers or schedulers pause trials to free up resources for other trials to train in
the meantime. This only makes sense if the trials can then continue training from the latest state.
- The checkpoint can be later used for other downstream tasks like batch inference.

Learn saving and loading trial checkpoints here: :ref:`here <tune-trial-checkpoint>`.

Trial Results
~~~~~~~~~~~~~

Metrics reported by trials are saved and logged to their respective trial directories.
This is the data stored in CSV, JSON or Tensorboard (events.out.tfevents.*) formats.
that can be inspected by Tensorboard and used for post-experiment analysis.

Experiment data storage is critical in distributed setting
----------------------------------------------------------

When running Tune in a distributed setting, trials run on many different machines,
which means that experiment outputs such as model checkpoints will be spread all across the cluster.

Expand All @@ -15,6 +56,8 @@ Tune allows you to configure persistent storage options to enable following use
checkpoints to start from where the experiment left off.
- **Post-experiment analysis**: A consolidated location storing data from all trials is useful for post-experiment analysis
such as accessing the best checkpoints and hyperparameter configs after the cluster has already been terminated.
- **Bridge with downstream serving/batch inference tasks**: With a configured storage, you can easily access the models
and artifacts generated by trials, share them with others or use them in downstream tasks.


Storage Options in Tune
Expand Down Expand Up @@ -306,53 +349,3 @@ This experiment can be resumed from the head node:
resume_errored=True
)
tuner.fit()

.. TODO: Move this appendix to a new tune-checkpoints user guide.

.. _tune-persisted-experiment-data:

Appendix: Types of Tune Experiment Data
---------------------------------------

Experiment Checkpoints
~~~~~~~~~~~~~~~~~~~~~~

Experiment-level checkpoints save the experiment state. This includes the state of the searcher,
the list of trials and their statuses (e.g. PENDING, RUNNING, TERMINATED, ERROR), and
metadata pertaining to each trial (e.g. hyperparameter configuration, trial logdir, etc).

The experiment-level checkpoint is periodically saved by the driver on the head node.
By default, the frequency at which it is saved is automatically
adjusted so that at most 5% of the time is spent saving experiment checkpoints,
and the remaining time is used for handling training results and scheduling.
This time can also be adjusted with the
:ref:`TUNE_GLOBAL_CHECKPOINT_S environment variable <tune-env-vars>`.
xwjiang2010 marked this conversation as resolved.
Show resolved Hide resolved

The purpose of the experiment checkpoint is to maintain a global state from which the whole Ray Tune experiment
can be resumed from if it is interrupted or failed.
It is also used to load tuning results after a Ray Tune experiment has finished.

Trial Results
~~~~~~~~~~~~~

Metrics reported by trials get saved and logged to their respective trial directories.
This is the data stored in csv/json format that can be inspected by Tensorboard and
used for post-experiment analysis.

Trial Checkpoints
~~~~~~~~~~~~~~~~~

Trial-level checkpoints capture the per-trial state. They are saved by the :ref:`trainable <tune_60_seconds_trainables>` itself.
This often includes the model and optimizer states. Here are a few uses of trial checkpoints:

- If the trial is interrupted for some reason (e.g. on spot instances), it can be resumed from the
last state. No training time is lost.
- Some searchers/schedulers pause trials to free resources so that other trials can train in
the meantime. This only makes sense if the trials can then continue training from the latest state.
- The checkpoint can be later used for other downstream tasks like batch inference.

Everything that is saved by ``session.report()`` (if using the Function API) or
``Trainable.save_checkpoint`` (if using the Class API) is a **trial-level checkpoint.**
See :ref:`checkpointing with the Function API <tune-function-checkpointing>` and
:ref:`checkpointing with the Class API <tune-trainable-save-restore>`
for examples of saving and loading trial-level checkpoints.
Loading