ray-project · xwjiang2010 · Mar 22, 2023 · Mar 8, 2023 · Mar 9, 2023 · Mar 9, 2023
@@ -162,6 +162,7 @@ parts:
               - file: tune/tutorials/tune-stopping
                 title: "How to Stop and Resume"
               - file: tune/tutorials/tune-storage
+              - file: tune/tutorials/tune-trial-checkpoint
               - file: tune/tutorials/tune-metrics
                 title: "Using Callbacks and Metrics"
               - file: tune/tutorials/tune_get_data_in_and_out

@@ -172,7 +172,7 @@ This can be enabled by setting the ``scheduler`` parameter of ``tune.TuneConfig`
 
 When the PBT scheduler is enabled, each trial variant is treated as a member of the population.
 Periodically, **top-performing trials are checkpointed**
-(this requires your Trainable to support :ref:`save and restore <tune-function-checkpointing>`).
+(this requires your Trainable to support :ref:`save and restore <tune-trial-checkpoint>`).
 **Low-performing trials clone the hyperparameter configurations of top performers and
 perturb them** slightly in the hopes of discovering even better hyperparameter settings.
 **Low-performing trials also resume from the checkpoints of the top performers**, allowing
@@ -261,7 +261,7 @@ PB2 can be enabled by setting the ``scheduler`` parameter of ``tune.TuneConfig``
 
 When the PB2 scheduler is enabled, each trial variant is treated as a member of the population.
 Periodically, top-performing trials are checkpointed (this requires your Trainable to
-support :ref:`save and restore <tune-function-checkpointing>`).
+support :ref:`save and restore <tune-trial-checkpoint>`).
 Low-performing trials clone the checkpoints of top performers and perturb the configurations
 in the hope of discovering an even better variation.
 
@@ -308,9 +308,9 @@ It wraps around another scheduler and uses its decisions.
     which will let your model know about the new resources assigned. You can also obtain the current trial resources
     by calling ``Trainable.trial_resources``.
 
-* If you are using the functional API for tuning, the current trial resources can be
-    obtained by calling `tune.get_trial_resources()` inside the training function.
-    The function should be able to :ref:`load and save checkpoints <tune-function-checkpointing>`
+* If you are using the functional API for tuning, get the current trial resources obtained by calling
+    `tune.get_trial_resources()` inside the training function.
+    The function should be able to :ref:`load and save checkpoints <tune-function-trainable-checkpointing>`
     (the latter preferably every iteration).
 
 An example of this in use can be found here: :doc:`/tune/examples/includes/xgboost_dynamic_resources_example`.

@@ -7,7 +7,8 @@
 Training in Tune (tune.Trainable, session.report)
 =================================================
 
-Training can be done with either a **Function API** (:ref:`session.report <tune-function-docstring>`) or **Class API** (:ref:`tune.Trainable <tune-trainable-docstring>`).
+Training can be done with either a **Function API** (:ref:`session.report <tune-function-docstring>`) or
+**Class API** (:ref:`tune.Trainable <tune-trainable-docstring>`).
 
 For the sake of example, let's maximize this objective function:
 
@@ -18,11 +19,11 @@ For the sake of example, let's maximize this objective function:
 
 .. _tune-function-api:
 
-Tune's Function API
--------------------
+Function Trainable API
+----------------------
 
-The Function API allows you to define a custom training function that Tune will run in parallel Ray actor processes,
-one for each Tune trial.
+Use the Function API to define a custom training function that Tune runs in Ray actor processes. Each trial is placed
+into a Ray actor process and runs in parallel.
 
 The ``config`` argument in the function is a dictionary populated automatically by Ray Tune and corresponding to
 the hyperparameters selected for the trial from the :ref:`search space <tune-key-concepts-search-spaces>`.
@@ -51,36 +52,14 @@ It's also possible to return a final set of metrics to Tune by returning them fr
     :start-after: __function_api_return_final_metrics_start__
     :end-before: __function_api_return_final_metrics_end__
 
-You'll notice that Ray Tune will output extra values in addition to the user reported metrics,
-such as ``iterations_since_restore``. See :ref:`tune-autofilled-metrics` for an explanation/glossary of these values.
-
-.. _tune-function-checkpointing:
-
-Function API Checkpointing
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Many Tune features rely on checkpointing, including the usage of certain Trial Schedulers and fault tolerance.
-You can save and load checkpoints in Ray Tune in the following manner:
-
-.. literalinclude:: /tune/doc_code/trainable.py
-    :language: python
-    :start-after: __function_api_checkpointing_start__
-    :end-before: __function_api_checkpointing_end__
-
-.. note:: ``checkpoint_frequency`` and ``checkpoint_at_end`` will not work with Function API checkpointing.
-
-In this example, checkpoints will be saved by training iteration to ``<local_dir>/<exp_name>/trial_name/checkpoint_<step>``.
-
-Tune also may copy or move checkpoints during the course of tuning. For this purpose,
-it is important not to depend on absolute paths in the implementation of ``save``.
+Note that Ray Tune outputs extra values in addition to the user reported metrics,
+such as ``iterations_since_restore``. See :ref:`tune-autofilled-metrics` for an explanation of these values.
 
-See :ref:`here for more information on creating checkpoints <air-checkpoint-ref>`.
-If using framework-specific trainers from Ray AIR, see :ref:`here <air-trainer-ref>` for
-references to framework-specific checkpoints such as `TensorflowCheckpoint`.
+See how to configure checkpointing for a function trainable :ref:`here <tune-function-trainable-checkpointing>`.
 
 .. _tune-class-api:
 
-Tune's Trainable Class API
+Class Trainable API
 --------------------------
 
 .. caution:: Do not use ``session.report`` within a ``Trainable`` class.
@@ -111,73 +90,7 @@ You'll notice that Ray Tune will output extra values in addition to the user rep
 such as ``iterations_since_restore``.
 See :ref:`tune-autofilled-metrics` for an explanation/glossary of these values.
 
-.. _tune-trainable-save-restore:
-
-Class API Checkpointing
-~~~~~~~~~~~~~~~~~~~~~~~
-
-You can also implement checkpoint/restore using the Trainable Class API:
-
-.. literalinclude:: /tune/doc_code/trainable.py
-    :language: python
-    :start-after: __class_api_checkpointing_start__
-    :end-before: __class_api_checkpointing_end__
-
-You can checkpoint with three different mechanisms: manually, periodically, and at termination.
-
-**Manual Checkpointing**: A custom Trainable can manually trigger checkpointing by returning ``should_checkpoint: True``
-(or ``tune.result.SHOULD_CHECKPOINT: True``) in the result dictionary of `step`.
-This can be especially helpful in spot instances:
-
-.. code-block:: python
-
-    def step(self):
-        # training code
-        result = {"mean_accuracy": accuracy}
-        if detect_instance_preemption():
-            result.update(should_checkpoint=True)
-        return result
-
-
-**Periodic Checkpointing**: periodic checkpointing can be used to provide fault-tolerance for experiments.
-This can be enabled by setting ``checkpoint_frequency=<int>`` and ``max_failures=<int>`` to checkpoint trials
-every *N* iterations and recover from up to *M* crashes per trial, e.g.:
-
-.. code-block:: python
-
-    tuner = tune.Tuner(
-        my_trainable,
-        run_config=air.RunConfig(
-            checkpoint_config=air.CheckpointConfig(checkpoint_frequency=10),
-            failure_config=air.FailureConfig(max_failures=5))
-    )
-    results = tuner.fit()
-
-**Checkpointing at Termination**: The checkpoint_frequency may not coincide with the exact end of an experiment.
-If you want a checkpoint to be created at the end of a trial, you can additionally set the ``checkpoint_at_end=True``:
-
-.. code-block:: python
-   :emphasize-lines: 5
-
-    tuner = tune.Tuner(
-        my_trainable,
-        run_config=air.RunConfig(
-            checkpoint_config=air.CheckpointConfig(checkpoint_frequency=10, checkpoint_at_end=True),
-            failure_config=air.FailureConfig(max_failures=5))
-    )
-    results = tuner.fit()
-
-
-Use ``validate_save_restore`` to catch ``save_checkpoint``/``load_checkpoint`` errors before execution.
-
-.. code-block:: python
-
-    from ray.tune.utils import validate_save_restore
-
-    # both of these should return
-    validate_save_restore(MyTrainableClass)
-    validate_save_restore(MyTrainableClass, use_object_store=True)
-
+See how to configure checkpoint for class trainable :ref:`here <tune-class-trainable-checkpointing>`.
 
 
 Advanced: Reusing Actors in Tune

@@ -98,12 +98,12 @@ Tune Trials
 
 You use :ref:`Tuner.fit <tune-run-ref>` to execute and manage hyperparameter tuning and generate your `trials`.
 At a minimum, your ``Tuner`` call takes in a trainable as first argument, and a ``param_space`` dictionary
-to define your search space.
+to define the search space.
 
 The ``Tuner.fit()`` function also provides many features such as :ref:`logging <tune-logging>`,
-:ref:`checkpointing <tune-function-checkpointing>`, and :ref:`early stopping <tune-stopping-ref>`.
-Continuing with the example defined earlier (minimizing ``a (x ** 2) + b``), a simple Tune run with a simplistic
-search space for ``a`` and ``b`` would look like this:
+:ref:`checkpointing <tune-trial-checkpoint>`, and :ref:`early stopping <tune-stopping-ref>`.
+In the example, minimizing ``a (x ** 2) + b``, a simple Tune run with a simplistic search space for ``a`` and ``b``
+looks like this:
 
 .. literalinclude:: doc_code/key_concepts.py
     :language: python
@@ -306,10 +306,10 @@ and `Population Based Bandits (PB2) <https://arxiv.org/abs/2002.02518>`__.
 
 When using schedulers, you may face compatibility issues, as shown in the below compatibility matrix.
 Certain schedulers cannot be used with search algorithms,
-and certain schedulers require :ref:`checkpointing to be implemented <tune-function-checkpointing>`.
+and certain schedulers require that you implementing :ref:`checkpointing <tune-trial-checkpoint>`.
 
 Schedulers can dynamically change trial resource requirements during tuning.
-This is currently implemented in :ref:`ResourceChangingScheduler<tune-resource-changing-scheduler>`,
+This is implemented in :ref:`ResourceChangingScheduler<tune-resource-changing-scheduler>`,
 which can wrap around any other scheduler.
 
 .. list-table:: Scheduler Compatibility Matrix

@@ -166,7 +166,9 @@ In GCP, you can use the following configuration modification:
         scheduling:
           - preemptible: true
 
-Spot instances may be removed suddenly while trials are still running. Often times this may be difficult to deal with when using other distributed hyperparameter optimization frameworks. Tune allows users to mitigate the effects of this by preserving the progress of your model training through :ref:`checkpointing <tune-function-checkpointing>`.
+Spot instances may be removed suddenly while trials are still running.
+Tune allows you to mitigate the effects of this by preserving the progress of your model training through
+:ref:`checkpointing <tune-trial-checkpoint>`.
 
 .. literalinclude:: /../../python/ray/tune/tests/tutorial.py
     :language: python
@@ -219,13 +221,19 @@ You can also specify ``sync_config=tune.SyncConfig(upload_dir=...)``, as part of
 Fault Tolerance of Tune Runs
 ----------------------------
 
-Tune will automatically restart trials in case of trial failures/error (if ``max_failures != 0``), both in the single node and distributed setting.
+Tune automatically restarts trials in the case of trial failures (if ``max_failures != 0``),
+both in the single node and distributed setting.
 
-Tune will restore trials from the latest checkpoint, where available. In the distributed setting, Tune will automatically sync the trial folder with the driver. For example, if a node is lost while a trial (specifically, the corresponding Trainable actor of the trial) is still executing on that node and a checkpoint of the trial exists, Tune will wait until available resources are available to begin executing the trial again.
-See :ref:`here for information on checkpointing <tune-function-checkpointing>`.
+Tune restores trials from the latest available checkpoint. In the distributed setting, Tune automatically
+syncs the trial folder with the driver. For example, if a node is lost while a trial (specifically,
+the corresponding Trainable actor of the trial) is still executing on that node and a checkpoint of the trial exists,
+Tune waits until available resources are available to begin executing the trial again.
+See :ref:`here for information on checkpointing <tune-trial-checkpoint>`.
 
 
-If the trial/actor is placed on a different node, Tune will automatically push the previous checkpoint file to that node and restore the remote trial actor state, allowing the trial to resume from the latest checkpoint even after failure.
+If the trial or actor is then placed on a different node, Tune automatically pushes the previous checkpoint file
+to that node and restores the remote trial actor state, allowing the trial to resume from the latest checkpoint
+even after failure.
 
 Recovering From Failures
 ~~~~~~~~~~~~~~~~~~~~~~~~

@@ -3,6 +3,47 @@
 How to Configure Storage Options for a Distributed Tune Experiment?
 ===================================================================
 
+Before diving into storage options, let's first take a look at what are the types of data stored by Ray Tune.
+There are mainly three types of data.
+
+.. _tune-persisted-experiment-data:
+
+Types of data stored by Tune
+----------------------------
+
+Experiment Checkpoints
+~~~~~~~~~~~~~~~~~~~~~~
+
+Experiment-level checkpoints save the experiment state. This includes the state of the searcher,
+the list of trials and their statuses (e.g., PENDING, RUNNING, TERMINATED, ERROR), and
+metadata pertaining to each trial (e.g., hyperparameter configuration, some derived trial results
+(min, max, last), etc).
+
+Trial Checkpoints
+~~~~~~~~~~~~~~~~~
+
+Trial-level checkpoints capture the per-trial state. They are saved by the
+:ref:`trainable <tune_60_seconds_trainables>` itself. This often includes the model and optimizer states.
+Following are a few uses of trial checkpoints:
+
+- If the trial is interrupted for some reason (e.g., on spot instances), it can be resumed from the
+  last state. No training time is lost.
+- Some searchers or schedulers pause trials to free up resources for other trials to train in
+  the meantime. This only makes sense if the trials can then continue training from the latest state.
+- The checkpoint can be later used for other downstream tasks like batch inference.
+
+Learn saving and loading trial checkpoints here: :ref:`here <tune-trial-checkpoint>`.
+
+Trial Results
+~~~~~~~~~~~~~
+
+Metrics reported by trials are saved and logged to their respective trial directories.
+This is the data stored in CSV, JSON or Tensorboard (events.out.tfevents.*) formats.
+that can be inspected by Tensorboard and used for post-experiment analysis.
+
+Experiment data storage is critical in distributed setting
+----------------------------------------------------------
+
 When running Tune in a distributed setting, trials run on many different machines,
 which means that experiment outputs such as model checkpoints will be spread all across the cluster.
 
@@ -15,6 +56,8 @@ Tune allows you to configure persistent storage options to enable following use
   checkpoints to start from where the experiment left off.
 - **Post-experiment analysis**: A consolidated location storing data from all trials is useful for post-experiment analysis
   such as accessing the best checkpoints and hyperparameter configs after the cluster has already been terminated.
+- **Bridge with downstream serving/batch inference tasks**: With a configured storage, you can easily access the models
+  and artifacts generated by trials, share them with others or use them in downstream tasks.
 
 
 Storage Options in Tune
@@ -306,53 +349,3 @@ This experiment can be resumed from the head node:
         resume_errored=True
     )
     tuner.fit()
-
-.. TODO: Move this appendix to a new tune-checkpoints user guide.
-
-.. _tune-persisted-experiment-data:
-
-Appendix: Types of Tune Experiment Data
----------------------------------------
-
-Experiment Checkpoints
-~~~~~~~~~~~~~~~~~~~~~~
-
-Experiment-level checkpoints save the experiment state. This includes the state of the searcher,
-the list of trials and their statuses (e.g. PENDING, RUNNING, TERMINATED, ERROR), and
-metadata pertaining to each trial (e.g. hyperparameter configuration, trial logdir, etc).
-
-The experiment-level checkpoint is periodically saved by the driver on the head node.
-By default, the frequency at which it is saved is automatically
-adjusted so that at most 5% of the time is spent saving experiment checkpoints,
-and the remaining time is used for handling training results and scheduling.
-This time can also be adjusted with the
-:ref:`TUNE_GLOBAL_CHECKPOINT_S environment variable <tune-env-vars>`.
-
-The purpose of the experiment checkpoint is to maintain a global state from which the whole Ray Tune experiment
-can be resumed from if it is interrupted or failed.
-It is also used to load tuning results after a Ray Tune experiment has finished.
-
-Trial Results
-~~~~~~~~~~~~~
-
-Metrics reported by trials get saved and logged to their respective trial directories.
-This is the data stored in csv/json format that can be inspected by Tensorboard and
-used for post-experiment analysis.
-
-Trial Checkpoints
-~~~~~~~~~~~~~~~~~
-
-Trial-level checkpoints capture the per-trial state. They are saved by the :ref:`trainable <tune_60_seconds_trainables>` itself.
-This often includes the model and optimizer states. Here are a few uses of trial checkpoints:
-
-- If the trial is interrupted for some reason (e.g. on spot instances), it can be resumed from the
-  last state. No training time is lost.
-- Some searchers/schedulers pause trials to free resources so that other trials can train in
-  the meantime. This only makes sense if the trials can then continue training from the latest state.
-- The checkpoint can be later used for other downstream tasks like batch inference.
-
-Everything that is saved by ``session.report()`` (if using the Function API) or
-``Trainable.save_checkpoint`` (if using the Class API) is a **trial-level checkpoint.**
-See :ref:`checkpointing with the Function API <tune-function-checkpointing>` and
-:ref:`checkpointing with the Class API <tune-trainable-save-restore>`
-for examples of saving and loading trial-level checkpoints.