Skip to content

Commit

Permalink
[cherry-pick][data][doc] Add DatasetConfig -> DataConfig migration gu…
Browse files Browse the repository at this point in the history
…ide (#37383)

* [data][doc] Add DatasetConfig -> DataConfig migration guide  (#37278)

- Add DatasetConfig -> DataConfig migration guide
- Move ray.train.data_config to ray.train._internal.data_config

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
  • Loading branch information
raulchen authored Jul 13, 2023
1 parent 0a0bac8 commit abd287f
Show file tree
Hide file tree
Showing 21 changed files with 104 additions and 20 deletions.
26 changes: 24 additions & 2 deletions doc/source/ray-air/check-ingest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ In this basic example, the `train_ds` object is created in your Ray script befor
Splitting data across workers
-----------------------------

By default, Train will split the ``"train"`` dataset across workers using :meth:`Dataset.streaming_split <ray.data.Dataset.streaming_split>`. This means that each worker sees a disjoint subset of the data, instead of iterating over the entire dataset. To customize this, we can pass in a :class:`DataConfig <ray.train.data_config.DataConfig>` to the Trainer constructor. For example, the following splits dataset ``"a"`` but not ``"b"``.
By default, Train will split the ``"train"`` dataset across workers using :meth:`Dataset.streaming_split <ray.data.Dataset.streaming_split>`. This means that each worker sees a disjoint subset of the data, instead of iterating over the entire dataset. To customize this, we can pass in a :class:`DataConfig <ray.train.DataConfig>` to the Trainer constructor. For example, the following splits dataset ``"a"`` but not ``"b"``.

.. literalinclude:: doc_code/air_ingest_new.py
:language: python
Expand All @@ -51,7 +51,7 @@ Datasets are lazy and their execution is streamed, which means that on each epoc
Ray Data execution options
~~~~~~~~~~~~~~~~~~~~~~~~~~

Under the hood, Train configures some default Data options for ingest: limiting the data ingest memory usage to 2GB per worker, and telling it to optimize the locality of the output data for ingest. See :meth:`help(DataConfig.default_ingest_options()) <ray.train.data_config.DataConfig.default_ingest_options>` if you want to learn more and further customize these settings.
Under the hood, Train configures some default Data options for ingest: limiting the data ingest memory usage to 2GB per worker, and telling it to optimize the locality of the output data for ingest. See :meth:`help(DataConfig.default_ingest_options()) <ray.train.DataConfig.default_ingest_options>` if you want to learn more and further customize these settings.

Common options you may want to adjust:

Expand Down Expand Up @@ -88,3 +88,25 @@ What do you need to know about this ``DataConfig`` class?
* Its ``configure`` method is called on the main actor of the Trainer group to create the data iterators for each worker.

In general, you can use ``DataConfig`` for any shared setup that has to occur ahead of time before the workers start reading data. The setup will be run at the start of each Trainer run.

Migrating from the legacy DatasetConfig API
-------------------------------------------

Starting from Ray 2.6, the ``DatasetConfig`` API is deprecated, and it will be removed in a future release. If your workloads are still using it, consider migrating to the new :class:`DataConfig <ray.train.DataConfig>` API as soon as possible.

The main difference is that preprocessing no longer part of the Trainer. As Dataset operations are lazy. You can apply any operations to your Datasets before passing them to the Trainer. The operations will be re-executed before each epoch.

In the following example with the legacy ``DatasetConfig`` API, we pass 2 Datasets ("train" and "test") to the Trainer and apply an "add_noise" preprocessor per epoch to the "train" Dataset. Also, we will split the "train" Dataset, but not the "test" Dataset.

.. literalinclude:: doc_code/air_ingest_migration.py
:language: python
:start-after: __legacy_api__
:end-before: __legacy_api_end__

To migrate this example to the new :class:`DatasetConfig <ray.air.config.DatasetConfig>` API, we apply the "add_noise" preprocesor to the "train" Dataset prior to passing it to the Trainer. And we use ``DataConfig(datasets_to_split=["train"])`` to specify which Datasets need to be split. Note, the ``datasets_to_split`` argument is optional. By default, only the "train" Dataset will be split. If you don't want to split the "train" Dataset either, use ``datasets_to_split=[]``.

.. literalinclude:: doc_code/air_ingest_migration.py
:language: python
:start-after: __new_api__
:end-before: __new_api_end__

62 changes: 62 additions & 0 deletions doc/source/ray-air/doc_code/air_ingest_migration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# flake8: noqa
# isort: skip_file

# __legacy_api__
import random
import ray

from ray.air.config import ScalingConfig, DatasetConfig
from ray.data.preprocessors.batch_mapper import BatchMapper
from ray.train.torch import TorchTrainer

train_ds = ray.data.range_tensor(1000)
test_ds = ray.data.range_tensor(10)

# A randomized preprocessor that adds a random float to all values.
add_noise = BatchMapper(lambda df: df + random.random(), batch_format="pandas")

my_trainer = TorchTrainer(
lambda: None,
scaling_config=ScalingConfig(num_workers=1),
datasets={
"train": train_ds,
"test": test_ds,
},
dataset_config={
"train": DatasetConfig(
split=True,
# Apply the preprocessor for each epoch.
per_epoch_preprocessor=add_noise,
),
"test": DatasetConfig(
split=False,
),
},
)
my_trainer.fit()
# __legacy_api_end__

# __new_api__
from ray.train import DataConfig

train_ds = ray.data.range_tensor(1000)
test_ds = ray.data.range_tensor(10)

# Apply the preprocessor before passing the Dataset to the Trainer.
# This operation is lazy. It will be re-executed for each epoch.
train_ds = add_noise.transform(train_ds)

my_trainer = TorchTrainer(
lambda: None,
scaling_config=ScalingConfig(num_workers=1),
datasets={
"train": train_ds,
"test": test_ds,
},
# Specify which datasets to split.
dataset_config=DataConfig(
datasets_to_split=["train"],
),
)
my_trainer.fit()
# __new_api_end__
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
import mlflow
import pandas as pd
from ray.air.config import ScalingConfig
from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.train.torch.torch_trainer import TorchTrainer
import torch
import torch.nn as nn
Expand Down
2 changes: 1 addition & 1 deletion doc/source/train/api/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Trainer Base Classes

~train.trainer.BaseTrainer
~train.data_parallel_trainer.DataParallelTrainer
~train.data_config.DataConfig
~train.DataConfig
~train.gbdt_trainer.GBDTTrainer

``BaseTrainer`` API
Expand Down
2 changes: 1 addition & 1 deletion python/ray/air/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@ def from_placement_group_factory(
@Deprecated(
message="Use `ray.train.DataConfig` instead of DatasetConfig to "
"configure data ingest for training. "
"See https://docs.ray.io/en/master/ray-air/check-ingest.html for more details."
"See https://docs.ray.io/en/master/ray-air/check-ingest.html#migrating-from-the-legacy-datasetconfig-api for more details." # noqa: E501
)
class DatasetConfig:
"""Configuration for ingest of a single Dataset.
Expand Down
2 changes: 1 addition & 1 deletion python/ray/air/tests/test_new_dataset_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from ray.air import session
from ray.air.config import ScalingConfig
from ray.data import DataIterator
from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.train.data_parallel_trainer import DataParallelTrainer


Expand Down
2 changes: 1 addition & 1 deletion python/ray/air/util/check_ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from ray.data.preprocessors import BatchMapper, Chain
from ray.train._internal.dataset_spec import DataParallelIngestSpec
from ray.train.data_parallel_trainer import DataParallelTrainer
from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.util.annotations import DeveloperAPI


Expand Down
2 changes: 1 addition & 1 deletion python/ray/train/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from ray._private.usage import usage_lib
from ray.train.backend import BackendConfig
from ray.train.data_config import DataConfig
from ray.train._internal.data_config import DataConfig
from ray.train.constants import TRAIN_DATASET_KEY
from ray.train.trainer import TrainingIterator

Expand Down
2 changes: 1 addition & 1 deletion python/ray/train/_internal/backend_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from ray._private.ray_constants import env_integer
from ray.air.config import CheckpointConfig
from ray.exceptions import RayActorError
from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.air.checkpoint import Checkpoint
from ray.train._internal.session import (
TrainingResult,
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion python/ray/train/data_parallel_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
from ray.train import BackendConfig, TrainingIterator
from ray.train._internal.backend_executor import BackendExecutor, TrialInfo
from ray.train._internal.checkpoint import TuneCheckpointManager
from ray.train.data_config import DataConfig, _LegacyDataConfigWrapper
from ray.train._internal.data_config import DataConfig, _LegacyDataConfigWrapper
from ray.train._internal.utils import construct_train_func
from ray.train.constants import TRAIN_DATASET_KEY, WILDCARD_KEY
from ray.train.trainer import BaseTrainer, GenDataset
Expand Down
2 changes: 1 addition & 1 deletion python/ray/train/horovod/horovod_trainer.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from typing import Dict, Callable, Optional, Union, TYPE_CHECKING

from ray.air.config import ScalingConfig, RunConfig
from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.train.trainer import GenDataset
from ray.air.checkpoint import Checkpoint

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from ray.air import session
from ray.air.checkpoint import Checkpoint
from ray.air.config import RunConfig, ScalingConfig
from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.train.torch import TorchConfig
from ray.train.trainer import GenDataset

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
EVALUATION_DATASET_KEY,
TRAIN_DATASET_KEY,
)
from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.train.data_parallel_trainer import DataParallelTrainer
from ray.train.torch import TorchConfig, TorchTrainer
from ray.train.trainer import GenDataset
Expand Down
2 changes: 1 addition & 1 deletion python/ray/train/lightning/lightning_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from ray.air.constants import MODEL_KEY
from ray.air.checkpoint import Checkpoint
from ray.data.preprocessor import Preprocessor
from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.train.trainer import GenDataset
from ray.train.torch import TorchTrainer
from ray.train.torch.config import TorchConfig
Expand Down
2 changes: 1 addition & 1 deletion python/ray/train/mosaic/mosaic_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from ray.air.checkpoint import Checkpoint
from ray.air.config import RunConfig, ScalingConfig
from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.train.mosaic._mosaic_utils import RayLogger
from ray.train.torch import TorchConfig, TorchTrainer
from ray.train.trainer import GenDataset
Expand Down
2 changes: 1 addition & 1 deletion python/ray/train/tensorflow/tensorflow_trainer.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from typing import Callable, Optional, Dict, Union, TYPE_CHECKING

from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.train.tensorflow.config import TensorflowConfig
from ray.train.trainer import GenDataset
from ray.train.data_parallel_trainer import DataParallelTrainer
Expand Down
2 changes: 1 addition & 1 deletion python/ray/train/tests/test_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
TrainingWorkerError,
)

from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.train._internal.worker_group import WorkerGroup, WorkerMetadata
from ray.train.backend import Backend, BackendConfig
from ray.train.constants import (
Expand Down
2 changes: 1 addition & 1 deletion python/ray/train/tests/test_training_iterator.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from ray.air.config import CheckpointConfig
from ray.train._internal.worker_group import WorkerGroup
from ray.train.trainer import TrainingIterator
from ray.train.data_config import DataConfig
from ray.train import DataConfig

import ray
from ray.air import session
Expand Down
2 changes: 1 addition & 1 deletion python/ray/train/torch/torch_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from ray.air.checkpoint import Checkpoint
from ray.air.config import RunConfig, ScalingConfig
from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.train.data_parallel_trainer import DataParallelTrainer
from ray.train.torch.config import TorchConfig
from ray.train.trainer import GenDataset
Expand Down
2 changes: 1 addition & 1 deletion python/ray/train/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from ray.air._internal.uri_utils import URI
from ray.air._internal.util import StartTraceback
from ray.data import Dataset
from ray.train.data_config import DataConfig
from ray.train import DataConfig
from ray.train._internal.backend_executor import (
BackendExecutor,
InactiveWorkerGroupError,
Expand Down

0 comments on commit abd287f

Please sign in to comment.