ray-project · raulchen · Jul 13, 2023 · Jul 11, 2023 · Jul 11, 2023 · Jul 11, 2023
@@ -51,7 +51,7 @@ Datasets are lazy and their execution is streamed, which means that on each epoc
 Ray Data execution options
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Under the hood, Train configures some default Data options for ingest: limiting the data ingest memory usage to 2GB per worker, and telling it to optimize the locality of the output data for ingest. See :meth:`help(DataConfig.default_ingest_options()) <ray.train.data_config.DataConfig.default_ingest_options>` if you want to learn more and further customize these settings. 
+Under the hood, Train configures some default Data options for ingest: limiting the data ingest memory usage to 2GB per worker, and telling it to optimize the locality of the output data for ingest. See :meth:`help(DataConfig.default_ingest_options()) <ray.train.data_config.DataConfig.default_ingest_options>` if you want to learn more and further customize these settings.
 
 Common options you may want to adjust:
 
@@ -88,3 +88,25 @@ What do you need to know about this ``DataConfig`` class?
 * Its ``configure`` method is called on the main actor of the Trainer group to create the data iterators for each worker.
 
 In general, you can use ``DataConfig`` for any shared setup that has to occur ahead of time before the workers start reading data. The setup will be run at the start of each Trainer run.
+
+Migrating from the legacy DatasetConfig API
+-------------------------------------------
+
+Starting from Ray 2.6, the ``DatasetConfig`` API is deprecated, and it will be removed in a future release. If your workloads are still using it, consider migrating to the new :class:`DataConfig <ray.train.data_config.DataConfig>` API as soon as possible.
+
+The main difference is that preprocessing no longer part of the Trainer. As Dataset operations are lazy. You can apply any operations to your Datasets before passing them to the Trainer. The operations will be re-executed before each epoch.
+
+In the following example with the legacy ``DatasetConfig`` API, we pass 2 Datasets ("train" and "test") to the Trainer and apply an "add_noise" preprocessor per epoch to the "train" Dataset. Also, we will split the "train" Dataset, but not the "test" Dataset.
+
+.. literalinclude:: doc_code/air_ingest_migration.py
+    :language: python
+    :start-after: __legacy_api__
+    :end-before: __legacy_api_end__
+
+To migrate this example to the new :class:`DatasetConfig <ray.air.config.DatasetConfig>` API, we apply the "add_noise" preprocesor to the "train" Dataset prior to passing it to the Trainer. And we use ``DataConfig(datasets_to_split=["train"])`` to specify which Datasets need to be split. Note, the ``datasets_to_split`` argument is optional. By default, only the "train" Dataset will be split. If you don't want to split the "train" Dataset either, use ``datasets_to_split=[]``.
+
+.. literalinclude:: doc_code/air_ingest_migration.py
+    :language: python
+    :start-after: __new_api__
+    :end-before: __new_api_end__
+
@@ -0,0 +1,62 @@
+# flake8: noqa
+# isort: skip_file
+
+# __legacy_api__
+import random
+import ray
+
+from ray.air.config import ScalingConfig, DatasetConfig
+from ray.data.preprocessors.batch_mapper import BatchMapper
+from ray.train.torch import TorchTrainer
+
+train_ds = ray.data.range_tensor(1000)
+test_ds = ray.data.range_tensor(10)
+
+# A randomized preprocessor that adds a random float to all values.
+add_noise = BatchMapper(lambda df: df + random.random(), batch_format="pandas")
+
+my_trainer = TorchTrainer(
+    lambda: None,
+    scaling_config=ScalingConfig(num_workers=1),
+    datasets={
+        "train": train_ds,
+        "test": test_ds,
+    },
+    dataset_config={
+        "train": DatasetConfig(
+            split=True,
+            # Apply the preprocessor for each epoch.
+            per_epoch_preprocessor=add_noise,
+        ),
+        "test": DatasetConfig(
+            split=False,
+        ),
+    },
+)
+my_trainer.fit()
+# __legacy_api_end__
+
+# __new_api__
+from ray.train.data_config import DataConfig
+
+train_ds = ray.data.range_tensor(1000)
+test_ds = ray.data.range_tensor(10)
+
+# Apply the preprocessor before passing the Dataset to the Trainer.
+# This operation is lazy. It will be re-executed for each epoch.
+train_ds = add_noise.transform(train_ds)
+
+my_trainer = TorchTrainer(
+    lambda: None,
+    scaling_config=ScalingConfig(num_workers=1),
+    datasets={
+        "train": train_ds,
+        "test": test_ds,
+    },
+    # Specify which datasets to split.
+    dataset_config=DataConfig(
+        datasets_to_split=["train"],
+    ),
+)
+my_trainer.fit()
+# __new_api_end__
diff --git a/python/ray/air/config.py b/python/ray/air/config.py
@@ -297,7 +297,7 @@ def from_placement_group_factory(
 @Deprecated(
     message="Use `ray.train.DataConfig` instead of DatasetConfig to "
     "configure data ingest for training. "
-    "See https://docs.ray.io/en/master/ray-air/check-ingest.html for more details."
+    "See https://docs.ray.io/en/master/ray-air/check-ingest.html#migrating-from-the-legacy-datasetconfig-api for more details."  # noqa: E501
 )
 class DatasetConfig:
     """Configuration for ingest of a single Dataset.