Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data][doc] Add DatasetConfig -> DataConfig migration guide #37278

Merged
merged 10 commits into from
Jul 13, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion doc/source/ray-air/check-ingest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Datasets are lazy and their execution is streamed, which means that on each epoc
Ray Data execution options
~~~~~~~~~~~~~~~~~~~~~~~~~~

Under the hood, Train configures some default Data options for ingest: limiting the data ingest memory usage to 2GB per worker, and telling it to optimize the locality of the output data for ingest. See :meth:`help(DataConfig.default_ingest_options()) <ray.train.data_config.DataConfig.default_ingest_options>` if you want to learn more and further customize these settings.
Under the hood, Train configures some default Data options for ingest: limiting the data ingest memory usage to 2GB per worker, and telling it to optimize the locality of the output data for ingest. See :meth:`help(DataConfig.default_ingest_options()) <ray.train.data_config.DataConfig.default_ingest_options>` if you want to learn more and further customize these settings.

Common options you may want to adjust:

Expand Down Expand Up @@ -88,3 +88,25 @@ What do you need to know about this ``DataConfig`` class?
* Its ``configure`` method is called on the main actor of the Trainer group to create the data iterators for each worker.

In general, you can use ``DataConfig`` for any shared setup that has to occur ahead of time before the workers start reading data. The setup will be run at the start of each Trainer run.

Migrating from the legacy DatasetConfig API
-------------------------------------------

Starting from Ray 2.6, the ``DatasetConfig`` API is deprecated, and it will be removed in a future release. If your workloads are still using it, consider migrating to the new :class:`DataConfig <ray.train.data_config.DataConfig>` API as soon as possible.

The main difference is that preprocessing no longer part of the Trainer. As Dataset operations are lazy. You can apply any operations to your Datasets before passing them to the Trainer. The operations will be re-executed before each epoch.

In the following example with the legacy ``DatasetConfig`` API, we pass 2 Datasets ("train" and "test") to the Trainer and apply an "add_noise" preprocessor per epoch to the "train" Dataset. Also, we will split the "train" Dataset, but not the "test" Dataset.

.. literalinclude:: doc_code/air_ingest_migration.py
:language: python
:start-after: __legacy_api__
:end-before: __legacy_api_end__

To migrate this example to the new :class:`DatasetConfig <ray.air.config.DatasetConfig>` API, we apply the "add_noise" preprocesor to the "train" Dataset prior to passing it to the Trainer. And we use ``DataConfig(datasets_to_split=["train"])`` to specify which Datasets need to be split. Note, the ``datasets_to_split`` argument is optional. By default, only the "train" Dataset will be split. If you don't want to split the "train" Dataset either, use ``datasets_to_split=[]``.

.. literalinclude:: doc_code/air_ingest_migration.py
:language: python
:start-after: __new_api__
:end-before: __new_api_end__

62 changes: 62 additions & 0 deletions doc/source/ray-air/doc_code/air_ingest_migration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# flake8: noqa
# isort: skip_file

# __legacy_api__
import random
import ray

from ray.air.config import ScalingConfig, DatasetConfig
from ray.data.preprocessors.batch_mapper import BatchMapper
from ray.train.torch import TorchTrainer

train_ds = ray.data.range_tensor(1000)
test_ds = ray.data.range_tensor(10)

# A randomized preprocessor that adds a random float to all values.
add_noise = BatchMapper(lambda df: df + random.random(), batch_format="pandas")

my_trainer = TorchTrainer(
lambda: None,
scaling_config=ScalingConfig(num_workers=1),
datasets={
"train": train_ds,
"test": test_ds,
},
dataset_config={
"train": DatasetConfig(
split=True,
# Apply the preprocessor for each epoch.
per_epoch_preprocessor=add_noise,
),
"test": DatasetConfig(
split=False,
),
},
)
my_trainer.fit()
# __legacy_api_end__

# __new_api__
from ray.train.data_config import DataConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be clear here that ray.train.data_config is not a public API. All the configs should be included via ray.train directly going forward. Also see the PR description of #36706 and the REP ray-project/enhancements#36 :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl maybe we should rename data_config to _data_config so there are not two ways to do this and people get confused?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put it in a _internal folder then probably, to follow convention of other libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's replace ray.train.data_config with ray.train everywhere in this PR before merging, and then we can do the _internal refactor later as time permits (doesn't need to be part of 2.6).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I assume nothing is needed in this PR, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is -- you need to replace ray.train.data_config.DataConfig with ray.train.DataConfig everywhere :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I didn't know it's already in ray.train. I think there may be other places using ray.train.data_config as well. Will update all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's awesome :)


train_ds = ray.data.range_tensor(1000)
test_ds = ray.data.range_tensor(10)

# Apply the preprocessor before passing the Dataset to the Trainer.
# This operation is lazy. It will be re-executed for each epoch.
train_ds = add_noise.transform(train_ds)

my_trainer = TorchTrainer(
lambda: None,
scaling_config=ScalingConfig(num_workers=1),
datasets={
"train": train_ds,
"test": test_ds,
},
# Specify which datasets to split.
dataset_config=DataConfig(
datasets_to_split=["train"],
),
)
my_trainer.fit()
# __new_api_end__
2 changes: 1 addition & 1 deletion python/ray/air/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@ def from_placement_group_factory(
@Deprecated(
message="Use `ray.train.DataConfig` instead of DatasetConfig to "
"configure data ingest for training. "
"See https://docs.ray.io/en/master/ray-air/check-ingest.html for more details."
"See https://docs.ray.io/en/master/ray-air/check-ingest.html#migrating-from-the-legacy-datasetconfig-api for more details." # noqa: E501
)
class DatasetConfig:
"""Configuration for ingest of a single Dataset.
Expand Down