Skip to content

Commit

Permalink
[data] [streaming] Expose streaming config APIs in toplevel (ray-proj…
Browse files Browse the repository at this point in the history
…ect#33653)

Signed-off-by: elliottower <elliot@elliottower.com>
  • Loading branch information
ericl authored and elliottower committed Apr 22, 2023
1 parent 6eb2cd8 commit 3eae1ef
Show file tree
Hide file tree
Showing 8 changed files with 37 additions and 8 deletions.
1 change: 1 addition & 0 deletions doc/source/data/api/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Ray Datasets API
dataset.rst
dataset_iterator.rst
dataset_pipeline.rst
execution_options.rst
grouped_dataset.rst
dataset_context.rst
data_representations.rst
Expand Down
4 changes: 2 additions & 2 deletions doc/source/data/api/dataset_context.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@ Constructor
:toctree: doc/
:template: autosummary/class_with_autosummary.rst

context.DatasetContext
DatasetContext

Get DatasetContext
------------------

.. autosummary::
:toctree: doc/

context.DatasetContext.get_current
DatasetContext.get_current
23 changes: 23 additions & 0 deletions doc/source/data/api/execution_options.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
.. _execution-options-api:

ExecutionOptions API
====================

.. currentmodule:: ray.data

Constructor
-----------

.. autosummary::
:toctree: doc/
:template: autosummary/class_with_autosummary.rst

ExecutionOptions

Resource Options
----------------

.. autosummary::
:toctree: doc/

ExecutionResources
6 changes: 3 additions & 3 deletions doc/source/data/dataset-internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ Datasets and Placement Groups
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, Datasets configures its tasks and actors to use the cluster-default scheduling strategy ("DEFAULT"). You can inspect this configuration variable here:
:class:`ray.data.context.DatasetContext.get_current().scheduling_strategy <ray.data.context.DatasetContext>`. This scheduling strategy will schedule these tasks and actors outside any present
placement group. If you want to force Datasets to schedule tasks within the current placement group (i.e., to use current placement group resources specifically for Datasets), you can set ``ray.data.context.DatasetContext.get_current().scheduling_strategy = None``.
:class:`ray.data.DatasetContext.get_current().scheduling_strategy <ray.data.DatasetContext>`. This scheduling strategy will schedule these tasks and actors outside any present
placement group. If you want to force Datasets to schedule tasks within the current placement group (i.e., to use current placement group resources specifically for Datasets), you can set ``ray.data.DatasetContext.get_current().scheduling_strategy = None``.

This should be considered for advanced use cases to improve performance predictability only. We generally recommend letting Datasets run outside placement groups as documented in the :ref:`Datasets and Other Libraries <datasets_tune>` section.

Expand Down Expand Up @@ -119,7 +119,7 @@ Execution Memory

During execution, a task can read multiple input blocks, and write multiple output blocks. Input and output blocks consume both worker heap memory and shared memory via Ray's object store.

Datasets attempts to bound its heap memory usage to `num_execution_slots * max_block_size`. The number of execution slots is by default equal to the number of CPUs, unless custom resources are specified. The maximum block size is set by the configuration parameter `ray.data.context.DatasetContext.target_max_block_size` and is set to 512MiB by default. When a task's output is larger than this value, the worker will automatically split the output into multiple smaller blocks to avoid running out of heap memory.
Datasets attempts to bound its heap memory usage to `num_execution_slots * max_block_size`. The number of execution slots is by default equal to the number of CPUs, unless custom resources are specified. The maximum block size is set by the configuration parameter `ray.data.DatasetContext.target_max_block_size` and is set to 512MiB by default. When a task's output is larger than this value, the worker will automatically split the output into multiple smaller blocks to avoid running out of heap memory.

Large block size can lead to potential out-of-memory situations. To avoid these issues, make sure no single item in your Datasets is too large, and always call :meth:`ds.map_batches() <ray.data.Dataset.map_batches>` with batch size small enough such that the output batch can comfortably fit into memory.

Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/dataset-tensor-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,7 @@ below.

.. code-block::
from ray.data.context import DatasetContext
from ray.data import DatasetContext
ctx = DatasetContext.get_current()
ctx.enable_tensor_extension_casting = False
Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/doc_code/key_concepts.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ def objective(*args):
# fmt: off
# __block_move_begin__
import ray
from ray.data.context import DatasetContext
from ray.data import DatasetContext

ctx = DatasetContext.get_current()
ctx.optimize_fuse_stages = False
Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/performance-tips.rst
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ setting the ``DatasetContext.use_push_based_shuffle`` flag:
import ray.data
ctx = ray.data.context.DatasetContext.get_current()
ctx = ray.data.DatasetContext.get_current()
ctx.use_push_based_shuffle = True
n = 1000
Expand Down
5 changes: 5 additions & 0 deletions python/ray/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@

from ray.data._internal.compute import ActorPoolStrategy
from ray.data._internal.progress_bar import set_progress_bars
from ray.data._internal.execution.interfaces import ExecutionOptions, ExecutionResources
from ray.data.dataset import Dataset
from ray.data.context import DatasetContext
from ray.data.dataset_iterator import DatasetIterator
from ray.data.dataset_pipeline import DatasetPipeline
from ray.data.datasource import Datasource, ReadTask
Expand Down Expand Up @@ -56,9 +58,12 @@
__all__ = [
"ActorPoolStrategy",
"Dataset",
"DatasetContext",
"DatasetIterator",
"DatasetPipeline",
"Datasource",
"ExecutionOptions",
"ExecutionResources",
"ReadTask",
"from_dask",
"from_items",
Expand Down

0 comments on commit 3eae1ef

Please sign in to comment.