[data] [streaming] Expose streaming config APIs in toplevel (ray-proj…

…ect#33653) Signed-off-by: elliottower <elliot@elliottower.com>
elliottower · Apr 22, 2023 · 3eae1ef · 3eae1ef
1 parent 6eb2cd8
commit 3eae1ef
Show file tree

Hide file tree

Showing 8 changed files with 37 additions and 8 deletions.
diff --git a/doc/source/data/api/api.rst b/doc/source/data/api/api.rst
@@ -10,6 +10,7 @@ Ray Datasets API
     dataset.rst
     dataset_iterator.rst
     dataset_pipeline.rst
+    execution_options.rst
     grouped_dataset.rst
     dataset_context.rst
     data_representations.rst

diff --git a/doc/source/data/api/dataset_context.rst b/doc/source/data/api/dataset_context.rst
@@ -12,12 +12,12 @@ Constructor
    :toctree: doc/
    :template: autosummary/class_with_autosummary.rst
 
-   context.DatasetContext
+   DatasetContext
 
 Get DatasetContext
 ------------------
 
 .. autosummary::
    :toctree: doc/
 
-   context.DatasetContext.get_current
+   DatasetContext.get_current
diff --git a/doc/source/data/api/execution_options.rst b/doc/source/data/api/execution_options.rst
@@ -0,0 +1,23 @@
+.. _execution-options-api:
+
+ExecutionOptions API
+====================
+
+.. currentmodule:: ray.data
+
+Constructor
+-----------
+
+.. autosummary::
+   :toctree: doc/
+   :template: autosummary/class_with_autosummary.rst
+
+   ExecutionOptions
+
+Resource Options
+----------------
+
+.. autosummary::
+   :toctree: doc/
+
+   ExecutionResources
diff --git a/doc/source/data/dataset-internals.rst b/doc/source/data/dataset-internals.rst
@@ -54,8 +54,8 @@ Datasets and Placement Groups
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 By default, Datasets configures its tasks and actors to use the cluster-default scheduling strategy ("DEFAULT"). You can inspect this configuration variable here:
-:class:`ray.data.context.DatasetContext.get_current().scheduling_strategy <ray.data.context.DatasetContext>`. This scheduling strategy will schedule these tasks and actors outside any present
-placement group. If you want to force Datasets to schedule tasks within the current placement group (i.e., to use current placement group resources specifically for Datasets), you can set ``ray.data.context.DatasetContext.get_current().scheduling_strategy = None``.
+:class:`ray.data.DatasetContext.get_current().scheduling_strategy <ray.data.DatasetContext>`. This scheduling strategy will schedule these tasks and actors outside any present
+placement group. If you want to force Datasets to schedule tasks within the current placement group (i.e., to use current placement group resources specifically for Datasets), you can set ``ray.data.DatasetContext.get_current().scheduling_strategy = None``.
 
 This should be considered for advanced use cases to improve performance predictability only. We generally recommend letting Datasets run outside placement groups as documented in the :ref:`Datasets and Other Libraries <datasets_tune>` section.
 
@@ -119,7 +119,7 @@ Execution Memory
 
 During execution, a task can read multiple input blocks, and write multiple output blocks. Input and output blocks consume both worker heap memory and shared memory via Ray's object store.
 
-Datasets attempts to bound its heap memory usage to `num_execution_slots * max_block_size`. The number of execution slots is by default equal to the number of CPUs, unless custom resources are specified. The maximum block size is set by the configuration parameter `ray.data.context.DatasetContext.target_max_block_size` and is set to 512MiB by default. When a task's output is larger than this value, the worker will automatically split the output into multiple smaller blocks to avoid running out of heap memory.
+Datasets attempts to bound its heap memory usage to `num_execution_slots * max_block_size`. The number of execution slots is by default equal to the number of CPUs, unless custom resources are specified. The maximum block size is set by the configuration parameter `ray.data.DatasetContext.target_max_block_size` and is set to 512MiB by default. When a task's output is larger than this value, the worker will automatically split the output into multiple smaller blocks to avoid running out of heap memory.
 
 Large block size can lead to potential out-of-memory situations. To avoid these issues, make sure no single item in your Datasets is too large, and always call :meth:`ds.map_batches() <ray.data.Dataset.map_batches>` with batch size small enough such that the output batch can comfortably fit into memory.
 

diff --git a/doc/source/data/dataset-tensor-support.rst b/doc/source/data/dataset-tensor-support.rst
@@ -236,7 +236,7 @@ below.
 
 .. code-block::
 
-    from ray.data.context import DatasetContext
+    from ray.data import DatasetContext
 
     ctx = DatasetContext.get_current()
     ctx.enable_tensor_extension_casting = False

diff --git a/doc/source/data/doc_code/key_concepts.py b/doc/source/data/doc_code/key_concepts.py
@@ -54,7 +54,7 @@ def objective(*args):
 # fmt: off
 # __block_move_begin__
 import ray
-from ray.data.context import DatasetContext
+from ray.data import DatasetContext
 
 ctx = DatasetContext.get_current()
 ctx.optimize_fuse_stages = False

diff --git a/doc/source/data/performance-tips.rst b/doc/source/data/performance-tips.rst
@@ -148,7 +148,7 @@ setting the ``DatasetContext.use_push_based_shuffle`` flag:
 
     import ray.data
 
-    ctx = ray.data.context.DatasetContext.get_current()
+    ctx = ray.data.DatasetContext.get_current()
     ctx.use_push_based_shuffle = True
 
     n = 1000

diff --git a/python/ray/data/__init__.py b/python/ray/data/__init__.py
@@ -8,7 +8,9 @@
 
 from ray.data._internal.compute import ActorPoolStrategy
 from ray.data._internal.progress_bar import set_progress_bars
+from ray.data._internal.execution.interfaces import ExecutionOptions, ExecutionResources
 from ray.data.dataset import Dataset
+from ray.data.context import DatasetContext
 from ray.data.dataset_iterator import DatasetIterator
 from ray.data.dataset_pipeline import DatasetPipeline
 from ray.data.datasource import Datasource, ReadTask
@@ -56,9 +58,12 @@
 __all__ = [
     "ActorPoolStrategy",
     "Dataset",
+    "DatasetContext",
     "DatasetIterator",
     "DatasetPipeline",
     "Datasource",
+    "ExecutionOptions",
+    "ExecutionResources",
     "ReadTask",
     "from_dask",
     "from_items",