[data] Add iterator batch_format=None support, which will yield batches in the current batch format with zero copies #33562

ericl · 2023-03-21T23:29:20Z

Why are these changes needed?

This PR is a cleanup of #33536

It uses "None" instead of "zero-copy" as a batch format, since None has a similar meaning for batch_size, where it means a system-chosen batch size. Here "None" also means the system chosen optimal batch format.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

amogkam · 2023-03-21T23:32:11Z

python/ray/data/dataset.py

@@ -379,7 +379,7 @@ def map_batches(
        *,
        batch_size: Optional[Union[int, Literal["default"]]] = "default",
        compute: Optional[Union[str, ComputeStrategy]] = None,
-        batch_format: Literal["default", "pandas", "pyarrow", "numpy"] = "default",
+        batch_format: Optional[str] = "default",


keep it as Optional[Literal] for full explicitness of supported batch formats?

I feel like that is hard to maintain (per the inconsistencies in the code already), so opted to go unify on the shorter signature.

amogkam · 2023-03-21T23:34:09Z

lets also keep the documentation changes from https://github.com/ray-project/ray/pull/33536/files#diff-988f3832ac94d085daf61260175e2580920ebd1521dc760f58b426b94379d5b7L235?

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl · 2023-03-22T02:14:07Z

lets also keep the documentation changes from https://github.com/ray-project/ray/pull/33536/files#diff-988f3832ac94d085daf61260175e2580920ebd1521dc760f58b426b94379d5b7L235?

Done

clarkzinzow · 2023-03-22T16:05:29Z

python/ray/data/dataset.py

@@ -540,7 +540,9 @@ def map_batches(
                (promotes tables to Pandas and tensors to NumPy), ``"pandas"`` to select
                ``pandas.DataFrame``, "pyarrow" to select ``pyarrow.Table``, or
                ``"numpy"`` to select ``numpy.ndarray`` for tensor datasets and
-                ``Dict[str, numpy.ndarray]`` for tabular datasets. Default is "default".
+                ``Dict[str, numpy.ndarray]`` for tabular datasets, or None to return
+                the underlying block exactly as is with no additional formatting.


Nice, I like batch_size=None a good bit more than adding another literal string!

#33601) The failure in rllib should have been fixed by #33562 Verified with `python -m pytest rllib/core/learner/torch/tests/test_torch_learner.py::TestLearner::test_end_to_end_update`.

…es in the current batch format with zero copies (ray-project#33562) This PR is a cleanup of ray-project#33536 It uses "None" instead of "zero-copy" as a batch format, since None has a similar meaning for batch_size, where it means a system-chosen batch size. Here "None" also means the system chosen optimal batch format. Signed-off-by: elliottower <elliot@elliottower.com>

…project#324… (ray-project#33601) The failure in rllib should have been fixed by ray-project#33562 Verified with `python -m pytest rllib/core/learner/torch/tests/test_torch_learner.py::TestLearner::test_end_to_end_update`. Signed-off-by: elliottower <elliot@elliottower.com>

…es in the current batch format with zero copies (ray-project#33562) This PR is a cleanup of ray-project#33536 It uses "None" instead of "zero-copy" as a batch format, since None has a similar meaning for batch_size, where it means a system-chosen batch size. Here "None" also means the system chosen optimal batch format. Signed-off-by: Jack He <jackhe2345@gmail.com>

…project#324… (ray-project#33601) The failure in rllib should have been fixed by ray-project#33562 Verified with `python -m pytest rllib/core/learner/torch/tests/test_torch_learner.py::TestLearner::test_end_to_end_update`. Signed-off-by: Jack He <jackhe2345@gmail.com>

ericl added 6 commits March 21, 2023 14:54

wip

5d15807

wip

43ba159

Signed-off-by: Eric Liang <ekhliang@gmail.com>

wip

26b8d92

Signed-off-by: Eric Liang <ekhliang@gmail.com>

update

79cd16c

Signed-off-by: Eric Liang <ekhliang@gmail.com>

update

d876c83

Signed-off-by: Eric Liang <ekhliang@gmail.com>

fix

b310e35

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl requested review from scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners March 21, 2023 23:29

ericl assigned c21, amogkam and jianoaix Mar 21, 2023

amogkam approved these changes Mar 21, 2023

View reviewed changes

ericl mentioned this pull request Mar 22, 2023

[Data] Deprecate dataset_format #33437

Merged

8 tasks

add zero copy docs

4e6aec4

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl requested review from maxpumperla and a team as code owners March 22, 2023 02:14

clarkzinzow approved these changes Mar 22, 2023

View reviewed changes

ericl merged commit 68afa43 into ray-project:master Mar 22, 2023

jianoaix mentioned this pull request Mar 22, 2023

Revert "[Datasets] Revert "Enable streaming executor by default (#324… #33601

Merged

8 tasks

ericl mentioned this pull request Mar 25, 2023

[Datasets] Add a new "zero_copy" batch format #32662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Add iterator batch_format=None support, which will yield batches in the current batch format with zero copies #33562

[data] Add iterator batch_format=None support, which will yield batches in the current batch format with zero copies #33562

ericl commented Mar 21, 2023

amogkam Mar 21, 2023

ericl Mar 22, 2023

amogkam commented Mar 21, 2023

ericl commented Mar 22, 2023

clarkzinzow Mar 22, 2023

[data] Add iterator batch_format=None support, which will yield batches in the current batch format with zero copies #33562

[data] Add iterator batch_format=None support, which will yield batches in the current batch format with zero copies #33562

Conversation

ericl commented Mar 21, 2023

Why are these changes needed?

amogkam Mar 21, 2023

Choose a reason for hiding this comment

ericl Mar 22, 2023

Choose a reason for hiding this comment

amogkam commented Mar 21, 2023

ericl commented Mar 22, 2023

clarkzinzow Mar 22, 2023

Choose a reason for hiding this comment