[data] Read->SplitBlocks to ensure requested read parallelism is always met #36352

ericl · 2023-06-12T23:00:18Z

Why are these changes needed?

Today, the number of initial blocks of a dataset is limited to the number of input files of the datasource, regardless of the requested parallelism. This is problematic as it means to increase the number of blocks requires a repartition() call, which is not always practical in the streaming setting.

This PR inserts a streaming SplitBlocks operator that is fused with read tasks in this case to allow for arbitrarily high requested parallelism (up to number of individual records) without needing a blocking repartition.

Before:

ray.data.read_parquet([list, of, 100, parquet, files], parallelism=2000)
# -> num_blocks = 100

After:

ray.data.read_parquet([list, of, 100, parquet, files], parallelism=2000)
# -> num_blocks = 2000

Limitations:

Until [Streaming Generator] Make it compatible with wait #36071 merges and is integrated with Ray Data, downstream operators of the read may still block until the entire file is read, even if the read would produce multiple blocks.
The SplitBlocks operator cannot be fused with downstream Map stages, since it is changing the physical partitioning of the stream. If we fused it, then the parallelism increase would not be realized as we could not split the read output to multiple processes.

Related issue number

Closes #31501

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(
  ]([Dataset] Split input files to launch as many read tasks as user-specified parallelism #31501)

Signed-off-by: Eric Liang <ekhliang@gmail.com>

python/ray/data/read_api.py

python/ray/data/datasource/parquet_datasource.py

python/ray/data/read_api.py

python/ray/data/_internal/logical/operators/read_operator.py

python/ray/data/_internal/block_list.py

python/ray/data/_internal/execution/operators/input_data_buffer.py

python/ray/data/_internal/logical/operators/read_operator.py

python/ray/data/datasource/parquet_datasource.py

python/ray/data/read_api.py

amogkam · 2023-06-14T01:00:36Z

Closes #31501

stephanie-wang · 2023-06-14T15:12:45Z

python/ray/data/_internal/logical/rules/operator_fusion.py

@@ -130,6 +131,9 @@ def _can_fuse(self, down_op: PhysicalOperator, up_op: PhysicalOperator) -> bool:
        down_logical_op = self._op_map[down_op]
        up_logical_op = self._op_map[up_op]

+        if isinstance(up_logical_op, Read) and not up_logical_op.fusable():


Why the extra check if it's a Read op?

The fusable method is part of the Read class only.

I think we can define fusable in the base LogicalOperator class. Other op may need it as well in the future.

python/ray/data/_internal/plan.py

python/ray/data/datasource/datasource.py

python/ray/data/read_api.py

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl · 2023-06-16T22:05:20Z

This is ready for review.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl · 2023-06-21T18:54:01Z

Updated.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

raulchen · 2023-06-21T21:19:17Z

python/ray/data/_internal/logical/operators/read_operator.py

+        if len(read_tasks) == estimated_num_blocks:
+            suffix = ""
+        else:
+            suffix = f"->SplitBlocks({int(estimated_num_blocks / len(read_tasks))})"


This looks like SplitBlocks is a separate op. What about Read(spit_blocks=N)?

yeah, +1 for ReadXXX(split_blocks=N), otherwise Dataset.__repr__ would become confusing.

I'm not sure I understand this--- the original proposal is that SplitBlock is supposed to be a logical operator, since it only applies to the output of the read. It seems more clear therefore using the chaining syntax of -> instead of making it part of the Read.

Sorry if I miss any context, why don't we implement SplitBlock as a separate logical & physical operator?

The current implementation is inside Datasource, so it looks like part of Read & InputDataBuffer.

I think we should, but it would get fused with Read anyways. So here we only implement it as part of Read since we have yet to decide whether it should be a general operator.

E.g., for dynamic_repartition() or such.

I see, +1 to make it a general operator.

raulchen · 2023-06-21T21:20:43Z

python/ray/data/_internal/logical/rules/operator_fusion.py

@@ -130,6 +131,9 @@ def _can_fuse(self, down_op: PhysicalOperator, up_op: PhysicalOperator) -> bool:
        down_logical_op = self._op_map[down_op]
        up_logical_op = self._op_map[up_op]

+        if isinstance(up_logical_op, Read) and not up_logical_op.fusable():


I think we can define fusable in the base LogicalOperator class. Other op may need it as well in the future.

raulchen · 2023-06-21T21:21:46Z

python/ray/data/dataset.py

@@ -480,7 +480,7 @@ def map_batches(
            >>> ds = ds.map_batches(map_fn_with_large_output)
            >>> ds
            MapBatches(map_fn_with_large_output)
-            +- Dataset(num_blocks=1, num_rows=1, schema={item: int64})
+            +- Dataset(num_blocks=..., num_rows=1, schema={item: int64})


For my understanding, what does this change mean?

We have a test rule where ellipsis can match any value.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl · 2023-06-22T00:54:31Z

Merging so we can test in master. Let's discuss the future of split blocks as an operator separately.

… is always met (#36352)" This reverts commit 0ab00ec.

… is always met (#36352)" (#36747) This reverts commit 0ab00ec.

…allelism is always met (ray-project#36352)" (ray-project#36747)" This reverts commit 96a7145.

…ys met (ray-project#36352) Today, the number of initial blocks of a dataset is limited to the number of input files of the datasource, regardless of the requested parallelism. This is problematic as it means to increase the number of blocks requires a `repartition()` call, which is not always practical in the streaming setting. This PR inserts a streaming SplitBlocks operator that is fused with read tasks in this case to allow for arbitrarily high requested parallelism (up to number of individual records) without needing a blocking repartition. Before: ``` ray.data.read_parquet([list, of, 100, parquet, files], parallelism=2000) # -> num_blocks = 100 ``` After: ``` ray.data.read_parquet([list, of, 100, parquet, files], parallelism=2000) # -> num_blocks = 2000 ``` Limitations: - Until ray-project#36071 merges and is integrated with Ray Data, downstream operators of the read may still block until the entire file is read, even if the read would produce multiple blocks. - The SplitBlocks operator cannot be fused with downstream Map stages, since it is changing the physical partitioning of the stream. If we fused it, then the parallelism increase would not be realized as we could not split the read output to multiple processes. Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

… is always met (ray-project#36352)" (ray-project#36747) This reverts commit 0ab00ec. Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

split blocks prototype

20c8d2b

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl requested review from scv119, c21, amogkam, scottjlee, bveeramani and raulchen as code owners June 12, 2023 23:00

ericl added 3 commits June 12, 2023 16:22

wip

4a9bc3e

Signed-off-by: Eric Liang <ekhliang@gmail.com>

wip

24a2d55

wip

d37383c

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl commented Jun 13, 2023

View reviewed changes

python/ray/data/read_api.py Show resolved Hide resolved

ericl commented Jun 13, 2023

View reviewed changes

python/ray/data/datasource/parquet_datasource.py Show resolved Hide resolved

ericl assigned stephanie-wang and raulchen Jun 13, 2023

raulchen reviewed Jun 14, 2023

View reviewed changes

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/_internal/logical/operators/read_operator.py Show resolved Hide resolved

amogkam reviewed Jun 14, 2023

View reviewed changes

stephanie-wang reviewed Jun 14, 2023

View reviewed changes

ericl added 6 commits June 15, 2023 16:36

Merge remote-tracking branch 'upstream/master' into split-blocks

ac457d8

wip

886c226

Signed-off-by: Eric Liang <ekhliang@gmail.com>

wip

835a439

Signed-off-by: Eric Liang <ekhliang@gmail.com>

wip add logging

1e0b4d0

Signed-off-by: Eric Liang <ekhliang@gmail.com>

cleanup

7400c2e

Signed-off-by: Eric Liang <ekhliang@gmail.com>

add tests

024b328

ericl changed the title ~~[WIP] Prototype splitting blocks to ensure requested read parallelism is always met~~ [data] Read->SplitBlocks to ensure requested read parallelism is always met Jun 16, 2023

ericl added 4 commits June 16, 2023 15:23

update docs

b7248af

Signed-off-by: Eric Liang <ekhliang@gmail.com>

add docs

4a323c3

Signed-off-by: Eric Liang <ekhliang@gmail.com>

fix some tests

8c3a358

Signed-off-by: Eric Liang <ekhliang@gmail.com>

fix tests

d746470

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl added 2 commits June 20, 2023 12:12

Merge remote-tracking branch 'upstream/master' into split-blocks

d7d01a7

last test

7926ba3

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 20, 2023

Merge remote-tracking branch 'upstream/master' into split-blocks

8812481

ericl removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 21, 2023

ericl added 3 commits June 21, 2023 11:34

rewrite splitrange

f26f4f2

Signed-off-by: Eric Liang <ekhliang@gmail.com>

add comments

c5e09dc

Signed-off-by: Eric Liang <ekhliang@gmail.com>

fix test

32c2e2f

Signed-off-by: Eric Liang <ekhliang@gmail.com>

add more edge case tests

9171b83

Signed-off-by: Eric Liang <ekhliang@gmail.com>

raulchen approved these changes Jun 21, 2023

View reviewed changes

ericl added 2 commits June 21, 2023 15:00

fix

8fc17d9

Signed-off-by: Eric Liang <ekhliang@gmail.com>

Merge remote-tracking branch 'upstream/master' into split-blocks

3ee6717

ericl merged commit 0ab00ec into ray-project:master Jun 22, 2023

ericl mentioned this pull request Jun 22, 2023

[data] [streaming] Support a streaming_repartition() operator #36724

Open

vitsai added a commit that referenced this pull request Jun 23, 2023

Revert "[data] Read->SplitBlocks to ensure requested read parallelism…

064d098

… is always met (#36352)" This reverts commit 0ab00ec.

vitsai mentioned this pull request Jun 23, 2023

Revert "[data] Read->SplitBlocks to ensure requested read parallelism is always met" #36747

Merged

ericl pushed a commit that referenced this pull request Jun 23, 2023

Revert "[data] Read->SplitBlocks to ensure requested read parallelism…

96a7145

… is always met (#36352)" (#36747) This reverts commit 0ab00ec.

ericl added a commit to ericl/ray that referenced this pull request Jun 23, 2023

Revert "Revert "[data] Read->SplitBlocks to ensure requested read par…

5564b1a

…allelism is always met (ray-project#36352)" (ray-project#36747)" This reverts commit 96a7145.

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Read->SplitBlocks to ensure requested read parallelism is always met #36352

[data] Read->SplitBlocks to ensure requested read parallelism is always met #36352

ericl commented Jun 12, 2023 •

edited

Loading

amogkam commented Jun 14, 2023

stephanie-wang Jun 14, 2023

ericl Jun 15, 2023

raulchen Jun 21, 2023

ericl commented Jun 16, 2023

ericl commented Jun 21, 2023

raulchen Jun 21, 2023

c21 Jun 22, 2023

ericl Jun 22, 2023

c21 Jun 22, 2023

ericl Jun 22, 2023

ericl Jun 22, 2023

c21 Jun 22, 2023

raulchen Jun 21, 2023

raulchen Jun 21, 2023

ericl Jun 21, 2023

ericl commented Jun 22, 2023

[data] Read->SplitBlocks to ensure requested read parallelism is always met #36352

[data] Read->SplitBlocks to ensure requested read parallelism is always met #36352

Conversation

ericl commented Jun 12, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

amogkam commented Jun 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Jun 16, 2023

ericl commented Jun 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Jun 22, 2023

ericl commented Jun 12, 2023 •

edited

Loading