[data] [streaming] Support a streaming_repartition() operator #36724

ericl · 2023-06-22T20:46:13Z

In several use cases, it is useful to change the block size of datasets in a streaming way. The current repartition() operator is an all-to-all operator and is incompatible with streaming.

We could implement a general purpose streaming_repartition() operator that supports repartitioning in a few streaming-compatible ways:

Splitting/coalescing blocks into a certain number of rows
Splitting/coalescing blocks into a certain in-memory byte size
Splitting/coalescing blocks into K pieces

This could be implemented as a new PhysicalOperator that implements the online repartitioning. This could also replace the current SplitBlocks mechanism from #36352

The text was updated successfully, but these errors were encountered:

luxunxiansheng · 2023-10-05T08:19:08Z

Suppose I have 20 big size files and I implement a specfic datasource for it. I would like to load the datset by read_datasource with a parallesim ,say , 200. Now I see the splitblocks function to split each block to many smaller blocks. My question is , how does the splitblocks work? It will split a single big file in each row into many many binary parts and then to coalesce them somewhere in the downstream?

ericl · 2023-10-05T18:27:49Z

SplitBlocks works within the read task to split the read output into multiple smaller pieces. These will remain as smaller individual blocks for the remainder of the computation unless the dataset is explicitly repartitioned.

Ray Data will automatically insert SplitBlocks to ensure the desired/autodetected parallelism is met after a read.

ericl added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Jun 22, 2023

anyscalesam added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Nov 8, 2023

alexeykudinkin self-assigned this Nov 22, 2024

alexeykudinkin added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] [streaming] Support a streaming_repartition() operator #36724

[data] [streaming] Support a streaming_repartition() operator #36724

ericl commented Jun 22, 2023 •

edited

Loading

luxunxiansheng commented Oct 5, 2023

ericl commented Oct 5, 2023 •

edited

Loading

[data] [streaming] Support a streaming_repartition() operator #36724

[data] [streaming] Support a streaming_repartition() operator #36724

Comments

ericl commented Jun 22, 2023 • edited Loading

luxunxiansheng commented Oct 5, 2023

ericl commented Oct 5, 2023 • edited Loading

ericl commented Jun 22, 2023 •

edited

Loading

ericl commented Oct 5, 2023 •

edited

Loading