[Ray Data] Configure block size by number of rows #48804

comaniac · 2024-11-19T20:16:26Z

Description

I'm implementing a Ray Data pipeline with checkpoints. Intuitively, I want to configure the fault tolerance granularity for my pipeline. For example, I want to configure the pipeline so that when it fails, I only lost the progress of N-1 rows. Intuitively this can be achieved by making block size = N, but now I have to use the following operators:

ds = ray.data.read_parquet(...)
num_blocks = ds.count() / N
ds = ds.repartition(num_blocks)

One issue with this code snippet is that ds.count() only shows the total number of rows without considering the one that has been checkpointed, so when resuming from failure, I will get the same number of blocks but the actual block size is smaller than expected, which may hurt throughput.

Meanwhile, I'm also thinking whether it makes more sense to just leverage block size in bytes (the current Ray Data behavior). One potential issue is when building a pipeline with LLM, we may have workloads with short prompts and long decoding length (e.g., Write an article with a thousand words about Ray). In this case if the block size is based on input data size in bytes, then we will put lots of rows in one block, and will lost all of them when failure. So I guess the feature I really want is setting the maximum number of rows in a block.

cc @alexeykudinkin @scottjlee

Use case

No response

The text was updated successfully, but these errors were encountered:

comaniac added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 19, 2024

alexeykudinkin added the data Ray Data-related issues label Nov 20, 2024

alexeykudinkin self-assigned this Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Data] Configure block size by number of rows #48804

[Ray Data] Configure block size by number of rows #48804

comaniac commented Nov 19, 2024

[Ray Data] Configure block size by number of rows #48804

[Ray Data] Configure block size by number of rows #48804

Comments

comaniac commented Nov 19, 2024

Description

Use case