Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Data] Configure block size by number of rows #48804

Open
comaniac opened this issue Nov 19, 2024 · 0 comments
Open

[Ray Data] Configure block size by number of rows #48804

comaniac opened this issue Nov 19, 2024 · 0 comments
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@comaniac
Copy link

Description

I'm implementing a Ray Data pipeline with checkpoints. Intuitively, I want to configure the fault tolerance granularity for my pipeline. For example, I want to configure the pipeline so that when it fails, I only lost the progress of N-1 rows. Intuitively this can be achieved by making block size = N, but now I have to use the following operators:

ds = ray.data.read_parquet(...)
num_blocks = ds.count() / N
ds = ds.repartition(num_blocks)

One issue with this code snippet is that ds.count() only shows the total number of rows without considering the one that has been checkpointed, so when resuming from failure, I will get the same number of blocks but the actual block size is smaller than expected, which may hurt throughput.

Meanwhile, I'm also thinking whether it makes more sense to just leverage block size in bytes (the current Ray Data behavior). One potential issue is when building a pipeline with LLM, we may have workloads with short prompts and long decoding length (e.g., Write an article with a thousand words about Ray). In this case if the block size is based on input data size in bytes, then we will put lots of rows in one block, and will lost all of them when failure. So I guess the feature I really want is setting the maximum number of rows in a block.

cc @alexeykudinkin @scottjlee

Use case

No response

@comaniac comaniac added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 19, 2024
@alexeykudinkin alexeykudinkin added the data Ray Data-related issues label Nov 20, 2024
@alexeykudinkin alexeykudinkin self-assigned this Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants