[Ray Data] Configure block size by number of rows #48804
Labels
data
Ray Data-related issues
enhancement
Request for new feature and/or capability
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
Description
I'm implementing a Ray Data pipeline with checkpoints. Intuitively, I want to configure the fault tolerance granularity for my pipeline. For example, I want to configure the pipeline so that when it fails, I only lost the progress of N-1 rows. Intuitively this can be achieved by making block size = N, but now I have to use the following operators:
One issue with this code snippet is that
ds.count()
only shows the total number of rows without considering the one that has been checkpointed, so when resuming from failure, I will get the same number of blocks but the actual block size is smaller than expected, which may hurt throughput.Meanwhile, I'm also thinking whether it makes more sense to just leverage block size in bytes (the current Ray Data behavior). One potential issue is when building a pipeline with LLM, we may have workloads with short prompts and long decoding length (e.g.,
Write an article with a thousand words about Ray
). In this case if the block size is based on input data size in bytes, then we will put lots of rows in one block, and will lost all of them when failure. So I guess the feature I really want is setting the maximum number of rows in a block.cc @alexeykudinkin @scottjlee
Use case
No response
The text was updated successfully, but these errors were encountered: