[Data] Fix `to_pandas` error when multiple block types #48583

bveeramani · 2024-11-05T21:14:15Z

Why are these changes needed?

Two issues:

When the execution plan caches the dataset schema, it might call unify_schema to produce a single schema from all of the bundles' schemas. The issue is that this function expects all of the input schemas to be of Arrow type, but we only check if at least one schema is of Arrow type before calling the function.
to_pandas iterates over blocks and adds them to DelegatingBlockBuilder. The issue is that DelegatingBlockBuilder expects all input blocks to be of the same type.

Related issue number

Fixes #48575

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

scottjlee · 2024-11-06T21:52:05Z

python/ray/data/dataset.py

+        builder = PandasBlockBuilder()
+        for batch in self.iter_batches(batch_format="pandas", batch_size=None):
+            builder.add_block(batch)
+        block = builder.build()
+
+        # `PandasBlockBuilder` creates a dataframe with internal extension types like
+        # 'TensorDtype'. We use the `to_pandas` method to convert these extension
+        # types to regular types.
+        return BlockAccessor.for_block(block).to_pandas()


if i'm understanding correctly, i think this new code now adds an extra conversion from Arrow to pandas batch. is there any way we can avoid it?

the old code:
Arrow internal -> Arrow Block -> build to big Arrow block -> convert to pd DF

the new code:
Arrow internal -> Pandas batch -> build to big Arrow block -> convert to pd DF

There's no step 3 ("build to big Arrow block") in the new code, since we're using PandasBlockBuilder

ah you're right, i mistakenly thought it converts to arrow tables.

alexeykudinkin · 2024-11-07T19:46:55Z

python/ray/data/dataset.py

+        builder = PandasBlockBuilder()
+        for batch in self.iter_batches(batch_format="pandas", batch_size=None):
+            builder.add_block(batch)
+        block = builder.build()
+
+        # `PandasBlockBuilder` creates a dataframe with internal extension types like
+        # 'TensorDtype'. We use the `to_pandas` method to convert these extension
+        # types to regular types.
+        return BlockAccessor.for_block(block).to_pandas()


There's no step 3 ("build to big Arrow block") in the new code, since we're using PandasBlockBuilder

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…8583) Two issues: 1. When the execution plan caches the dataset schema, it might call `unify_schema` to produce a single schema from all of the bundles' schemas. The issue is that this function expects all of the input schemas to be of Arrow type, but we only check if at least one schema is of Arrow type before calling the function. 2. `to_pandas` iterates over blocks and adds them to `DelegatingBlockBuilder`. The issue is that `DelegatingBlockBuilder` expects all input blocks to be of the same type. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…8583) Two issues: 1. When the execution plan caches the dataset schema, it might call `unify_schema` to produce a single schema from all of the bundles' schemas. The issue is that this function expects all of the input schemas to be of Arrow type, but we only check if at least one schema is of Arrow type before calling the function. 2. `to_pandas` iterates over blocks and adds them to `DelegatingBlockBuilder`. The issue is that `DelegatingBlockBuilder` expects all input blocks to be of the same type. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>

bveeramani added 2 commits November 5, 2024 13:08

Initial commit

2a982f1

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Appease lint

fed1328

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani requested review from scottjlee, raulchen, stephanie-wang, omatthew98 and alexeykudinkin as code owners November 5, 2024 21:14

bveeramani assigned scottjlee and ArturNiederfahrenhorst Nov 5, 2024

scottjlee reviewed Nov 6, 2024

View reviewed changes

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Nov 7, 2024

alexeykudinkin approved these changes Nov 7, 2024

View reviewed changes

scottjlee approved these changes Nov 7, 2024

View reviewed changes

bveeramani enabled auto-merge (squash) November 7, 2024 20:09

Format

6475928

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani requested a review from srinathk10 as a code owner November 7, 2024 20:44

github-actions bot disabled auto-merge November 7, 2024 20:45

bveeramani merged commit 1bc18a1 into master Nov 7, 2024
5 checks passed

bveeramani deleted the fix-pandas-union branch November 7, 2024 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Fix `to_pandas` error when multiple block types #48583

[Data] Fix `to_pandas` error when multiple block types #48583

bveeramani commented Nov 5, 2024

scottjlee Nov 6, 2024

alexeykudinkin Nov 7, 2024

scottjlee Nov 7, 2024

alexeykudinkin Nov 7, 2024

[Data] Fix to_pandas error when multiple block types #48583

[Data] Fix to_pandas error when multiple block types #48583

Conversation

bveeramani commented Nov 5, 2024

Why are these changes needed?

Related issue number

Checks

scottjlee Nov 6, 2024

Choose a reason for hiding this comment

alexeykudinkin Nov 7, 2024

Choose a reason for hiding this comment

scottjlee Nov 7, 2024

Choose a reason for hiding this comment

alexeykudinkin Nov 7, 2024

Choose a reason for hiding this comment

[Data] Fix `to_pandas` error when multiple block types #48583

[Data] Fix `to_pandas` error when multiple block types #48583