-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Fix to_pandas
error when multiple block types
#48583
Conversation
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
builder = PandasBlockBuilder() | ||
for batch in self.iter_batches(batch_format="pandas", batch_size=None): | ||
builder.add_block(batch) | ||
block = builder.build() | ||
|
||
# `PandasBlockBuilder` creates a dataframe with internal extension types like | ||
# 'TensorDtype'. We use the `to_pandas` method to convert these extension | ||
# types to regular types. | ||
return BlockAccessor.for_block(block).to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i'm understanding correctly, i think this new code now adds an extra conversion from Arrow to pandas batch. is there any way we can avoid it?
the old code:
Arrow internal -> Arrow Block -> build to big Arrow block -> convert to pd DF
the new code:
Arrow internal -> Pandas batch -> build to big Arrow block -> convert to pd DF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no step 3 ("build to big Arrow block") in the new code, since we're using PandasBlockBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah you're right, i mistakenly thought it converts to arrow tables.
builder = PandasBlockBuilder() | ||
for batch in self.iter_batches(batch_format="pandas", batch_size=None): | ||
builder.add_block(batch) | ||
block = builder.build() | ||
|
||
# `PandasBlockBuilder` creates a dataframe with internal extension types like | ||
# 'TensorDtype'. We use the `to_pandas` method to convert these extension | ||
# types to regular types. | ||
return BlockAccessor.for_block(block).to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no step 3 ("build to big Arrow block") in the new code, since we're using PandasBlockBuilder
…8583) Two issues: 1. When the execution plan caches the dataset schema, it might call `unify_schema` to produce a single schema from all of the bundles' schemas. The issue is that this function expects all of the input schemas to be of Arrow type, but we only check if at least one schema is of Arrow type before calling the function. 2. `to_pandas` iterates over blocks and adds them to `DelegatingBlockBuilder`. The issue is that `DelegatingBlockBuilder` expects all input blocks to be of the same type. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…8583) Two issues: 1. When the execution plan caches the dataset schema, it might call `unify_schema` to produce a single schema from all of the bundles' schemas. The issue is that this function expects all of the input schemas to be of Arrow type, but we only check if at least one schema is of Arrow type before calling the function. 2. `to_pandas` iterates over blocks and adds them to `DelegatingBlockBuilder`. The issue is that `DelegatingBlockBuilder` expects all input blocks to be of the same type. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Why are these changes needed?
Two issues:
unify_schema
to produce a single schema from all of the bundles' schemas. The issue is that this function expects all of the input schemas to be of Arrow type, but we only check if at least one schema is of Arrow type before calling the function.to_pandas
iterates over blocks and adds them toDelegatingBlockBuilder
. The issue is thatDelegatingBlockBuilder
expects all input blocks to be of the same type.Related issue number
Fixes #48575
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.