Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python): add support for pyarrow 13+ #1804

Merged
merged 13 commits into from
Nov 5, 2023

Conversation

ion-elgreco
Copy link
Collaborator

@ion-elgreco ion-elgreco commented Nov 4, 2023

Description

I build on top of the branch of @wjones127 #1602. In pyarrow v13+ the ParquetWriter by default uses the compliant_nested_types = True (see related PR: https://github.com/apache/arrow/pull/35146/files)and the docs: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html).

In arrow/parquet-rs it fails when it compares schemas because it expected the old non-compliant ones. For now we can have pyarrow 13+ supported by disabling it or updating the file options provided by a user.

Related Issue(s)

Documentation

@github-actions github-actions bot added the binding/python Issues for the Python package label Nov 4, 2023
Comment on lines 1440 to 1452
def _cast_to_equal_batch(
batch: pyarrow.RecordBatch, schema: pyarrow.Schema
) -> pyarrow.RecordBatch:
"""
Cast a batch to a schema, if it is already considered equal.

This is mostly for mapping things like list field names, which arrow-rs
checks when looking at schema equality, but pyarrow does not.
"""
if batch.schema == schema:
return pyarrow.Table.from_batches([batch]).cast(schema).to_batches()[0]
else:
return batch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ion-elgreco If we take this out, do the tests still pass?

Copy link
Collaborator Author

@ion-elgreco ion-elgreco Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, everything passes still. Shall I take it out?

Or would we require it later once we enable the compliant types again.

edit: Taking it out for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I think we shouldn’t need it. This was an earlier attempt at fixing the compliant types thing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have backwards compatibility during reads and writes when we use the compliant types, I think spark-delta also doesn't write these compliant types but I may be wrong here

@wjones127
Copy link
Collaborator

Thanks for finishing this up!

@wjones127 wjones127 merged commit d98a0c2 into delta-io:main Nov 5, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support pyarrow>=13
3 participants