Bump to PyArrow 17.0.0 #929

Fokko · 2024-07-15T10:03:53Z

No description provided.

Fokko · 2024-07-16T10:15:25Z

Vote has been passed: https://lists.apache.org/thread/mnzdpwzhctx6yrjl16zn8hl7pcxxt575

sungwy · 2024-07-16T13:21:49Z

Amazing. Given that this issue (#936) also requires 17.0.0 for the fix, maybe it's right for us to move forward with onboarding 17.0.0

Fokko · 2024-07-16T15:48:30Z

@syun64 It has been released, and I've updated the lockfile

sungwy · 2024-07-16T15:56:01Z

@syun64 It has been released, and I've updated the lockfile

Awesome :) we live in exciting times 🎉

Fokko · 2024-07-16T15:59:42Z

It looks like the 3.9 artifacts are missing:

Devlist: https://lists.apache.org/thread/xpvsr3qvc86nbzp1sxxx9vp41wbo6d02
Github tracking issue: [Python][Packaging][Release] Missing pyarrow wheels for 17.0.0 release arrow#43289

raulcd · 2024-07-17T10:42:02Z

The missing wheels and source distribution for pyarrow 17.0.0 have been uploaded to PyPI. Sorry for the inconvenience.

Fokko · 2024-07-17T12:32:37Z

@raulcd No problem, thanks for the heads up here 👍

Fokko · 2024-07-19T12:08:36Z

@syun64 @HonahX @kevinjqliu This provides a nice cleanup of the types (and probably also a speed-up), the downside is that we have to raise the lower bound to PyArrow 17. PTAL

raulcd · 2024-07-19T12:12:58Z

tests/integration/test_writes/test_writes.py

        pa.field(
            "address",
            pa.struct([
-                pa.field("street", pa.large_string()),
-                pa.field("city", pa.large_string()),
+                pa.field("street", pa.string()),


totally outsider here but curious, was there a bug on pyarrow that made those large_string instead of string?

@raulcd - It wasn't a bug, but actually an intentional change for the time being. If we update to PyArrow 17.0.0 we will be able to revert that change, and let the encoding in the parquet file dictate whether the table should be read as a large or small type for the Table API.

raulcd · 2024-07-19T12:14:29Z

btw, are dependabot PRs automatically merged? It seems it updated pyarrow (4282d2f)

Fokko · 2024-07-19T12:23:35Z

btw, are dependabot PRs automatically merged? It seems it updated pyarrow (4282d2f)

No, we still have to merge them by hand #937. The good thing is that the current codebase runs with 17 :)

sungwy · 2024-07-19T13:01:34Z

@syun64 @HonahX @kevinjqliu This provides a nice cleanup of the types (and probably also a speed-up), the downside is that we have to raise the lower bound to PyArrow 17. PTAL

Great question @Fokko ... after thinking a lot about this the past week, here's my long answer organized by different topics of consideration

Benefits of 17.0.0

Table Scan API to_arrow() will now be able to infer the type correctly based on the encoding in a parquet file (instead of always reading as large types). If there's a discrepancy in the encodings across the files, this will result in some of the tables's data being casted to larger types. For to_arrow_batches() this means that we will be reading the batches with types matching the encoding in the parquet file, and then always casting to large types **
We will be able conform with the Iceberg Spec's physical type mapping for DecimalType: [Spec][Upstream] Mapping from DecimalType to Parquet physical type not aligned with spec #936

User's ability to use PyIceberg in applications

pyarrow 8.0.0 ~ pyarrow 17.0.0 all have the same single subdependency on numpy of the same version, which means it shouldn't be all that difficult for users to switch to a higher version of pyarrow in existing applications
However, switching versions of a core dependency comes with the risk of introducing changes to the data underneath. Although the arrow representation under the hood won't change, I wonder if there'd still be subtle differences in the API args that would need to be considered carefully in a version update. I'd imagine that it would require a lot of effort for owners of existing Production applications to update the PyArrow version and QC their output artifacts as a pre-requisite for adding a PyIceberg dependency to their application (if they want to use pyarrow for scans and writes)
This would also mean that there is only one version of PyArrow available for users to use with PyIceberg - there's some element of risk in having just one version of a package available to use. For example, what if there's a really bad issue with 17.0.0 that affects a specific use case?

** -> I'm of the impression that while this change seems to make sense from the perspective of preserving type or encoding correctness, it will actually result in a performance regression due to the fact that we will be reading most batches as small types, but having to cast them to large types (infrequently for pa.Table, but always for pa.RecordBatchReader). Another option is to always choose to cast to a small type instead in to_arrow_batch_reader

Based on these points, I'm leaning towards not aggressively increasing the lower bound to 17.0.0, at least for this minor release, but I'm very excited to hear what others think as well!

kevinjqliu · 2024-07-19T17:18:43Z

@syun64 already pointed to the cost/benefits of upgrading.

I lean more towards correctness than performance. What is the correctness issue if we do not upgrade? As I understand from the above, if the parquet file is of type string, we read it as large_string but write it as string again.

As for updating the minimum dependency to pyarrow 17.0.0, I would prefer to wait for the new arrow version to be baked for a time before we require all new versions of Pyiceberg to use it.

I also think the 0.7.0 release's feature set is getting massive. We can add this upgrade as a fast-follow release.

Fokko force-pushed the fd-test-against-pyarrow-17 branch from e972de5 to a396149 Compare July 19, 2024 12:07

Fokko marked this pull request as ready for review July 19, 2024 12:07

raulcd reviewed Jul 19, 2024

View reviewed changes

Fokko changed the title ~~Test again PyArrow 17.0.0~~ Bump to PyArrow 17.0.0 Jul 19, 2024

Fokko added this to the PyIceberg 0.8.0 release milestone Jul 23, 2024

Fokko mentioned this pull request Aug 7, 2024

Pyarrow IO property for configuring large v small types on read #986

Merged

Fokko force-pushed the fd-test-against-pyarrow-17 branch 4 times, most recently from 9969926 to 921cd84 Compare August 12, 2024 09:37

Move to PyArrow 17

73b8965

Fokko force-pushed the fd-test-against-pyarrow-17 branch from 921cd84 to 73b8965 Compare August 12, 2024 09:53

This was referenced Sep 1, 2024

[feat] push down schema casting to the record batch level #1049

Open

[feat] push down filters and positional deletes to the record batch level #1050

Open

Regression in 0.7.0 due to type coercion from "string" to "large_string" #1128

Open

Fokko closed this Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump to PyArrow 17.0.0 #929

Bump to PyArrow 17.0.0 #929

Fokko commented Jul 15, 2024

Fokko commented Jul 16, 2024

sungwy commented Jul 16, 2024

Fokko commented Jul 16, 2024

sungwy commented Jul 16, 2024

Fokko commented Jul 16, 2024 •

edited

Loading

raulcd commented Jul 17, 2024

Fokko commented Jul 17, 2024

Fokko commented Jul 19, 2024

raulcd Jul 19, 2024

sungwy Jul 19, 2024

raulcd commented Jul 19, 2024

Fokko commented Jul 19, 2024

sungwy commented Jul 19, 2024

kevinjqliu commented Jul 19, 2024

Bump to PyArrow 17.0.0 #929

Bump to PyArrow 17.0.0 #929

Conversation

Fokko commented Jul 15, 2024

Fokko commented Jul 16, 2024

sungwy commented Jul 16, 2024

Fokko commented Jul 16, 2024

sungwy commented Jul 16, 2024

Fokko commented Jul 16, 2024 • edited Loading

raulcd commented Jul 17, 2024

Fokko commented Jul 17, 2024

Fokko commented Jul 19, 2024

raulcd Jul 19, 2024

Choose a reason for hiding this comment

sungwy Jul 19, 2024

Choose a reason for hiding this comment

raulcd commented Jul 19, 2024

Fokko commented Jul 19, 2024

sungwy commented Jul 19, 2024

Benefits of 17.0.0

User's ability to use PyIceberg in applications

kevinjqliu commented Jul 19, 2024

Fokko commented Jul 16, 2024 •

edited

Loading