-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Default to compliant nested types in Parquet writer #29781
Comments
Joris Van den Bossche / @jorisvandenbossche:
— I am not super familiar, but so the simplest option is to just switch the default of the |
Antoine Pitrou / @pitrou: |
Joris Van den Bossche / @jorisvandenbossche: table = pa.table({'a': [[1, 2], [3, 4, 5]]})
pq.write_table(table, "test_nested_noncompliant.parquet")
pq.write_table(table, "test_nested_compliant.parquet", use_compliant_nested_type=True) In the latest pyarrow they both read fine, but so have different names (which can impact eg nested field refs): >>> pq.read_table("test_nested_noncompliant.parquet")
pyarrow.Table
a: list<item: int64>
child 0, item: int64
>>> pq.read_table("test_nested_compliant.parquet")
pyarrow.Table
a: list<element: int64>
child 0, element: int64 So eg doing Those files also read fine (and result in the same difference in list field names) with older versions of Arrow (tested down to Arrow 1.0). |
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: |
Joris Van den Bossche / @jorisvandenbossche:
With the new datasets API, it actually raises an error if a column name is not found. With the legacy implementation (or the plain |
Truc Lam Nguyen / @trucnguyenlam: |
Micah Kornfield / @emkornfield:
A few item that don't really solve all cases but would make things better or at least adaptable to the long term: 1. Add an option that translates "compliant nest type" name back to the arrow's naming schema. 2. Make it possible to select columns by eliding the list name components.
Another question that is dataset specific, is if one file was written with compliant nested types and one was not, and both where read in the same dataset are the results sane (schema's get unified?) |
Micah Kornfield / @emkornfield: |
Micah Kornfield / @emkornfield: |
Jim Pivarski / @jpivarski: If it's changing how columns are named in the file (currently "fieldA.list.fieldB.list", etc.), then that would require adjustment on my side, but it would be an adjustment that depends on how the file was written, not the pyarrow version, right? If so, I'd have to support both naming conventions because both would exist in the wild. If there is a change that Awkward Array has to adjust to, then now may be a good time to do it because we're going to be rewriting the "from_parquet" function soon, as part of integrating with Dask https://github.com/ContinuumIO/dask-awkward adopting fsspec, and using pyarrow's Dataset API (to replace our manual implementation). If there's something that we can set to get the new behavior, we can turn that on now while developing the new version and be ready for the change. |
Joris Van den Bossche / @jorisvandenbossche:
Currently, I think the C++ API only deals with column indices? (at least for the Python bindings, the translation of column names to field indices happens in Python) For Python that should be relatively straightforward to implement. Opened ARROW-14286 for this.
@jpivarski yes, but that's already the case right now as well. Parquet files written by (py)arrow will use a different name for the list element compared to parquet files written by other tools (that's actually what we are trying to harmonize). So if you select a subfield of a list field by name, you already need to take into account potentially different names at the moment. |
…ult (#35146) ### Rationale for this change This has been a long-standing TODO. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** * Closes: #29781 Lead-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
…y default (apache#35146) ### Rationale for this change This has been a long-standing TODO. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** * Closes: apache#29781 Lead-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
…y default (apache#35146) ### Rationale for this change This has been a long-standing TODO. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** * Closes: apache#29781 Lead-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
In C++ there is already an option to get the "compliant_nested_types" (to have the list columns follow the Parquet specification), and ARROW-11497 exposed this option in Python.
This is still set to False by default, but in the source it says "TODO: At some point we should flip this.", and in ARROW-11497 there was also some discussion about what it would take to change the default.
cc @emkornfield @pitrou
Reporter: Will Jones / @wjones127
Assignee: Will Jones / @wjones127
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-14196. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: