[C++][Parquet] Default to compliant nested types in Parquet writer #29781

asfimport · 2021-10-01T13:45:56Z

In C++ there is already an option to get the "compliant_nested_types" (to have the list columns follow the Parquet specification), and ARROW-11497 exposed this option in Python.

This is still set to False by default, but in the source it says "TODO: At some point we should flip this.", and in ARROW-11497 there was also some discussion about what it would take to change the default.

cc @emkornfield @pitrou

Reporter: Will Jones / @wjones127
Assignee: Will Jones / @wjones127

Related issues:

[Python][Parquet] Allow to select columns of a list field without requiring the list component names (relates to)
[Python] pyarrow parquet writer for list does not conform with Apache Parquet specification (is related to)

PRs and other links:

GitHub Pull Request #14651

_{Note: This issue was originally created as ARROW-14196. Please see the migration documentation for further details.}

asfimport · 2021-10-01T14:05:53Z

Joris Van den Bossche / @jorisvandenbossche:
Some relevant quotes from the ARROW-11497:

[Micah] I think the main reason it isn't enabled by default is it breaks round trips for arrow data. This could potentially be fixed on the reader side as well.

[Antoine] Perhaps we could convert the field name at the Arrow<->Parquet boundary.
[Micah] This should be possible but it potentially needs another flag. I think in the short term plumbing the additional flag through to python makes sense and we can figure out a longer term solution if this becomes a larger problem.

[Antoine] It should simply be the default (and obviously right) behaviour. Am I missing something?
[Micah] Backwards compatibility? It might be possible to make some inferences (haven't thought about it deeply). But I think if we were reading a conforming java produced parquet file then we would get different column names if we transformed on the border (maybe there can be some rules around Arrow metadata being present). I think we can make the default to be conforming behavior, but we should give users some level of control to preserve the old behavior.

—

I am not super familiar, but so the simplest option is to just switch the default of the compliant_nested_types option in ArrowWriterProperties. What would be the (possible backwards incompatible) consequences of that?
We would start writing a different Parquet file (but actually following the spec). But I suppose that also when reading such a file, you would then get a different name for the sub-lists (which can impact selecting a sublist with a nested field reference?)
To avoid having a breaking change on the read path, we could by default also convert the names at the Parquet->Arrow boundary (like the compliant_nested_types option already does on the Arrow->Parquet boundary). However, doing that can also break code for people currently already reading compliant parquet files ..

asfimport · 2021-10-01T14:18:20Z

Antoine Pitrou / @pitrou:
One possible investigation would be to write two different files, one with the option turned on, the other with the option turned off, and find out whether they are readable by 1) current version of Arrow 2) past versions of Arrow.

asfimport · 2021-10-01T14:31:39Z

Joris Van den Bossche / @jorisvandenbossche:
I wrote this with the latest pyarrow (master):

table = pa.table({'a': [[1, 2], [3, 4, 5]]})
pq.write_table(table, "test_nested_noncompliant.parquet")
pq.write_table(table, "test_nested_compliant.parquet", use_compliant_nested_type=True)

In the latest pyarrow they both read fine, but so have different names (which can impact eg nested field refs):

>>> pq.read_table("test_nested_noncompliant.parquet")
pyarrow.Table
a: list<item: int64>
  child 0, item: int64

>>> pq.read_table("test_nested_compliant.parquet")
pyarrow.Table
a: list<element: int64>
  child 0, element: int64

So eg doing pq.read_table("test_nested_noncompliant.parquet", columns=["a.list.item"], use_legacy_dataset=True) for works the noncompliant file, but doesn't select anything with the compliant file.

Those files also read fine (and result in the same difference in list field names) with older versions of Arrow (tested down to Arrow 1.0).

asfimport · 2021-10-01T15:01:07Z

Antoine Pitrou / @pitrou:
Is it really common to read only a list's child column without the list column itself? Otherwise we can live with the minor compatibility break, IMHO.

asfimport · 2021-10-01T15:01:48Z

Antoine Pitrou / @pitrou:
I'm also surprised that a non-existing column name wouldn't return an error instead of selecting nothing?

asfimport · 2021-10-01T15:09:45Z

Joris Van den Bossche / @jorisvandenbossche:

I'm also surprised that a non-existing column name wouldn't return an error instead of selecting nothing?

With the new datasets API, it actually raises an error if a column name is not found. With the legacy implementation (or the plain ParquetFile interface), it ignores those.
(and I was using use_legacy_dataset=True because with Datasets we don't yet support selecting nested fields ..)

asfimport · 2021-10-02T07:53:07Z

Truc Lam Nguyen / @trucnguyenlam:
[~judah.rand] sorry that I don't really understand your comment, could you please explain a little bit more? thanks

asfimport · 2021-10-08T03:37:31Z

Micah Kornfield / @emkornfield:
Yeah, I think it is really only the name change that I'm concerned about. https://issues.apache.org/jira/browse/ARROW-13151 has another example where people where trying to reference things by path that was broken for other reasons.

A few item that don't really solve all cases but would make things better or at least adaptable to the long term:

1. Add an option that translates "compliant nest type" name back to the arrow's naming schema.

2. Make it possible to select columns by eliding the list name components.

Another question that is dataset specific, is if one file was written with compliant nested types and one was not, and both where read in the same dataset are the results sane (schema's get unified?)

asfimport · 2021-10-08T03:47:05Z

Micah Kornfield / @emkornfield:
I think similar to other default changes we have discussed, a very public advertisement of the intended change and consequences would be good. At some point we should just bite the bullet. I think with the first feature (take a compliant nested type and rename it on read) it would give users a temporary fallback.

asfimport · 2021-10-08T03:50:18Z

Micah Kornfield / @emkornfield:
Also CC @jpivarski on thoughts on how this might impact awkward arrays and ways to mitigate.

asfimport · 2021-10-08T12:51:50Z

Jim Pivarski / @jpivarski:
If it's changing a default, I can just set the option explicitly, right?

If it's changing how columns are named in the file (currently "fieldA.list.fieldB.list", etc.), then that would require adjustment on my side, but it would be an adjustment that depends on how the file was written, not the pyarrow version, right? If so, I'd have to support both naming conventions because both would exist in the wild.

If there is a change that Awkward Array has to adjust to, then now may be a good time to do it because we're going to be rewriting the "from_parquet" function soon, as part of integrating with Dask

https://github.com/ContinuumIO/dask-awkward

adopting fsspec, and using pyarrow's Dataset API (to replace our manual implementation). If there's something that we can set to get the new behavior, we can turn that on now while developing the new version and be ready for the change.

asfimport · 2021-10-11T15:53:32Z

Joris Van den Bossche / @jorisvandenbossche:

Make it possible to select columns by eliding the list name components.

Currently, I think the C++ API only deals with column indices? (at least for the Python bindings, the translation of column names to field indices happens in Python) For Python that should be relatively straightforward to implement. Opened ARROW-14286 for this.

If so, I'd have to support both naming conventions because both would exist in the wild.

@jpivarski yes, but that's already the case right now as well. Parquet files written by (py)arrow will use a different name for the list element compared to parquet files written by other tools (that's actually what we are trying to harmonize). So if you select a subfield of a list field by name, you already need to take into account potentially different names at the moment.

…ult (#35146) ### Rationale for this change This has been a long-standing TODO. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** * Closes: #29781 Lead-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

…y default (apache#35146) ### Rationale for this change This has been a long-standing TODO. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** * Closes: apache#29781 Lead-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

asfimport assigned wjones127 Jan 10, 2023

This was referenced Jan 11, 2023

[Python] pyarrow parquet writer for list does not conform with Apache Parquet specification #27376

Closed

[Python][Parquet] Allow to select columns of a list field without requiring the list component names #29862

Open

wjones127 added the Breaking Change Includes a breaking change to the API label Apr 14, 2023

github-actions bot mentioned this issue Apr 14, 2023

GH-29781: [C++][Parquet] Switch to use compliant nested types by default #35146

Merged

pitrou closed this as completed in #35146 May 12, 2023

pitrou added this to the 13.0.0 milestone May 12, 2023

slotrans mentioned this issue May 18, 2023

Clickhouse produces non-compliant Parquet lists ClickHouse/ClickHouse#49997

Closed

jorisvandenbossche mentioned this issue Mar 26, 2024

Add GeoArrow encoding as an option to the specification opengeospatial/geoparquet#189

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Default to compliant nested types in Parquet writer #29781

[C++][Parquet] Default to compliant nested types in Parquet writer #29781

asfimport commented Oct 1, 2021 •

edited

Loading

asfimport commented Oct 1, 2021

asfimport commented Oct 1, 2021

asfimport commented Oct 1, 2021

asfimport commented Oct 1, 2021

asfimport commented Oct 1, 2021

asfimport commented Oct 1, 2021

asfimport commented Oct 2, 2021

asfimport commented Oct 8, 2021

asfimport commented Oct 8, 2021

asfimport commented Oct 8, 2021

asfimport commented Oct 8, 2021

asfimport commented Oct 11, 2021

[C++][Parquet] Default to compliant nested types in Parquet writer #29781

[C++][Parquet] Default to compliant nested types in Parquet writer #29781

Comments

asfimport commented Oct 1, 2021 • edited Loading

Related issues:

PRs and other links:

asfimport commented Oct 1, 2021

asfimport commented Oct 1, 2021

asfimport commented Oct 1, 2021

asfimport commented Oct 1, 2021

asfimport commented Oct 1, 2021

asfimport commented Oct 1, 2021

asfimport commented Oct 2, 2021

asfimport commented Oct 8, 2021

asfimport commented Oct 8, 2021

asfimport commented Oct 8, 2021

asfimport commented Oct 8, 2021

asfimport commented Oct 11, 2021

asfimport commented Oct 1, 2021 •

edited

Loading