GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status #36027

alippai · 2023-06-11T02:32:51Z

This is a draft skeleton for: #35638 (comment)

Closes: [Docs][Parquet] Document Parquet implementation status #36028

github-actions · 2023-06-11T02:33:11Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2023-06-11T02:42:05Z

⚠️ GitHub issue #36028 has been automatically assigned in GitHub to PR creator.

alippai · 2023-06-11T02:43:07Z

I'm sure this is too detailed in some places also there is a good chance that it misses many useful features.

My approach was going through the great blogpost, the parquet-format changelog, the thrift file, the parquet-mr, arrow and arrow-rs issue queue.

I've intentionally tried to avoid 2.4-2.10 parquet format version info as it'd imply that the 2.9 features include 2.6 features which might not reflect the reality. Instead of that I've tried to focus on the end-user public API and providing a flat list of features instead. I'm open for different approaches as well.

I feel particularly uncertain about the statistics and indices, I'm sure you can do that part better.

alippai · 2023-06-11T02:45:09Z

@tustvold @mapleFU @westonpace @wgtmac What do you think? Would this be useful?

tustvold

Left some comments, I would personally restrict this table to feature of the actual file readers and not query engine functionality like partitioning and concurrency - imo these are not features of a parquet implementation, but rather a query system. IMO a parquet implementation should not be unilaterally making concurrency decisions, but rather exposing APIs to allow query engines to distribute the work how they deem fit. Similarly partitions are a catalog detail

I would also suggest having separate tables for supported types, encodings, compression and feature support.

tustvold · 2023-06-11T10:44:12Z

docs/source/status.rst

+-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |


I'm not sure I'd consider this a feature of the parquet implementation, it is more a detail of the query engine imo?

While arrow-rs needs datafusion for this functionality, arrow handles it without Acero. I don't have strong opinion though

I agree with @tustvold, partitioning is more like a high-level use case on top of file format.

tustvold · 2023-06-11T10:44:44Z

docs/source/status.rst

+-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |


What is this referring to?

Like I said there is a good chance I made a mistake here. I saw this in the thrift spec: ColumnChunk->ColumnMetadata->Statistics

Could we organize these items in a layered fashion? Maybe this is a good start point: https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features

tustvold · 2023-06-11T10:45:36Z

docs/source/status.rst

+-------------------------------------------+-------+--------+--------+-------+-------+
+| Page CRC32 checksum                       |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel partition processing             |       |        |        |       |       |


IMO this is a query engine detail, not a detail of the file format?

It's part of the arrow API in python

tustvold · 2023-06-11T10:46:19Z

docs/source/status.rst

+-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |


What is this?

https://github.com/apache/parquet-format/blob/c766945d90935ebcd4e03fee13aad2b6efcadce3/src/main/thrift/parquet.thrift#L766

OMG, they finally added it - amazing, will get that incorporated into the rust writer/reader

OMG, they finally added it - amazing, will get that incorporated into the rust writer/reader

I just added it recently :) Please note that the latest format is not released yet so the parquet-mr does not know bloom_filter_length now.

tustvold · 2023-06-11T10:47:12Z

docs/source/status.rst

+-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |


Again this is a detail of the query engine not the parquet implementation imo

Same, it's part of the current API, but I agree it's not consistent across implementations.

tustvold · 2023-06-11T10:51:17Z

docs/source/status.rst

+-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |


I don't think any support page appending, the semantics would be peculiar for things like dictionary pages, the rust implementation does support appending column chunks though

Yes, likely some / most of the Page references should be ColumnChunk. I'll read about this more.

Isn't Parquet itself a write-once format that can't be appended to? I'm not sure what these are supposed to indicate. The inability to append/delete without re-writing a Parquet file is why table formats like Iceberg and Delta have proliferated.

tustvold · 2023-06-11T10:53:04Z

docs/source/status.rst

+| Storage-aware defaults (1)                |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive concurrency (2)                  |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive IO when pruning used (3)         |       |        |        |       |       |


I'm not sure which parquet reader these features are based off, but my 2 cents is that they indicate a problematic IO abstraction that relies on prefetching heuristics instead of pushing vectored IO down into the IO subsystem (which the Rust, and proprietary DataBricks implementation do).

I wanted to capture the IO pushdown section https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/#io-pushdown but also added more. Likely out of scope as none of the implementations goes into details or provides an API

Perhaps just a "Vectorized IO Pushdown". I believe there are efforts to add such an API to parquet-mr

tustvold · 2023-06-11T10:53:34Z

docs/source/status.rst

+-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |


Isn't this also a detail of the engine choosing what columns to read or not? Or is the intent here to indicate that rows/values can be pruned based on projection directly in the parquet lib?

tustvold · 2023-06-11T10:54:23Z

docs/source/status.rst

+-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |


I don't think this is supported by the format, bloom filters are per column chunk

tustvold · 2023-06-11T10:55:44Z

docs/source/status.rst

+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
+===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |


I wonder if we could have separate tables for supported physical types, encodings and compression

+1 for this.

alippai · 2023-06-12T04:15:13Z

Thanks @tustvold. I'll address the Page vs ColumnChunk issues and other improvement ideas. Also it's a good insight that the parquet vs arrow vs dataset vs query engine level API separation is different in select languages.

wgtmac · 2023-06-12T04:44:00Z

docs/source/status.rst

+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
+===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |


+1 for this.

wgtmac · 2023-06-12T04:45:01Z

docs/source/status.rst

+-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |


I agree with @tustvold, partitioning is more like a high-level use case on top of file format.

wgtmac · 2023-06-12T04:47:45Z

docs/source/status.rst

+-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+


Are these intended for the completeness of fields defined in the metadata? If yes, probably they worth a separate table and indicate the states of each field. But that sounds too complicated.

wgtmac · 2023-06-12T04:51:10Z

docs/source/status.rst

+=================================
+
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |


The Java column could be misleading here. In the arrow repo, there is a java dataset reader to support reading from parquet dataset. If this is for parquet-mr, then it can be easily out of sync.

wgtmac · 2023-06-12T04:53:22Z

docs/source/status.rst

+-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |


Could we organize these items in a layered fashion? Maybe this is a good start point: https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features

westonpace · 2023-06-14T17:46:37Z

I'll repeat what the rest said about engine/format differences and maybe offer some clarification.

In C++ the picture is pretty clear, as the APIs tend to be focused on implementation:

There is a C++ parquet module which is purely a parquet reader.
There is a C++ datasets library which, using Acero, offers a lot of features on top of this

In pyarrow the picture is pretty muddled, as the APIs are more focused on user experience:

There is a pyarrow.parquet module, however, many of its features are powered by C++ datasets. For example, the pyarrow.parquet module can read from S3 even the the C++ parquet module has no concept of S3 (it just has an abstraction for input streams).

So I agree with the others that we should probably not base the features on the python API.

westonpace · 2023-06-14T17:49:30Z

Although...to play devil's advocate...it might be odd when a feature is available in the parquet reader, but not yet exposed in the query component. For example, there is some row skipping and bloom filters in the C++ parquet reader, but we haven't integrated those into the datasets layer yet.

westonpace · 2023-06-15T14:20:33Z

Also, do we think this table might belong at https://parquet.apache.org/docs/ (and we could link to it from Arrow's docs)? For example, the parquet-mr (java) implementation and the parquet.net (C#) implementation are not involved with the arrow project but are still standalone parquet readers.

pitrou · 2023-06-15T14:29:13Z

Agreed with @westonpace.
I created https://issues.apache.org/jira/browse/PARQUET-2310 to propose adding this in the Parquet docs.

alippai · 2023-06-15T14:42:39Z

Thanks, I can do another round on the weekend on the correct website and the suggestions included

alippai · 2023-06-20T02:17:01Z

Moved it to the parquet-site repo: apache/parquet-site#34

github-actions bot added Component: Documentation awaiting review Awaiting review labels Jun 11, 2023

alippai changed the title ~~Detailed parquet and parquet integration support status~~ GH-36028: [Documentation] Detailed parquet format support and parquet integration status Jun 11, 2023

alippai force-pushed the parquet-advanced-details branch from b5cf60b to c42a6bb Compare June 11, 2023 03:03

Detailed parquet and parquet integration support status

2392f06

alippai force-pushed the parquet-advanced-details branch from c42a6bb to 2392f06 Compare June 11, 2023 03:16

tustvold reviewed Jun 11, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jun 11, 2023

kou changed the title ~~GH-36028: [Documentation] Detailed parquet format support and parquet integration status~~ GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status Jun 11, 2023

wgtmac reviewed Jun 12, 2023

View reviewed changes

alippai closed this Jun 20, 2023

alippai mentioned this pull request Jun 20, 2023

PARQUET-2310: implementation status apache/parquet-site#34

Merged

	\| Page pruning using projection pushdown \| \| \| \| \| \|
	\| Column Pruning using projection pushdown \| \| \| \| \| \|

GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status #36027

GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status #36027

Conversation

alippai commented Jun 11, 2023 • edited by github-actions bot Loading

github-actions bot commented Jun 11, 2023

github-actions bot commented Jun 11, 2023

alippai commented Jun 11, 2023 • edited Loading

alippai commented Jun 11, 2023

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jun 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jun 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alippai commented Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Jun 14, 2023

westonpace commented Jun 14, 2023

westonpace commented Jun 15, 2023

pitrou commented Jun 15, 2023

alippai commented Jun 15, 2023

alippai commented Jun 20, 2023

alippai commented Jun 11, 2023 •

edited by github-actions bot

Loading

alippai commented Jun 11, 2023 •

edited

Loading

tustvold left a comment •

edited

Loading

tustvold Jun 11, 2023 •

edited

Loading

tustvold Jun 11, 2023 •

edited

Loading

alippai commented Jun 12, 2023 •

edited

Loading