GH-36882: [C++][Parquet] Default RLE for bool values in the parquet version 2.x #36955

wgtmac · 2023-07-31T15:26:39Z

Rationale for this change

RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs.

What changes are included in this PR?

Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN).
If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN.

Closes: [C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x #36882

github-actions · 2023-07-31T15:27:16Z

⚠️ GitHub issue #36882 has been automatically assigned in GitHub to PR creator.

wgtmac · 2023-07-31T15:27:30Z

@pitrou @mapleFU @emkornfield Please take a look when you have time, thanks!

wgtmac · 2023-07-31T15:30:00Z

cpp/src/parquet/column_writer.cc

-  Encoding::type encoding = properties->encoding(descr->path());
+  Encoding::type default_encoding =
+      (descr->physical_type() == Type::BOOLEAN &&
+       properties->data_page_version() == ParquetDataPageVersion::V2)


I'm not sure if we need to check properties->version() != ParquetVersion::PARQUET_1_0. parquet-mr does not have a way to set format version and always write 1 in the footer.

WriterProperties seems to have ParquetVersion::PARQUET_1_0 support, maybe we can set default_encoding within WriterProperties? Or we can extract a default_encoding function here?

mapleFU

My concern is mentioned in #36882

The code looks good, but maybe we would change it when we support more TYPES and default in 2.0?

mapleFU · 2023-08-01T04:59:44Z

cpp/src/parquet/column_writer.cc

-  Encoding::type encoding = properties->encoding(descr->path());
+  Encoding::type default_encoding =
+      (descr->physical_type() == Type::BOOLEAN &&
+       properties->data_page_version() == ParquetDataPageVersion::V2)


WriterProperties seems to have ParquetVersion::PARQUET_1_0 support, maybe we can set default_encoding within WriterProperties? Or we can extract a default_encoding function here?

mapleFU · 2023-08-01T05:02:00Z

cpp/src/parquet/properties.h

 static const char DEFAULT_CREATED_BY[] = CREATED_BY_VERSION;
 static constexpr Compression::type DEFAULT_COMPRESSION_TYPE = Compression::UNCOMPRESSED;
 static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = false;

 class PARQUET_EXPORT ColumnProperties {
 public:
-  ColumnProperties(Encoding::type encoding = DEFAULT_ENCODING,
+  ColumnProperties(std::optional<Encoding::type> encoding = std::nullopt,


Personally I'm ok with this, but should we make constructor compatible?

Making it compatible requires us to write a default encoding value. We have to use UNKNOWN or UNDEFINED instead of PLAIN now. This could be dirty. Usually this constructor is used internally without any parameters supplied. So I think users will not be affected.

Okay, this looks ok to me

Hmm... if we want to switch to std::optional in ColumnProperties we should probably do so more consistently, instead of breaking compatibility for this single property. Can this be deferred to another issue and PR?

Hmm... if we want to switch to std::optional in ColumnProperties we should probably do so more consistently, instead of breaking compatibility for this single property. Can this be deferred to another issue and PR?

Do you mean we can use Encoding::UNKNOWN as the default to fix the current issue without breaking the compatibility?

For this PR, yes.

wgtmac · 2023-08-11T02:37:04Z

My concern is mentioned in #36882

The code looks good, but maybe we would change it when we support more TYPES and default in 2.0?

I have simply checked the parquet impl in the arrow-rs. It has two distinctions compared to parquet-cpp:

There isn't DataPageVersion in the arrow-rs. It depends on WriterVersion to decide the data page version. For example, PARQUET_1_0 uses data page V1 and PARQUET_2_0 uses V2. This seems to be more aligned with parquet-mr.
WriterVersion in the arrow-rs only has two values: PARQUET_1_0 and PARQUET_2_0. In the parquet-cpp we have more fine-grained versions, so we'd better not enable an encoding introduced by 2_8 (e.g. BYTE_STREAM_SPLIT) when writer version is PARQUET_2_6 or less. Unfortunately parquet-mr can enable BYTE_STREAM_SPLIT even when the writer version is V1.

Now I have changed the code to use Encoding::UNKNOWN as the default. Before proceeding, I need to do more investigation on the relationship between encoding and version.

wgtmac · 2023-08-15T09:39:33Z

@mapleFU @pitrou This is ready for review. I'd like to address default encoding of other types in a separate PR.

mapleFU

For parquet only part it look good to me, lets waiting for runing more ci...

mapleFU · 2023-08-16T16:44:22Z

Would you mind more roundtrip CI? I'm afraid this will harm other (like other language) user. If it's not I'm general ok on this patch

pitrou · 2023-08-30T10:08:54Z

@jorisvandenbossche Do you think this would be ok?

jorisvandenbossche · 2023-08-30T13:21:11Z

Haven't looked at the code in detail, but from reading the discussion: I think it is fine to make this change (we need to be able to change defaults at some point, if we think there is broad enough support for a better default option). In addition, in this case, I think we still write datapage V1 by default? So you would only see this change in explicitly opting in for V2?

wgtmac · 2023-08-30T14:25:32Z

Haven't looked at the code in detail, but from reading the discussion: I think it is fine to make this change (we need to be able to change defaults at some point, if we think there is broad enough support for a better default option). In addition, in this case, I think we still write datapage V1 by default? So you would only see this change in explicitly opting in for V2?

Yes, we still write datapage V1 by default.

pitrou · 2023-08-30T16:05:56Z

Well, this PR uses RLE for all data pages if the Parquet version is >= 2.0, right?

pitrou · 2023-08-30T16:07:18Z

Note that WriterProperties::data_page_version() and WriterProperties::version() are two independent settings...

wgtmac · 2023-08-30T16:10:43Z

Note that WriterProperties::data_page_version() and WriterProperties::version() are two independent settings...

Yes, this is something different in the parquet-cpp compared to other implementations. It seems that if user has enabled ParquetDataPageVersion::V2, then the ParquetVersion should not be set to PARQUET_1_0.

pitrou · 2023-08-30T16:11:47Z

Yes, but @jorisvandenbossche 's question is for the other way round: what happens if the user selects v1 data pages with Parquet version >= 2.0? Do they get RLE-encoded boolean data pages?

wgtmac · 2023-08-30T16:17:02Z

Yes. The current implementation is supposed to do this.

IMO, a parquet v2 file can be any of the following:

applied data page v2
applied any v2 feature: delta encoding, LZ4_RAW codec, etc.

It seems that if user has enabled ParquetDataPageVersion::V2, then the ParquetVersion should not be set to PARQUET_1_0.

With all the above assumptions, this patch simply checks the parquet version and ignores the data page version.

pitrou · 2023-08-30T16:28:53Z

Again, @jorisvandenbossche said:

I think we still write datapage V1 by default? So you would only see this change in explicitly opting in for V2?

But with this PR, RLE is selected by default, right?

pitrou · 2023-08-30T16:29:24Z

Note I'm not objecting to the PR. Just pointing out that your answer to @jorisvandenbossche 's question seems incorrect.

jorisvandenbossche · 2023-08-30T18:00:35Z

Yes, now I am confused ;)

what happens if the user selects v1 data pages with Parquet version >= 2.0? Do they get RLE-encoded boolean data pages?

Yes. The current implementation is supposed to do this.

Note that this is the default situation for pyarrow users (without the user selecting anything in specific): you get version "2.+" features (eg unsigned integers, nanoseconds) but with data_page v1.

But looking at the code, I assume that indeed the above statement is indeed correct: it just looks at the Parquet version (and enabled it for >2), not the DataPage version.

I think it's then mostly the mention of "DataPage v2" in the issue and the code that makes it confusing, as the current PR is not tied to the DataPage version at all?

wgtmac · 2023-08-31T02:38:52Z

Yes, the PR (and issue) title is misleading. Originally I implemented this by checking the DataPageVersion solely, which is the same behavior of parquet-mr. Then after more investigation, I found that parquet-mr has mixed DataPageVersion with ParquetVersion. So I think it is better to check the ParquetVersion in the C++ implementation.

Back to the question above: yes, the default encoding is switched to RLE when parquet version is 2.x and data page version is v1.

Sorry for the confusion. I didn't know the default setting on the pyarrow side before this discussion. @jorisvandenbossche @pitrou

pitrou

This looks ok to me on the principle, just one small question re tests.

pitrou · 2023-08-31T12:57:55Z

cpp/src/parquet/column_writer_test.cc

+    const auto& encodings = this->metadata_encodings();
+    auto iter = std::find(encodings.begin(), encodings.end(), encoding);
+    ASSERT_TRUE(iter != encodings.end());


Why can't we just assert the value of encodings? There should be only one encoding, right?

No, in the case of parquet version 1.0, both PLAIN and RLE exists. The reason is here: https://github.com/apache/arrow/blob/main/cpp/src/parquet/metadata.cc#L1487-L1492

pitrou · 2023-08-31T16:55:18Z

Thanks a lot @wgtmac , will merge as-is.

wgtmac · 2023-09-01T01:09:34Z

Thanks a lot @wgtmac , will merge as-is.

Thank you as always!

conbench-apache-arrow · 2023-09-03T16:24:44Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 66d948d.

There were 2 benchmark results indicating a performance regression:

Commit Run on ursa-i9-9960x at 2023-09-01 16:47:39Z
- dataframe-to-table (R) with dataset=type_floats, language=R
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-08, scale_factor=1

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

jorisvandenbossche · 2023-09-29T12:15:16Z

We discovered another Parquet implementation (through our python tests, although those were not being run lately, see #37853) did not read the combination of RLE-encoded bool with datapage V1 correctly (dask/fastparquet#884).
Although maybe not very likely, but it might be good to explicitly check with some other implementations (parquet-mr, parquet-rs ?) that those are fine with reading files created with the new defaults of Parquet (Arrow) C++.

mapleFU · 2023-09-29T12:21:15Z

We discovered another Parquet implementation (through our python tests, although those were not being run lately, see #37853) did not read the combination of RLE-encoded bool with datapage V1 correctly

Standard here becomes not clear. See https://issues.apache.org/jira/browse/PARQUET-2222 @jorisvandenbossche

I don't know write RLE with Boolean in v1 page is ok...

jorisvandenbossche · 2023-09-29T12:45:01Z

I just tested with datafusion (assuming this is using parquet-rs), and this reads both PLAIN and RLE fine for boolean values in V1 datapage. So that matches with what is stated in PARQUET-2222 (comment) by @wgtmac about all of parquet-mr, arrow-rs and parquet-cpp supporting either option on read.

…h data page and version is V2 (#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( #36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: #36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…quet version 2.x (apache#36955) ### Rationale for this change RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs. ### What changes are included in this PR? * Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN). * If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN. * Closes: apache#36882 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…quet version 2.x (apache#36955) ### Rationale for this change RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs. ### What changes are included in this PR? * Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN). * If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN. * Closes: apache#36882 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added Component: Parquet Component: C++ labels Jul 31, 2023

github-actions bot added the awaiting review Awaiting review label Jul 31, 2023

wgtmac commented Jul 31, 2023

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 31, 2023

mapleFU reviewed Aug 1, 2023

View reviewed changes

wgtmac added 2 commits August 15, 2023 12:33

apacheGH-36882: [C++][Parquet] Default RLE for bool values in v2 pages

25546ee

Use Encoding::UNKNOWN as the default

a896c97

wgtmac force-pushed the bool_v2 branch from 343e324 to 73b8de7 Compare August 15, 2023 04:33

Use ParquetVersion instead of DataPageVersion

2b680af

wgtmac force-pushed the bool_v2 branch from 73b8de7 to 2b680af Compare August 15, 2023 07:53

github-actions bot added the Component: Python label Aug 15, 2023

wgtmac changed the title ~~GH-36882: [C++][Parquet] Default RLE for bool values in v2 pages~~ GH-36882: [C++][Parquet] Default RLE for bool values in 2.0 Aug 15, 2023

mapleFU reviewed Aug 15, 2023

View reviewed changes

mapleFU approved these changes Aug 16, 2023

View reviewed changes

mapleFU requested a review from pitrou August 27, 2023 12:38

wgtmac changed the title ~~GH-36882: [C++][Parquet] Default RLE for bool values in 2.0~~ GH-36882: [C++][Parquet] Default RLE for bool values in the parquet version 2.x Aug 31, 2023

pitrou reviewed Aug 31, 2023

View reviewed changes

pitrou merged commit 66d948d into apache:main Aug 31, 2023

pitrou removed the awaiting committer review Awaiting committer review label Aug 31, 2023

This was referenced Sep 25, 2023

[Python][CI] Tests involving fastparquet are never run #37853

Open

BUG: reading boolean column with RLE encoding gives wrong values dask/fastparquet#884

Closed

jorisvandenbossche mentioned this pull request Oct 9, 2023

GH-37312: [Python][Docs] Update Python docstrings to reflect new parquet encoding option #38070

Merged

This was referenced Oct 9, 2023

[C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x #36882

Closed

GH-36882: [C++][Parquet] Use RLE as BOOLEAN default encoding when both data page and version is V2 #38163

Merged

GH-36882: [C++][Parquet] Default RLE for bool values in the parquet version 2.x #36955

GH-36882: [C++][Parquet] Default RLE for bool values in the parquet version 2.x #36955

Conversation

wgtmac commented Jul 31, 2023 • edited by pitrou Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jul 31, 2023

wgtmac commented Jul 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac commented Aug 11, 2023

wgtmac commented Aug 15, 2023

mapleFU left a comment • edited Loading

Choose a reason for hiding this comment

mapleFU commented Aug 16, 2023

pitrou commented Aug 30, 2023

jorisvandenbossche commented Aug 30, 2023

wgtmac commented Aug 30, 2023

pitrou commented Aug 30, 2023

pitrou commented Aug 30, 2023

wgtmac commented Aug 30, 2023

pitrou commented Aug 30, 2023 • edited Loading

wgtmac commented Aug 30, 2023 • edited Loading

pitrou commented Aug 30, 2023

pitrou commented Aug 30, 2023

jorisvandenbossche commented Aug 30, 2023

wgtmac commented Aug 31, 2023

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Aug 31, 2023

wgtmac commented Sep 1, 2023

conbench-apache-arrow bot commented Sep 3, 2023

jorisvandenbossche commented Sep 29, 2023

mapleFU commented Sep 29, 2023

jorisvandenbossche commented Sep 29, 2023

wgtmac commented Jul 31, 2023 •

edited by pitrou

Loading

mapleFU left a comment •

edited

Loading

pitrou commented Aug 30, 2023 •

edited

Loading

wgtmac commented Aug 30, 2023 •

edited

Loading