Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36882: [C++][Parquet] Default RLE for bool values in the parquet version 2.x #36955

Merged
merged 3 commits into from
Aug 31, 2023

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Jul 31, 2023

Rationale for this change

RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs.

What changes are included in this PR?

  • Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN).
  • If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN.

@github-actions
Copy link

⚠️ GitHub issue #36882 has been automatically assigned in GitHub to PR creator.

@wgtmac
Copy link
Member Author

wgtmac commented Jul 31, 2023

@pitrou @mapleFU @emkornfield Please take a look when you have time, thanks!

@github-actions github-actions bot added the awaiting review Awaiting review label Jul 31, 2023
Encoding::type encoding = properties->encoding(descr->path());
Encoding::type default_encoding =
(descr->physical_type() == Type::BOOLEAN &&
properties->data_page_version() == ParquetDataPageVersion::V2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need to check properties->version() != ParquetVersion::PARQUET_1_0. parquet-mr does not have a way to set format version and always write 1 in the footer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriterProperties seems to have ParquetVersion::PARQUET_1_0 support, maybe we can set default_encoding within WriterProperties? Or we can extract a default_encoding function here?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 31, 2023
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is mentioned in #36882

The code looks good, but maybe we would change it when we support more TYPES and default in 2.0?

Encoding::type encoding = properties->encoding(descr->path());
Encoding::type default_encoding =
(descr->physical_type() == Type::BOOLEAN &&
properties->data_page_version() == ParquetDataPageVersion::V2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriterProperties seems to have ParquetVersion::PARQUET_1_0 support, maybe we can set default_encoding within WriterProperties? Or we can extract a default_encoding function here?

static const char DEFAULT_CREATED_BY[] = CREATED_BY_VERSION;
static constexpr Compression::type DEFAULT_COMPRESSION_TYPE = Compression::UNCOMPRESSED;
static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = false;

class PARQUET_EXPORT ColumnProperties {
public:
ColumnProperties(Encoding::type encoding = DEFAULT_ENCODING,
ColumnProperties(std::optional<Encoding::type> encoding = std::nullopt,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm ok with this, but should we make constructor compatible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it compatible requires us to write a default encoding value. We have to use UNKNOWN or UNDEFINED instead of PLAIN now. This could be dirty. Usually this constructor is used internally without any parameters supplied. So I think users will not be affected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this looks ok to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... if we want to switch to std::optional in ColumnProperties we should probably do so more consistently, instead of breaking compatibility for this single property. Can this be deferred to another issue and PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... if we want to switch to std::optional in ColumnProperties we should probably do so more consistently, instead of breaking compatibility for this single property. Can this be deferred to another issue and PR?

Do you mean we can use Encoding::UNKNOWN as the default to fix the current issue without breaking the compatibility?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, yes.

@wgtmac
Copy link
Member Author

wgtmac commented Aug 11, 2023

My concern is mentioned in #36882

The code looks good, but maybe we would change it when we support more TYPES and default in 2.0?

I have simply checked the parquet impl in the arrow-rs. It has two distinctions compared to parquet-cpp:

  • There isn't DataPageVersion in the arrow-rs. It depends on WriterVersion to decide the data page version. For example, PARQUET_1_0 uses data page V1 and PARQUET_2_0 uses V2. This seems to be more aligned with parquet-mr.
  • WriterVersion in the arrow-rs only has two values: PARQUET_1_0 and PARQUET_2_0. In the parquet-cpp we have more fine-grained versions, so we'd better not enable an encoding introduced by 2_8 (e.g. BYTE_STREAM_SPLIT) when writer version is PARQUET_2_6 or less. Unfortunately parquet-mr can enable BYTE_STREAM_SPLIT even when the writer version is V1.

Now I have changed the code to use Encoding::UNKNOWN as the default. Before proceeding, I need to do more investigation on the relationship between encoding and version.

@wgtmac wgtmac changed the title GH-36882: [C++][Parquet] Default RLE for bool values in v2 pages GH-36882: [C++][Parquet] Default RLE for bool values in 2.0 Aug 15, 2023
@wgtmac
Copy link
Member Author

wgtmac commented Aug 15, 2023

@mapleFU @pitrou This is ready for review. I'd like to address default encoding of other types in a separate PR.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For parquet only part it look good to me, lets waiting for runing more ci...

@mapleFU
Copy link
Member

mapleFU commented Aug 16, 2023

Would you mind more roundtrip CI? I'm afraid this will harm other (like other language) user. If it's not I'm general ok on this patch

@mapleFU mapleFU requested a review from pitrou August 27, 2023 12:38
@pitrou
Copy link
Member

pitrou commented Aug 30, 2023

@jorisvandenbossche Do you think this would be ok?

@jorisvandenbossche
Copy link
Member

Haven't looked at the code in detail, but from reading the discussion: I think it is fine to make this change (we need to be able to change defaults at some point, if we think there is broad enough support for a better default option). In addition, in this case, I think we still write datapage V1 by default? So you would only see this change in explicitly opting in for V2?

@wgtmac
Copy link
Member Author

wgtmac commented Aug 30, 2023

Haven't looked at the code in detail, but from reading the discussion: I think it is fine to make this change (we need to be able to change defaults at some point, if we think there is broad enough support for a better default option). In addition, in this case, I think we still write datapage V1 by default? So you would only see this change in explicitly opting in for V2?

Yes, we still write datapage V1 by default.

@pitrou
Copy link
Member

pitrou commented Aug 30, 2023

Well, this PR uses RLE for all data pages if the Parquet version is >= 2.0, right?

@pitrou
Copy link
Member

pitrou commented Aug 30, 2023

Note that WriterProperties::data_page_version() and WriterProperties::version() are two independent settings...

@wgtmac
Copy link
Member Author

wgtmac commented Aug 30, 2023

Note that WriterProperties::data_page_version() and WriterProperties::version() are two independent settings...

Yes, this is something different in the parquet-cpp compared to other implementations. It seems that if user has enabled ParquetDataPageVersion::V2, then the ParquetVersion should not be set to PARQUET_1_0.

@pitrou
Copy link
Member

pitrou commented Aug 30, 2023

Yes, but @jorisvandenbossche 's question is for the other way round: what happens if the user selects v1 data pages with Parquet version >= 2.0? Do they get RLE-encoded boolean data pages?

@wgtmac
Copy link
Member Author

wgtmac commented Aug 30, 2023

Yes. The current implementation is supposed to do this.

IMO, a parquet v2 file can be any of the following:

  • applied data page v2
  • applied any v2 feature: delta encoding, LZ4_RAW codec, etc.

It seems that if user has enabled ParquetDataPageVersion::V2, then the ParquetVersion should not be set to PARQUET_1_0.

With all the above assumptions, this patch simply checks the parquet version and ignores the data page version.

@pitrou
Copy link
Member

pitrou commented Aug 30, 2023

Again, @jorisvandenbossche said:

I think we still write datapage V1 by default? So you would only see this change in explicitly opting in for V2?

But with this PR, RLE is selected by default, right?

@pitrou
Copy link
Member

pitrou commented Aug 30, 2023

Note I'm not objecting to the PR. Just pointing out that your answer to @jorisvandenbossche 's question seems incorrect.

@jorisvandenbossche
Copy link
Member

Yes, now I am confused ;)

what happens if the user selects v1 data pages with Parquet version >= 2.0? Do they get RLE-encoded boolean data pages?

Yes. The current implementation is supposed to do this.

Note that this is the default situation for pyarrow users (without the user selecting anything in specific): you get version "2.+" features (eg unsigned integers, nanoseconds) but with data_page v1.

But looking at the code, I assume that indeed the above statement is indeed correct: it just looks at the Parquet version (and enabled it for >2), not the DataPage version.

I think it's then mostly the mention of "DataPage v2" in the issue and the code that makes it confusing, as the current PR is not tied to the DataPage version at all?

@wgtmac
Copy link
Member Author

wgtmac commented Aug 31, 2023

Yes, the PR (and issue) title is misleading. Originally I implemented this by checking the DataPageVersion solely, which is the same behavior of parquet-mr. Then after more investigation, I found that parquet-mr has mixed DataPageVersion with ParquetVersion. So I think it is better to check the ParquetVersion in the C++ implementation.

Back to the question above: yes, the default encoding is switched to RLE when parquet version is 2.x and data page version is v1.

Sorry for the confusion. I didn't know the default setting on the pyarrow side before this discussion. @jorisvandenbossche @pitrou

@wgtmac wgtmac changed the title GH-36882: [C++][Parquet] Default RLE for bool values in 2.0 GH-36882: [C++][Parquet] Default RLE for bool values in the parquet version 2.x Aug 31, 2023
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me on the principle, just one small question re tests.

Comment on lines +783 to +785
const auto& encodings = this->metadata_encodings();
auto iter = std::find(encodings.begin(), encodings.end(), encoding);
ASSERT_TRUE(iter != encodings.end());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just assert the value of encodings? There should be only one encoding, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in the case of parquet version 1.0, both PLAIN and RLE exists. The reason is here: https://github.com/apache/arrow/blob/main/cpp/src/parquet/metadata.cc#L1487-L1492

@pitrou
Copy link
Member

pitrou commented Aug 31, 2023

Thanks a lot @wgtmac , will merge as-is.

@pitrou pitrou merged commit 66d948d into apache:main Aug 31, 2023
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Aug 31, 2023
@wgtmac
Copy link
Member Author

wgtmac commented Sep 1, 2023

Thanks a lot @wgtmac , will merge as-is.

Thank you as always!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 66d948d.

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

@jorisvandenbossche
Copy link
Member

We discovered another Parquet implementation (through our python tests, although those were not being run lately, see #37853) did not read the combination of RLE-encoded bool with datapage V1 correctly (dask/fastparquet#884).
Although maybe not very likely, but it might be good to explicitly check with some other implementations (parquet-mr, parquet-rs ?) that those are fine with reading files created with the new defaults of Parquet (Arrow) C++.

@mapleFU
Copy link
Member

mapleFU commented Sep 29, 2023

We discovered another Parquet implementation (through our python tests, although those were not being run lately, see #37853) did not read the combination of RLE-encoded bool with datapage V1 correctly

Standard here becomes not clear. See https://issues.apache.org/jira/browse/PARQUET-2222 @jorisvandenbossche

I don't know write RLE with Boolean in v1 page is ok...

@jorisvandenbossche
Copy link
Member

I just tested with datafusion (assuming this is using parquet-rs), and this reads both PLAIN and RLE fine for boolean values in V1 datapage. So that matches with what is stated in PARQUET-2222 (comment) by @wgtmac about all of parquet-mr, arrow-rs and parquet-cpp supporting either option on read.

pitrou added a commit that referenced this pull request Oct 10, 2023
…h data page and version is V2 (#38163)

### Rationale for this change

Only use RLE as BOOLEAN default encoding when data page is V2.

Previous patch ( #36955 ) set RLE encoding for Boolean type by default.  However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we:

1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust
2. If DataPage V1 is used, don't use RLE as default Boolean encoding.

### What changes are included in this PR?

Only use RLE as BOOLEAN default encoding when both data page and version is V2.

### Are these changes tested?

Yes

### Are there any user-facing changes?

RLE encoding change for Boolean.

* Closes: #36882

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…en both data page and version is V2 (apache#38163)

### Rationale for this change

Only use RLE as BOOLEAN default encoding when data page is V2.

Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default.  However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we:

1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust
2. If DataPage V1 is used, don't use RLE as default Boolean encoding.

### What changes are included in this PR?

Only use RLE as BOOLEAN default encoding when both data page and version is V2.

### Are these changes tested?

Yes

### Are there any user-facing changes?

RLE encoding change for Boolean.

* Closes: apache#36882

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…quet version 2.x (apache#36955)

### Rationale for this change

RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs.

### What changes are included in this PR?

* Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN).
* If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN.

* Closes: apache#36882

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…en both data page and version is V2 (apache#38163)

### Rationale for this change

Only use RLE as BOOLEAN default encoding when data page is V2.

Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default.  However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we:

1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust
2. If DataPage V1 is used, don't use RLE as default Boolean encoding.

### What changes are included in this PR?

Only use RLE as BOOLEAN default encoding when both data page and version is V2.

### Are these changes tested?

Yes

### Are there any user-facing changes?

RLE encoding change for Boolean.

* Closes: apache#36882

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…quet version 2.x (apache#36955)

### Rationale for this change

RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs.

### What changes are included in this PR?

* Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN).
* If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN.

* Closes: apache#36882

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…en both data page and version is V2 (apache#38163)

### Rationale for this change

Only use RLE as BOOLEAN default encoding when data page is V2.

Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default.  However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we:

1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust
2. If DataPage V1 is used, don't use RLE as default Boolean encoding.

### What changes are included in this PR?

Only use RLE as BOOLEAN default encoding when both data page and version is V2.

### Are these changes tested?

Yes

### Are there any user-facing changes?

RLE encoding change for Boolean.

* Closes: apache#36882

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x
4 participants