-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36882: [C++][Parquet] Default RLE for bool values in the parquet version 2.x #36955
Conversation
|
@pitrou @mapleFU @emkornfield Please take a look when you have time, thanks! |
cpp/src/parquet/column_writer.cc
Outdated
Encoding::type encoding = properties->encoding(descr->path()); | ||
Encoding::type default_encoding = | ||
(descr->physical_type() == Type::BOOLEAN && | ||
properties->data_page_version() == ParquetDataPageVersion::V2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need to check properties->version() != ParquetVersion::PARQUET_1_0
. parquet-mr does not have a way to set format version and always write 1 in the footer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WriterProperties seems to have ParquetVersion::PARQUET_1_0
support, maybe we can set default_encoding
within WriterProperties
? Or we can extract a default_encoding
function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is mentioned in #36882
The code looks good, but maybe we would change it when we support more TYPES and default in 2.0?
cpp/src/parquet/column_writer.cc
Outdated
Encoding::type encoding = properties->encoding(descr->path()); | ||
Encoding::type default_encoding = | ||
(descr->physical_type() == Type::BOOLEAN && | ||
properties->data_page_version() == ParquetDataPageVersion::V2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WriterProperties seems to have ParquetVersion::PARQUET_1_0
support, maybe we can set default_encoding
within WriterProperties
? Or we can extract a default_encoding
function here?
cpp/src/parquet/properties.h
Outdated
static const char DEFAULT_CREATED_BY[] = CREATED_BY_VERSION; | ||
static constexpr Compression::type DEFAULT_COMPRESSION_TYPE = Compression::UNCOMPRESSED; | ||
static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = false; | ||
|
||
class PARQUET_EXPORT ColumnProperties { | ||
public: | ||
ColumnProperties(Encoding::type encoding = DEFAULT_ENCODING, | ||
ColumnProperties(std::optional<Encoding::type> encoding = std::nullopt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'm ok with this, but should we make constructor compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it compatible requires us to write a default encoding value. We have to use UNKNOWN or UNDEFINED instead of PLAIN now. This could be dirty. Usually this constructor is used internally without any parameters supplied. So I think users will not be affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this looks ok to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... if we want to switch to std::optional
in ColumnProperties
we should probably do so more consistently, instead of breaking compatibility for this single property. Can this be deferred to another issue and PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... if we want to switch to
std::optional
inColumnProperties
we should probably do so more consistently, instead of breaking compatibility for this single property. Can this be deferred to another issue and PR?
Do you mean we can use Encoding::UNKNOWN
as the default to fix the current issue without breaking the compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, yes.
I have simply checked the parquet impl in the arrow-rs. It has two distinctions compared to parquet-cpp:
Now I have changed the code to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For parquet only part it look good to me, lets waiting for runing more ci...
Would you mind more roundtrip CI? I'm afraid this will harm other (like other language) user. If it's not I'm general ok on this patch |
@jorisvandenbossche Do you think this would be ok? |
Haven't looked at the code in detail, but from reading the discussion: I think it is fine to make this change (we need to be able to change defaults at some point, if we think there is broad enough support for a better default option). In addition, in this case, I think we still write datapage V1 by default? So you would only see this change in explicitly opting in for V2? |
Yes, we still write datapage V1 by default. |
Well, this PR uses RLE for all data pages if the Parquet version is >= 2.0, right? |
Note that |
Yes, this is something different in the parquet-cpp compared to other implementations. It seems that if user has enabled |
Yes, but @jorisvandenbossche 's question is for the other way round: what happens if the user selects v1 data pages with Parquet version >= 2.0? Do they get RLE-encoded boolean data pages? |
Yes. The current implementation is supposed to do this. IMO, a parquet v2 file can be any of the following:
With all the above assumptions, this patch simply checks the parquet version and ignores the data page version. |
Again, @jorisvandenbossche said:
But with this PR, RLE is selected by default, right? |
Note I'm not objecting to the PR. Just pointing out that your answer to @jorisvandenbossche 's question seems incorrect. |
Yes, now I am confused ;)
Note that this is the default situation for pyarrow users (without the user selecting anything in specific): you get version "2.+" features (eg unsigned integers, nanoseconds) but with data_page v1. But looking at the code, I assume that indeed the above statement is indeed correct: it just looks at the Parquet version (and enabled it for >2), not the DataPage version. I think it's then mostly the mention of "DataPage v2" in the issue and the code that makes it confusing, as the current PR is not tied to the DataPage version at all? |
Yes, the PR (and issue) title is misleading. Originally I implemented this by checking the DataPageVersion solely, which is the same behavior of parquet-mr. Then after more investigation, I found that parquet-mr has mixed DataPageVersion with ParquetVersion. So I think it is better to check the ParquetVersion in the C++ implementation. Back to the question above: yes, the default encoding is switched to RLE when parquet version is 2.x and data page version is v1. Sorry for the confusion. I didn't know the default setting on the pyarrow side before this discussion. @jorisvandenbossche @pitrou |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok to me on the principle, just one small question re tests.
const auto& encodings = this->metadata_encodings(); | ||
auto iter = std::find(encodings.begin(), encodings.end(), encoding); | ||
ASSERT_TRUE(iter != encodings.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just assert the value of encodings
? There should be only one encoding, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, in the case of parquet version 1.0, both PLAIN
and RLE
exists. The reason is here: https://github.com/apache/arrow/blob/main/cpp/src/parquet/metadata.cc#L1487-L1492
Thanks a lot @wgtmac , will merge as-is. |
Thank you as always! |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 66d948d. There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
We discovered another Parquet implementation (through our python tests, although those were not being run lately, see #37853) did not read the combination of RLE-encoded bool with datapage V1 correctly (dask/fastparquet#884). |
Standard here becomes not clear. See https://issues.apache.org/jira/browse/PARQUET-2222 @jorisvandenbossche I don't know write RLE with Boolean in v1 page is ok... |
I just tested with datafusion (assuming this is using |
…h data page and version is V2 (#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( #36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: #36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…quet version 2.x (apache#36955) ### Rationale for this change RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs. ### What changes are included in this PR? * Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN). * If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN. * Closes: apache#36882 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…quet version 2.x (apache#36955) ### Rationale for this change RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs. ### What changes are included in this PR? * Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN). * If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN. * Closes: apache#36882 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs.
What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN.