-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-15239: [C++][Parquet] Parquet writer writes decimal as int32/64 #15244
Conversation
|
|
@pitrou @emkornfield @wjones127 Could you please take a look? Thanks! cc @liukun4515 |
@@ -199,7 +199,8 @@ class PARQUET_EXPORT WriterProperties { | |||
pagesize_(kDefaultDataPageSize), | |||
version_(ParquetVersion::PARQUET_2_4), | |||
data_page_version_(ParquetDataPageVersion::V1), | |||
created_by_(DEFAULT_CREATED_BY) {} | |||
created_by_(DEFAULT_CREATED_BY), | |||
integer_annotate_decimal_(false) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How widespread is support? IIUC, this allows for more compact decimal representation when available, which seems like it would be nice to have in most cases. The only reason not to, IMO is if most other implementations don't support reading this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason to keep the default false, is to not change enable code paths that weren't used before when people update there library version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the desire for stability, but would like to have a pathway for improved defaults to be released. It's unfortunate we always do a major release while still trying to respect stability, as there's no clear good time to make breaking changes like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is a hard problem, and don't have any guidance. The one thing I would say is that if Arrow is intended to be developer facing, then IMO developers looking to upgrade versions we should be following the "least surprise" principle as much as possible. Other opinions are valid here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Some belated comments:
- Maybe it's because I am not a native English speaker, but I have a hard time understanding what "integer annotate decimal" means. Is "annotate" a word here?
- I agree that the defaults should remain false for the time being.
- While enabling this does allow for a more compact representation, compression might at least partially alleviate that (though unfortunately FIXED_LEN_BYTE_ARRAY doesn't seem to support any of the delta encodings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed https://issues.apache.org/jira/browse/PARQUET-2231 because parquet-mr allows DELTA_BYTE_ARRAY on FIXED_LEN_BYTE_ARRAY, while the spec doesn't say so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks good, but would like to see an additional test for the null code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found Decimal256 doesn't work with this code path if store_schema
is set. Could you also address that? Here is a unit test that should reproduce:
TEST(ArrowReadWrite, Decimal256AsInt) {
using ::arrow::Decimal256;
using ::arrow::field;
auto type = ::arrow::decimal256(8, 4);
const char* json = R"(["1.0000", null, "-1.2345", "-1000.5678",
"-9999.9999", "9999.9999"])";
auto array = ::arrow::ArrayFromJSON(type, json);
auto table = ::arrow::Table::Make(::arrow::schema({field("root", type)}), {array});
parquet::WriterProperties::Builder builder;
// Enforce integer type to annotate decimal type
auto writer_properties = builder.enable_integer_annotate_decimal()->build();
auto props_store_schema = ArrowWriterProperties::Builder().store_schema()->build();
CheckConfiguredRoundtrip(table, table, writer_properties, props_store_schema);
}
const auto& decimal_type = static_cast<const ::arrow::DecimalType&>(*field->type()); | ||
precision = decimal_type.precision(); | ||
scale = decimal_type.scale(); | ||
length = DecimalType::DecimalSize(precision); | ||
if (properties.integer_annotate_decimal() && 1 <= precision && precision <= 18) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does c++ arrow support converting int32/int64 to decimal128 and decimal256?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by that. This PR is how the data is encoded in the buffers, not what logical type is saved in the Parquet file. When read back to Arrow it will always come out as a Decimal, so there's no need for conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does c++ arrow support converting int32/int64 to decimal128 and decimal256?
decimal128/decimal256 support construction from integers. But the parquet itself does not use an integer array to represent decimal values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks
Fixed and added the test case. Please take a look again. Thanks! @wjones127 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thank you @wgtmac!
Benchmark runs are scheduled for baseline = 0da51b7 and contender = 7d17a5b. 7d17a5b is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
@@ -450,6 +451,22 @@ class PARQUET_EXPORT WriterProperties { | |||
return this->disable_statistics(path->ToDotString()); | |||
} | |||
|
|||
/// Enable integer type to annotate decimal type as below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "Enable X to do Y" correct English? Just asking, as I'm not a native English speaker, but I would intuitively have expected "Allow X to do Y".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be nice if the docstring was more explanatory as to the consequences (the reader may not know that this may make the column more compact, and not be readable with previous Arrow C++ versions).
Should we DCHECK the precision when writing arrow? Seems I meet a bug that my precision is wrong, but it write success as Decimal128, and read as int32... @wgtmac |
Could you please provide the process to reproduce it? Or you can directly submit a PR if it is straight-forward. |
Sure, I will submit a tiny check tonight |
As the parquet specs states, DECIMAL can be used to annotate the following types:
The aim of this patch is to provide a writer option to use int32 to annotate decimal when 1 <= precision <= 9 and int64 when 10 <= precision <= 18.