Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15239: [C++][Parquet] Parquet writer writes decimal as int32/64 #15244

Merged
merged 3 commits into from
Jan 11, 2023

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Jan 7, 2023

As the parquet specs states, DECIMAL can be used to annotate the following types:

  • int32: for 1 <= precision <= 9
  • int64: for 1 <= precision <= 18; precision < 10 will produce a warning
  • fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
  • binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

The aim of this patch is to provide a writer option to use int32 to annotate decimal when 1 <= precision <= 9 and int64 when 10 <= precision <= 18.

@github-actions
Copy link

github-actions bot commented Jan 7, 2023

@github-actions
Copy link

github-actions bot commented Jan 7, 2023

⚠️ GitHub issue #15239 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

github-actions bot commented Jan 7, 2023

⚠️ GitHub issue #15239 has no components, please add labels for components.

@wgtmac
Copy link
Member Author

wgtmac commented Jan 8, 2023

@pitrou @emkornfield @wjones127 Could you please take a look? Thanks!

cc @liukun4515

@@ -199,7 +199,8 @@ class PARQUET_EXPORT WriterProperties {
pagesize_(kDefaultDataPageSize),
version_(ParquetVersion::PARQUET_2_4),
data_page_version_(ParquetDataPageVersion::V1),
created_by_(DEFAULT_CREATED_BY) {}
created_by_(DEFAULT_CREATED_BY),
integer_annotate_decimal_(false) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How widespread is support? IIUC, this allows for more compact decimal representation when available, which seems like it would be nice to have in most cases. The only reason not to, IMO is if most other implementations don't support reading this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason to keep the default false, is to not change enable code paths that weren't used before when people update there library version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the desire for stability, but would like to have a pathway for improved defaults to be released. It's unfortunate we always do a major release while still trying to respect stability, as there's no clear good time to make breaking changes like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a hard problem, and don't have any guidance. The one thing I would say is that if Arrow is intended to be developer facing, then IMO developers looking to upgrade versions we should be following the "least surprise" principle as much as possible. Other opinions are valid here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
Some belated comments:

  • Maybe it's because I am not a native English speaker, but I have a hard time understanding what "integer annotate decimal" means. Is "annotate" a word here?
  • I agree that the defaults should remain false for the time being.
  • While enabling this does allow for a more compact representation, compression might at least partially alleviate that (though unfortunately FIXED_LEN_BYTE_ARRAY doesn't seem to support any of the delta encodings)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed https://issues.apache.org/jira/browse/PARQUET-2231 because parquet-mr allows DELTA_BYTE_ARRAY on FIXED_LEN_BYTE_ARRAY, while the spec doesn't say so...

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good, but would like to see an additional test for the null code paths.

cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found Decimal256 doesn't work with this code path if store_schema is set. Could you also address that? Here is a unit test that should reproduce:

TEST(ArrowReadWrite, Decimal256AsInt) {
  using ::arrow::Decimal256;
  using ::arrow::field;

  auto type = ::arrow::decimal256(8, 4);

  const char* json = R"(["1.0000", null, "-1.2345", "-1000.5678",
                         "-9999.9999", "9999.9999"])";
  auto array = ::arrow::ArrayFromJSON(type, json);
  auto table = ::arrow::Table::Make(::arrow::schema({field("root", type)}), {array});

  parquet::WriterProperties::Builder builder;
  // Enforce integer type to annotate decimal type
  auto writer_properties = builder.enable_integer_annotate_decimal()->build();
  auto props_store_schema = ArrowWriterProperties::Builder().store_schema()->build();

  CheckConfiguredRoundtrip(table, table, writer_properties, props_store_schema);
}

cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved
const auto& decimal_type = static_cast<const ::arrow::DecimalType&>(*field->type());
precision = decimal_type.precision();
scale = decimal_type.scale();
length = DecimalType::DecimalSize(precision);
if (properties.integer_annotate_decimal() && 1 <= precision && precision <= 18) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does c++ arrow support converting int32/int64 to decimal128 and decimal256?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by that. This PR is how the data is encoded in the buffers, not what logical type is saved in the Parquet file. When read back to Arrow it will always come out as a Decimal, so there's no need for conversion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does c++ arrow support converting int32/int64 to decimal128 and decimal256?

decimal128/decimal256 support construction from integers. But the parquet itself does not use an integer array to represent decimal values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks

@wgtmac
Copy link
Member Author

wgtmac commented Jan 11, 2023

I found Decimal256 doesn't work with this code path if store_schema is set. Could you also address that? Here is a unit test that should reproduce:

TEST(ArrowReadWrite, Decimal256AsInt) {
  using ::arrow::Decimal256;
  using ::arrow::field;

  auto type = ::arrow::decimal256(8, 4);

  const char* json = R"(["1.0000", null, "-1.2345", "-1000.5678",
                         "-9999.9999", "9999.9999"])";
  auto array = ::arrow::ArrayFromJSON(type, json);
  auto table = ::arrow::Table::Make(::arrow::schema({field("root", type)}), {array});

  parquet::WriterProperties::Builder builder;
  // Enforce integer type to annotate decimal type
  auto writer_properties = builder.enable_integer_annotate_decimal()->build();
  auto props_store_schema = ArrowWriterProperties::Builder().store_schema()->build();

  CheckConfiguredRoundtrip(table, table, writer_properties, props_store_schema);
}

Fixed and added the test case. Please take a look again. Thanks! @wjones127

@wgtmac wgtmac requested a review from wjones127 January 11, 2023 05:42
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you @wgtmac!

@wjones127 wjones127 merged commit 7d17a5b into apache:master Jan 11, 2023
@ursabot
Copy link

ursabot commented Jan 12, 2023

Benchmark runs are scheduled for baseline = 0da51b7 and contender = 7d17a5b. 7d17a5b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.51% ⬆️0.0%] test-mac-arm
[Finished ⬇️1.02% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.59% ⬆️0.06%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7d17a5b0 ec2-t3-xlarge-us-east-2
[Finished] 7d17a5b0 test-mac-arm
[Finished] 7d17a5b0 ursa-i9-9960x
[Finished] 7d17a5b0 ursa-thinkcentre-m75q
[Finished] 0da51b72 ec2-t3-xlarge-us-east-2
[Finished] 0da51b72 test-mac-arm
[Finished] 0da51b72 ursa-i9-9960x
[Finished] 0da51b72 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Jan 12, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

@@ -450,6 +451,22 @@ class PARQUET_EXPORT WriterProperties {
return this->disable_statistics(path->ToDotString());
}

/// Enable integer type to annotate decimal type as below:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Enable X to do Y" correct English? Just asking, as I'm not a native English speaker, but I would intuitively have expected "Allow X to do Y".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be nice if the docstring was more explanatory as to the consequences (the reader may not know that this may make the column more compact, and not be readable with previous Arrow C++ versions).

@mapleFU
Copy link
Member

mapleFU commented Mar 14, 2023

Should we DCHECK the precision when writing arrow? Seems I meet a bug that my precision is wrong, but it write success as Decimal128, and read as int32... @wgtmac

@wgtmac
Copy link
Member Author

wgtmac commented Mar 14, 2023

Should we DCHECK the precision when writing arrow? Seems I meet a bug that my precision is wrong, but it write success as Decimal128, and read as int32... @wgtmac

Could you please provide the process to reproduce it? Or you can directly submit a PR if it is straight-forward.

@mapleFU
Copy link
Member

mapleFU commented Mar 14, 2023

Sure, I will submit a tiny check tonight

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Parquet writer supports writing int32/int64 for decimal type
7 participants