Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43956: [C++][Format] Add initial Decimal32/Decimal64 implementations #43957

Merged
merged 56 commits into from
Sep 30, 2024

Conversation

zeroshade
Copy link
Member

@zeroshade zeroshade commented Sep 4, 2024

Rationale for this change

Widening the Decimal128/256 type to allow for bitwidths of 32 and 64 allows for more interoperability with other libraries and utilities which already support these types. This provides even more opportunities for zero-copy interactions between things such as libcudf and various databases.

What changes are included in this PR?

This PR contains the basic C++ implementations for Decimal32/Decimal64 types, arrays, builders and scalars. It also includes the minimum necessary to get everything compiling and tests passing without also extending the acero kernels and parquet handling (both of which will be handled in follow-up PRs).

Are these changes tested?

Yes, tests were extended where applicable to add decimal32/decimal64 cases.

Are there any user-facing changes?

Currently if a user is using decimal(precision, scale) rather than decimal128(precision, scale) they will get a Decimal128Type if the precision is <= 38 (max precision for Decimal128) and Decimal256Type if the precision is higher. Following the same pattern, this change means that using decimal(precision, scale) instead of the specific decimal32/decimal64/decimal128/decimal256 functions results in the following functionality:

  • for precisions [1 : 9] => Decimal32Type
  • for precisions [10 : 18] => Decimal64Type
  • for precisions [19 : 38] => Decimal128Type
  • for precisions [39 : 76] => Decimal256Type

While many of our tests currently make the assumption that decimal with a low precision would be Decimal128 and had to be updated, this may cause an initial surprise if users are making the same assumptions.

Copy link

github-actions bot commented Sep 4, 2024

⚠️ GitHub issue #43956 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/type.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/builder_dict.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/codegen_internal.h Outdated Show resolved Hide resolved
@@ -171,7 +171,8 @@ using PrimitiveArrowTypes =
using TemporalArrowTypes =
::testing::Types<Date32Type, Date64Type, TimestampType, Time32Type, Time64Type>;

using DecimalArrowTypes = ::testing::Types<Decimal128Type, Decimal256Type>;
using DecimalArrowTypes =
::testing::Types</*Decimal32Type, Decimal64Type,*/ Decimal128Type, Decimal256Type>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here. (Should we file issues to come back to these?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are commented out because we didn't implement casting for the new decimal types. This is mentioned in the issue as check boxes to do rather than as an entirely separate issue currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's going to be a separate PR, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i didn't want to make this already large PR even larger. I'll implement the cast kernels and so on as a follow-up PR

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Sep 5, 2024
@pitrou
Copy link
Member

pitrou commented Sep 5, 2024

Following the same pattern, this change means that using decimal(precision, scale) instead of the specific decimal32/decimal64/decimal128/decimal256 functions results in the following functionality

I'm afraid this may massively break user code. I would suggest another approach:

  • deprecate the decimal() factory while keeping its current behavior of always returning at least decimal128
  • introduce a new smallest_decimal() factory that is documented to return the smallest possible type, and explicitly makes no guarantees about the stability of the return type

@wgtmac
Copy link
Member

wgtmac commented Sep 5, 2024

Following the same pattern, this change means that using decimal(precision, scale) instead of the specific decimal32/decimal64/decimal128/decimal256 functions results in the following functionality

I'm afraid this may massively break user code. I would suggest another approach:

  • deprecate the decimal() factory while keeping its current behavior of always returning at least decimal128
  • introduce a new smallest_decimal() factory that is documented to return the smallest possible type, and explicitly makes no guarantees about the stability of the return type

I just have the same concern. +1 on the proposed workaround.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 5, 2024
@zeroshade
Copy link
Member Author

@pitrou @bkietz @wgtmac I've updated this based on the suggestion, created a smallest_decimal function and added a deprecated message to the docstring for decimal.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 5, 2024
@zeroshade zeroshade merged commit d55d4c6 into apache:main Sep 30, 2024
38 of 41 checks passed
@zeroshade zeroshade deleted the cpp-decimal32-64 branch September 30, 2024 21:15
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit d55d4c6.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them.

@pitrou
Copy link
Member

pitrou commented Oct 1, 2024

Hmm, did you notice the UBSAN failure in Decimal32Test.LeftShift?
https://github.com/apache/arrow/actions/runs/11115928849/job/30885255275#step:6:6273

(you can easily run this build locally using archery docker if you don't want to wait for CI every time :-))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants