Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Parquet writer supports writing int32/int64 for decimal type #15239

Closed
wgtmac opened this issue Jan 7, 2023 · 4 comments · Fixed by #15244
Closed

[C++][Parquet] Parquet writer supports writing int32/int64 for decimal type #15239

wgtmac opened this issue Jan 7, 2023 · 4 comments · Fixed by #15244

Comments

@wgtmac
Copy link
Member

wgtmac commented Jan 7, 2023

Describe the enhancement requested

As the parquet specs states below, decimal types with small precision can use int32/int64 physical types.

DECIMAL can be used to annotate the following types:

- int32: for 1 <= precision <= 9
- int64: for 1 <= precision <= 18; precision < 10 will produce a warning
- fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
- binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

The aim of this issue is to provide a writer option to write decimal types using int32 when 1 <= precision <= 9 and int64 when 10 <= precision <= 18.

Component(s)

C++, Parquet

@wgtmac
Copy link
Member Author

wgtmac commented Jan 7, 2023

I will work on it shortly. cc @emkornfield @pitrou

wjones127 pushed a commit that referenced this issue Jan 11, 2023
…15244)

As the parquet [specs](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal) states, DECIMAL can be used to annotate the following types:
- int32: for 1 <= precision <= 9
- int64: for 1 <= precision <= 18; precision < 10 will produce a warning
- fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
- binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

The aim of this patch is to provide a writer option to use int32 to annotate decimal when 1 <= precision <= 9 and int64 when 10 <= precision <= 18.
* Closes: #15239

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added this to the 11.0.0 milestone Jan 11, 2023
@alippai
Copy link
Contributor

alippai commented Jan 11, 2023

When talking about datasets (multiple parquet files) are the mixed physical types supported? Some files written using the old way, some files with the improved physical type.

@wjones127
Copy link
Member

When talking about datasets (multiple parquet files) are the mixed physical types supported? Some files written using the old way, some files with the improved physical type.

The physical type does not change the logical type in the Parquet file, just how the data is serialized. Datasets shouldn't care about the Parquet physical type; it should only care about the logical one.

@alippai
Copy link
Contributor

alippai commented Jan 11, 2023

🥳 thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants