Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add stats to convert-to-delta operation #2491

Merged
merged 7 commits into from
May 15, 2024

Conversation

gruuya
Copy link
Contributor

@gruuya gruuya commented May 8, 2024

Description

Collect stats during conversion of a parquet dir to a Delta table and add to the actions.

Related Issue(s)

Closes #2490

Documentation

@gruuya gruuya requested review from wjones127, roeap and rtyler as code owners May 8, 2024 15:44
@github-actions github-actions bot added the binding/rust Issues for the Rust crate label May 8, 2024
Copy link

github-actions bot commented May 8, 2024

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@gruuya gruuya changed the title Add stats to convert-to-delta operation feat: Add stats to convert-to-delta operation May 8, 2024
@gruuya gruuya changed the title feat: Add stats to convert-to-delta operation feat: add stats to convert-to-delta operation May 8, 2024
@gruuya
Copy link
Contributor Author

gruuya commented May 8, 2024

I'm not sure why the parse decimal overflow appears in the Python tests.

It's indicative that I see the decimal column min/max ends up as Value(Number(2560.0))/Value(Number(3584.0)) even though the pyarrrow seems to be properly persisting the stats to the file

% parquet meta python/tests/part-0.parquet

File path:  python/tests/part-0.parquet
Created by: parquet-cpp-arrow version 15.0.2
Properties:
  ARROW:schema: /////5ADAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAA4AAAAwAwAA5AIAALACAAB8AgAASAIAABACAADgAQAAtAEAAIgBAABMAQAAHAEAAOgAAABkAAAABAAAABj9//8AAAEMFAAAABwAAAAEAAAAAQAAABQAAAAEAAAAbGlzdAAAAAAM/f//RP3//wAAAQIQAAAAGAAAAAQAAAAAAAAABAAAAGl0ZW0AAAAAfP3//wAAAAFAAAAAdP3//wAAAQ0YAAAAIAAAAAQAAAACAAAAPAAAABQAAAAGAAAAc3RydWN0AABs/f//pP3//wAAAQUQAAAAFAAAAAQAAAAAAAAAAQAAAHkAAACQ/f//yP3//wAAAQIQAAAAFAAAAAQAAAAAAAAAAQAAAHgAAAD8/f//AAAAAUAAAAD0/f//AAABChAAAAAcAAAABAAAAAAAAAAJAAAAdGltZXN0YW1wAAAA8v7//wAAAgAk/v//AAABCBAAAAAYAAAABAAAAAAAAAAGAAAAZGF0ZTMyAAAe////AAAAAFD+//8AAAEHEAAAACAAAAAEAAAAAAAAAAcAAABkZWNpbWFsAAgADAAEAAgACAAAAAUAAAADAAAAiP7//wAAAQQQAAAAGAAAAAQAAAAAAAAABgAAAGJpbmFyeQAAeP7//7D+//8AAAEGEAAAABgAAAAEAAAAAAAAAAQAAABib29sAAAAAKD+///Y/v//AAABAxAAAAAYAAAABAAAAAAAAAAHAAAAZmxvYXQ2NADS////AAACAAT///8AAAEDEAAAACAAAAAEAAAAAAAAAAcAAABmbG9hdDMyAAAABgAIAAYABgAAAAAAAQA4////AAABAhAAAAAYAAAABAAAAAAAAAAEAAAAaW50OAAAAABw////AAAAAQgAAABo////AAABAhAAAAAYAAAABAAAAAAAAAAFAAAAaW50MTYAAACg////AAAAARAAAACY////AAABAhAAAAAYAAAABAAAAAAAAAAFAAAAaW50MzIAAADQ////AAAAASAAAADI////AAABAhAAAAAgAAAABAAAAAAAAAAFAAAAaW50NjQAAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAEAAAAdXRmOAAAAAAEAAQABAAAAAAAAAA=
Schema:
message schema {
  optional binary utf8 (STRING);
...
  optional fixed_len_byte_array(3) decimal (DECIMAL(5,3));
...


Row group 0:  count: 5  288.20 B records  start: 4  total(compressed): 1.407 kB total(uncompressed):1.408 kB
--------------------------------------------------------------------------------
                   type      encodings count     avg size   nulls   min / max
utf8               BINARY    S _ R     5         16.00 B    0       "0" / "4"
...
decimal            FIXED[3] S _ R     5         17.00 B  0       "10.000" / "14.000"
...

EDIT: I think this was caused by an unidentified bug in the statistics parser that is resolved by 5972aab

Comment on lines -234 to +271
let val = if val.len() <= 4 {
let mut bytes = [0; 4];
bytes[..val.len()].copy_from_slice(val);
i32::from_be_bytes(bytes) as f64
} else if val.len() <= 8 {
let mut bytes = [0; 8];
bytes[..val.len()].copy_from_slice(val);
i64::from_be_bytes(bytes) as f64
} else if val.len() <= 16 {
let mut bytes = [0; 16];
bytes[..val.len()].copy_from_slice(val);
i128::from_be_bytes(bytes) as f64
let val = if val.len() <= 16 {
i128::from_be_bytes(sign_extend_be(val)) as f64
Copy link
Contributor Author

@gruuya gruuya May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was causing those test failures.

In particular note that the above would wrongly extend slices that are not a power of 2 long. This is the case that occured for the failing tests, since this:

"decimal": pa.array([Decimal("10.000") + x for x in range(nrows)]),

would result in min [0, 39, 16] and max [0, 54, 176] fixed length byte arrays, and these would in turn be coerced to [0, 39, 16, 0] an [0, 54, 176, 0], instead of [0, 0, 39, 16] and [0, 0, 54, 176] respectively.

Contrast with arrow-rs/datafusion where the extension happens at the beginning of the array:
https://github.com/apache/arrow-rs/blob/b25c441745602c9967b1e3cc4a28bc469cfb1311/parquet/src/arrow/array_reader/fixed_len_byte_array.rs#L170

Granted I'm not sure what/how encodes the min value "10.000" to [0, 39, 16] in the first place, aside from that it occurs in pyarrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ion-elgreco, can you approve the rest of the workflows?

@gruuya
Copy link
Contributor Author

gruuya commented May 14, 2024

Hi @roeap, @ion-elgreco, I think this is ready for review. There's both a feat added (convert stats) as well as a bugfix (decimal stats parsing from byte arrays) now.

Can you provide some feedback on it? Thanks!

let stats = stats_from_parquet_metadata(
&IndexMap::from_iter(partition_values.clone().into_iter()),
parquet_metadata.as_ref(),
-1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be -1. You should use the same function I recently added in the write the get the configuration values: get_num_idx_cols_and_stats_columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, revised now.

@ion-elgreco
Copy link
Collaborator

Thankss

@ion-elgreco
Copy link
Collaborator

@gruuya can you rebase so that we can merge

@ion-elgreco ion-elgreco enabled auto-merge (squash) May 15, 2024 06:40
@gruuya
Copy link
Contributor Author

gruuya commented May 15, 2024

@ion-elgreco weirdly a couple of tests seem to be stuck, and have been running close to 2h hours now (test_concurrency_local, test_integration_local). I presume this is a CI problem of some sort (the previous runs were ok), can you restart the job please?

@ion-elgreco
Copy link
Collaborator

@gruuya yeah it's a flaky test, re-executed it

@ion-elgreco ion-elgreco merged commit c86d29f into delta-io:main May 15, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include file stats when converting a parquet directory to a Delta table
2 participants