-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add stats to convert-to-delta operation #2491
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
I'm not sure why the It's indicative that I see the decimal column min/max ends up as % parquet meta python/tests/part-0.parquet
File path: python/tests/part-0.parquet
Created by: parquet-cpp-arrow version 15.0.2
Properties:
ARROW:schema: /////5ADAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAA4AAAAwAwAA5AIAALACAAB8AgAASAIAABACAADgAQAAtAEAAIgBAABMAQAAHAEAAOgAAABkAAAABAAAABj9//8AAAEMFAAAABwAAAAEAAAAAQAAABQAAAAEAAAAbGlzdAAAAAAM/f//RP3//wAAAQIQAAAAGAAAAAQAAAAAAAAABAAAAGl0ZW0AAAAAfP3//wAAAAFAAAAAdP3//wAAAQ0YAAAAIAAAAAQAAAACAAAAPAAAABQAAAAGAAAAc3RydWN0AABs/f//pP3//wAAAQUQAAAAFAAAAAQAAAAAAAAAAQAAAHkAAACQ/f//yP3//wAAAQIQAAAAFAAAAAQAAAAAAAAAAQAAAHgAAAD8/f//AAAAAUAAAAD0/f//AAABChAAAAAcAAAABAAAAAAAAAAJAAAAdGltZXN0YW1wAAAA8v7//wAAAgAk/v//AAABCBAAAAAYAAAABAAAAAAAAAAGAAAAZGF0ZTMyAAAe////AAAAAFD+//8AAAEHEAAAACAAAAAEAAAAAAAAAAcAAABkZWNpbWFsAAgADAAEAAgACAAAAAUAAAADAAAAiP7//wAAAQQQAAAAGAAAAAQAAAAAAAAABgAAAGJpbmFyeQAAeP7//7D+//8AAAEGEAAAABgAAAAEAAAAAAAAAAQAAABib29sAAAAAKD+///Y/v//AAABAxAAAAAYAAAABAAAAAAAAAAHAAAAZmxvYXQ2NADS////AAACAAT///8AAAEDEAAAACAAAAAEAAAAAAAAAAcAAABmbG9hdDMyAAAABgAIAAYABgAAAAAAAQA4////AAABAhAAAAAYAAAABAAAAAAAAAAEAAAAaW50OAAAAABw////AAAAAQgAAABo////AAABAhAAAAAYAAAABAAAAAAAAAAFAAAAaW50MTYAAACg////AAAAARAAAACY////AAABAhAAAAAYAAAABAAAAAAAAAAFAAAAaW50MzIAAADQ////AAAAASAAAADI////AAABAhAAAAAgAAAABAAAAAAAAAAFAAAAaW50NjQAAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAEAAAAdXRmOAAAAAAEAAQABAAAAAAAAAA=
Schema:
message schema {
optional binary utf8 (STRING);
...
optional fixed_len_byte_array(3) decimal (DECIMAL(5,3));
...
Row group 0: count: 5 288.20 B records start: 4 total(compressed): 1.407 kB total(uncompressed):1.408 kB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
utf8 BINARY S _ R 5 16.00 B 0 "0" / "4"
...
decimal FIXED[3] S _ R 5 17.00 B 0 "10.000" / "14.000"
... EDIT: I think this was caused by an unidentified bug in the statistics parser that is resolved by 5972aab |
let val = if val.len() <= 4 { | ||
let mut bytes = [0; 4]; | ||
bytes[..val.len()].copy_from_slice(val); | ||
i32::from_be_bytes(bytes) as f64 | ||
} else if val.len() <= 8 { | ||
let mut bytes = [0; 8]; | ||
bytes[..val.len()].copy_from_slice(val); | ||
i64::from_be_bytes(bytes) as f64 | ||
} else if val.len() <= 16 { | ||
let mut bytes = [0; 16]; | ||
bytes[..val.len()].copy_from_slice(val); | ||
i128::from_be_bytes(bytes) as f64 | ||
let val = if val.len() <= 16 { | ||
i128::from_be_bytes(sign_extend_be(val)) as f64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this was causing those test failures.
In particular note that the above would wrongly extend slices that are not a power of 2 long. This is the case that occured for the failing tests, since this:
delta-rs/python/tests/conftest.py
Line 218 in 81593e9
"decimal": pa.array([Decimal("10.000") + x for x in range(nrows)]), |
would result in min [0, 39, 16] and max [0, 54, 176] fixed length byte arrays, and these would in turn be coerced to [0, 39, 16, 0] an [0, 54, 176, 0], instead of [0, 0, 39, 16] and [0, 0, 54, 176] respectively.
Contrast with arrow-rs/datafusion where the extension happens at the beginning of the array:
https://github.com/apache/arrow-rs/blob/b25c441745602c9967b1e3cc4a28bc469cfb1311/parquet/src/arrow/array_reader/fixed_len_byte_array.rs#L170
Granted I'm not sure what/how encodes the min value "10.000" to [0, 39, 16] in the first place, aside from that it occurs in pyarrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ion-elgreco, can you approve the rest of the workflows?
Hi @roeap, @ion-elgreco, I think this is ready for review. There's both a feat added (convert stats) as well as a bugfix (decimal stats parsing from byte arrays) now. Can you provide some feedback on it? Thanks! |
let stats = stats_from_parquet_metadata( | ||
&IndexMap::from_iter(partition_values.clone().into_iter()), | ||
parquet_metadata.as_ref(), | ||
-1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be -1. You should use the same function I recently added in the write the get the configuration values: get_num_idx_cols_and_stats_columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, revised now.
Thankss |
@gruuya can you rebase so that we can merge |
@ion-elgreco weirdly a couple of tests seem to be stuck, and have been running close to 2h hours now ( |
@gruuya yeah it's a flaky test, re-executed it |
Description
Collect stats during conversion of a parquet dir to a Delta table and add to the actions.
Related Issue(s)
Closes #2490
Documentation