Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: parquet hybrid RLE encoding did not always align to bit width #13883

Merged
merged 3 commits into from
Jan 21, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Jan 21, 2024

Fixes #13818.

If there were remainder values after dividing by the block size, these remainder values were not written with alignment to the bit width.

This crashes spark because spark reads bitWidth values at a time:

            // values are bit packed 8 at a time, so reading bitWidth will always work
            ByteBuffer buffer = in.slice(bitWidth);

(source https://github.com/apache/spark/blob/ce5ddad990373636e94071e7cef2f31021add07b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java#L934-L935)

Note to reproduce the issue on main, one should use an Int32 column of values instead of String, and ensure that the output of pq.read_metadata().row_group(0).column(0) contains RLE_DICTIONARY under encodings. This df should reproduce the issue:

pl.Series(
    "test_data",
    20 * [1, 2, 3],
    dtype=pl.Int32,
).to_frame()

Relevant spec page https://parquet.apache.org/docs/file-format/data-pages/encodings/

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Jan 21, 2024
@nameexhaustion nameexhaustion marked this pull request as ready for review January 21, 2024 11:23
@@ -55,7 +55,7 @@ fn bitpacked_encode_u32<W: Write, I: Iterator<Item = u32>>(
}

if remainder != 0 {
let compressed_remainder_size = ceil8(remainder * num_bits);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to this PR here?

@ritchie46
Copy link
Member

Thanks a lot for the fix! I left one minor comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark can't read parquet files written by polars
2 participants