fix: parquet hybrid RLE encoding did not always align to bit width #13883

nameexhaustion · 2024-01-21T11:05:09Z

If there were remainder values after dividing by the block size, these remainder values were not written with alignment to the bit width.

This crashes spark because spark reads bitWidth values at a time:

            // values are bit packed 8 at a time, so reading bitWidth will always work
            ByteBuffer buffer = in.slice(bitWidth);

(source https://github.com/apache/spark/blob/ce5ddad990373636e94071e7cef2f31021add07b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java#L934-L935)

Note to reproduce the issue on main, one should use an Int32 column of values instead of String, and ensure that the output of pq.read_metadata().row_group(0).column(0) contains RLE_DICTIONARY under encodings. This df should reproduce the issue:

pl.Series(
    "test_data",
    20 * [1, 2, 3],
    dtype=pl.Int32,
).to_frame()

Relevant spec page https://parquet.apache.org/docs/file-format/data-pages/encodings/

ritchie46 · 2024-01-21T12:03:28Z

crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs

@@ -55,7 +55,7 @@ fn bitpacked_encode_u32<W: Write, I: Iterator<Item = u32>>(
    }

    if remainder != 0 {
-        let compressed_remainder_size = ceil8(remainder * num_bits);


Can you add a comment to this PR here?

ritchie46 · 2024-01-21T12:10:23Z

Thanks a lot for the fix! I left one minor comment.

…ola-rs#13883)

c

818472c

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Jan 21, 2024

update test case

dfa1767

nameexhaustion marked this pull request as ready for review January 21, 2024 11:23

nameexhaustion requested review from ritchie46, stinodego, orlp and c-peters as code owners January 21, 2024 11:23

ritchie46 reviewed Jan 21, 2024

View reviewed changes

add comment

7db9eee

ritchie46 approved these changes Jan 21, 2024

View reviewed changes

ritchie46 merged commit 3259c29 into pola-rs:main Jan 21, 2024
17 checks passed

r-brink pushed a commit to r-brink/polars that referenced this pull request Jan 24, 2024

fix: parquet hybrid RLE encoding did not always align to bit width (p…

4b10b43

…ola-rs#13883)

nameexhaustion mentioned this pull request Feb 22, 2024

refactor(rust): Simplify compressed_chunk_size calculation and leave comments to explain for rle encode #14634

Merged

nameexhaustion deleted the rle-dict branch February 22, 2024 05:46

max-ipinfo mentioned this pull request Jul 3, 2024

Spark can't read parquet-go generated files: can not read class org.apache.parquet.format.PageHeader parquet-go/parquet-go#64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: parquet hybrid RLE encoding did not always align to bit width #13883

fix: parquet hybrid RLE encoding did not always align to bit width #13883

nameexhaustion commented Jan 21, 2024 •

edited

Loading

ritchie46 Jan 21, 2024

ritchie46 commented Jan 21, 2024

fix: parquet hybrid RLE encoding did not always align to bit width #13883

fix: parquet hybrid RLE encoding did not always align to bit width #13883

Conversation

nameexhaustion commented Jan 21, 2024 • edited Loading

ritchie46 Jan 21, 2024

Choose a reason for hiding this comment

ritchie46 commented Jan 21, 2024

nameexhaustion commented Jan 21, 2024 •

edited

Loading