feat(rust,python): Add run-length encoding to Parquet writer #16125

thalassemia · 2024-05-09T00:57:53Z

Second attempt at #15959. This adds two commits on top of my original PR. One contains a fix for #16109 and better code comments (bb09b83). The other updates my test to use more realistic random data and adds new columns to check each possible branch condition (ee90a5e). I verified that these columns invoke the branch(es) mentioned in their associated comments by manually stepping through with a debugger. If you remove just the first new commit and run the test suite (running the new test on my original PR), it fails as expected.

thalassemia · 2024-05-09T01:11:04Z

crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs

+        }
+        // If buffer is full, bit-pack as literal run and reset
+        if buffer_idx == MAX_VALUES_PER_LITERAL_RUN {
+            T::bitpacked_encode(writer, buffered_bits.iter().copied(), num_bits as usize)?;


The issue with my original PR was here. Instead of buffered_bits.iter().copied(), I had buffered_bits.iter().copied().take(literal_run_idx).

When the buffer filled up in the middle of a consecutive run, literal_run_idx != MAX_VALUES_PER_LITERAL_RUN so data was lost. The solution was to remove the take clause and always flush the entire buffer when full.

I added a debug_assert to check that the right number of repeats is consolidated into the buffered literal run.

thalassemia · 2024-05-09T01:24:05Z

py-polars/tests/unit/io/test_parquet.py

+                n, with_replacement=True
+            ),
+            # Fill up bit-packing buffer in middle of consecutive run
+            "large_bit_pack": np.concatenate(


This column tests the failure mechanism in #16109. By changing the value every 5 elements, we never have a run long enough to run-length encode. Eventually, all 8192 elements of the bit-packing buffer are filled. At that point, literal_run_idx = 8190 because it was last updated when the value last changed and consecutive_repeats = 3. Two of the most recent consecutive repeats are buffered as the 8191st and 8192nd elements of the bit-packing buffer. That means they are bit-packed with the rest of the buffer and consecutive_repeats is reset to 1.

codspeed-hq · 2024-05-09T01:24:36Z

CodSpeed Performance Report

Merging #16125 will not alter performance

_{Comparing thalassemia:rle (ee90a5e) with main (12b40b9)}

Summary

✅ 35 untouched benchmarks

codecov · 2024-05-09T02:08:36Z

Codecov Report

Attention: Patch coverage is 98.39572% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 80.98%. Comparing base (12b40b9) to head (ee90a5e).
Report is 1 commits behind head on main.

Files	Patch %	Lines
crates/polars-arrow/src/compute/cast/binary_to.rs	0.00%	1 Missing ⚠️
crates/polars-arrow/src/compute/cast/binview_to.rs	50.00%	1 Missing ⚠️
crates/polars-arrow/src/compute/cast/utf8_to.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #16125      +/-   ##
==========================================
+ Coverage   80.95%   80.98%   +0.02%     
==========================================
  Files        1386     1386              
  Lines      178434   178511      +77     
  Branches     2878     2878              
==========================================
+ Hits       144455   144562     +107     
+ Misses      33487    33457      -30     
  Partials      492      492

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

thalassemia · 2024-05-09T02:25:35Z

crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs

+            // * Final run: Extra values from buffer will be included in
+            //   packed[..compressed_remainder_size] but ignored when decoding
+            //   because they extend beyond known column length
+            bitpacked::encode_pack(&buffer, num_bits, packed.as_mut());


In my original PR, I changed this to &buffer[..remainder] because I was worried about other data being included from the buffer (same buffer is used before reaching this branch).

I now realize that this is a non-issue for the reasons described in my comment and changed it back to &buffer. This theoretically avoids an extra memory allocation in bitpacked::encode_pack. Unfortunately, the current bitpacked::encode_pack always copies input data to a new array because it compares the input slice length to T::Packed::LENGTH instead of T::Unpacked::LENGTH. I'll fix that and optimize bit-packing in a separate PR.

ritchie46 · 2024-05-10T17:28:39Z

Alright, thank you for the retry @thalassemia.

I'll fix that and optimize bit-packing in a separate PR.

Waiting for that follow up. :)

…#16125) Co-authored-by: ritchie <ritchie46@gmail.com>

thalassemia and others added 5 commits May 8, 2024 09:42

feat(rust): Add RLE to RLE_DICTIONARY encoding

f40229f

feat(rust): Remove dictionary downcast to u16

f064c5c

feat(rust): Reserve dictionary key vector during cast

e92e868

feat(rust): Use Parquet V1 data page to compress rep levels

cd1ca79

add slow tag

f419a10

thalassemia requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners May 9, 2024 00:57

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels May 9, 2024

thalassemia commented May 9, 2024

View reviewed changes

thalassemia force-pushed the rle branch from f21a318 to fe507f4 Compare May 9, 2024 01:12

thalassemia commented May 9, 2024

View reviewed changes

thalassemia force-pushed the rle branch from fe507f4 to 20e8df0 Compare May 9, 2024 01:25

fix(rust): Bitpack full buffer when full

bb09b83

thalassemia force-pushed the rle branch 2 times, most recently from b91f064 to 1a676bd Compare May 9, 2024 01:30

feat(rust): Better testing of hybrid RLE encoder

ee90a5e

thalassemia force-pushed the rle branch from 1a676bd to ee90a5e Compare May 9, 2024 01:47

thalassemia commented May 9, 2024

View reviewed changes

ritchie46 merged commit be57336 into pola-rs:main May 10, 2024
28 checks passed

s-banach mentioned this pull request May 14, 2024

Nulls are not encoded efficiently #16201

Closed

2 tasks

talawahtech mentioned this pull request May 20, 2024

Support RLE_DICTIONARY encoding with sink_parquet() #8442

Closed

Wouittone pushed a commit to Wouittone/polars that referenced this pull request Jun 22, 2024

feat(rust,python): Add run-length encoding to Parquet writer (pola-rs…

89b89f8

…#16125) Co-authored-by: ritchie <ritchie46@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rust,python): Add run-length encoding to Parquet writer #16125

feat(rust,python): Add run-length encoding to Parquet writer #16125

thalassemia commented May 9, 2024 •

edited

Loading

thalassemia May 9, 2024 •

edited

Loading

thalassemia May 9, 2024

codspeed-hq bot commented May 9, 2024 •

edited

Loading

codecov bot commented May 9, 2024

thalassemia May 9, 2024

ritchie46 commented May 10, 2024

feat(rust,python): Add run-length encoding to Parquet writer #16125

feat(rust,python): Add run-length encoding to Parquet writer #16125

Conversation

thalassemia commented May 9, 2024 • edited Loading

thalassemia May 9, 2024 • edited Loading

Choose a reason for hiding this comment

thalassemia May 9, 2024

Choose a reason for hiding this comment

codspeed-hq bot commented May 9, 2024 • edited Loading

CodSpeed Performance Report

Merging #16125 will not alter performance

Summary

codecov bot commented May 9, 2024

Codecov Report

thalassemia May 9, 2024

Choose a reason for hiding this comment

ritchie46 commented May 10, 2024

thalassemia commented May 9, 2024 •

edited

Loading

thalassemia May 9, 2024 •

edited

Loading

codspeed-hq bot commented May 9, 2024 •

edited

Loading