fix: parquet hybrid RLE encoding did not always align to bit width #13883
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #13818.
If there were remainder values after dividing by the block size, these remainder values were not written with alignment to the bit width.
This crashes spark because spark reads
bitWidth
values at a time:(source https://github.com/apache/spark/blob/ce5ddad990373636e94071e7cef2f31021add07b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java#L934-L935)
Note to reproduce the issue on
main
, one should use anInt32
column of values instead ofString
, and ensure that the output ofpq.read_metadata().row_group(0).column(0)
containsRLE_DICTIONARY
underencodings
. This df should reproduce the issue:Relevant spec page https://parquet.apache.org/docs/file-format/data-pages/encodings/