-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix read parquet for different encodings across row groups #9129
Fix read parquet for different encodings across row groups #9129
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova @majetideepak @Yuhta Could you help to review? Looks like the CI failure is irrelevant. Thanks! |
0946b0c
to
031f7e0
Compare
@Yuhta Can you help to review and merge? Result mismatch is a fatal issue to customer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix
…ncubator#9129) Summary: When the parquet file contains a column that uses both plain and dictionary encodings across row groups, the output after filter can be wrong. The root cause is that `returnReaderNulls_` in `SelectiveColumnReader` may not be reset when the column encoding switches from dictionary encoding to plain encoding. Here is the parquet metadata for the test file to reproduce this issue. `n_2` is dictionary encoded in row group 2 and is plain encoded in row group 3. ``` file: file:/home/sparkuser/github/oap-project/velox/velox/dwio/parquet/tests/examples/different_encodings_with_filter.parquet creator: parquet-cpp-arrow version 15.0.0 extra: ARROW:schema = /////9gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAMAAABwAAAAMAAAAAQAAACs////AAABBRAAAAAYAAAABAAAAAAAAAADAAAAbl8yAAQABAAEAAAA1P///wAAAQIQAAAAFAAAAAQAAAAAAAAAAwAAAG5fMQDE////AAAAASAAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIQAAAAHAAAAAQAAAAAAAAAAwAAAG5fMAAIAAwACAAHAAgAAAAAAAABIAAAAAAAAAA= file schema: schema -------------------------------------------------------------------------------- n_0: OPTIONAL INT32 R:0 D:1 n_1: OPTIONAL INT32 R:0 D:1 n_2: OPTIONAL BINARY O:UTF8 R:0 D:1 row group 1: RC:5 TS:185 OFFSET:4 -------------------------------------------------------------------------------- n_0: INT32 SNAPPY DO:0 FPO:4 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 5, num_nulls: 0] n_1: INT32 SNAPPY DO:0 FPO:143 SZ:65/71/1.09 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 2, num_nulls: 0] n_2: BINARY SNAPPY DO:0 FPO:276 SZ:45/43/0.96 VC:5 ENC:RLE,PLAIN ST:[min: A, max: B, num_nulls: 3] row group 2: RC:5 TS:202 OFFSET:369 -------------------------------------------------------------------------------- n_0: INT32 SNAPPY DO:0 FPO:369 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 5, num_nulls: 0] n_1: INT32 SNAPPY DO:0 FPO:509 SZ:65/71/1.09 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 2, num_nulls: 0] n_2: BINARY SNAPPY DO:642 FPO:668 SZ:64/60/0.94 VC:5 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: A, max: B, num_nulls: 3] row group 3: RC:5 TS:195 OFFSET:766 -------------------------------------------------------------------------------- n_0: INT32 SNAPPY DO:0 FPO:766 SZ:73/71/0.97 VC:5 ENC:RLE,PLAIN ST:[min: 6, max: 10, num_nulls: 0] n_1: INT32 SNAPPY DO:0 FPO:907 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0] n_2: BINARY SNAPPY DO:0 FPO:1047 SZ:55/53/0.96 VC:5 ENC:RLE,PLAIN ST:[min: F, max: J, num_nulls: 1] ``` Pull Request resolved: facebookincubator#9129 Reviewed By: mbasmanova Differential Revision: D55018054 Pulled By: Yuhta fbshipit-source-id: 83dbab02575321930262628ea60cd4ad3e7b8892
When the parquet file contains a column that uses both plain and dictionary encodings across row groups, the output after filter can be wrong. The root cause is that
returnReaderNulls_
inSelectiveColumnReader
may not be reset when the column encoding switches from dictionary encoding to plain encoding.Here is the parquet metadata for the test file to reproduce this issue.
n_2
is dictionary encoded in row group 2 and is plain encoded in row group 3.