Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix read parquet for different encodings across row groups #9129

Closed

Conversation

marin-ma
Copy link
Contributor

When the parquet file contains a column that uses both plain and dictionary encodings across row groups, the output after filter can be wrong. The root cause is that returnReaderNulls_ in SelectiveColumnReader may not be reset when the column encoding switches from dictionary encoding to plain encoding.

Here is the parquet metadata for the test file to reproduce this issue. n_2 is dictionary encoded in row group 2 and is plain encoded in row group 3.

file:        file:/home/sparkuser/github/oap-project/velox/velox/dwio/parquet/tests/examples/different_encodings_with_filter.parquet
creator:     parquet-cpp-arrow version 15.0.0
extra:       ARROW:schema = /////9gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAMAAABwAAAAMAAAAAQAAACs////AAABBRAAAAAYAAAABAAAAAAAAAADAAAAbl8yAAQABAAEAAAA1P///wAAAQIQAAAAFAAAAAQAAAAAAAAAAwAAAG5fMQDE////AAAAASAAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIQAAAAHAAAAAQAAAAAAAAAAwAAAG5fMAAIAAwACAAHAAgAAAAAAAABIAAAAAAAAAA=

file schema: schema
--------------------------------------------------------------------------------
n_0:         OPTIONAL INT32 R:0 D:1
n_1:         OPTIONAL INT32 R:0 D:1
n_2:         OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:5 TS:185 OFFSET:4
--------------------------------------------------------------------------------
n_0:          INT32 SNAPPY DO:0 FPO:4 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 5, num_nulls: 0]
n_1:          INT32 SNAPPY DO:0 FPO:143 SZ:65/71/1.09 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 2, num_nulls: 0]
n_2:          BINARY SNAPPY DO:0 FPO:276 SZ:45/43/0.96 VC:5 ENC:RLE,PLAIN ST:[min: A, max: B, num_nulls: 3]

row group 2: RC:5 TS:202 OFFSET:369
--------------------------------------------------------------------------------
n_0:          INT32 SNAPPY DO:0 FPO:369 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 5, num_nulls: 0]
n_1:          INT32 SNAPPY DO:0 FPO:509 SZ:65/71/1.09 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 2, num_nulls: 0]
n_2:          BINARY SNAPPY DO:642 FPO:668 SZ:64/60/0.94 VC:5 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: A, max: B, num_nulls: 3]

row group 3: RC:5 TS:195 OFFSET:766
--------------------------------------------------------------------------------
n_0:          INT32 SNAPPY DO:0 FPO:766 SZ:73/71/0.97 VC:5 ENC:RLE,PLAIN ST:[min: 6, max: 10, num_nulls: 0]
n_1:          INT32 SNAPPY DO:0 FPO:907 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
n_2:          BINARY SNAPPY DO:0 FPO:1047 SZ:55/53/0.96 VC:5 ENC:RLE,PLAIN ST:[min: F, max: J, num_nulls: 1]

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 18, 2024
Copy link

netlify bot commented Mar 18, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 031f7e0
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65f90fb6575d5b0008203345

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@marin-ma
Copy link
Contributor Author

@mbasmanova @majetideepak @Yuhta Could you help to review? Looks like the CI failure is irrelevant. Thanks!

@FelixYBW
Copy link
Contributor

It's noted from a customer workload. The bug leads to wrong results returned from Gluten.

@FelixYBW
Copy link
Contributor

@Yuhta Can you help to review and merge? Result mismatch is a fatal issue to customer.

Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in 94f52ec.

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…ncubator#9129)

Summary:
When the parquet file contains a column that uses both plain and dictionary encodings across row groups, the output after filter can be wrong. The root cause is that `returnReaderNulls_` in `SelectiveColumnReader` may not be reset when the column encoding switches from dictionary encoding to plain encoding.

Here is the parquet metadata for the test file to reproduce this issue. `n_2` is dictionary encoded in row group 2 and is plain encoded in row group 3.

```
file:        file:/home/sparkuser/github/oap-project/velox/velox/dwio/parquet/tests/examples/different_encodings_with_filter.parquet
creator:     parquet-cpp-arrow version 15.0.0
extra:       ARROW:schema = /////9gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAMAAABwAAAAMAAAAAQAAACs////AAABBRAAAAAYAAAABAAAAAAAAAADAAAAbl8yAAQABAAEAAAA1P///wAAAQIQAAAAFAAAAAQAAAAAAAAAAwAAAG5fMQDE////AAAAASAAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIQAAAAHAAAAAQAAAAAAAAAAwAAAG5fMAAIAAwACAAHAAgAAAAAAAABIAAAAAAAAAA=

file schema: schema
--------------------------------------------------------------------------------
n_0:         OPTIONAL INT32 R:0 D:1
n_1:         OPTIONAL INT32 R:0 D:1
n_2:         OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:5 TS:185 OFFSET:4
--------------------------------------------------------------------------------
n_0:          INT32 SNAPPY DO:0 FPO:4 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 5, num_nulls: 0]
n_1:          INT32 SNAPPY DO:0 FPO:143 SZ:65/71/1.09 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 2, num_nulls: 0]
n_2:          BINARY SNAPPY DO:0 FPO:276 SZ:45/43/0.96 VC:5 ENC:RLE,PLAIN ST:[min: A, max: B, num_nulls: 3]

row group 2: RC:5 TS:202 OFFSET:369
--------------------------------------------------------------------------------
n_0:          INT32 SNAPPY DO:0 FPO:369 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 5, num_nulls: 0]
n_1:          INT32 SNAPPY DO:0 FPO:509 SZ:65/71/1.09 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 2, num_nulls: 0]
n_2:          BINARY SNAPPY DO:642 FPO:668 SZ:64/60/0.94 VC:5 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: A, max: B, num_nulls: 3]

row group 3: RC:5 TS:195 OFFSET:766
--------------------------------------------------------------------------------
n_0:          INT32 SNAPPY DO:0 FPO:766 SZ:73/71/0.97 VC:5 ENC:RLE,PLAIN ST:[min: 6, max: 10, num_nulls: 0]
n_1:          INT32 SNAPPY DO:0 FPO:907 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
n_2:          BINARY SNAPPY DO:0 FPO:1047 SZ:55/53/0.96 VC:5 ENC:RLE,PLAIN ST:[min: F, max: J, num_nulls: 1]

```

Pull Request resolved: facebookincubator#9129

Reviewed By: mbasmanova

Differential Revision: D55018054

Pulled By: Yuhta

fbshipit-source-id: 83dbab02575321930262628ea60cd4ad3e7b8892
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants