Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](parquet) fix dict page offset zero incompatible behaviour #41506

Merged
merged 3 commits into from
Oct 10, 2024

Conversation

suxiaogang223
Copy link
Contributor

Proposed changes

Reason: https://issues.apache.org/jira/browse/ARROW-5322
Java readers(parquet-mr) handles "dictionaryPageOffset = 0" to determine if dictionary page exists where as the C readers uses "has_dictionaryPageOffset" (_isset bit in thrift message) to determine the same resulting in incompatible behaviours.
Therefore, we should consider that dicttionary page exists when both __isset.dictionary_page_offset is true and dictionary_page_offset is greater than 0.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.32% (9630/25805)
Line Coverage: 28.69% (79853/278283)
Region Coverage: 28.14% (41294/146757)
Branch Coverage: 24.75% (21036/84980)
Coverage Report: http://coverage.selectdb-in.cc/coverage/6797a8f28b8ba8c9141f2672df3d9c0382ea825a_6797a8f28b8ba8c9141f2672df3d9c0382ea825a/report/index.html

@suxiaogang223
Copy link
Contributor Author

@CodiumAI-Agent /review

@suxiaogang223
Copy link
Contributor Author

run buildall

@CodiumAI-Agent
Copy link

CodiumAI-Agent commented Oct 6, 2024

PR Reviewer Guide 🔍

(Review updated until commit 215f451)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Logic Change
The logic for determining the start offset has been modified to include a check for dictionary_page_offset > 0. Ensure this change aligns with expected behavior across all scenarios.

Logic Change
Similar to vparquet_column_chunk_reader.cpp, the logic for determining chunk_start has been updated. This needs thorough testing to confirm no regression in other use cases.

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.32% (9630/25805)
Line Coverage: 28.71% (79891/278284)
Region Coverage: 28.14% (41295/146763)
Branch Coverage: 24.76% (21043/84984)
Coverage Report: http://coverage.selectdb-in.cc/coverage/215f451c9a3cf5c1d2c3c900a5f19f9d000b83ff_215f451c9a3cf5c1d2c3c900a5f19f9d000b83ff/report/index.html

@CodiumAI-Agent
Copy link

Persistent review updated to latest commit 215f451

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.28% (9631/25833)
Line Coverage: 28.66% (79847/278579)
Region Coverage: 28.10% (41294/146932)
Branch Coverage: 24.73% (21040/85090)
Coverage Report: http://coverage.selectdb-in.cc/coverage/a1b18b1995510b594024c13f1c55ec2805aeb499_a1b18b1995510b594024c13f1c55ec2805aeb499/report/index.html

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

github-actions bot commented Oct 8, 2024

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Oct 8, 2024
Copy link
Contributor

github-actions bot commented Oct 8, 2024

PR approved by anyone and no changes requested.

@morningman morningman merged commit 0d2550f into apache:master Oct 10, 2024
23 of 28 checks passed
@suxiaogang223 suxiaogang223 deleted the dict_page_offset_zero branch October 10, 2024 08:02
eldenmoon pushed a commit to eldenmoon/incubator-doris that referenced this pull request Oct 10, 2024
…he#41506)

## Proposed changes
Reason: https://issues.apache.org/jira/browse/ARROW-5322
Java readers(parquet-mr) handles "dictionaryPageOffset = 0" to determine
if dictionary page exists where as the C readers uses
"has_dictionaryPageOffset" (_isset bit in thrift message) to determine
the same resulting in incompatible behaviours.
Therefore, we should consider that dicttionary page exists when both
`__isset.dictionary_page_offset` is true and `dictionary_page_offset` is
greater than 0.
cjj2010 pushed a commit to cjj2010/doris that referenced this pull request Oct 12, 2024
…he#41506)

## Proposed changes
Reason: https://issues.apache.org/jira/browse/ARROW-5322
Java readers(parquet-mr) handles "dictionaryPageOffset = 0" to determine
if dictionary page exists where as the C readers uses
"has_dictionaryPageOffset" (_isset bit in thrift message) to determine
the same resulting in incompatible behaviours.
Therefore, we should consider that dicttionary page exists when both
`__isset.dictionary_page_offset` is true and `dictionary_page_offset` is
greater than 0.
amorynan pushed a commit to amorynan/doris that referenced this pull request Oct 12, 2024
…he#41506)

## Proposed changes
Reason: https://issues.apache.org/jira/browse/ARROW-5322
Java readers(parquet-mr) handles "dictionaryPageOffset = 0" to determine
if dictionary page exists where as the C readers uses
"has_dictionaryPageOffset" (_isset bit in thrift message) to determine
the same resulting in incompatible behaviours.
Therefore, we should consider that dicttionary page exists when both
`__isset.dictionary_page_offset` is true and `dictionary_page_offset` is
greater than 0.
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Oct 16, 2024
…he#41506)

## Proposed changes
Reason: https://issues.apache.org/jira/browse/ARROW-5322
Java readers(parquet-mr) handles "dictionaryPageOffset = 0" to determine
if dictionary page exists where as the C readers uses
"has_dictionaryPageOffset" (_isset bit in thrift message) to determine
the same resulting in incompatible behaviours.
Therefore, we should consider that dicttionary page exists when both
`__isset.dictionary_page_offset` is true and `dictionary_page_offset` is
greater than 0.
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Oct 16, 2024
…he#41506)

## Proposed changes
Reason: https://issues.apache.org/jira/browse/ARROW-5322
Java readers(parquet-mr) handles "dictionaryPageOffset = 0" to determine
if dictionary page exists where as the C readers uses
"has_dictionaryPageOffset" (_isset bit in thrift message) to determine
the same resulting in incompatible behaviours.
Therefore, we should consider that dicttionary page exists when both
`__isset.dictionary_page_offset` is true and `dictionary_page_offset` is
greater than 0.
yiguolei pushed a commit that referenced this pull request Oct 17, 2024
…bug (#41931)

## Proposed changes
pick pr:
  #41683
  #41506
  #41338
  #39326

---------

Co-authored-by: morningman <morningman@163.com>
morningman added a commit that referenced this pull request Oct 17, 2024
) (#41923)

## Proposed changes
pick prs:
#41506
#41526
#41683
#41816

---------

Co-authored-by: morningman <morningman@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.7-merged dev/3.0.3-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants