-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38018][SQL] Fix ColumnVectorUtils.populate to handle CalendarIntervalType correctly #35314
Conversation
@cloud-fan could you help take a look when you have time? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thanks, @c21 .
thanks, merging to master! |
Thank you @cloud-fan, @dongjoon-hyun and @yikf for review! |
btw @cloud-fan & @dongjoon-hyun - do we want to add this to older branches (3.2-) ? |
+1 for backporting, @c21 . Could you make a backporting PR to the applicable branches? |
…ndarIntervalType correctly ### What changes were proposed in this pull request? This is a backport of #35314 to branch 3.2. See that original PR for context. ### Why are the changes needed? To fix potential correctness issue. ### Does this PR introduce _any_ user-facing change? No but fix the exiting correctness issue when reading partition column with CalendarInterval type. ### How was this patch tested? Added unit test in `ColumnVectorSuite.scala`. Closes #37114 from c21/branch-3.2. Authored-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ndarIntervalType correctly ### What changes were proposed in this pull request? This is a backport of apache#35314 to branch 3.2. See that original PR for context. ### Why are the changes needed? To fix potential correctness issue. ### Does this PR introduce _any_ user-facing change? No but fix the exiting correctness issue when reading partition column with CalendarInterval type. ### How was this patch tested? Added unit test in `ColumnVectorSuite.scala`. Closes apache#37114 from c21/branch-3.2. Authored-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c5983c1) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
ColumnVectorUtils.populate()
does not handle CalendarInterval type correctly. The CalendarInterval type is in the format of (months: int, days: int, microseconds: long). However, the function above missesdays
field, and setsmicroseconds
field in wrong position.ColumnVectorUtils.populate()
is used by Parquet and ORC vectorized reader to read partition column. So technically Spark can potentially produce wrong result if reading table with CalendarInterval partition column. However I also notice Spark explicitly disallows writing data with CalendarInterval type, so it might not be a big deal for users. But it's worth to fix anyway.Caveat: I found the bug when reading through the related code path, but I never encountered the issue in production for partition column with CalendarInterval type. I think it should be an obvious fix unless anyone more experienced could find some more historical context. The code was introduced a long time ago where I couldn't find any more info why it was implemented as it is (#11435)
Why are the changes needed?
To fix potential correctness issue.
Does this PR introduce any user-facing change?
No but fix the exiting correctness issue when reading partition column with CalendarInterval type.
How was this patch tested?
Added unit test in
ColumnVectorSuite.scala
.Verified the unit test failed with exception below without this PR: