-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite #28993
Conversation
@cloud-fan, @brkyvz, @aokolnychyi, @dbtsai, @dongjoon-hyun, you may be interested in this PR. This fixes a correctness bug in SQL INSERT INTO with v2 tables. It only affects hidden partitioned tables, so the impact is limited. I think we should aim to get this into 3.0.1 if possible. |
Thank you for pinging me, @rdblue . |
ok to test |
BTW, AmpLab Jenkins farm has been out of order since last Friday. |
Test build #124925 has finished for PR 28993 at commit
|
Hi, @rdblue . Could you take a look at the relevant UT failures? |
Good catch! LGTM, let's update the failed tests. |
The tests were failing because |
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala
Show resolved
Hide resolved
Thank you for updating, @rdblue . |
Test build #125112 has finished for PR 28993 at commit
|
Retest this please. |
Test build #125119 has finished for PR 28993 at commit
|
retest this please |
Test build #125142 has finished for PR 28993 at commit
|
retest this please |
Test build #125193 has finished for PR 28993 at commit
|
retest this please |
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/connector/InsertIntoTests.scala
Outdated
Show resolved
Hide resolved
Test build #125222 has finished for PR 28993 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala
Show resolved
Hide resolved
|
Test build #125256 has finished for PR 28993 at commit
|
seems like a flaky test. @huaxingao can you take a look? |
retest this please |
Test build #125297 has finished for PR 28993 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
retest this please |
Test build #125322 has finished for PR 28993 at commit
|
retest this please |
Test build #125355 has finished for PR 28993 at commit
|
retest this please |
Test build #125369 has finished for PR 28993 at commit
|
retest this please |
### What changes were proposed in this pull request? This PR aims to disable SBT `unidoc` generation testing in Jenkins environment because it's flaky in Jenkins environment and not used for the official documentation generation. Also, GitHub Action has the correct test coverage for the official documentation generation. - #28848 (comment) (amp-jenkins-worker-06) - #28926 (comment) (amp-jenkins-worker-06) - #28969 (comment) (amp-jenkins-worker-06) - #28975 (comment) (amp-jenkins-worker-05) - #28986 (comment) (amp-jenkins-worker-05) - #28992 (comment) (amp-jenkins-worker-06) - #28993 (comment) (amp-jenkins-worker-05) - #28999 (comment) (amp-jenkins-worker-04) - #29010 (comment) (amp-jenkins-worker-03) - #29013 (comment) (amp-jenkins-worker-04) - #29016 (comment) (amp-jenkins-worker-05) - #29025 (comment) (amp-jenkins-worker-04) - #29042 (comment) (amp-jenkins-worker-03) ### Why are the changes needed? Apache Spark `release-build.sh` generates the official document by using the following command. - https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L341 ```bash PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" jekyll build ``` And, this is executed by the following `unidoc` command for Scala/Java API doc. - https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30 ```ruby system("build/sbt -Pkinesis-asl clean compile unidoc") || raise("Unidoc generation failed") ``` However, the PR builder disabled `Jekyll build` and instead has a different test coverage. ```python # determine if docs were changed and if we're inside the amplab environment # note - the below commented out until *all* Jenkins workers can get `jekyll` installed # if "DOCS" in changed_modules and test_env == "amplab_jenkins": # build_spark_documentation() ``` ``` Building Unidoc API Documentation ======================================================================== [info] Building Spark unidoc using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pspark-ganglia-lgpl -Pkubernetes -Pmesos -Phadoop-cloud -Phive -Phive-thriftserver -Pkinesis-asl -Pyarn unidoc ``` ### Does this PR introduce _any_ user-facing change? No. (This is used only for testing and not used in the official doc generation.) ### How was this patch tested? Pass the Jenkins without doc generation invocation. Closes #29017 from dongjoon-hyun/SPARK-DOC-GEN. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Test build #125382 has finished for PR 28993 at commit
|
Thank you, @rdblue and all. Merged to master/3.0. |
…rwrite ### What changes were proposed in this pull request? When converting an `INSERT OVERWRITE` query to a v2 overwrite plan, Spark attempts to detect when a dynamic overwrite and a static overwrite will produce the same result so it can use the static overwrite. Spark incorrectly detects when dynamic and static overwrites are equivalent when there are hidden partitions, such as `days(ts)`. This updates the analyzer rule `ResolveInsertInto` to always use a dynamic overwrite when the mode is dynamic, and static when the mode is static. This avoids the problem by not trying to determine whether the two plans are equivalent and always using the one that corresponds to the partition overwrite mode. ### Why are the changes needed? This is a correctness bug. If a table has hidden partitions, all of the values for those partitions are dropped instead of dynamically overwriting changed partitions. This only affects SQL commands (not `DataFrameWriter`) writing to tables that have hidden partitions. It is also only a problem when the partition overwrite mode is dynamic. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the correctness bug detailed above. ### How was this patch tested? * This updates the in-memory table to support a hidden partition transform, `days`, and adds a test case to `DataSourceV2SQLSuite` in which the table uses this hidden partition function. This test fails without the fix to `ResolveInsertInto`. * This updates the test case `InsertInto: overwrite - multiple static partitions - dynamic mode` in `InsertIntoTests`. The result of the SQL command is unchanged, but the SQL command will now use a dynamic overwrite so the test now uses `dynamicOverwriteTest`. Closes #28993 from rdblue/fix-insert-overwrite-v2-conversion. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3bb1ac5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Thanks to everyone that reviewed, and to everyone who helped keep the tests running until they passed! |
What changes were proposed in this pull request?
When converting an
INSERT OVERWRITE
query to a v2 overwrite plan, Spark attempts to detect when a dynamic overwrite and a static overwrite will produce the same result so it can use the static overwrite. Spark incorrectly detects when dynamic and static overwrites are equivalent when there are hidden partitions, such asdays(ts)
.This updates the analyzer rule
ResolveInsertInto
to always use a dynamic overwrite when the mode is dynamic, and static when the mode is static. This avoids the problem by not trying to determine whether the two plans are equivalent and always using the one that corresponds to the partition overwrite mode.Why are the changes needed?
This is a correctness bug. If a table has hidden partitions, all of the values for those partitions are dropped instead of dynamically overwriting changed partitions.
This only affects SQL commands (not
DataFrameWriter
) writing to tables that have hidden partitions. It is also only a problem when the partition overwrite mode is dynamic.Does this PR introduce any user-facing change?
Yes, it fixes the correctness bug detailed above.
How was this patch tested?
days
, and adds a test case toDataSourceV2SQLSuite
in which the table uses this hidden partition function. This test fails without the fix toResolveInsertInto
.InsertInto: overwrite - multiple static partitions - dynamic mode
inInsertIntoTests
. The result of the SQL command is unchanged, but the SQL command will now use a dynamic overwrite so the test now usesdynamicOverwriteTest
.