[SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite #28993

rdblue · 2020-07-03T20:56:06Z

What changes were proposed in this pull request?

When converting an INSERT OVERWRITE query to a v2 overwrite plan, Spark attempts to detect when a dynamic overwrite and a static overwrite will produce the same result so it can use the static overwrite. Spark incorrectly detects when dynamic and static overwrites are equivalent when there are hidden partitions, such as days(ts).

This updates the analyzer rule ResolveInsertInto to always use a dynamic overwrite when the mode is dynamic, and static when the mode is static. This avoids the problem by not trying to determine whether the two plans are equivalent and always using the one that corresponds to the partition overwrite mode.

Why are the changes needed?

This is a correctness bug. If a table has hidden partitions, all of the values for those partitions are dropped instead of dynamically overwriting changed partitions.

This only affects SQL commands (not DataFrameWriter) writing to tables that have hidden partitions. It is also only a problem when the partition overwrite mode is dynamic.

Does this PR introduce any user-facing change?

Yes, it fixes the correctness bug detailed above.

How was this patch tested?

This updates the in-memory table to support a hidden partition transform, days, and adds a test case to DataSourceV2SQLSuite in which the table uses this hidden partition function. This test fails without the fix to ResolveInsertInto.
This updates the test case InsertInto: overwrite - multiple static partitions - dynamic mode in InsertIntoTests. The result of the SQL command is unchanged, but the SQL command will now use a dynamic overwrite so the test now uses dynamicOverwriteTest.

rdblue · 2020-07-03T20:58:27Z

@cloud-fan, @brkyvz, @aokolnychyi, @dbtsai, @dongjoon-hyun, you may be interested in this PR. This fixes a correctness bug in SQL INSERT INTO with v2 tables. It only affects hidden partitioned tables, so the impact is limited. I think we should aim to get this into 3.0.1 if possible.

dongjoon-hyun · 2020-07-05T15:34:55Z

Thank you for pinging me, @rdblue .

dongjoon-hyun · 2020-07-05T15:35:02Z

ok to test

dongjoon-hyun · 2020-07-05T15:35:45Z

BTW, AmpLab Jenkins farm has been out of order since last Friday.

SparkQA · 2020-07-05T23:41:01Z

Test build #124925 has finished for PR 28993 at commit fd65dd4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-06T00:03:04Z

Hi, @rdblue . Could you take a look at the relevant UT failures?

cloud-fan · 2020-07-06T11:23:31Z

Good catch! LGTM, let's update the failed tests.

rdblue · 2020-07-06T17:55:36Z

The tests were failing because InMemoryTable was ignoring hidden partitions when building partition keys. I implemented the rest of the transforms needed for the tests to fix it, so now we have a test implementation of hidden partitioning.

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

dongjoon-hyun · 2020-07-06T18:15:38Z

Thank you for updating, @rdblue .

SparkQA · 2020-07-06T19:57:00Z

Test build #125112 has finished for PR 28993 at commit afef6ce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2020-07-06T19:59:02Z

Retest this please.

SparkQA · 2020-07-06T23:03:52Z

Test build #125119 has finished for PR 28993 at commit afef6ce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-07-06T23:11:25Z

retest this please

SparkQA · 2020-07-07T03:46:12Z

Test build #125142 has finished for PR 28993 at commit afef6ce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-07T07:51:42Z

retest this please

SparkQA · 2020-07-07T12:48:40Z

Test build #125193 has finished for PR 28993 at commit afef6ce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-07T13:30:17Z

retest this please

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

sql/core/src/test/scala/org/apache/spark/sql/connector/InsertIntoTests.scala

SparkQA · 2020-07-07T20:16:12Z

Test build #125222 has finished for PR 28993 at commit afef6ce.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

HyukjinKwon · 2020-07-08T01:55:05Z

~~One question #28993 (comment). Otherwise, I am okay considering that it's still under development.~~ I misread. Looks good.

SparkQA · 2020-07-08T04:43:05Z

Test build #125256 has finished for PR 28993 at commit 2efb84c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-08T04:49:18Z

python/pyspark/mllib/tests/test_streaming_algorithms.py", line 461, in condition
self.assertGreater(errors[1] - errors[-1], 2)
AssertionError: 1.672640157855923 not greater than 2

seems like a flaky test. @huaxingao can you take a look?

cloud-fan · 2020-07-08T04:49:28Z

retest this please

SparkQA · 2020-07-08T05:11:56Z

Test build #125297 has finished for PR 28993 at commit 2efb84c.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM2

cloud-fan · 2020-07-08T07:25:29Z

retest this please

SparkQA · 2020-07-08T11:07:53Z

Test build #125322 has finished for PR 28993 at commit 2efb84c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-08T11:52:49Z

retest this please

SparkQA · 2020-07-08T12:29:55Z

Test build #125355 has finished for PR 28993 at commit 2efb84c.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-08T14:05:27Z

retest this please

SparkQA · 2020-07-08T15:44:03Z

Test build #125369 has finished for PR 28993 at commit 2efb84c.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-07-08T15:48:39Z

retest this please

### What changes were proposed in this pull request? This PR aims to disable SBT `unidoc` generation testing in Jenkins environment because it's flaky in Jenkins environment and not used for the official documentation generation. Also, GitHub Action has the correct test coverage for the official documentation generation. - #28848 (comment) (amp-jenkins-worker-06) - #28926 (comment) (amp-jenkins-worker-06) - #28969 (comment) (amp-jenkins-worker-06) - #28975 (comment) (amp-jenkins-worker-05) - #28986 (comment) (amp-jenkins-worker-05) - #28992 (comment) (amp-jenkins-worker-06) - #28993 (comment) (amp-jenkins-worker-05) - #28999 (comment) (amp-jenkins-worker-04) - #29010 (comment) (amp-jenkins-worker-03) - #29013 (comment) (amp-jenkins-worker-04) - #29016 (comment) (amp-jenkins-worker-05) - #29025 (comment) (amp-jenkins-worker-04) - #29042 (comment) (amp-jenkins-worker-03) ### Why are the changes needed? Apache Spark `release-build.sh` generates the official document by using the following command. - https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L341 ```bash PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" jekyll build ``` And, this is executed by the following `unidoc` command for Scala/Java API doc. - https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30 ```ruby system("build/sbt -Pkinesis-asl clean compile unidoc") || raise("Unidoc generation failed") ``` However, the PR builder disabled `Jekyll build` and instead has a different test coverage. ```python # determine if docs were changed and if we're inside the amplab environment # note - the below commented out until *all* Jenkins workers can get `jekyll` installed # if "DOCS" in changed_modules and test_env == "amplab_jenkins": # build_spark_documentation() ``` ``` Building Unidoc API Documentation ======================================================================== [info] Building Spark unidoc using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pspark-ganglia-lgpl -Pkubernetes -Pmesos -Phadoop-cloud -Phive -Phive-thriftserver -Pkinesis-asl -Pyarn unidoc ``` ### Does this PR introduce _any_ user-facing change? No. (This is used only for testing and not used in the official doc generation.) ### How was this patch tested? Pass the Jenkins without doc generation invocation. Closes #29017 from dongjoon-hyun/SPARK-DOC-GEN. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

SparkQA · 2020-07-08T23:05:40Z

Test build #125382 has finished for PR 28993 at commit 2efb84c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-08T23:06:56Z

Thank you, @rdblue and all. Merged to master/3.0.

…rwrite ### What changes were proposed in this pull request? When converting an `INSERT OVERWRITE` query to a v2 overwrite plan, Spark attempts to detect when a dynamic overwrite and a static overwrite will produce the same result so it can use the static overwrite. Spark incorrectly detects when dynamic and static overwrites are equivalent when there are hidden partitions, such as `days(ts)`. This updates the analyzer rule `ResolveInsertInto` to always use a dynamic overwrite when the mode is dynamic, and static when the mode is static. This avoids the problem by not trying to determine whether the two plans are equivalent and always using the one that corresponds to the partition overwrite mode. ### Why are the changes needed? This is a correctness bug. If a table has hidden partitions, all of the values for those partitions are dropped instead of dynamically overwriting changed partitions. This only affects SQL commands (not `DataFrameWriter`) writing to tables that have hidden partitions. It is also only a problem when the partition overwrite mode is dynamic. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the correctness bug detailed above. ### How was this patch tested? * This updates the in-memory table to support a hidden partition transform, `days`, and adds a test case to `DataSourceV2SQLSuite` in which the table uses this hidden partition function. This test fails without the fix to `ResolveInsertInto`. * This updates the test case `InsertInto: overwrite - multiple static partitions - dynamic mode` in `InsertIntoTests`. The result of the SQL command is unchanged, but the SQL command will now use a dynamic overwrite so the test now uses `dynamicOverwriteTest`. Closes #28993 from rdblue/fix-insert-overwrite-v2-conversion. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3bb1ac5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

rdblue · 2020-07-09T00:22:43Z

Thanks to everyone that reviewed, and to everyone who helped keep the tests running until they passed!

Fix INSERT OVERWRITE for v2 with hidden partitions.

fd65dd4

probot-autolabeler bot added the SQL label Jul 3, 2020

rdblue requested review from cloud-fan and dbtsai and removed request for cloud-fan July 3, 2020 20:58

Implement years, months, hours, and bucket transforms for tests.

afef6ce

rdblue mentioned this pull request Jul 6, 2020

Add Spark 3 SQL tests apache/iceberg#1156

Merged

viirya reviewed Jul 6, 2020

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala Show resolved Hide resolved

dongjoon-hyun reviewed Jul 7, 2020

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala Show resolved Hide resolved

dongjoon-hyun reviewed Jul 7, 2020

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala Show resolved Hide resolved

dongjoon-hyun reviewed Jul 7, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/InsertIntoTests.scala Outdated Show resolved Hide resolved

Remove unnecessary configuration from updated test.

2efb84c

dongjoon-hyun approved these changes Jul 7, 2020

View reviewed changes

HyukjinKwon reviewed Jul 8, 2020

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala Show resolved Hide resolved

viirya approved these changes Jul 8, 2020

View reviewed changes

dongjoon-hyun mentioned this pull request Jul 8, 2020

[SPARK-32233][TESTS] Disable SBT unidoc generation testing in Jenkins #29017

Closed

HyukjinKwon approved these changes Jul 8, 2020

View reviewed changes

dongjoon-hyun closed this in 3bb1ac5 Jul 8, 2020

dongjoon-hyun mentioned this pull request Nov 24, 2020

[SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 #28027

Closed

[SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite #28993

[SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite #28993

Conversation

rdblue commented Jul 3, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

rdblue commented Jul 3, 2020

dongjoon-hyun commented Jul 5, 2020

dongjoon-hyun commented Jul 5, 2020

dongjoon-hyun commented Jul 5, 2020

SparkQA commented Jul 5, 2020

dongjoon-hyun commented Jul 6, 2020

cloud-fan commented Jul 6, 2020

rdblue commented Jul 6, 2020 • edited Loading

dongjoon-hyun commented Jul 6, 2020

SparkQA commented Jul 6, 2020

rdblue commented Jul 6, 2020

SparkQA commented Jul 6, 2020

viirya commented Jul 6, 2020

SparkQA commented Jul 7, 2020

cloud-fan commented Jul 7, 2020

SparkQA commented Jul 7, 2020

cloud-fan commented Jul 7, 2020

SparkQA commented Jul 7, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 8, 2020 • edited Loading

SparkQA commented Jul 8, 2020

cloud-fan commented Jul 8, 2020

cloud-fan commented Jul 8, 2020

SparkQA commented Jul 8, 2020

HyukjinKwon left a comment

Choose a reason for hiding this comment

cloud-fan commented Jul 8, 2020

SparkQA commented Jul 8, 2020

cloud-fan commented Jul 8, 2020

SparkQA commented Jul 8, 2020

cloud-fan commented Jul 8, 2020

SparkQA commented Jul 8, 2020

viirya commented Jul 8, 2020

SparkQA commented Jul 8, 2020

dongjoon-hyun commented Jul 8, 2020

rdblue commented Jul 9, 2020

rdblue commented Jul 3, 2020 •

edited

Loading

rdblue commented Jul 6, 2020 •

edited

Loading

HyukjinKwon commented Jul 8, 2020 •

edited

Loading