Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite #28993

Closed

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Jul 3, 2020

What changes were proposed in this pull request?

When converting an INSERT OVERWRITE query to a v2 overwrite plan, Spark attempts to detect when a dynamic overwrite and a static overwrite will produce the same result so it can use the static overwrite. Spark incorrectly detects when dynamic and static overwrites are equivalent when there are hidden partitions, such as days(ts).

This updates the analyzer rule ResolveInsertInto to always use a dynamic overwrite when the mode is dynamic, and static when the mode is static. This avoids the problem by not trying to determine whether the two plans are equivalent and always using the one that corresponds to the partition overwrite mode.

Why are the changes needed?

This is a correctness bug. If a table has hidden partitions, all of the values for those partitions are dropped instead of dynamically overwriting changed partitions.

This only affects SQL commands (not DataFrameWriter) writing to tables that have hidden partitions. It is also only a problem when the partition overwrite mode is dynamic.

Does this PR introduce any user-facing change?

Yes, it fixes the correctness bug detailed above.

How was this patch tested?

  • This updates the in-memory table to support a hidden partition transform, days, and adds a test case to DataSourceV2SQLSuite in which the table uses this hidden partition function. This test fails without the fix to ResolveInsertInto.
  • This updates the test case InsertInto: overwrite - multiple static partitions - dynamic mode in InsertIntoTests. The result of the SQL command is unchanged, but the SQL command will now use a dynamic overwrite so the test now uses dynamicOverwriteTest.

@rdblue
Copy link
Contributor Author

rdblue commented Jul 3, 2020

@cloud-fan, @brkyvz, @aokolnychyi, @dbtsai, @dongjoon-hyun, you may be interested in this PR. This fixes a correctness bug in SQL INSERT INTO with v2 tables. It only affects hidden partitioned tables, so the impact is limited. I think we should aim to get this into 3.0.1 if possible.

@rdblue rdblue requested review from cloud-fan and dbtsai and removed request for cloud-fan July 3, 2020 20:58
@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @rdblue .

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

BTW, AmpLab Jenkins farm has been out of order since last Friday.

@SparkQA
Copy link

SparkQA commented Jul 5, 2020

Test build #124925 has finished for PR 28993 at commit fd65dd4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @rdblue . Could you take a look at the relevant UT failures?
Screen Shot 2020-07-05 at 5 01 54 PM

@cloud-fan
Copy link
Contributor

Good catch! LGTM, let's update the failed tests.

@rdblue
Copy link
Contributor Author

rdblue commented Jul 6, 2020

The tests were failing because InMemoryTable was ignoring hidden partitions when building partition keys. I implemented the rest of the transforms needed for the tests to fix it, so now we have a test implementation of hidden partitioning.

@dongjoon-hyun
Copy link
Member

Thank you for updating, @rdblue .

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125112 has finished for PR 28993 at commit afef6ce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Jul 6, 2020

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125119 has finished for PR 28993 at commit afef6ce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jul 6, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125142 has finished for PR 28993 at commit afef6ce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125193 has finished for PR 28993 at commit afef6ce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125222 has finished for PR 28993 at commit afef6ce.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 8, 2020

One question #28993 (comment). Otherwise, I am okay considering that it's still under development. I misread. Looks good.

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125256 has finished for PR 28993 at commit 2efb84c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

python/pyspark/mllib/tests/test_streaming_algorithms.py", line 461, in condition
self.assertGreater(errors[1] - errors[-1], 2)
AssertionError: 1.672640157855923 not greater than 2

seems like a flaky test. @huaxingao can you take a look?

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125297 has finished for PR 28993 at commit 2efb84c.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125322 has finished for PR 28993 at commit 2efb84c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125355 has finished for PR 28993 at commit 2efb84c.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125369 has finished for PR 28993 at commit 2efb84c.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jul 8, 2020

retest this please

dongjoon-hyun added a commit that referenced this pull request Jul 8, 2020
### What changes were proposed in this pull request?

This PR aims to disable SBT `unidoc` generation testing in Jenkins environment because it's flaky in Jenkins environment and not used for the official documentation generation. Also, GitHub Action has the correct test coverage for the official documentation generation.

- #28848 (comment) (amp-jenkins-worker-06)
- #28926 (comment) (amp-jenkins-worker-06)
- #28969 (comment) (amp-jenkins-worker-06)
- #28975 (comment) (amp-jenkins-worker-05)
- #28986 (comment)  (amp-jenkins-worker-05)
- #28992 (comment) (amp-jenkins-worker-06)
- #28993 (comment) (amp-jenkins-worker-05)
- #28999 (comment) (amp-jenkins-worker-04)
- #29010 (comment) (amp-jenkins-worker-03)
- #29013 (comment) (amp-jenkins-worker-04)
- #29016 (comment) (amp-jenkins-worker-05)
- #29025 (comment) (amp-jenkins-worker-04)
- #29042 (comment) (amp-jenkins-worker-03)

### Why are the changes needed?

Apache Spark `release-build.sh` generates the official document by using the following command.
- https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L341

```bash
PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" jekyll build
```

And, this is executed by the following `unidoc` command for Scala/Java API doc.
- https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30

```ruby
system("build/sbt -Pkinesis-asl clean compile unidoc") || raise("Unidoc generation failed")
```

However, the PR builder disabled `Jekyll build` and instead has a different test coverage.
```python
# determine if docs were changed and if we're inside the amplab environment
# note - the below commented out until *all* Jenkins workers can get `jekyll` installed
# if "DOCS" in changed_modules and test_env == "amplab_jenkins":
#    build_spark_documentation()
```

```
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:
-Phadoop-3.2 -Phive-2.3 -Pspark-ganglia-lgpl -Pkubernetes -Pmesos
-Phadoop-cloud -Phive -Phive-thriftserver -Pkinesis-asl -Pyarn unidoc
```

### Does this PR introduce _any_ user-facing change?

No. (This is used only for testing and not used in the official doc generation.)

### How was this patch tested?

Pass the Jenkins without doc generation invocation.

Closes #29017 from dongjoon-hyun/SPARK-DOC-GEN.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125382 has finished for PR 28993 at commit 2efb84c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you, @rdblue and all. Merged to master/3.0.

dongjoon-hyun pushed a commit that referenced this pull request Jul 8, 2020
…rwrite

### What changes were proposed in this pull request?

When converting an `INSERT OVERWRITE` query to a v2 overwrite plan, Spark attempts to detect when a dynamic overwrite and a static overwrite will produce the same result so it can use the static overwrite. Spark incorrectly detects when dynamic and static overwrites are equivalent when there are hidden partitions, such as `days(ts)`.

This updates the analyzer rule `ResolveInsertInto` to always use a dynamic overwrite when the mode is dynamic, and static when the mode is static. This avoids the problem by not trying to determine whether the two plans are equivalent and always using the one that corresponds to the partition overwrite mode.

### Why are the changes needed?

This is a correctness bug. If a table has hidden partitions, all of the values for those partitions are dropped instead of dynamically overwriting changed partitions.

This only affects SQL commands (not `DataFrameWriter`) writing to tables that have hidden partitions. It is also only a problem when the partition overwrite mode is dynamic.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the correctness bug detailed above.

### How was this patch tested?

* This updates the in-memory table to support a hidden partition transform, `days`, and adds a test case to `DataSourceV2SQLSuite` in which the table uses this hidden partition function. This test fails without the fix to `ResolveInsertInto`.
* This updates the test case `InsertInto: overwrite - multiple static partitions - dynamic mode` in `InsertIntoTests`. The result of the SQL command is unchanged, but the SQL command will now use a dynamic overwrite so the test now uses `dynamicOverwriteTest`.

Closes #28993 from rdblue/fix-insert-overwrite-v2-conversion.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 3bb1ac5)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@rdblue
Copy link
Contributor Author

rdblue commented Jul 9, 2020

Thanks to everyone that reviewed, and to everyone who helped keep the tests running until they passed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants