[SPARK-32196][SQL] Extract In convertible part if it is not convertible #29013

ulysses-you · 2020-07-06T14:38:50Z

What changes were proposed in this pull request?

Modify OptimizeIn, extract In convertible part if it is not convertible.
And a new config spark.sql.optimizer.inExtractLiteralPart to control if we should extract the literal part of In.

Why are the changes needed?

Try to optimize more predicate.
First split In to 2 parts, one is convertible the other is not convertible. Then we can optimize the convertible part.

A table create table t1 (c1 int, c2 int) using parquet

select * from t1 where c1 in (1, 2, c2)
=>
select * from t1 where c1 in (1, 2) or c1 in (c2)

select * from t1 where c1 in(1, c2)
=>
select * from t1 where c1 = 1 or c1 in(c2)

select * from t1 where c1 in(1, 2, ..., c2)
=> 
select * from t1 where c1 inset(1, 2, ...) or c1 in (c2)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add ut.

SparkQA · 2020-07-06T15:02:10Z

Test build #125092 has finished for PR 29013 at commit bbf63c9.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

ulysses-you · 2020-07-07T00:10:32Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeInSuite.scala

@@ -91,21 +91,6 @@ class OptimizeInSuite extends PlanTest {
    comparePlans(optimized, correctAnswer)
  }

-  test("OptimizedIn test: In clause not optimized in case filter has attributes") {


Remove this test since we support convert part of list and new test include this.

maropu · 2020-07-07T01:15:53Z

I don't look into the impl. though (I just read the PR description), you meant this case? Or, do I miss something?

scala> sql("select * from t1").show()
+---+---+
| c1| c2|
+---+---+
|  1|  3|
|  3|  3|
+---+---+


scala> sql("select * from t1 where c1 in (1, 2, c2)").show()
+---+---+
| c1| c2|
+---+---+
|  1|  3|
|  3|  3|
+---+---+


scala> sql("select * from (select * from t1 where c1 in (1, 2)) t2(c1, c2) where t2.c1 in (1, 2, t2.c2)").show()
+---+---+
| c1| c2|
+---+---+
|  1|  3|
+---+---+

ulysses-you · 2020-07-07T01:32:04Z

ah morning @maropu

As your case after this pr, it changes to:

select * from t1 where c1 in (1, 2, c2)
 ||
 \/
select * from t1 where c1 in (1, 2) and c1 in (c2)

maropu · 2020-07-07T01:46:51Z

But, I think select * from t1 where c1 in (1, 2, c2) is equal to select * from t1 where c1=1 or c1=2 or c1=c2. Why can we transform it into where c1 in (1, 2) and c1 in (c2)?

ulysses-you · 2020-07-07T01:49:21Z

I make a mistake, it should be

select * from t1 where c1 in (1, 2, c2)
 ||
 \/
select * from t1 where c1 in (1, 2) or c1 in (c2)

SparkQA · 2020-07-07T03:17:08Z

Test build #125158 has finished for PR 29013 at commit f45d6c3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-07T04:59:27Z

Test build #125148 has finished for PR 29013 at commit 0d982d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-08T07:05:05Z

Test build #125299 has finished for PR 29013 at commit 9bf23cc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-08T08:09:39Z

Why is where c1 in (1, 2) or c1 in (c2) an optimized form? You mean it is fater than where c1 in (1, 2, c2)? At least, I think you need to describe more in the PR description.

ulysses-you · 2020-07-08T09:14:33Z

@maropu Once we have where c1 in (1, 2) or c1 in (c2), we can optimize the c1 in (1, 2) part with OptimizeIn.
e.g.

c1 in(1, c2) => c1 = 1 or c1 in(c2)

c1 in(1, 2, ..., c2) => c1 inset(1, 2, ...) or c1 in (c2)

### What changes were proposed in this pull request? This PR aims to disable SBT `unidoc` generation testing in Jenkins environment because it's flaky in Jenkins environment and not used for the official documentation generation. Also, GitHub Action has the correct test coverage for the official documentation generation. - #28848 (comment) (amp-jenkins-worker-06) - #28926 (comment) (amp-jenkins-worker-06) - #28969 (comment) (amp-jenkins-worker-06) - #28975 (comment) (amp-jenkins-worker-05) - #28986 (comment) (amp-jenkins-worker-05) - #28992 (comment) (amp-jenkins-worker-06) - #28993 (comment) (amp-jenkins-worker-05) - #28999 (comment) (amp-jenkins-worker-04) - #29010 (comment) (amp-jenkins-worker-03) - #29013 (comment) (amp-jenkins-worker-04) - #29016 (comment) (amp-jenkins-worker-05) - #29025 (comment) (amp-jenkins-worker-04) - #29042 (comment) (amp-jenkins-worker-03) ### Why are the changes needed? Apache Spark `release-build.sh` generates the official document by using the following command. - https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L341 ```bash PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" jekyll build ``` And, this is executed by the following `unidoc` command for Scala/Java API doc. - https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30 ```ruby system("build/sbt -Pkinesis-asl clean compile unidoc") || raise("Unidoc generation failed") ``` However, the PR builder disabled `Jekyll build` and instead has a different test coverage. ```python # determine if docs were changed and if we're inside the amplab environment # note - the below commented out until *all* Jenkins workers can get `jekyll` installed # if "DOCS" in changed_modules and test_env == "amplab_jenkins": # build_spark_documentation() ``` ``` Building Unidoc API Documentation ======================================================================== [info] Building Spark unidoc using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pspark-ganglia-lgpl -Pkubernetes -Pmesos -Phadoop-cloud -Phive -Phive-thriftserver -Pkinesis-asl -Pyarn unidoc ``` ### Does this PR introduce _any_ user-facing change? No. (This is used only for testing and not used in the official doc generation.) ### How was this patch tested? Pass the Jenkins without doc generation invocation. Closes #29017 from dongjoon-hyun/SPARK-DOC-GEN. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

SparkQA · 2020-07-12T06:02:56Z

Test build #125700 has finished for PR 29013 at commit 21c5262.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-10-21T00:55:06Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

ulysses-you added 2 commits July 6, 2020 22:30

init

6594601

up ut name

bbf63c9

probot-autolabeler bot added the SQL label Jul 6, 2020

fix

0d982d2

ulysses-you commented Jul 7, 2020

View reviewed changes

fix or

f45d6c3

dongjoon-hyun mentioned this pull request Jul 7, 2020

[SPARK-32233][TESTS] Disable SBT unidoc generation testing in Jenkins #29017

Closed

fix comment

9bf23cc

add conf

21c5262

github-actions bot added the Stale label Oct 21, 2020

github-actions bot closed this Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32196][SQL] Extract In convertible part if it is not convertible #29013

[SPARK-32196][SQL] Extract In convertible part if it is not convertible #29013

ulysses-you commented Jul 6, 2020 •

edited

Loading

SparkQA commented Jul 6, 2020

ulysses-you Jul 7, 2020

maropu commented Jul 7, 2020

ulysses-you commented Jul 7, 2020

maropu commented Jul 7, 2020

ulysses-you commented Jul 7, 2020

SparkQA commented Jul 7, 2020

SparkQA commented Jul 7, 2020

SparkQA commented Jul 8, 2020

maropu commented Jul 8, 2020

ulysses-you commented Jul 8, 2020

SparkQA commented Jul 12, 2020

github-actions bot commented Oct 21, 2020

[SPARK-32196][SQL] Extract In convertible part if it is not convertible #29013

[SPARK-32196][SQL] Extract In convertible part if it is not convertible #29013

Conversation

ulysses-you commented Jul 6, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jul 6, 2020

ulysses-you Jul 7, 2020

Choose a reason for hiding this comment

maropu commented Jul 7, 2020

ulysses-you commented Jul 7, 2020

maropu commented Jul 7, 2020

ulysses-you commented Jul 7, 2020

SparkQA commented Jul 7, 2020

SparkQA commented Jul 7, 2020

SparkQA commented Jul 8, 2020

maropu commented Jul 8, 2020

ulysses-you commented Jul 8, 2020

SparkQA commented Jul 12, 2020

github-actions bot commented Oct 21, 2020

ulysses-you commented Jul 6, 2020 •

edited

Loading