[SPARK-28227][SQL] Support projection, aggregate/window functions, and lateral view in the TRANSFORM clause #29087

AngersZhuuuu · 2020-07-13T13:19:43Z

What changes were proposed in this pull request?

For Spark SQL, it can't support script transform SQL with aggregationClause/windowClause/LateralView.
This case we can't directly migration Hive SQL to Spark SQL.

In this PR, we treat all script transform statement's query part (exclude transform about part) as a separate query block and solve it as ScriptTransformation's child and pass a UnresolvedStart as ScriptTransform's input. Then in analyzer level, we pass child's output as ScriptTransform's input. Then we can support all kind of normal SELECT query combine with script transformation.

Such as transform with aggregation:

SELECT TRANSFORM ( d2, max(d1) as max_d1, sum(d3))
USING 'cat' AS (a,b,c)
FROM script_trans
WHERE d1 <= 100
GROUP BY d2
 HAVING max_d1 > 0

When we build AST, we treat it as

SELECT TRANSFORM (*)
USING 'cat' AS (a,b,c)
FROM (
     SELECT  d2, max(d1) as max_d1, sum(d3)
     FROM script_trans
    WHERE d1 <= 100
    GROUP BY d2
    HAVING max_d1 > 0 
) tmp

then in Analyzer's ResolveReferences, resolve * (UnresolvedStar), then sql behavior like

SELECT TRANSFORM ( d2, max(d1) as max_d1, sum(d3))
USING 'cat' AS (a,b,c)
FROM script_trans
WHERE d1 <= 100
GROUP BY d2
HAVING max_d1 > 0

About UT, in this pr we add a lot of different SQL to check we can support all kind of such SQL and each kind of expressions can work well, such as alias, case when, binary compute etc...

Why are the changes needed?

Support transform with aggregateClause/windowClause/LateralView etc , make sql migration more smoothly

Does this PR introduce any user-facing change?

User can write transform with aggregateClause/windowClause/LateralView.

How was this patch tested?

Added UT

AngersZhuuuu · 2020-07-13T13:20:37Z

cc @alfozan @HyukjinKwon @wangyum @cloud-fan

SparkQA · 2020-07-13T15:03:52Z

Test build #125770 has finished for PR 29087 at commit d51c0dc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

maropu · 2020-07-13T23:15:07Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

@@ -2558,6 +2558,131 @@ abstract class SQLQuerySuiteBase extends QueryTest with SQLTestUtils with TestHi
      }
    }
  }
+
+  test("SPARK-28227: test script transform with aggregation") {


Could you move the tests into SQLQueryTestSuite?

Could you move the tests into SQLQueryTestSuite?

This should wait for #29085, since currently we can't use script transform in sql/core

@maropu
UT have been moved to transform.sql

SparkQA · 2020-07-14T07:05:01Z

Test build #125797 has finished for PR 29087 at commit 5d85160.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-07-15T02:22:39Z

retest this please

SparkQA · 2020-07-15T07:05:02Z

Test build #125872 has finished for PR 29087 at commit 5d85160.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-08-19T11:52:51Z

retest this please

SparkQA · 2020-08-19T14:14:25Z

Test build #127650 has finished for PR 29087 at commit 5d85160.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-09-04T04:26:37Z

retest this please

AngersZhuuuu · 2020-09-04T04:44:41Z

@HyukjinKwon as metioned in #29087 (comment). can you also help review that pr

SparkQA · 2020-09-04T07:00:11Z

Test build #128276 has finished for PR 29087 at commit 5d85160.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-04T13:50:30Z

Test build #128287 has finished for PR 29087 at commit 2b8912e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-04T14:17:29Z

Test build #128289 has finished for PR 29087 at commit d89afa9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-23T07:05:02Z

Test build #130190 has finished for PR 29087 at commit d89afa9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-22T12:06:58Z

Test build #133214 has finished for PR 29087 at commit f2a640b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-22T12:18:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37812/

SparkQA · 2020-12-22T12:23:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37812/

SparkQA · 2020-12-22T15:20:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37824/

SparkQA · 2020-12-22T15:52:27Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37824/

SparkQA · 2021-03-25T13:11:23Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41098/

SparkQA · 2021-03-25T13:52:00Z

Test build #136513 has finished for PR 29087 at commit 1278705.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

Looks fine. I wait another 1-2 weeks for comments from other reviewers.

maropu · 2021-03-30T14:41:32Z

retest this please

SparkQA · 2021-03-30T15:30:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41308/

SparkQA · 2021-03-30T15:30:41Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41308/

SparkQA · 2021-03-30T18:24:45Z

Test build #136727 has finished for PR 29087 at commit 1278705.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-03-30T23:06:29Z

retest this please

SparkQA · 2021-03-31T00:27:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41321/

SparkQA · 2021-03-31T00:56:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41321/

SparkQA · 2021-03-31T03:46:56Z

Test build #136739 has finished for PR 29087 at commit 1278705.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-04-07T07:25:28Z

Took a quick look. Looks ok to me too.

maropu · 2021-04-08T02:30:47Z

retest this please

SparkQA · 2021-04-08T04:23:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41629/

SparkQA · 2021-04-08T04:23:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41629/

SparkQA · 2021-04-08T08:07:26Z

Test build #137051 has finished for PR 29087 at commit 1278705.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2021-04-12T09:40:40Z

retest this please

SparkQA · 2021-04-12T10:33:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41788/

SparkQA · 2021-04-12T10:33:10Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41788/

SparkQA · 2021-04-12T14:31:49Z

Test build #137209 has finished for PR 29087 at commit 1278705.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-04-13T02:35:19Z

Thanks! Merged to master.

cloud-fan · 2021-04-13T09:03:02Z

sql/core/src/test/resources/sql-tests/inputs/transform.sql

+SET spark.sql.parser.quotedRegexColumnNames=true;
+
+SELECT TRANSFORM(`(a|b)?+.+`)
+ USING 'cat' AS (c)


nit: 2 spaces indentation

cloud-fan · 2021-04-13T09:03:24Z

sql/core/src/test/resources/sql-tests/inputs/transform.sql

+FROM script_trans;
+
+SET spark.sql.parser.quotedRegexColumnNames=false;
+


can we test something like TRANSFORM(distinct a, b) and check the error message?

Will raise a follow up soon

cloud-fan · 2021-04-13T09:04:03Z

late LGTM

…e the error clear ### What changes were proposed in this pull request? According to #29087 (comment), add UT in `transform.sql` It seems that distinct is not recognized as a reserved word here ``` -- !query explain extended SELECT TRANSFORM(distinct b, a, c) USING 'cat' AS (a, b, c) FROM script_trans WHERE a <= 4 -- !query schema struct<plan:string> -- !query output == Parsed Logical Plan == 'ScriptTransformation [*], cat, [a#x, b#x, c#x], ScriptInputOutputSchema(List(),List(),None,None,List(),List(),None,None,false) +- 'Project ['distinct AS b#x, 'a, 'c] +- 'Filter ('a <= 4) +- 'UnresolvedRelation [script_trans], [], false == Analyzed Logical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 'distinct' given input columns: [script_trans.a, script_trans.b, script_trans.c]; line 1 pos 34; 'ScriptTransformation [*], cat, [a#x, b#x, c#x], ScriptInputOutputSchema(List(),List(),None,None,List(),List(),None,None,false) +- 'Project ['distinct AS b#x, a#x, c#x] +- Filter (a#x <= 4) +- SubqueryAlias script_trans +- View (`script_trans`, [a#x,b#x,c#x]) +- Project [cast(a#x as int) AS a#x, cast(b#x as int) AS b#x, cast(c#x as int) AS c#x] +- Project [a#x, b#x, c#x] +- SubqueryAlias script_trans +- LocalRelation [a#x, b#x, c#x] ``` Hive's error ![image](https://user-images.githubusercontent.com/46485123/114533170-355d8380-9c80-11eb-992f-982f0b296759.png) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Ut Closes #32149 from AngersZhuuuu/SPARK-28227-new-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

maropu · 2021-04-19T01:18:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -1558,14 +1558,9 @@ class Analyzer(override val catalogManager: CatalogManager)
        } else {
          a.copy(aggregateExpressions = buildExpandedProjectList(a.aggregateExpressions, a.child))
        }
-      // If the script transformation input contains Stars, expand it.
+      // TODO: Remove this logic and see SPARK-34035


Please do not forget to fix this, @AngersZhuuuu

[SPARK-28227][SQL] Support TRANSFORM with aggregation

d51c0dc

probot-autolabeler bot added the SQL label Jul 13, 2020

maropu reviewed Jul 13, 2020

View reviewed changes

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Outdated Show resolved Hide resolved

maropu reviewed Jul 13, 2020

View reviewed changes

Fix UT

5d85160

AngersZhuuuu added 3 commits September 4, 2020 15:50

Merge branch 'master' into SPARK-28227-NEW

dbb4d04

Update SparkSqlParserSuite.scala

2b8912e

Update SparkSqlParserSuite.scala

d89afa9

AngersZhuuuu added 2 commits December 22, 2020 19:29

Merge branch 'master' into SPARK-28227-NEW

b1cc739

solve import

f2a640b

AngersZhuuuu added 2 commits December 22, 2020 21:51

follow comment

b04909c

Update SQLQuerySuite.scala

671711b

AngersZhuuuu mentioned this pull request Mar 30, 2021

[SPARK-33976][SQL][DOCS] Add a SQL doc page for a TRANSFORM clause #31010

Closed

maropu approved these changes Mar 30, 2021

View reviewed changes

maropu closed this in 278203d Apr 13, 2021

cloud-fan reviewed Apr 13, 2021

View reviewed changes

AngersZhuuuu mentioned this pull request Apr 13, 2021

[SPARK-35069][SQL] TRANSFORM forbidden DISTICNT and ALL, also make the error clear #32149

Closed

maropu reviewed Apr 19, 2021

View reviewed changes

		FROM script_trans;

		SET spark.sql.parser.quotedRegexColumnNames=false;

[SPARK-28227][SQL] Support projection, aggregate/window functions, and lateral view in the TRANSFORM clause #29087

[SPARK-28227][SQL] Support projection, aggregate/window functions, and lateral view in the TRANSFORM clause #29087

Conversation

AngersZhuuuu commented Jul 13, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AngersZhuuuu commented Jul 13, 2020

SparkQA commented Jul 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 14, 2020

AngersZhuuuu commented Jul 15, 2020

SparkQA commented Jul 15, 2020

HyukjinKwon commented Aug 19, 2020

SparkQA commented Aug 19, 2020

HyukjinKwon commented Sep 4, 2020

AngersZhuuuu commented Sep 4, 2020

SparkQA commented Sep 4, 2020

SparkQA commented Sep 4, 2020

SparkQA commented Sep 4, 2020

SparkQA commented Oct 23, 2020

SparkQA commented Dec 22, 2020

SparkQA commented Dec 22, 2020

SparkQA commented Dec 22, 2020

SparkQA commented Dec 22, 2020

SparkQA commented Dec 22, 2020

SparkQA commented Mar 25, 2021

SparkQA commented Mar 25, 2021

maropu left a comment

Choose a reason for hiding this comment

maropu commented Mar 30, 2021

SparkQA commented Mar 30, 2021

SparkQA commented Mar 30, 2021

SparkQA commented Mar 30, 2021

maropu commented Mar 30, 2021

SparkQA commented Mar 31, 2021

SparkQA commented Mar 31, 2021

SparkQA commented Mar 31, 2021

HyukjinKwon commented Apr 7, 2021

maropu commented Apr 8, 2021

SparkQA commented Apr 8, 2021

SparkQA commented Apr 8, 2021

SparkQA commented Apr 8, 2021

AngersZhuuuu commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

maropu commented Apr 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 13, 2021

Choose a reason for hiding this comment

AngersZhuuuu commented Jul 13, 2020 •

edited

Loading