[SPARK-47430][SQL] Rework group by map type to fix bind reference exception #47545

ulysses-you · 2024-07-31T03:32:04Z

What changes were proposed in this pull request?

This pr reworks the group by map type to fix issues:

Can not bind reference excpetion at runtume since the attribute was wrapped by MapSort and we didi not transform the plan with new output
The add MapSort rule should be put before PullOutGroupingExpressions to avoid complex expr existing in grouping keys

Why are the changes needed?

To fix issues.

for example:

select map(1, id) from range(10) group by map(1, id);


[INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470)

Does this PR introduce any user-facing change?

no, not released

How was this patch tested?

improve the tests to add more cases

Was this patch authored or co-authored using generative AI tooling?

no

ulysses-you · 2024-07-31T03:34:09Z

cc @cloud-fan @stevomitric thank you

stevomitric · 2024-07-31T09:08:33Z

cc @nebojsa-db

HyukjinKwon · 2024-08-02T02:52:45Z

To fix issues.

To fix which issue?

ulysses-you · 2024-08-02T05:26:12Z

@HyukjinKwon to fix the issue memtioned in pr description..

for example:

select map(1, id) from range(10) group by map(1, id);


[INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470)

I add this case in description to make it clear.

yaooqinn · 2024-08-02T06:29:32Z

Could you please ensure that the PR title does not sound like it's for refactoring if it's a bugfix?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

cloud-fan · 2024-08-05T12:57:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/AddMapSortInAggregate.scala

+ * SELECT map_expr as c, COUNT(*) FROM TABLE GROUP BY map_expr =>
+ * SELECT map_sort(map_expr) as c, COUNT(*) FROM TABLE GROUP BY map_sort(map_expr)
+ */
+object AddMapSortInAggregate extends Rule[LogicalPlan] {


is it a simple rename? not sure why git diff doesn't detect it

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

.../main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala

cloud-fan · 2024-08-08T13:12:47Z

.../main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala

   */
-  private def insertMapSortRecursively(e: Expression): Expression = {
+  private def replaceWithMapSortRecursively(


why rename? It's indeed inserting MapSort

cloud-fan · 2024-08-09T14:18:36Z

.../main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala

+            exprToMapSort.getOrElseUpdate(
+                expr.canonicalized, Alias(inserted, "_groupingmapsort")())
+              .toAttribute


Suggested change

exprToMapSort.getOrElseUpdate(

expr.canonicalized, Alias(inserted, "_groupingmapsort")())

.toAttribute

exprToMapSort.getOrElseUpdate(

expr.canonicalized,

Alias(inserted, "_groupingmapsort")()

).toAttribute

cloud-fan · 2024-08-09T14:20:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -297,6 +295,7 @@ abstract class Optimizer(catalogManager: CatalogManager)
      ReplaceExpressions,
      RewriteNonCorrelatedExists,
      PullOutGroupingExpressions,
+      InsertMapSortInGroupingExpressions,


let's add some comments to explain the rule order reasoning.

ulysses-you · 2024-08-12T02:35:47Z

thank you all, merged to master

…eption ### What changes were proposed in this pull request? This pr reworks the group by map type to fix issues: - Can not bind reference excpetion at runtume since the attribute was wrapped by `MapSort` and we didi not transform the plan with new output - The add `MapSort` rule should be put before `PullOutGroupingExpressions` to avoid complex expr existing in grouping keys ### Why are the changes needed? To fix issues. for example: ``` select map(1, id) from range(10) group by map(1, id); [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)apache#19] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)apache#19] SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470) ``` ### Does this PR introduce _any_ user-facing change? no, not released ### How was this patch tested? improve the tests to add more cases ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47545 from ulysses-you/maptype. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>

github-actions bot added the SQL label Jul 31, 2024

cloud-fan reviewed Aug 2, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Show resolved Hide resolved

ulysses-you force-pushed the maptype branch from 6d2ce79 to 57ab912 Compare August 2, 2024 07:25

ulysses-you changed the title ~~[SPARK-47430][SQL] Rework group by map type~~ [SPARK-47430][SQL] Rework group by map type to fix bind reference exception Aug 2, 2024

ulysses-you force-pushed the maptype branch from 57ab912 to 2935bdb Compare August 2, 2024 07:55

cloud-fan reviewed Aug 5, 2024

View reviewed changes

ulysses-you force-pushed the maptype branch 2 times, most recently from a575672 to a443354 Compare August 8, 2024 08:52

cloud-fan reviewed Aug 8, 2024

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 8, 2024

View reviewed changes

.../main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala Outdated Show resolved Hide resolved

ulysses-you force-pushed the maptype branch from a443354 to 4f29759 Compare August 8, 2024 11:51

cloud-fan reviewed Aug 8, 2024

View reviewed changes

.../main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala Show resolved Hide resolved

cloud-fan reviewed Aug 8, 2024

View reviewed changes

ulysses-you force-pushed the maptype branch from 4f29759 to 7ea8593 Compare August 9, 2024 06:47

Rework group by map type to fix bind reference exception

94ace2d

ulysses-you force-pushed the maptype branch from 7ea8593 to 94ace2d Compare August 9, 2024 06:48

cloud-fan reviewed Aug 9, 2024

View reviewed changes

cloud-fan approved these changes Aug 9, 2024

View reviewed changes

address comments

a1cb22c

ulysses-you closed this in b093029 Aug 12, 2024

ulysses-you deleted the maptype branch August 12, 2024 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47430][SQL] Rework group by map type to fix bind reference exception #47545

[SPARK-47430][SQL] Rework group by map type to fix bind reference exception #47545

ulysses-you commented Jul 31, 2024 •

edited

Loading

ulysses-you commented Jul 31, 2024

stevomitric commented Jul 31, 2024

HyukjinKwon commented Aug 2, 2024

ulysses-you commented Aug 2, 2024

yaooqinn commented Aug 2, 2024

cloud-fan Aug 5, 2024

cloud-fan Aug 8, 2024

cloud-fan Aug 9, 2024

cloud-fan Aug 9, 2024

ulysses-you commented Aug 12, 2024

[SPARK-47430][SQL] Rework group by map type to fix bind reference exception #47545

[SPARK-47430][SQL] Rework group by map type to fix bind reference exception #47545

Conversation

ulysses-you commented Jul 31, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

ulysses-you commented Jul 31, 2024

stevomitric commented Jul 31, 2024

HyukjinKwon commented Aug 2, 2024

ulysses-you commented Aug 2, 2024

yaooqinn commented Aug 2, 2024

cloud-fan Aug 5, 2024

Choose a reason for hiding this comment

cloud-fan Aug 8, 2024

Choose a reason for hiding this comment

cloud-fan Aug 9, 2024

Choose a reason for hiding this comment

cloud-fan Aug 9, 2024

Choose a reason for hiding this comment

ulysses-you commented Aug 12, 2024

ulysses-you commented Jul 31, 2024 •

edited

Loading