Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47430][SQL] Rework group by map type to fix bind reference exception #47545

Closed
wants to merge 2 commits into from

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Jul 31, 2024

What changes were proposed in this pull request?

This pr reworks the group by map type to fix issues:

  • Can not bind reference excpetion at runtume since the attribute was wrapped by MapSort and we didi not transform the plan with new output
  • The add MapSort rule should be put before PullOutGroupingExpressions to avoid complex expr existing in grouping keys

Why are the changes needed?

To fix issues.

for example:

select map(1, id) from range(10) group by map(1, id);


[INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470)

Does this PR introduce any user-facing change?

no, not released

How was this patch tested?

improve the tests to add more cases

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Jul 31, 2024
@ulysses-you
Copy link
Contributor Author

cc @cloud-fan @stevomitric thank you

@stevomitric
Copy link
Contributor

cc @nebojsa-db

@HyukjinKwon
Copy link
Member

To fix issues.

To fix which issue?

@ulysses-you
Copy link
Contributor Author

@HyukjinKwon to fix the issue memtioned in pr description..

for example:

select map(1, id) from range(10) group by map(1, id);


[INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470)

I add this case in description to make it clear.

@yaooqinn
Copy link
Member

yaooqinn commented Aug 2, 2024

Could you please ensure that the PR title does not sound like it's for refactoring if it's a bugfix?

@ulysses-you ulysses-you changed the title [SPARK-47430][SQL] Rework group by map type [SPARK-47430][SQL] Rework group by map type to fix bind reference exception Aug 2, 2024
* SELECT map_expr as c, COUNT(*) FROM TABLE GROUP BY map_expr =>
* SELECT map_sort(map_expr) as c, COUNT(*) FROM TABLE GROUP BY map_sort(map_expr)
*/
object AddMapSortInAggregate extends Rule[LogicalPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a simple rename? not sure why git diff doesn't detect it

@ulysses-you ulysses-you force-pushed the maptype branch 2 times, most recently from a575672 to a443354 Compare August 8, 2024 08:52
*/
private def insertMapSortRecursively(e: Expression): Expression = {
private def replaceWithMapSortRecursively(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why rename? It's indeed inserting MapSort

Comment on lines 62 to 64
exprToMapSort.getOrElseUpdate(
expr.canonicalized, Alias(inserted, "_groupingmapsort")())
.toAttribute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
exprToMapSort.getOrElseUpdate(
expr.canonicalized, Alias(inserted, "_groupingmapsort")())
.toAttribute
exprToMapSort.getOrElseUpdate(
expr.canonicalized,
Alias(inserted, "_groupingmapsort")()
).toAttribute

@@ -297,6 +295,7 @@ abstract class Optimizer(catalogManager: CatalogManager)
ReplaceExpressions,
RewriteNonCorrelatedExists,
PullOutGroupingExpressions,
InsertMapSortInGroupingExpressions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add some comments to explain the rule order reasoning.

@ulysses-you
Copy link
Contributor Author

thank you all, merged to master

@ulysses-you ulysses-you deleted the maptype branch August 12, 2024 02:35
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
…eption

### What changes were proposed in this pull request?

This pr reworks the group by map type to fix issues:
- Can not bind reference excpetion at runtume since the attribute was wrapped by `MapSort` and we didi not transform the plan with new output
- The add `MapSort` rule should be put before `PullOutGroupingExpressions` to avoid complex expr existing in grouping keys

### Why are the changes needed?

To fix issues.

for example:
```
select map(1, id) from range(10) group by map(1, id);

[INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)apache#19] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)apache#19] SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470)
```

### Does this PR introduce _any_ user-facing change?

no, not released

### How was this patch tested?

improve the tests to add more cases

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#47545 from ulysses-you/maptype.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…eption

### What changes were proposed in this pull request?

This pr reworks the group by map type to fix issues:
- Can not bind reference excpetion at runtume since the attribute was wrapped by `MapSort` and we didi not transform the plan with new output
- The add `MapSort` rule should be put before `PullOutGroupingExpressions` to avoid complex expr existing in grouping keys

### Why are the changes needed?

To fix issues.

for example:
```
select map(1, id) from range(10) group by map(1, id);

[INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)apache#19] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)apache#19] SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470)
```

### Does this PR introduce _any_ user-facing change?

no, not released

### How was this patch tested?

improve the tests to add more cases

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#47545 from ulysses-you/maptype.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…eption

### What changes were proposed in this pull request?

This pr reworks the group by map type to fix issues:
- Can not bind reference excpetion at runtume since the attribute was wrapped by `MapSort` and we didi not transform the plan with new output
- The add `MapSort` rule should be put before `PullOutGroupingExpressions` to avoid complex expr existing in grouping keys

### Why are the changes needed?

To fix issues.

for example:
```
select map(1, id) from range(10) group by map(1, id);

[INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)apache#19] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)apache#19] SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470)
```

### Does this PR introduce _any_ user-facing change?

no, not released

### How was this patch tested?

improve the tests to add more cases

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#47545 from ulysses-you/maptype.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants