Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-49977][SQL] Use stack-based iterative computation to avoid cre…
…ating many Scala List objects for deep expression trees ### What changes were proposed in this pull request? In some use cases with deep expression trees, the driver's heap shows many `scala.collection.immutable.$colon$colon` objects from the heap. The objects are allocated due to deep recursion in the `gatherCommutative` method which uses `flatmap` recursively. Each invocation of `flatmap` creates a new temporary Scala collection. Our claim is based on the following stack trace (>1K lines) of a thread in the driver below, truncated here for brevity: ``` "HiveServer2-Background-Pool: Thread-9867" #9867 daemon prio=5 os_prio=0 tid=0x00007f35080bf000 nid=0x33e7 runnable [0x00007f3393372000] java.lang.Thread.State: RUNNABLE at scala.collection.immutable.List$Appender$1.apply(List.scala:350) at scala.collection.immutable.List$Appender$1.apply(List.scala:341) at scala.collection.immutable.List.flatMap(List.scala:431) at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.gatherCommutative(Expression.scala:1479) at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.$anonfun$gatherCommutative$1(Expression.scala:1479) at org.apache.spark.sql.catalyst.expressions.CommutativeExpression$$Lambda$5280/143713747.apply(Unknown Source) at scala.collection.immutable.List.flatMap(List.scala:366) .... at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.gatherCommutative(Expression.scala:1479) at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.$anonfun$gatherCommutative$1(Expression.scala:1479) at org.apache.spark.sql.catalyst.expressions.CommutativeExpression$$Lambda$5280/143713747.apply(Unknown Source) at scala.collection.immutable.List.flatMap(List.scala:366) .... ``` This PR fixes the issue by using a stack-based iterative computation, completely avoiding the creation of temporary Scala objects. ### Why are the changes needed? Reduce heap usage of the driver ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests, refactor ### Was this patch authored or co-authored using generative AI tooling? No Closes #48481 from utkarsh39/SPARK-49977. Lead-authored-by: Utkarsh <utkarsh.agarwal@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Loading branch information