[SPARK-49977][SQL] Use stack-based iterative computation to avoid creating many Scala List objects for deep expression trees #48481

utkarsh39 · 2024-10-15T16:00:37Z

What changes were proposed in this pull request?

In some use cases with deep expression trees, the driver's heap shows many scala.collection.immutable.$colon$colon objects from the heap. The objects are allocated due to deep recursion in the gatherCommutative method which uses flatmap recursively. Each invocation of flatmap creates a new temporary Scala collection. Our claim is based on the following stack trace (>1K lines) of a thread in the driver below, truncated here for brevity:

"HiveServer2-Background-Pool: Thread-9867" #9867 daemon prio=5 os_prio=0 tid=0x00007f35080bf000 nid=0x33e7 runnable [0x00007f3393372000]
   java.lang.Thread.State: RUNNABLE
   	at scala.collection.immutable.List$Appender$1.apply(List.scala:350)
	at scala.collection.immutable.List$Appender$1.apply(List.scala:341)
	at scala.collection.immutable.List.flatMap(List.scala:431)
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.gatherCommutative(Expression.scala:1479)
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.$anonfun$gatherCommutative$1(Expression.scala:1479)
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression$$Lambda$5280/143713747.apply(Unknown Source)
	at scala.collection.immutable.List.flatMap(List.scala:366)
....
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.gatherCommutative(Expression.scala:1479)
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.$anonfun$gatherCommutative$1(Expression.scala:1479)
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression$$Lambda$5280/143713747.apply(Unknown Source)
	at scala.collection.immutable.List.flatMap(List.scala:366)
....

This PR fixes the issue by using a stack-based iterative computation, completely avoiding the creation of temporary Scala objects.

Why are the changes needed?

Reduce heap usage of the driver

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests, refactor

Was this patch authored or co-authored using generative AI tooling?

No

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

patsukp-db

LGTM. Left a nit.

…essions/Expression.scala Co-authored-by: Pat Sukprasert <pat.sukprasert@databricks.com>

cloud-fan · 2024-10-17T04:26:34Z

thanks, merging to master!

fix

be4bdb1

github-actions bot added the SQL label Oct 15, 2024

patsukp-db reviewed Oct 16, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala Outdated Show resolved Hide resolved

patsukp-db approved these changes Oct 16, 2024

View reviewed changes

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr…

fbea11f

…essions/Expression.scala Co-authored-by: Pat Sukprasert <pat.sukprasert@databricks.com>

cloud-fan approved these changes Oct 17, 2024

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-49977] Use stack-based iterative computation to avoid creating many Scala List objects for deep expression trees~~ [SPARK-49977][SQL] Use stack-based iterative computation to avoid creating many Scala List objects for deep expression trees Oct 17, 2024

HyukjinKwon approved these changes Oct 17, 2024

View reviewed changes

cloud-fan closed this in 175d563 Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49977][SQL] Use stack-based iterative computation to avoid creating many Scala List objects for deep expression trees #48481

[SPARK-49977][SQL] Use stack-based iterative computation to avoid creating many Scala List objects for deep expression trees #48481

utkarsh39 commented Oct 15, 2024

patsukp-db left a comment

cloud-fan commented Oct 17, 2024

[SPARK-49977][SQL] Use stack-based iterative computation to avoid creating many Scala List objects for deep expression trees #48481

[SPARK-49977][SQL] Use stack-based iterative computation to avoid creating many Scala List objects for deep expression trees #48481

Conversation

utkarsh39 commented Oct 15, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

patsukp-db left a comment

Choose a reason for hiding this comment

cloud-fan commented Oct 17, 2024