Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49977][SQL] Use stack-based iterative computation to avoid creating many Scala List objects for deep expression trees #48481

Closed
wants to merge 2 commits into from

Conversation

utkarsh39
Copy link
Contributor

What changes were proposed in this pull request?

In some use cases with deep expression trees, the driver's heap shows many scala.collection.immutable.$colon$colon objects from the heap. The objects are allocated due to deep recursion in the gatherCommutative method which uses flatmap recursively. Each invocation of flatmap creates a new temporary Scala collection. Our claim is based on the following stack trace (>1K lines) of a thread in the driver below, truncated here for brevity:

"HiveServer2-Background-Pool: Thread-9867" #9867 daemon prio=5 os_prio=0 tid=0x00007f35080bf000 nid=0x33e7 runnable [0x00007f3393372000]
   java.lang.Thread.State: RUNNABLE
   	at scala.collection.immutable.List$Appender$1.apply(List.scala:350)
	at scala.collection.immutable.List$Appender$1.apply(List.scala:341)
	at scala.collection.immutable.List.flatMap(List.scala:431)
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.gatherCommutative(Expression.scala:1479)
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.$anonfun$gatherCommutative$1(Expression.scala:1479)
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression$$Lambda$5280/143713747.apply(Unknown Source)
	at scala.collection.immutable.List.flatMap(List.scala:366)
....
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.gatherCommutative(Expression.scala:1479)
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.$anonfun$gatherCommutative$1(Expression.scala:1479)
	at org.apache.spark.sql.catalyst.expressions.CommutativeExpression$$Lambda$5280/143713747.apply(Unknown Source)
	at scala.collection.immutable.List.flatMap(List.scala:366)
....

This PR fixes the issue by using a stack-based iterative computation, completely avoiding the creation of temporary Scala objects.

Why are the changes needed?

Reduce heap usage of the driver

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests, refactor

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Oct 15, 2024
Copy link

@patsukp-db patsukp-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left a nit.

…essions/Expression.scala

Co-authored-by: Pat Sukprasert <pat.sukprasert@databricks.com>
@HyukjinKwon HyukjinKwon changed the title [SPARK-49977] Use stack-based iterative computation to avoid creating many Scala List objects for deep expression trees [SPARK-49977][SQL] Use stack-based iterative computation to avoid creating many Scala List objects for deep expression trees Oct 17, 2024
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 175d563 Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants