-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType #43250
[SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType #43250
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Hmm, it looks like the
|
The test failures are not related - unfortunately reattempt did not fix them. |
What changes were proposed in this pull request?
This PR aims to reduce the memory consumption of
LiveStageMetrics.accumIdsToMetricType
, which should help to reduce driver memory usage when running complex SQL queries that contain many operators and run many jobs.In SQLAppStatusListener, the LiveStageMetrics.accumIdsToMetricType field holds a map which is used to look up the type of accumulators in order to perform conditional processing of a stage’s metrics.
Currently, that field is derived from
LiveExecutionData.metrics
, which contains metrics for all operators used anywhere in the query. Whenever a job is submitted, we construct a fresh map containing all metrics that have ever been registered for that SQL query. If a query runs a single job, this isn't an issue: in that case, allLiveStageMetrics
instances will hold the same immutableaccumIdsToMetricType
.The problem arises if we have a query that runs many jobs (e.g. a complex query with many joins which gets divided into many jobs due to AQE): in that case, each job submission results in a new
accumIdsToMetricType
map being created.This PR fixes this by changing
accumIdsToMetricType
to be a mutablemutable.HashMap
which is shared across allLivestageMetrics
instances belonging to the sameLiveExecutionData
.The modified classes are
private
and are used only in SQLAppStatusListener, so I don't think this change poses any realistic risk of binary incompatibility risks to third party code.Why are the changes needed?
Addresses one contributing factor behind high driver memory / OOMs when executing complex queries.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing unit tests.
To demonstrate memory reduction, I performed manual benchmarking and heap dump inspection using benchmark that ran copies of a complex query: each test query launches ~200 jobs (so at least 200 stages) and contains ~3800 total operators, resulting in a huge number metric accumulators. Prior to this PR's fix, ~3700 LiveStageMetrics instances (from multiple concurrent runs of the query) consumed a combined ~3.3 GB of heap. After this PR's fix, I observed negligible memory usage from LiveStageMetrics.
Was this patch authored or co-authored using generative AI tooling?
No.