[SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType #43250

JoshRosen · 2023-10-06T20:25:01Z

What changes were proposed in this pull request?

This PR aims to reduce the memory consumption of LiveStageMetrics.accumIdsToMetricType, which should help to reduce driver memory usage when running complex SQL queries that contain many operators and run many jobs.

In SQLAppStatusListener, the LiveStageMetrics.accumIdsToMetricType field holds a map which is used to look up the type of accumulators in order to perform conditional processing of a stage’s metrics.

Currently, that field is derived from LiveExecutionData.metrics, which contains metrics for all operators used anywhere in the query. Whenever a job is submitted, we construct a fresh map containing all metrics that have ever been registered for that SQL query. If a query runs a single job, this isn't an issue: in that case, all LiveStageMetrics instances will hold the same immutable accumIdsToMetricType.

The problem arises if we have a query that runs many jobs (e.g. a complex query with many joins which gets divided into many jobs due to AQE): in that case, each job submission results in a new accumIdsToMetricType map being created.

This PR fixes this by changing accumIdsToMetricType to be a mutable mutable.HashMap which is shared across all LivestageMetrics instances belonging to the same LiveExecutionData.

The modified classes are private and are used only in SQLAppStatusListener, so I don't think this change poses any realistic risk of binary incompatibility risks to third party code.

Why are the changes needed?

Addresses one contributing factor behind high driver memory / OOMs when executing complex queries.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit tests.

To demonstrate memory reduction, I performed manual benchmarking and heap dump inspection using benchmark that ran copies of a complex query: each test query launches ~200 jobs (so at least 200 stages) and contains ~3800 total operators, resulting in a huge number metric accumulators. Prior to this PR's fix, ~3700 LiveStageMetrics instances (from multiple concurrent runs of the query) consumed a combined ~3.3 GB of heap. After this PR's fix, I observed negligible memory usage from LiveStageMetrics.

Was this patch authored or co-authored using generative AI tooling?

No.

jiangxb1987

LGTM

sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala

Ngone51

Nice work!

JoshRosen · 2023-10-14T00:29:35Z

Hmm, it looks like the OracleIntegrationSuite tests are flaky but I don't think that's related to this PR's changes:

[info] OracleIntegrationSuite:
[info] org.apache.spark.sql.jdbc.OracleIntegrationSuite *** ABORTED *** (7 minutes, 38 seconds)
[info]   The code passed to eventually never returned normally. Attempted 429 times over 7.003095079966667 minutes. Last failure message: ORA-12514: Cannot connect to database. Service freepdb1 is not registered with the listener at host 10.1.0.126 port 45139. (CONNECTION_ID=CC2wkBm6SPGoMF7IghzCeQ==). (DockerJDBCIntegrationSuite.scala:166)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
[info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
[info]   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
[info]   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
[info]   at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95)

mridulm · 2023-10-14T07:28:20Z

The test failures are not related - unfortunately reattempt did not fix them.
Merging to master.
Thanks for fixing this @JoshRosen !
Thanks for the reviews @jiangxb1987, @beliefer, @Ngone51 :-)

Reduce memory usage of LiveStageMetrics.accumIdsToMetricType

b9adb6b

github-actions bot added SQL WEB UI labels Oct 6, 2023

JoshRosen changed the title ~~[SPARK-45439][SQL] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType~~ [SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType Oct 6, 2023

jiangxb1987 approved these changes Oct 6, 2023

View reviewed changes

beliefer reviewed Oct 7, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala Outdated Show resolved Hide resolved

mridulm reviewed Oct 12, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala Outdated Show resolved Hide resolved

use mutable map instead

ee7d25b

mridulm approved these changes Oct 13, 2023

View reviewed changes

Ngone51 approved these changes Oct 13, 2023

View reviewed changes

mridulm closed this in 2f6cca5 Oct 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType #43250

[SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType #43250

JoshRosen commented Oct 6, 2023 •

edited

Loading

jiangxb1987 left a comment

Ngone51 left a comment

JoshRosen commented Oct 14, 2023

mridulm commented Oct 14, 2023

[SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType #43250

[SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType #43250

Conversation

JoshRosen commented Oct 6, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

jiangxb1987 left a comment

Choose a reason for hiding this comment

Ngone51 left a comment

Choose a reason for hiding this comment

JoshRosen commented Oct 14, 2023

mridulm commented Oct 14, 2023

JoshRosen commented Oct 6, 2023 •

edited

Loading