Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumI…
…dsToMetricType ### What changes were proposed in this pull request? This PR aims to reduce the memory consumption of `LiveStageMetrics.accumIdsToMetricType`, which should help to reduce driver memory usage when running complex SQL queries that contain many operators and run many jobs. In SQLAppStatusListener, the LiveStageMetrics.accumIdsToMetricType field holds a map which is used to look up the type of accumulators in order to perform conditional processing of a stage’s metrics. Currently, that field is derived from `LiveExecutionData.metrics`, which contains metrics for _all_ operators used anywhere in the query. Whenever a job is submitted, we construct a fresh map containing all metrics that have ever been registered for that SQL query. If a query runs a single job, this isn't an issue: in that case, all `LiveStageMetrics` instances will hold the same immutable `accumIdsToMetricType`. The problem arises if we have a query that runs many jobs (e.g. a complex query with many joins which gets divided into many jobs due to AQE): in that case, each job submission results in a new `accumIdsToMetricType` map being created. This PR fixes this by changing `accumIdsToMetricType` to be a mutable `mutable.HashMap` which is shared across all `LivestageMetrics` instances belonging to the same `LiveExecutionData`. The modified classes are `private` and are used only in SQLAppStatusListener, so I don't think this change poses any realistic risk of binary incompatibility risks to third party code. ### Why are the changes needed? Addresses one contributing factor behind high driver memory / OOMs when executing complex queries. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. To demonstrate memory reduction, I performed manual benchmarking and heap dump inspection using benchmark that ran copies of a complex query: each test query launches ~200 jobs (so at least 200 stages) and contains ~3800 total operators, resulting in a huge number metric accumulators. Prior to this PR's fix, ~3700 LiveStageMetrics instances (from multiple concurrent runs of the query) consumed a combined ~3.3 GB of heap. After this PR's fix, I observed negligible memory usage from LiveStageMetrics. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43250 from JoshRosen/reduce-accum-ids-to-metric-type-mem-overhead. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
- Loading branch information