-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49038][SQL] Fix regression in Spark UI SQL operator metrics calculation to filter out invalid accumulator values correctly #47516
Conversation
This change correctly filters out the invalid accumulators values for SIZE and TIMING metrics before data is shown on UI. Given SIZE and TIMING metrics are assigned with an initial value of -1, we need return it as it is so that the `SQLMetrics.stringValue` function can filter out the invalid values correctly.
@dongjoon-hyun could you please review - thanks! |
// values in `SQLMetrics.stringValue` when calculating min, max, etc. | ||
// However, users accessing the values in the physical plan programmatically still gets -1. They | ||
// may use `SQLMetric.isZero` before consuming this value. | ||
override def value: Long = _value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, actually, this is a direct reverting of #39311 .
- In this case, we need a review from the author because he proposed this as a kind of regression for accumulator values.
- In addition, could you add a new unit test case for your case? IIUC, this PR only seems to update the existing test cases.
cc @cloud-fan and @viirya and @HyukjinKwon from #39311 , too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1
is an internal initial value, and we can't expose it to users. People can get the physical plan and directly access the SQLMetrics
instances to get the values.
Do we have a test case to demonstrate the issue? AFAIK Spark filters out 0-value accumulators at the executor side. |
So the change made in #39311 basically converts invalid I think there is no existing test for this, that's why it was never caught. Let me see if I can add one to demonstrate the issue. |
Just to note, |
…iltered out correctly
Added the test @cloud-fan |
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
Outdated
Show resolved
Hide resolved
I'm not convinced by the added test, as it calls |
I don't see any problem with the current
|
You can run the tpc-ds benchmark q1 and check the SQL tab DAG views in driver hosted Spark UI. The minimum and median values are not correct for multiple operator metrics and they don't match with history server values either (as pointed out by @abmodi in the JIRA ticket). For incorrect metric cases, the minimum value is always zero while the median value is always less than the actual median value. The reason for this is that there are extra zeros in the I'm not a spark expert but given
Agree. If users are consuming values directly from listener, they may use |
As I explained earlier, this should not happen as we will filter out -1 values at the executor side. So the 0 values in the UI may be valid values from certain tasks. Do you have a simple repro (end-to-end query) to trigger this bug? |
Can you please use the below reproducer? This is join between two tables that shuffles data. This can be run in a spark-shell.
Attaching screenshots, data in spark UI is not correct and it doesn't match between spark UI and history server for Spark
|
@cloud-fan waiting for your response to unblock the review - thanks! |
@cloud-fan were you able to reproduce the issue? It is a very simple scenario that reproduces the issue for spark cc @abmodi and @dongjoon-hyun too. |
I confirmed that the bug exists. I was wrong about executor side accumulator updates filtering. We only filter out zero values for task metrics, but not SQL metrics. But I don't think this PR is the right fix as it makes |
Thanks for checking and confirming! Actually I was to able to track down that the bug started after #39311 only and hence I proposed this fix. But If there is a better way to fix this bug, I am fine with it as long as we are fixing it. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This patch fixes an issue in the driver hosted Spark UI
SQL
tab DAG view where the invalid SQL metric values are not filtered out correctly and hence showing incorrectminimum
andmedian
metric values in the UI. This regression got introduced in #39311 .Why are the changes needed?
SIZE
,TIMING
andNS_TIMING
metrics are created with initial value-1
(given0
is a valid metric value for them). TheSQLMetrics.stringValue
method filters out the invalid values using condition:value >= 0
before calculating themin
,med
andmax
values. But #39311 introduced in Spark3.4.0
introduced a regression where theSQLMetric.value
is always>= 0
. This means that the invalid accumulators with value-1
are no longer invalid to get filtered out correctly. This needs to be fixed.Does this PR introduce any user-facing change?
Yes, as end users can access accumulator values directly. Users accessing the values in the physical plan programmatically should use
SQLMetric.isZero
before consuming its value.How was this patch tested?
Existing tests; Created new jar for Spark
3.5.1
and confirms that the incorrect data is shown correctly in Spark UI now.Old UI view:
old_spark_ui_view_3_5_1.pdf
Fixed UI view:
new_spark_ui_view_3_5_1.pdf
Was this patch authored or co-authored using generative AI tooling?
No