-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay #24936
Conversation
As I stated in SPARK-28074 - #24890, this could be used as an indicator of unexpected discarded rows due to watermark, especially between multiple stateful operators. DiscardLateRowsExec doesn't implement CodegenSupport yet. As FilterExec implements CodegenSupport and DiscardLateRowsExec is simpler version of FilterExec, I guess I could address it. I'll try it out. |
Test build #106770 has finished for PR 24936 at commit
|
Test build #106773 has finished for PR 24936 at commit
|
Test build #106786 has finished for PR 24936 at commit
|
Test build #106789 has finished for PR 24936 at commit
|
Test build #106798 has finished for PR 24936 at commit
|
cc. @jose-torres @arunmahadevan as they've reviewed prior version of the patch. |
I'm now correcting myself that we shouldn't filter out late rows in new physical node, since there're some cases where input rows are later than watermark but they are still counting in aggregation, like non-window streaming aggregation. It should just only count the number of late events, which doesn't always mean they will be discarded, but they could have a chance to be discarded. So it's going to be less intuitive than what I intended for the first time, but it will be still helpful to identify the issue on #24890, as we mostly don't want to let intermediate outputs being late on watermark, having chance to be discarded. |
Test build #106881 has finished for PR 24936 at commit
|
Test build #106889 has finished for PR 24936 at commit
|
retest this please |
The last build failure looks interesting (though it is not relevant to this patch), as it doesn't seem to be different but string representation is different.
|
Test build #106894 has finished for PR 24936 at commit
|
Test build #106910 has finished for PR 24936 at commit
|
Retest this please. |
Test build #107197 has finished for PR 24936 at commit
|
Test build #108952 has finished for PR 24936 at commit
|
54d159f
to
63b2e24
Compare
Test build #109306 has finished for PR 24936 at commit
|
Test build #109307 has finished for PR 24936 at commit
|
retest this, please |
Test build #109310 has finished for PR 24936 at commit
|
Ping. Can we evaluate this and #21617 , and review either preferred one? This is important addition of #24890 (SPARK-28074) |
ok to test |
Test build #110681 has finished for PR 24936 at commit
|
Build failure is below:
Not relevant to this patch. |
retest this, please |
Test build #110720 has finished for PR 24936 at commit
|
Test build #114003 has finished for PR 24936 at commit
|
00d7877
to
698aa51
Compare
Test build #114942 has finished for PR 24936 at commit
|
Test build #114949 has finished for PR 24936 at commit
|
retest this, please |
Test build #114977 has finished for PR 24936 at commit
|
retest this, please |
Test build #116589 has finished for PR 24936 at commit
|
47dfca5
to
935856c
Compare
Test build #116597 has finished for PR 24936 at commit
|
935856c
to
f99a528
Compare
Test build #118490 has finished for PR 24936 at commit
|
Retest this, please |
Test build #118497 has finished for PR 24936 at commit
|
f99a528
to
71bdb2a
Compare
Test build #120562 has finished for PR 24936 at commit
|
…n watermark * just works * FIXME: remove redundant checks in both discard exec and stateful operator exec * FIXME: make discard exec to codegen
71bdb2a
to
140d759
Compare
Test build #121237 has finished for PR 24936 at commit
|
retest this, please |
Test build #122135 has finished for PR 24936 at commit
|
retest this, please |
Test build #122144 has finished for PR 24936 at commit
|
…ater than watermark plus allowed delay apache#24936
What changes were proposed in this pull request?
Please refer https://issues.apache.org/jira/browse/SPARK-24634 to see rationalization of the issue.
This patch adds a new metric to count the number of rows arrived later than watermark plus allowed delay. To count the number of rows correctly, this patch adds a new physical node before stateful operator node(s) which counts the number of input rows later than watermark plus allowed delay.
Note that it doesn't mean these rows will be always discarded, since there're some cases we don't discard late input rows, e.g. non-window streaming aggregation.
The metric will be exposed to two places:
number of input rows later than watermark plus allowed delay
in CountLateRowsExecThis is a revised version of #21617.
How was this patch tested?
Modified UT, and ran manual test reproducing SPARK-28094.
I've picked the specific case on "B outer C outer D" which is enough to represent the "intermediate late row" issue due to global watermark.
https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17
Spark logs warning message on the query which means SPARK-28074 is working correctly,
and query plan represents CountLateRows with the number of late rows. Empty batch runs for batch 4 as follows:
The boxed node in the detail page for batch 4 represents there're 4 rows emitted from left side of join (B inner C) which rows are all later than watermark + allowed delay.
The number of late rows are also reported as streaming query listener as follow (check
numLateInputRows
):