-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4012] call tryOrExit instead of logUncaughtExceptions in ContextCleaner #2864
Conversation
QA tests have started for PR 2864 at commit
|
QA tests have finished for PR 2864 at commit
|
Test FAILed. |
QA tests have started for PR 2864 at commit
|
QA tests have finished for PR 2864 at commit
|
Test FAILed. |
QA tests have started for PR 2864 at commit
|
QA tests have finished for PR 2864 at commit
|
Test FAILed. |
QA tests have started for PR 2864 at commit
|
Tests timed out for PR 2864 at commit |
Test FAILed. |
QA tests have started for PR 2864 at commit
|
QA tests have finished for PR 2864 at commit
|
Test PASSed. |
Hey @CodingCat I think this is the expected behavior. If an OOM is thrown on the driver then the context cleaner thread should just die. In fact Would you mind closing this? |
Hi, @andrewor14, the issue here is JVM default UncaughtExceptionHandler seems not handle the exception correctly, as I said in the PR description, it will request user to shutdown JVM manually w.r.t use ExecutorUncaughtExceptionHandler in driver side, it has been there for a while......https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L148 |
this is also very similar to #622, where the main thread cannot handle the exception thrown by the akka's scheduler thread |
but I don't mind to grab that ExecutorUncaughtExceptionHandler to somewhere else to make it more general.... |
I see... when the JVM's |
@andrewor14, (just met a LiveListernBus uncaught exception this afternoon....) personally, I feel that we shall stop the driver when such things happen ... e.g. the user may need to check the JVM process liveness for , for instance, HA, after this uncaught exception is thrown, the JVM process is "alive" but not functional.... If we agree on that...I think maybe we need to propagate the changes to other places |
Test build #22166 has started for PR 2864 at commit
|
Test build #22166 has finished for PR 2864 at commit
|
Test FAILed. |
Test build #22813 has started for PR 2864 at commit
|
Test build #22813 has finished for PR 2864 at commit
|
Test PASSed. |
Hey, @andrewor14 , Any further thoughts about this PR? |
OK....I didn't expect the PR is closed in such way (for this PR, I'm obviously waiting for more responses and am trying to get the clearer answer from the @andrewor14 that whether I misunderstood something or we shall propagate the chances to other places, see my last comment) just a suggestion. closing PRs without active maintenance is definitely a good way to keep the repository clean. However, I think maybe the better way to close some "pending" pull request is to end the discussion first by getting some answer, instead of terminating the discussion without any pre-notification and suddenly close an actual active PR I'm not intended to be offending. @andrewor14 is making great contributions to the community...what I talked about is just for making the review process more reasonable |
Hey @CodingCat sorry this is unintended. The automatic closing of pull requests is searches for key words that include "mind closing this", which I said a while ago before we started discussing the specifics. I have been swamped with the 1.2 release lately so I haven't had the time to look at some older PRs that are not as urgent, but I definitely did not intend to just terminate the discussion so abruptly. If you wish to open a new PR on the same issue, please feel free to do so. Personally, I am ambivalent on whether this particular issue is a good change, but maybe others can look at it. |
Hi, @andrewor14 , Sorry about the misunderstanding and thanks for your patient explanation I will resubmit it and hopefully will get more feedbacks about this, thanks |
…nfinite loop https://issues.apache.org/jira/browse/SPARK-4012 This patch is a resubmission for #2864 What I am proposing in this patch is that ***when the exception is thrown from an infinite loop, we should stop the SparkContext, instead of let JVM throws exception forever*** So, in the infinite loops where we originally wrapped with a ` logUncaughtExceptions`, I changed to `tryOrStopSparkContext`, so that the Spark component is stopped Early stopped JVM process is helpful for HA scheme design, for example, The user has a script checking the existence of the pid of the Spark Streaming driver for monitoring the availability; with the code before this patch, the JVM process is still available but not functional when the exceptions are thrown andrewor14, srowen , mind taking further consideration about the change? Author: CodingCat <zhunansjtu@gmail.com> Closes #5004 from CodingCat/SPARK-4012-1 and squashes the following commits: 589276a [CodingCat] throw fatal error again 3c72cd8 [CodingCat] address the comments 6087864 [CodingCat] revise comments 6ad3eb0 [CodingCat] stop SparkContext instead of quit the JVM process 6322959 [CodingCat] exit JVM process when the exception is thrown from an infinite loop
When running an "might-be-memory-intensive" application locally, I received the following exception
the reason is that we start ContextCleaner in a separate thread but didn't capture the exception there (re-throw with logUncaughtExceptions)