[SPARK-4012] call tryOrExit instead of logUncaughtExceptions in ContextCleaner #2864

CodingCat · 2014-10-20T22:32:05Z

When running an "might-be-memory-intensive" application locally, I received the following exception

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Spark Context Cleaner"
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Driver Heartbeater"
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- 
the VM may need to be forcibly terminated
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated

the reason is that we start ContextCleaner in a separate thread but didn't capture the exception there (re-throw with logUncaughtExceptions)

SparkQA · 2014-10-20T22:39:45Z

QA tests have started for PR 2864 at commit 287bd07.

This patch merges cleanly.

SparkQA · 2014-10-20T22:42:57Z

QA tests have finished for PR 2864 at commit 287bd07.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-20T22:42:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21935/
Test FAILed.

SparkQA · 2014-10-20T22:59:44Z

QA tests have started for PR 2864 at commit 55cafc8.

This patch merges cleanly.

SparkQA · 2014-10-20T23:03:18Z

QA tests have finished for PR 2864 at commit 55cafc8.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ReconnectWorker(masterUrl: String) extends DeployMessage

AmplabJenkins · 2014-10-20T23:03:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21937/
Test FAILed.

SparkQA · 2014-10-21T02:54:43Z

QA tests have started for PR 2864 at commit 3893a7e.

This patch merges cleanly.

SparkQA · 2014-10-21T03:58:18Z

QA tests have finished for PR 2864 at commit 3893a7e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JavaFutureActionWrapper[S, T](futureAction: FutureAction[S], converter: S => T)
- case class ReconnectWorker(masterUrl: String) extends DeployMessage

AmplabJenkins · 2014-10-21T03:58:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21958/
Test FAILed.

SparkQA · 2014-10-21T11:09:41Z

QA tests have started for PR 2864 at commit 2f5a4c0.

This patch merges cleanly.

SparkQA · 2014-10-21T13:09:42Z

Tests timed out for PR 2864 at commit 2f5a4c0 after a configured wait of 120m.

AmplabJenkins · 2014-10-21T13:09:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21985/
Test FAILed.

SparkQA · 2014-10-21T19:19:41Z

QA tests have started for PR 2864 at commit 737d36b.

This patch merges cleanly.

SparkQA · 2014-10-21T20:29:15Z

QA tests have finished for PR 2864 at commit 737d36b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-21T20:29:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21999/
Test PASSed.

andrewor14 · 2014-10-22T23:54:05Z

Hey @CodingCat I think this is the expected behavior. If an OOM is thrown on the driver then the context cleaner thread should just die. In fact logUncaughtException rethrows the error and this is propagated to Java's UncaughtExceptionHandler. I don't think we want to use tryOrExit here because we want to log the exception, and we don't want to use the ExecutorUncaughtExceptionHandler for the driver.

Would you mind closing this?

CodingCat · 2014-10-23T01:12:12Z

Hi, @andrewor14, the issue here is JVM default UncaughtExceptionHandler seems not handle the exception correctly, as I said in the PR description, it will request user to shutdown JVM manually

w.r.t use ExecutorUncaughtExceptionHandler in driver side, it has been there for a while......https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L148

CodingCat · 2014-10-23T01:34:37Z

this is also very similar to #622, where the main thread cannot handle the exception thrown by the akka's scheduler thread

CodingCat · 2014-10-23T01:37:58Z

but I don't mind to grab that ExecutorUncaughtExceptionHandler to somewhere else to make it more general....

andrewor14 · 2014-10-23T21:43:01Z

I see... when the JVM's UncaughtExceptionHandler catches an OOM exception it doesn't actually kill the JVM. However, we do want to log the fact that we are facing an OOM exception somewhere. Even though ExecutorUncaughtExceptionHandler eventually does that it's not clear from the name tryOrExit that we will log the error. I'm just not sure how badly this issue needs to be fixed. We do this in other places too (e.g. LiveListenerBus has a polling thread).

CodingCat · 2014-10-23T22:02:56Z

@andrewor14, (just met a LiveListernBus uncaught exception this afternoon....)

personally, I feel that we shall stop the driver when such things happen ... e.g. the user may need to check the JVM process liveness for , for instance, HA, after this uncaught exception is thrown, the JVM process is "alive" but not functional....

If we agree on that...I think maybe we need to propagate the changes to other places

SparkQA · 2014-10-24T21:34:49Z

Test build #22166 has started for PR 2864 at commit e163adf.

This patch merges cleanly.

SparkQA · 2014-10-24T22:30:23Z

Test build #22166 has finished for PR 2864 at commit e163adf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-24T22:30:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22166/
Test FAILed.

SparkQA · 2014-11-03T13:24:52Z

Test build #22813 has started for PR 2864 at commit 39eeb02.

This patch merges cleanly.

…pCleaning

SparkQA · 2014-11-03T14:49:55Z

Test build #22813 has finished for PR 2864 at commit 39eeb02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-03T14:49:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22813/
Test PASSed.

CodingCat · 2014-11-03T15:00:00Z

Hey, @andrewor14 ,

Any further thoughts about this PR?

CodingCat · 2014-11-08T03:08:30Z

OK....I didn't expect the PR is closed in such way

(for this PR, I'm obviously waiting for more responses and am trying to get the clearer answer from the @andrewor14 that whether I misunderstood something or we shall propagate the chances to other places, see my last comment)

just a suggestion. closing PRs without active maintenance is definitely a good way to keep the repository clean. However, I think maybe the better way to close some "pending" pull request is to end the discussion first by getting some answer, instead of terminating the discussion without any pre-notification and suddenly close an actual active PR

I'm not intended to be offending. @andrewor14 is making great contributions to the community...what I talked about is just for making the review process more reasonable

andrewor14 · 2014-11-08T03:55:28Z

Hey @CodingCat sorry this is unintended. The automatic closing of pull requests is searches for key words that include "mind closing this", which I said a while ago before we started discussing the specifics. I have been swamped with the 1.2 release lately so I haven't had the time to look at some older PRs that are not as urgent, but I definitely did not intend to just terminate the discussion so abruptly. If you wish to open a new PR on the same issue, please feel free to do so. Personally, I am ambivalent on whether this particular issue is a good change, but maybe others can look at it.

CodingCat · 2014-11-08T14:26:42Z

Hi, @andrewor14 , Sorry about the misunderstanding and thanks for your patient explanation

I will resubmit it and hopefully will get more feedbacks about this, thanks

…nfinite loop https://issues.apache.org/jira/browse/SPARK-4012 This patch is a resubmission for #2864 What I am proposing in this patch is that ***when the exception is thrown from an infinite loop, we should stop the SparkContext, instead of let JVM throws exception forever*** So, in the infinite loops where we originally wrapped with a ` logUncaughtExceptions`, I changed to `tryOrStopSparkContext`, so that the Spark component is stopped Early stopped JVM process is helpful for HA scheme design, for example, The user has a script checking the existence of the pid of the Spark Streaming driver for monitoring the availability; with the code before this patch, the JVM process is still available but not functional when the exceptions are thrown andrewor14, srowen , mind taking further consideration about the change? Author: CodingCat <zhunansjtu@gmail.com> Closes #5004 from CodingCat/SPARK-4012-1 and squashes the following commits: 589276a [CodingCat] throw fatal error again 3c72cd8 [CodingCat] address the comments 6087864 [CodingCat] revise comments 6ad3eb0 [CodingCat] stop SparkContext instead of quit the JVM process 6322959 [CodingCat] exit JVM process when the exception is thrown from an infinite loop

CodingCat force-pushed the SPARK-4012 branch from 287bd07 to 55cafc8 Compare October 20, 2014 22:54

CodingCat changed the title ~~SPARK-4012: call tryOrExit instead of logUncaughtExceptions in ContextCleaner~~ [SPARK-4012] call tryOrExit instead of logUncaughtExceptions in ContextCleaner Oct 21, 2014

CodingCat force-pushed the SPARK-4012 branch from 55cafc8 to 3893a7e Compare October 21, 2014 02:50

CodingCat force-pushed the SPARK-4012 branch from 3893a7e to 2f5a4c0 Compare October 21, 2014 11:04

CodingCat force-pushed the SPARK-4012 branch from 2f5a4c0 to 737d36b Compare October 21, 2014 19:15

CodingCat force-pushed the SPARK-4012 branch from 737d36b to e163adf Compare October 24, 2014 21:31

CodingCat force-pushed the SPARK-4012 branch from e163adf to 39eeb02 Compare November 3, 2014 13:18

call tryOrExit instead of logUncaughtExceptions in ContextCleaner.kee…

39eeb02

…pCleaning

asfgit closed this in 5923dd9 Nov 7, 2014

CodingCat mentioned this pull request Mar 12, 2015

[SPARK-4012] stop SparkContext when the exception is thrown from an infinite loop #5004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4012] call tryOrExit instead of logUncaughtExceptions in ContextCleaner #2864

[SPARK-4012] call tryOrExit instead of logUncaughtExceptions in ContextCleaner #2864

CodingCat commented Oct 20, 2014

SparkQA commented Oct 20, 2014

SparkQA commented Oct 20, 2014

AmplabJenkins commented Oct 20, 2014

SparkQA commented Oct 20, 2014

SparkQA commented Oct 20, 2014

AmplabJenkins commented Oct 20, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

andrewor14 commented Oct 22, 2014

CodingCat commented Oct 23, 2014

CodingCat commented Oct 23, 2014

CodingCat commented Oct 23, 2014

andrewor14 commented Oct 23, 2014

CodingCat commented Oct 23, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

CodingCat commented Nov 3, 2014

CodingCat commented Nov 8, 2014

andrewor14 commented Nov 8, 2014

CodingCat commented Nov 8, 2014

[SPARK-4012] call tryOrExit instead of logUncaughtExceptions in ContextCleaner #2864

[SPARK-4012] call tryOrExit instead of logUncaughtExceptions in ContextCleaner #2864

Conversation

CodingCat commented Oct 20, 2014

SparkQA commented Oct 20, 2014

SparkQA commented Oct 20, 2014

AmplabJenkins commented Oct 20, 2014

SparkQA commented Oct 20, 2014

SparkQA commented Oct 20, 2014

AmplabJenkins commented Oct 20, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

andrewor14 commented Oct 22, 2014

CodingCat commented Oct 23, 2014

CodingCat commented Oct 23, 2014

CodingCat commented Oct 23, 2014

andrewor14 commented Oct 23, 2014

CodingCat commented Oct 23, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

CodingCat commented Nov 3, 2014

CodingCat commented Nov 8, 2014

andrewor14 commented Nov 8, 2014

CodingCat commented Nov 8, 2014