Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4012] call tryOrExit instead of logUncaughtExceptions in ContextCleaner #2864

Closed
wants to merge 1 commit into from

Conversation

CodingCat
Copy link
Contributor

When running an "might-be-memory-intensive" application locally, I received the following exception

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Spark Context Cleaner"
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Driver Heartbeater"
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- 
the VM may need to be forcibly terminated
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated

the reason is that we start ContextCleaner in a separate thread but didn't capture the exception there (re-throw with logUncaughtExceptions)

@SparkQA
Copy link

SparkQA commented Oct 20, 2014

QA tests have started for PR 2864 at commit 287bd07.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 20, 2014

QA tests have finished for PR 2864 at commit 287bd07.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21935/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 20, 2014

QA tests have started for PR 2864 at commit 55cafc8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 20, 2014

QA tests have finished for PR 2864 at commit 55cafc8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ReconnectWorker(masterUrl: String) extends DeployMessage

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21937/
Test FAILed.

@CodingCat CodingCat changed the title SPARK-4012: call tryOrExit instead of logUncaughtExceptions in ContextCleaner [SPARK-4012] call tryOrExit instead of logUncaughtExceptions in ContextCleaner Oct 21, 2014
@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2864 at commit 3893a7e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2864 at commit 3893a7e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class JavaFutureActionWrapper[S, T](futureAction: FutureAction[S], converter: S => T)
    • case class ReconnectWorker(masterUrl: String) extends DeployMessage

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21958/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2864 at commit 2f5a4c0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

Tests timed out for PR 2864 at commit 2f5a4c0 after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21985/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2864 at commit 737d36b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2864 at commit 737d36b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21999/
Test PASSed.

@andrewor14
Copy link
Contributor

Hey @CodingCat I think this is the expected behavior. If an OOM is thrown on the driver then the context cleaner thread should just die. In fact logUncaughtException rethrows the error and this is propagated to Java's UncaughtExceptionHandler. I don't think we want to use tryOrExit here because we want to log the exception, and we don't want to use the ExecutorUncaughtExceptionHandler for the driver.

Would you mind closing this?

@CodingCat
Copy link
Contributor Author

Hi, @andrewor14, the issue here is JVM default UncaughtExceptionHandler seems not handle the exception correctly, as I said in the PR description, it will request user to shutdown JVM manually

w.r.t use ExecutorUncaughtExceptionHandler in driver side, it has been there for a while......https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L148

@CodingCat
Copy link
Contributor Author

this is also very similar to #622, where the main thread cannot handle the exception thrown by the akka's scheduler thread

@CodingCat
Copy link
Contributor Author

but I don't mind to grab that ExecutorUncaughtExceptionHandler to somewhere else to make it more general....

@andrewor14
Copy link
Contributor

I see... when the JVM's UncaughtExceptionHandler catches an OOM exception it doesn't actually kill the JVM. However, we do want to log the fact that we are facing an OOM exception somewhere. Even though ExecutorUncaughtExceptionHandler eventually does that it's not clear from the name tryOrExit that we will log the error. I'm just not sure how badly this issue needs to be fixed. We do this in other places too (e.g. LiveListenerBus has a polling thread).

@CodingCat
Copy link
Contributor Author

@andrewor14, (just met a LiveListernBus uncaught exception this afternoon....)

personally, I feel that we shall stop the driver when such things happen ... e.g. the user may need to check the JVM process liveness for , for instance, HA, after this uncaught exception is thrown, the JVM process is "alive" but not functional....

If we agree on that...I think maybe we need to propagate the changes to other places

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22166 has started for PR 2864 at commit e163adf.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22166 has finished for PR 2864 at commit e163adf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22166/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22813 has started for PR 2864 at commit 39eeb02.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22813 has finished for PR 2864 at commit 39eeb02.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22813/
Test PASSed.

@CodingCat
Copy link
Contributor Author

Hey, @andrewor14 ,

Any further thoughts about this PR?

@asfgit asfgit closed this in 5923dd9 Nov 7, 2014
@CodingCat
Copy link
Contributor Author

OK....I didn't expect the PR is closed in such way

(for this PR, I'm obviously waiting for more responses and am trying to get the clearer answer from the @andrewor14 that whether I misunderstood something or we shall propagate the chances to other places, see my last comment)

just a suggestion. closing PRs without active maintenance is definitely a good way to keep the repository clean. However, I think maybe the better way to close some "pending" pull request is to end the discussion first by getting some answer, instead of terminating the discussion without any pre-notification and suddenly close an actual active PR

I'm not intended to be offending. @andrewor14 is making great contributions to the community...what I talked about is just for making the review process more reasonable

@andrewor14
Copy link
Contributor

Hey @CodingCat sorry this is unintended. The automatic closing of pull requests is searches for key words that include "mind closing this", which I said a while ago before we started discussing the specifics. I have been swamped with the 1.2 release lately so I haven't had the time to look at some older PRs that are not as urgent, but I definitely did not intend to just terminate the discussion so abruptly. If you wish to open a new PR on the same issue, please feel free to do so. Personally, I am ambivalent on whether this particular issue is a good change, but maybe others can look at it.

@CodingCat
Copy link
Contributor Author

Hi, @andrewor14 , Sorry about the misunderstanding and thanks for your patient explanation

I will resubmit it and hopefully will get more feedbacks about this, thanks

asfgit pushed a commit that referenced this pull request Mar 19, 2015
…nfinite loop

https://issues.apache.org/jira/browse/SPARK-4012

This patch is a resubmission for #2864

What I am proposing in this patch is that ***when the exception is thrown from an infinite loop, we should stop the SparkContext, instead of let JVM throws exception forever***

So, in the infinite loops where we originally wrapped with a ` logUncaughtExceptions`, I changed to `tryOrStopSparkContext`, so that the Spark component is stopped

Early stopped JVM process is helpful for HA scheme design, for example,

The user has a script checking the existence of the pid of the Spark Streaming driver for monitoring the availability; with the code before this patch, the JVM process is still available but not functional when the exceptions are thrown

andrewor14, srowen , mind taking further consideration about the change?

Author: CodingCat <zhunansjtu@gmail.com>

Closes #5004 from CodingCat/SPARK-4012-1 and squashes the following commits:

589276a [CodingCat] throw fatal error again
3c72cd8 [CodingCat] address the comments
6087864 [CodingCat] revise comments
6ad3eb0 [CodingCat] stop SparkContext instead of quit the JVM process
6322959 [CodingCat] exit JVM process when the exception is thrown from an infinite loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants