Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1772 Stop catching Throwable, let Executors die #715

Closed
wants to merge 3 commits into from

Conversation

aarondav
Copy link
Contributor

@aarondav aarondav commented May 9, 2014

The main issue this patch fixes is SPARK-1772, in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.

This patch also continues the fight in the neverending war against case t: Throwable =>, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs.

It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing (here). The uncaughtExceptionHandler is not called from the places that we expected it would be.
SPARK-1620 deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR.

The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772),
in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes
Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.

This patch also continues the fight in the neverending war against `case t: Throwable =>`,
by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables
to make sure any uncaught exceptions are at least printed to the logs.

It also turns out that it is unlikely that the IndestructibleActorSystem actually works,
given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The
uncaughtExceptionHandler is not called from the places that we expected it would be.
[SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this
issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly
is a much bigger change, outside the scope of this PR.
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14854/

@pwendell
Copy link
Contributor

/cc @ScrapCodes who worked on the IndestructableActorSystem

@@ -71,7 +71,7 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String
stopDaemon()
startDaemon()
new Socket(daemonHost, daemonPort)
case e: Throwable => throw e
case e: Exception => throw e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this line ?

@ScrapCodes
Copy link
Member

It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing (here).

It works but in case of OOMs, the behavior can be very sporadic. The only reason it was needed was in akka 2.0.x days netty was tolerating OOMs and it was thus never the chance that Akka got to deal with them. Since netty almost always had them and mostly manage to eat them. Very weird things. Just saying for posterity.

@mateiz
Copy link
Contributor

mateiz commented May 12, 2014

Looks pretty good to me, just made one small comment. I think it's good to eliminate these now. I haven't seen many cases where they're super useful.

@ScrapCodes
Copy link
Member

Looks good to me too !

@aarondav aarondav changed the title [RFC] SPARK-1772 Stop catching Throwable, let Executors die SPARK-1772 Stop catching Throwable, let Executors die May 12, 2014
@aarondav
Copy link
Contributor Author

Addressed all comments and took "RFC" out of the PR title.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14897/

@pwendell
Copy link
Contributor

This passed all tests except for the (bogus) MIMA issue, so I'll merge it.

asfgit pushed a commit that referenced this pull request May 12, 2014
The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772), in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.

This patch also continues the fight in the neverending war against `case t: Throwable =>`, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs.

It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The uncaughtExceptionHandler is not called from the places that we expected it would be.
[SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR.

Author: Aaron Davidson <aaron@databricks.com>

Closes #715 from aarondav/throwable and squashes the following commits:

f9b9bfe [Aaron Davidson] Remove other redundant 'throw e'
e937a0a [Aaron Davidson] Address Prashant and Matei's comments
1867867 [Aaron Davidson] [RFC] SPARK-1772 Stop catching Throwable, let Executors die
(cherry picked from commit 3af1f38)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
@asfgit asfgit closed this in 3af1f38 May 12, 2014
markhamstra pushed a commit to markhamstra/spark that referenced this pull request May 14, 2014
The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772), in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.

This patch also continues the fight in the neverending war against `case t: Throwable =>`, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs.

It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The uncaughtExceptionHandler is not called from the places that we expected it would be.
[SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR.

Author: Aaron Davidson <aaron@databricks.com>

Closes apache#715 from aarondav/throwable and squashes the following commits:

f9b9bfe [Aaron Davidson] Remove other redundant 'throw e'
e937a0a [Aaron Davidson] Address Prashant and Matei's comments
1867867 [Aaron Davidson] [RFC] SPARK-1772 Stop catching Throwable, let Executors die
(cherry picked from commit 3af1f38)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>

Conflicts:
	core/src/main/scala/org/apache/spark/ContextCleaner.scala
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
	core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
	core/src/main/scala/org/apache/spark/deploy/Client.scala
	core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
	core/src/main/scala/org/apache/spark/deploy/master/Master.scala
	core/src/main/scala/org/apache/spark/deploy/worker/DriverWrapper.scala
	core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
	core/src/main/scala/org/apache/spark/executor/Executor.scala
	core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala
	core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala
	core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
	core/src/main/scala/org/apache/spark/storage/TachyonBlockManager.scala
	core/src/main/scala/org/apache/spark/util/AkkaUtils.scala
	core/src/main/scala/org/apache/spark/util/Utils.scala
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772), in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.

This patch also continues the fight in the neverending war against `case t: Throwable =>`, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs.

It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The uncaughtExceptionHandler is not called from the places that we expected it would be.
[SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR.

Author: Aaron Davidson <aaron@databricks.com>

Closes apache#715 from aarondav/throwable and squashes the following commits:

f9b9bfe [Aaron Davidson] Remove other redundant 'throw e'
e937a0a [Aaron Davidson] Address Prashant and Matei's comments
1867867 [Aaron Davidson] [RFC] SPARK-1772 Stop catching Throwable, let Executors die
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Jul 8, 2014
The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772), in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.

This patch also continues the fight in the neverending war against `case t: Throwable =>`, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs.

It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The uncaughtExceptionHandler is not called from the places that we expected it would be.
[SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR.

Author: Aaron Davidson <aaron@databricks.com>

Closes apache#715 from aarondav/throwable and squashes the following commits:

f9b9bfe [Aaron Davidson] Remove other redundant 'throw e'
e937a0a [Aaron Davidson] Address Prashant and Matei's comments
1867867 [Aaron Davidson] [RFC] SPARK-1772 Stop catching Throwable, let Executors die
(cherry picked from commit 3af1f38)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>

Conflicts:
	core/src/main/scala/org/apache/spark/ContextCleaner.scala
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
	core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
	core/src/main/scala/org/apache/spark/deploy/Client.scala
	core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
	core/src/main/scala/org/apache/spark/deploy/master/Master.scala
	core/src/main/scala/org/apache/spark/deploy/worker/DriverWrapper.scala
	core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
	core/src/main/scala/org/apache/spark/executor/Executor.scala
	core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala
	core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala
	core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
	core/src/main/scala/org/apache/spark/storage/TachyonBlockManager.scala
	core/src/main/scala/org/apache/spark/util/AkkaUtils.scala
	core/src/main/scala/org/apache/spark/util/Utils.scala

Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
	core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
	core/src/main/scala/org/apache/spark/util/Utils.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants