-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1929 DAGScheduler suspended by local task OOM #883
Conversation
Can one of the admins verify this patch? |
Jenkins, add to whitelist. |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
@@ -580,6 +580,13 @@ class DAGScheduler( | |||
case e: Exception => | |||
jobResult = JobFailed(e) | |||
job.listener.jobFailed(e) | |||
case oom: OutOfMemoryError => | |||
val errors: StringWriter = new StringWriter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it is actually OOM, should we try to avoid allocating new objects to make sure it can recover gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe trying to catch the OOM error is not a good idea, but here we can't distinguish the exception whether thrown by local task or by driver itself. And we just try to recover DAG scheduler from the previous situation.
Any advice would be appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if instead of allocating more stuff, you just put the following:
val exception = new SparkException("Out of memory exception", oom)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rxin, I have removed the redundant memory allocations.
Merged build triggered. |
Merged build started. |
@@ -580,6 +580,10 @@ class DAGScheduler( | |||
case e: Exception => | |||
jobResult = JobFailed(e) | |||
job.listener.jobFailed(e) | |||
case oom: OutOfMemoryError => | |||
val exception = new SparkException("job failed for Out of memory exception", oom) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the error message to "Local job aborted due to out of memory error"
Actually never mind I will just do that when I commit the change. Merging this into master. Thanks! |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15219/ |
DAGScheduler does not handle local task OOM properly, and will wait for the job result forever. Author: Zhen Peng <zhenpeng01@baidu.com> Closes apache#883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits: 76f7eda [Zhen Peng] remove redundant memory allocations aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM
DAGScheduler does not handle local task OOM properly, and will wait for the job result forever. Author: Zhen Peng <zhenpeng01@baidu.com> Closes apache#883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits: 76f7eda [Zhen Peng] remove redundant memory allocations aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM
Co-authored-by: Egor Krivokon <>
Co-authored-by: Egor Krivokon <>
DAGScheduler does not handle local task OOM properly, and will wait for the job result forever.