Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1929 DAGScheduler suspended by local task OOM #883

Closed
wants to merge 2 commits into from

Conversation

zhpengg
Copy link
Contributor

@zhpengg zhpengg commented May 26, 2014

DAGScheduler does not handle local task OOM properly, and will wait for the job result forever.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented May 26, 2014

Jenkins, add to whitelist.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15210/

@@ -580,6 +580,13 @@ class DAGScheduler(
case e: Exception =>
jobResult = JobFailed(e)
job.listener.jobFailed(e)
case oom: OutOfMemoryError =>
val errors: StringWriter = new StringWriter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it is actually OOM, should we try to avoid allocating new objects to make sure it can recover gracefully?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe trying to catch the OOM error is not a good idea, but here we can't distinguish the exception whether thrown by local task or by driver itself. And we just try to recover DAG scheduler from the previous situation.
Any advice would be appreciated!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if instead of allocating more stuff, you just put the following:

val exception = new SparkException("Out of memory exception", oom)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rxin, I have removed the redundant memory allocations.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@@ -580,6 +580,10 @@ class DAGScheduler(
case e: Exception =>
jobResult = JobFailed(e)
job.listener.jobFailed(e)
case oom: OutOfMemoryError =>
val exception = new SparkException("job failed for Out of memory exception", oom)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the error message to "Local job aborted due to out of memory error"

@rxin
Copy link
Contributor

rxin commented May 27, 2014

Actually never mind I will just do that when I commit the change. Merging this into master. Thanks!

@asfgit asfgit closed this in 8d271c9 May 27, 2014
@rxin
Copy link
Contributor

rxin commented May 27, 2014

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15219/

@zhpengg zhpengg deleted the bugfix-dag-scheduler-oom branch May 27, 2014 06:07
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
DAGScheduler does not handle local task OOM properly, and will wait for the job result forever.

Author: Zhen Peng <zhenpeng01@baidu.com>

Closes apache#883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits:

76f7eda [Zhen Peng] remove redundant memory allocations
aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
DAGScheduler does not handle local task OOM properly, and will wait for the job result forever.

Author: Zhen Peng <zhenpeng01@baidu.com>

Closes apache#883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits:

76f7eda [Zhen Peng] remove redundant memory allocations
aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM
Agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants