SPARK-1601 & SPARK-1602: two bug fixes related to cancellation #521

rxin · 2014-04-24T01:50:34Z

This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached.

Do not put partially executed partitions into cache (in task killing).
Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs.

Thanks @aarondav and @ahirreddy for reporting and helping debug.

…(in task killing).

AmplabJenkins · 2014-04-24T01:52:55Z

Merged build triggered.

AmplabJenkins · 2014-04-24T01:53:05Z

Merged build started.

AmplabJenkins · 2014-04-24T02:29:15Z

Merged build finished.

AmplabJenkins · 2014-04-24T02:29:15Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14419/

aarondav · 2014-04-24T02:39:05Z

Test failure seems unrelated since the test does not seem to construct an RDD or interact with a SparkContext or InterruptibleIterator at all.

Jenkins, retest this please.

aarondav · 2014-04-24T02:40:02Z

core/src/test/scala/org/apache/spark/JobCancellationSuite.scala

@@ -206,3 +235,9 @@ class JobCancellationSuite extends FunSuite with ShouldMatchers with BeforeAndAf
    }
  }
 }
+
+


nit: extra new line

actually two blank lines to separate top level objects is a habit and sometimes recommended style :)

aarondav · 2014-04-24T02:42:08Z

Note that this also fixes a bug where the iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs.

aarondav · 2014-04-24T02:56:55Z

I created SPARK-1601 and SPARK-1602 to track the issues fixed by this PR. As SPARK-1602 is the more serious issue, and the main one fixed here, please put it in the PR title.

…into kill Conflicts: core/src/test/scala/org/apache/spark/JobCancellationSuite.scala

rxin · 2014-04-24T05:04:49Z

Ok I brought this up to date. There are couple commits that show up because the asf git bot hasn't sync those commits to github yet.

AmplabJenkins · 2014-04-24T05:07:55Z

Merged build triggered.

AmplabJenkins · 2014-04-24T05:08:01Z

Merged build started.

AmplabJenkins · 2014-04-24T05:47:30Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-24T05:47:30Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14425/

mateiz · 2014-04-24T07:18:03Z

Looks good to me.

rxin · 2014-04-24T07:32:52Z

Thanks. I've merged this.

@aarondav

This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached. 1. Do not put partially executed partitions into cache (in task killing). 2. Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs. Thanks @aarondav and @ahirreddy for reporting and helping debug. Author: Reynold Xin <rxin@apache.org> Closes #521 from rxin/kill and squashes the following commits: 401033f [Reynold Xin] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into kill 7a7bdd2 [Reynold Xin] Add a new line in the end of JobCancellationSuite.scala. 35cd9f7 [Reynold Xin] Fixed a bug that partially executed partitions can be put into cache (in task killing). (cherry picked from commit 1fdf659) Signed-off-by: Reynold Xin <rxin@apache.org>

@aarondav

This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached. 1. Do not put partially executed partitions into cache (in task killing). 2. Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs. Thanks @aarondav and @ahirreddy for reporting and helping debug. Author: Reynold Xin <rxin@apache.org> Closes apache#521 from rxin/kill and squashes the following commits: 401033f [Reynold Xin] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into kill 7a7bdd2 [Reynold Xin] Add a new line in the end of JobCancellationSuite.scala. 35cd9f7 [Reynold Xin] Fixed a bug that partially executed partitions can be put into cache (in task killing). Conflicts: core/src/main/scala/org/apache/spark/CacheManager.scala core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/test/scala/org/apache/spark/JobCancellationSuite.scala

@aarondav

This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached. 1. Do not put partially executed partitions into cache (in task killing). 2. Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs. Thanks @aarondav and @ahirreddy for reporting and helping debug. Author: Reynold Xin <rxin@apache.org> Closes apache#521 from rxin/kill and squashes the following commits: 401033f [Reynold Xin] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into kill 7a7bdd2 [Reynold Xin] Add a new line in the end of JobCancellationSuite.scala. 35cd9f7 [Reynold Xin] Fixed a bug that partially executed partitions can be put into cache (in task killing). Conflicts: core/src/main/scala/org/apache/spark/CacheManager.scala core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/test/scala/org/apache/spark/JobCancellationSuite.scala Conflicts: core/src/main/scala/org/apache/spark/CacheManager.scala

@aarondav

This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached. 1. Do not put partially executed partitions into cache (in task killing). 2. Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs. Thanks @aarondav and @ahirreddy for reporting and helping debug. Author: Reynold Xin <rxin@apache.org> Closes apache#521 from rxin/kill and squashes the following commits: 401033f [Reynold Xin] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into kill 7a7bdd2 [Reynold Xin] Add a new line in the end of JobCancellationSuite.scala. 35cd9f7 [Reynold Xin] Fixed a bug that partially executed partitions can be put into cache (in task killing).

* Add back local file mounting. * Commit back entrypoint changes, add unit test * Remove unnecessary whitespace change

After talking with mada, 2 TF jobs running in parallel can also present the performance comparing, and will avoid the CI job timeout of kubeflow job.

rxin added 2 commits April 23, 2014 18:48

Fixed a bug that partially executed partitions can be put into cache …

35cd9f7

…(in task killing).

Add a new line in the end of JobCancellationSuite.scala.

7a7bdd2

aarondav reviewed Apr 24, 2014
View reviewed changes

rxin changed the title ~~Fixed a bug that partially executed partitions can be put into cache (in task killing).~~ SPARK-1601 & SPARK-1602: Fixed a bug that partially executed partitions can be put into cache (in task killing). Apr 24, 2014

rxin changed the title ~~SPARK-1601 & SPARK-1602: Fixed a bug that partially executed partitions can be put into cache (in task killing).~~ SPARK-1601 & SPARK-1602: Do not put partially executed partitions into cache (in task killing). Apr 24, 2014

rxin changed the title ~~SPARK-1601 & SPARK-1602: Do not put partially executed partitions into cache (in task killing).~~ SPARK-1601 & SPARK-1602: bug fix related to cancellation Apr 24, 2014

rxin changed the title ~~SPARK-1601 & SPARK-1602: bug fix related to cancellation~~ SPARK-1601 & SPARK-1602: two bug fixes related to cancellation Apr 24, 2014

Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark …

401033f

…into kill Conflicts: core/src/test/scala/org/apache/spark/JobCancellationSuite.scala

asfgit closed this in 1fdf659 Apr 24, 2014

rxin deleted the kill branch August 13, 2014 08:01

helenyugithub pushed a commit to helenyugithub/spark that referenced this pull request Aug 20, 2019

Add back Kubernetes local file mounting. (apache#521)

04a6670

* Add back local file mounting. * Commit back entrypoint changes, add unit test * Remove unnecessary whitespace change

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1601 & SPARK-1602: two bug fixes related to cancellation #521

SPARK-1601 & SPARK-1602: two bug fixes related to cancellation #521

rxin commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

aarondav commented Apr 24, 2014

aarondav Apr 24, 2014

rxin Apr 24, 2014

aarondav commented Apr 24, 2014

aarondav commented Apr 24, 2014

rxin commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

mateiz commented Apr 24, 2014

rxin commented Apr 24, 2014

SPARK-1601 & SPARK-1602: two bug fixes related to cancellation #521

SPARK-1601 & SPARK-1602: two bug fixes related to cancellation #521

Conversation

rxin commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

aarondav commented Apr 24, 2014

aarondav Apr 24, 2014

Choose a reason for hiding this comment

rxin Apr 24, 2014

Choose a reason for hiding this comment

aarondav commented Apr 24, 2014

aarondav commented Apr 24, 2014

rxin commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

mateiz commented Apr 24, 2014

rxin commented Apr 24, 2014