[SPARK-13054] Always post TaskEnd event for tasks #10951

tgravescs · 2016-01-27T20:09:28Z

I am using dynamic container allocation and speculation and am seeing issues with the active task accounting. The Executor UI still shows active tasks on the an executor but the job/stage is all completed. I think its also affecting the dynamic allocation being able to release containers because it thinks there are still tasks.
There are multiple issues with this:

If the task end for tasks (in this case probably because of speculation) comes in after the stage is finished, then the DAGScheduler.handleTaskCompletion will skip the task completion event

…ccounting wrong

Conflicts: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

SparkQA · 2016-01-27T20:26:01Z

Test build #50216 has finished for PR 10951 at commit 574000d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public final class BitArray
- public abstract class BloomFilter
- public class BloomFilterImpl extends BloomFilter
- \"Cannot merge bloom filter of class \" + other.getClass().getName()
- class CountMinSketchImpl extends CountMinSketch implements Serializable
- \"Cannot merge estimator of class \" + other.getClass().getName()
- public class IncompatibleMergeException extends Exception
- class TaskMetrics(initialAccums: Seq[Accumulator[_]]) extends Serializable
- class ExecutorsListener(storageStatusListener: StorageStatusListener, conf: SparkConf)
- class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister
- class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
- class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol):
- class ChiSqSelectorModel(JavaModel):
- public static final class Array extends ArrayData
- public static final class Struct extends InternalRow
- public class ColumnVectorUtils
- public static final class Row extends InternalRow

SparkQA · 2016-01-27T22:08:02Z

Test build #50219 has finished for PR 10951 at commit 249fc78.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-02T01:34:04Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+      // Need to handle tasks coming in late (speculative and jobs killed)
+      // post a task end event so accounting for things manually tracking tasks work.
+      // This really should be something other then success since the other speculative task
+      // finished first.


I think there's a better way to always post this event. I have some changes in #10958 to do this in a cleaner way: https://github.com/apache/spark/pull/10958/files#r51510044. I believe the semantics there are the same as the ones here.

I think that is fine for now since its combining failed vs success tasks. I do think its a bit weird that Spark marks all speculative tasks as Success even when both obviously don't commit. That is part of the other jira I was going to file though and if needed it can be split back apart at that point.

It does seem a bit odd to throw SPARK-13054 in with the other changes in the same pr though.

by the way you should put SPARK-13054 in the title of this patch if you plan to do that here.

updating now, need to test and then will post updated version.

andrewor14 · 2016-02-02T01:40:11Z

@tgravescs Have you verified that the changes here actually fix the issues you observed with dynamic allocation? I thought we needed to make the events posted in the right order in addition to making sure we always post a task end event for each task that ran, otherwise we'll still run into SPARK-11334.

tgravescs · 2016-02-08T14:00:31Z

hey @andrewor14. Yes these changes fix the issue. Its really easy to reproduce if you want to test it out yourself just check the instructions in the jira. dynamic allocation is pretty broken right now with speculation on. At least if you want it to give executors back. Here I changed what was necessary (kind of the minimum) to make it work, all the tests I ran passed. I ran many times because its all timing dependent on the order in which the events come in and whether speculative tasks finish before original, etc.. Honestly the speculation code needs some cleanup and rework. I am going to file another jira for that though.

In my opinion, you can try to make things happen in the right order as much as possible, but this is a distributed system and you have to be able to handle things coming out of order.

SparkQA · 2016-02-10T23:30:48Z

Test build #51058 has finished for PR 10951 at commit 5fc19c7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-02-11T14:48:06Z

Jenkins, test this please

SparkQA · 2016-02-11T15:06:34Z

Test build #51099 has started for PR 10951 at commit 5fc19c7.

shaneknapp · 2016-02-11T16:09:09Z

jenkins, test this please

SparkQA · 2016-02-11T17:59:11Z

Test build #51112 has finished for PR 10951 at commit 5fc19c7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-02-12T14:48:33Z

I have no idea why these tests are failing since they shouldn't be related to this change, I'll try to run them locally today.

tgravescs · 2016-02-12T14:50:27Z

Actually I see now a couple of the failures are due to throwing the Exception for commit denied so I'll look at those tests closer

tgravescs · 2016-02-16T19:18:52Z

Since some of the unit tests are having issues with me change it to throw a commit denied exceptions rather then ignore when needsTaskCommit=false I'm removing that and we can handle it under SPARK-13343.

Just sending out the TaskEnd fixes the issue with the accounting being wrong.

tgravescs · 2016-02-16T20:21:29Z

Jenkins, test this please

SparkQA · 2016-02-16T22:39:46Z

Test build #51381 has finished for PR 10951 at commit 7ea0b3a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-02-16T23:11:43Z

@andrewor14 this should be basically the same change you had made with addition of the test now.

andrewor14 · 2016-02-19T02:13:54Z

LGTM. Let's quickly pass this by @kayousterhout before merging.

tgravescs · 2016-02-29T14:41:02Z

@kayousterhout any comments

tgravescs · 2016-03-02T18:38:37Z

Its been 2 weeks on this, unless there are other comments I'd like to commit this. @andrewor14 any objections?

not sure this will cherry-pick back into 1.6 cleanly but if not I'll put another pull request up for that.

kayousterhout · 2016-03-03T06:30:53Z

Sorry for being insanely slow to look at this. I'm concerned about this code because of this line of code:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L793

We call taskEnded (which results in the code you modified getting called) when an executor is lost, for all of the tasks on that executor. I think it's possible, as a result, to get multiple task-end events for a particular task, in theory (if messages get re-ordered), so I think this could result in multiple SparkListenerTaskEnd events for the same task. I didn't look at this super thoroughly so let me know if you think this is a non-issue.

tgravescs · 2016-03-04T16:57:22Z

thanks for the feedback, I'll check into that.

tgravescs · 2016-03-08T22:12:52Z

So I don't think this PR changes that. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L793
sends the reason as Resubmitted so even before this change it would have sent the event because it checks to see if its != Success.

https://github.com/apache/spark/pull/10951/files#diff-6a9ff7fb74fd490a50462d45db2d5e11L1149

The only time it now sends the taskEnd that it didn't before is:
if (!stageIdToStage.contains(task.stageId)) {

tgravescs · 2016-03-09T21:51:09Z

To add some more detail, the executorLost function only calls taskEnded (which is task failed) if the task was in the list of successful tasks. handleSuccessfulTask calls the taskEnded and then adds it to the list of successful tasks. since taskEnded ends up sending the CompletionEvent it is possible for the events to show up in either order.

But the DAGScheduler.handleTaskCompletion before this PR handled both of those already and sent SparkListenerTaskEnd in both cases. This PR doesn't change that behavior for those 2 events.

The messages are success and resubmitted (which is failed). Both of which DAGScheduler.handleTaskCompletion would send the SparkListenerTaskEnd event for both with this PR and before this PR.

If there is anything else you think I missed let me know.

squito · 2016-03-10T23:03:15Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

@@ -134,6 +134,7 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou
    val successfulStages = new HashSet[Int]
    val failedStages = new ArrayBuffer[Int]
    val stageByOrderOfExecution = new ArrayBuffer[Int]
+    var endedTasks = new HashSet[Long]


can be a val, its a mutable hashset.

squito · 2016-03-10T23:23:42Z

some minor comments, otherwise lgtm

andrewor14 · 2016-03-14T19:30:51Z

I'm merging this into master since DAGSchedulerSuite already passed and the last two commits only touch that.

SparkQA · 2016-03-14T20:44:38Z

Test build #53086 has finished for PR 10951 at commit 68b64ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

I am using dynamic container allocation and speculation and am seeing issues with the active task accounting. The Executor UI still shows active tasks on the an executor but the job/stage is all completed. I think its also affecting the dynamic allocation being able to release containers because it thinks there are still tasks. There are multiple issues with this: - If the task end for tasks (in this case probably because of speculation) comes in after the stage is finished, then the DAGScheduler.handleTaskCompletion will skip the task completion event Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com> Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Author: Tom Graves <tgraves@yahoo-inc.com> Closes apache#10951 from tgravescs/SPARK-11701.

Tom Graves and others added 3 commits January 26, 2016 09:31

[SPARK-11701] YARN - dynamic allocation and speculation active task a…

416708e

…ccounting wrong

Add test for DAGScheduler changes

2ab1c90

Merge remote-tracking branch 'upstream/master' into SPARK-11701

574000d

Conflicts: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

fix scala tyle

249fc78

andrewor14 reviewed Feb 2, 2016
View reviewed changes

tgravescs mentioned this pull request Feb 8, 2016

[SPARK-10620] [SPARK-13054] Minor addendum to #10835 #10958

Closed

tgravescs changed the title ~~[SPARK-11701] dynamic allocation and speculation active task accounting wrong~~ [SPARK-11701][SPARK-13054] dynamic allocation and speculation active task accounting wrong Feb 10, 2016

include changes to commonize TaskEnd call from SPARK-13054 and rework

5fc19c7

Change back to just log when needsTaskCommit=false

7ea0b3a

tgravescs changed the title ~~[SPARK-11701][SPARK-13054] dynamic allocation and speculation active task accounting wrong~~ [SPARK-13054] dynamic allocation and speculation active task accounting wrong Feb 16, 2016

tgravescs changed the title ~~[SPARK-13054] dynamic allocation and speculation active task accounting wrong~~ [SPARK-13054] Always post TaskEnd event for tasks Feb 16, 2016

squito reviewed Mar 10, 2016
View reviewed changes

Thomas Graves added 2 commits March 14, 2016 18:20

Minor comments from review

4d5e402

remove extra spaces

68b64ae

asfgit closed this in 23385e8 Mar 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13054] Always post TaskEnd event for tasks #10951

[SPARK-13054] Always post TaskEnd event for tasks #10951

tgravescs commented Jan 27, 2016

SparkQA commented Jan 27, 2016

SparkQA commented Jan 27, 2016

andrewor14 Feb 2, 2016

tgravescs Feb 8, 2016

andrewor14 Feb 9, 2016

tgravescs Feb 10, 2016

andrewor14 commented Feb 2, 2016

tgravescs commented Feb 8, 2016

SparkQA commented Feb 10, 2016

tgravescs commented Feb 11, 2016

SparkQA commented Feb 11, 2016

shaneknapp commented Feb 11, 2016

SparkQA commented Feb 11, 2016

tgravescs commented Feb 12, 2016

tgravescs commented Feb 12, 2016

tgravescs commented Feb 16, 2016

tgravescs commented Feb 16, 2016

SparkQA commented Feb 16, 2016

tgravescs commented Feb 16, 2016

andrewor14 commented Feb 19, 2016

tgravescs commented Feb 29, 2016

tgravescs commented Mar 2, 2016

kayousterhout commented Mar 3, 2016

tgravescs commented Mar 4, 2016

tgravescs commented Mar 8, 2016

tgravescs commented Mar 9, 2016

squito Mar 10, 2016

squito commented Mar 10, 2016

andrewor14 commented Mar 14, 2016

SparkQA commented Mar 14, 2016

[SPARK-13054] Always post TaskEnd event for tasks #10951

[SPARK-13054] Always post TaskEnd event for tasks #10951

Conversation

tgravescs commented Jan 27, 2016

SparkQA commented Jan 27, 2016

SparkQA commented Jan 27, 2016

andrewor14 Feb 2, 2016

Choose a reason for hiding this comment

tgravescs Feb 8, 2016

Choose a reason for hiding this comment

andrewor14 Feb 9, 2016

Choose a reason for hiding this comment

tgravescs Feb 10, 2016

Choose a reason for hiding this comment

andrewor14 commented Feb 2, 2016

tgravescs commented Feb 8, 2016

SparkQA commented Feb 10, 2016

tgravescs commented Feb 11, 2016

SparkQA commented Feb 11, 2016

shaneknapp commented Feb 11, 2016

SparkQA commented Feb 11, 2016

tgravescs commented Feb 12, 2016

tgravescs commented Feb 12, 2016

tgravescs commented Feb 16, 2016

tgravescs commented Feb 16, 2016

SparkQA commented Feb 16, 2016

tgravescs commented Feb 16, 2016

andrewor14 commented Feb 19, 2016

tgravescs commented Feb 29, 2016

tgravescs commented Mar 2, 2016

kayousterhout commented Mar 3, 2016

tgravescs commented Mar 4, 2016

tgravescs commented Mar 8, 2016

tgravescs commented Mar 9, 2016

squito Mar 10, 2016

Choose a reason for hiding this comment

squito commented Mar 10, 2016

andrewor14 commented Mar 14, 2016

SparkQA commented Mar 14, 2016