[SPARK-25250][CORE] : Late zombie task completions handled correctly even before new taskset launched #22806

pgandhi999 · 2018-10-23T14:35:18Z

We recently had a scenario where a race condition occurred when a task from previous stage attempt just finished before new attempt for the same stage was created due to fetch failure, so the new task created in the second attempt on the same partition id was retrying multiple times due to TaskCommitDenied Exception without realizing that the task in earlier attempt was already successful.

For example, consider a task with partition id 9000 and index 9000 running in stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. Just within this timespan, the above task completes successfully, thus, marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has not yet been created, the taskset info for that stage is not available to the TaskScheduler so, naturally, the partition id 9000 has not been marked completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same partition id 9000. This task fails due to CommitDeniedException and since, it does not see the corresponding partition id as been marked successful, it keeps retrying multiple times until the job finally succeeds. It doesn't cause any job failures because the DAG scheduler is tracking the partitions separate from the task set managers.

What changes were proposed in this pull request?

SPARK-23433 tried to ensure that late task completions from a zombie taskset were properly updated in all tasksets for the stage (zombie or not). However, because it did this outside of the DAGScheduler event loop, the DAGScheduler could launch another taskset for the stage at the same time, before it had updated the set of tasks left to run. The new active taskset would never learn about the old completion from the zombie taskset. This could lead to multiple active tasksets (as described in SPARK-23433) or resultstage failure from repeated TaskCommitDenied exceptions for the task which the final stage attemps to re-run.

This change fixes it by duplicating that logic into the DAGScheduler event loop. So now, on completion of a task, we maintain a map of completed partitions to stage id's which we update from the DAGScheduler event loop. When any task fails in TSM, the TSM checks whether the corresponding partition is already complete by looking in the map and based on that, marks the corresponding partition as complete.

How was this patch tested?

The screenshot for the bug is attached below:

In the above screenshot, you can see that for the partition id 17352, one of the task attempts in the previous stage succeeded so, for the current stage attempt, we get a TaskCommitDenied Exception each time and instead of killing the task, Spark keeps on retrying multiple times till the application exits.

…tion id, kill other running task attempts on that same partition The fix that this PR addresses is as follows: Whenever any Result Task gets successfully completed, we simply mark the corresponding partition id as completed in all attempts for that particular stage. As a result, we do not see any Killed tasks due to TaskCommitDenied Exceptions showing up in the UI. Also, since, the method defined uses hash maps and arrays for efficient searching and processing, so as a result, it's time complexity is not related to the number of tasks, hence, it is also efficient.

SparkQA · 2018-10-23T17:52:08Z

Test build #97922 has finished for PR 22806 at commit 5ad6efd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…RK-25250 [SPARK-25250] : Upmerging with master to fix unit tests

SparkQA · 2018-10-24T00:08:42Z

Test build #97939 has finished for PR 22806 at commit 8667c28.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sujithjay · 2018-11-29T11:23:36Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -1091,6 +1091,10 @@ private[spark] class TaskSetManager(
  def executorAdded() {
    recomputeLocality()
  }
+
+  def markPartitionIdAsCompletedForTaskAttempt(index: Int): Unit = {
+    successful(index) = true


Should this method also make a call to TaskSetManager.maybeFinishTaskSet()?

Yes, have made the necessary changes. Thank you.

sujithjay · 2018-11-29T11:35:14Z

cc: @jiangxb1987 @cloud-fan @srowen
Could you please review this PR?

…omment

…RK-25250 [SPARK-25250]: Upmerging with master branch

SparkQA · 2018-12-28T17:05:55Z

Test build #100508 has finished for PR 22806 at commit 5509165.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

Multiline comment indentation

pgandhi999 · 2018-12-28T17:11:14Z

Thank you @sujithjay for your comment, have updated the PR.

srowen · 2018-12-28T20:39:58Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+      partitionId: Int, stageId: Int): Unit = {
+    taskSetsByStageIdAndAttempt.getOrElse(stageId, Map()).values.foreach { tsm =>
+      val index: Option[Int] = tsm.partitionToIndex.get(partitionId)
+      if (!index.isEmpty) {


Nit: it's more usual to match on the Option and use case Some(...), or use foreach. It avoids a couple lines of code here.
Same with the getOrElse above, I guess.

Makes sense, have updated the code.

srowen · 2018-12-28T20:40:22Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+        val taskInfoList = tsm.taskAttempts(index.get)
+        taskInfoList.foreach { taskInfo =>
+          if (taskInfo.running) {
+            killTaskAttempt(taskInfo.taskId, false, "Corresponding Partition Id " + partitionId +


Nit: Id -> ID and use string interpolation.

srowen · 2018-12-28T20:40:54Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+      if (!index.isEmpty) {
+        tsm.markPartitionIdAsCompletedForTaskAttempt(index.get)
+        val taskInfoList = tsm.taskAttempts(index.get)
+        taskInfoList.foreach { taskInfo =>


Consider taskInfoList.filter(_.running).foreach or for (taskInfo <- taskInfoList if taskInfo.running)

srowen · 2018-12-28T20:41:42Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+   * result, we do not see any Killed tasks due to TaskCommitDenied Exceptions showing up
+   * in the UI.
+   */
+  override def markPartitionIdAsCompletedAndKillCorrespondingTaskAttempts(


Doesn't this logic overlap with killAllTaskAttempts? should it reuse that logic? I understand it does something a little different, and I don't know this code well, but seems like there are related but separate implementations of something similar here.

As far as I understand the code, killAllTaskAttempts kills all the running tasks for a particular stage whereas markPartitionIdAsCompletedAndKillCorrespondingTaskAttempts kills all running tasks for all stages and attempts working on a particular partition that has already been marked as completed by one of the previously running tasks for that corresponding partition. So the logic is different for both the cases, but we can modify the code to have one fixed method for performing both these tasks. Let me know what you think!

srowen · 2018-12-28T20:43:09Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

@@ -39,6 +40,14 @@ class FakeSchedulerBackend extends SchedulerBackend {
  def reviveOffers() {}
  def defaultParallelism(): Int = 1
  def maxNumConcurrentTasks(): Int = 0
+  val killedTaskIds: HashSet[Long] = new HashSet[Long]()


Nit: why not a Scala mutable Set?

srowen · 2018-12-28T20:43:40Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

@@ -1319,4 +1328,26 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B
    tsm.handleFailedTask(tsm.taskAttempts.head.head.taskId, TaskState.FAILED, TaskKilled("test"))
    assert(tsm.isZombie)
  }
+  test("SPARK-25250 On successful completion of a task attempt on a partition id, kill other" +


Nit: blank line above, and please use spaces in the body of the test to break this up into more readable chunks.

SparkQA · 2018-12-28T20:46:30Z

Test build #100509 has finished for PR 22806 at commit ee5bc68.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-30T08:18:12Z

Stage 4.1 now spawns task with index 2000 on the same partition id 9000. This task fails due to CommitDeniedException and since, it does not see the corresponding partition id as been marked successful, it keeps retrying multiple times until the job finally succeeds.

Can you explain the code path that does it? This part is a little convoluted...

Ngone51 · 2018-12-30T09:24:51Z

This reminds me of another similar pr #21131, the key difference between this two pr is that wether the task from the previous stage attempt finished before the newest resubmit stage has been submitted to TaskSchedulerImpl or not. Also /cc @squito .

Ngone51 · 2019-01-01T16:11:54Z

I think we can expand #21131's behavior here. Instead of killing other running task attempts actively, updating a stage's completed partition once it is submitted to TaskScheduler. So, this may need to ask TaskScheduler to record a stage's completed partition ids across all TaskSets before it succeed. WDYT? @squito

pgandhi999 · 2019-01-02T16:32:11Z

@cloud-fan I do not fully understand your question but this is what happens according to the current behaviour:
Task with task id 5005(let us say) working on partition id 9000 is running on stage 4.0. Stage 4.0 fails, therefore, stage 4.1 is created. So stage 4.1 sees that the partition id 9000 has not completed and thus, it creates a new task with task id, let us say, 2002 scheduled to run on partition id 9000. However, it so happens that after stage 4.1 starts running, task id 5005 is successfully completed and it marks the partition id 9000 as complete. This information is not propagated to stage 4.1 as the map successful in TaskSetManager.scala which keeps track of the task indices that got completed does not get updated for tasksets running for stage 4.1. So, task id 2002 computes the result but is not able to commit it to HDFS as, results for the same partition have already been written to HDFS, thus, it sees a CommitDeniedException. However, it sees this as a failure and will keep retrying multiple times, even though it should not. This will continue on till the Spark application as a whole succeeds so the driver simply shuts down all the executors at the end, thus, killing task id 2002. I hope this answers your query.

pgandhi999 · 2019-01-02T16:40:55Z

@Ngone51 I see your point, and yes the key difference is that the PR #21131 marks partition as completed across all task attempts for the same stage attempt whereas this PR does that for all task attempts across all the stage attempts for the corresponding stage. We can extend the same behaviour here, but as far as killing running task attempts is concerned, that saves a lot of resources for long running tasks which might be consuming resources even though they have become redundant. I could be wrong here as well so let me know your thoughts. Thank you.

cloud-fan · 2019-01-02T18:24:29Z

when task 5005 finishes, stage 4.1 will mark partition 9000 as completed, assuming we are talking about the latest code base, with #21131 merged.

Did I miss something?

pgandhi999 · 2019-01-02T19:04:10Z

@cloud-fan yes, ideally, that should happen. However, this problem occurs only when a task from previous stage attempt just finishes before new attempt for the same stage gets created as described in the Jira, so when the code in PR #21131 executes, the tasksets for stage 4.1 have been submitted but not yet created, so taskSetsByStageIdAndAttempt does not have the required entry as of yet.

SparkQA · 2019-01-02T20:01:28Z

Test build #100644 has finished for PR 22806 at commit 7677aec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-01-03T02:36:20Z

as far as killing running task attempts is concerned, that saves a lot of resources for long running tasks which might be consuming resources even though they have become redundant.

I see.

Another question:

I know the CommitDeniedException is related to ResultStage, but from the desc of pr, is it possible that the same problem(I mean launching a redundant task for the already completed partition) happens on ShuffleMapStage ?

pgandhi999 · 2019-01-03T18:39:25Z

@Ngone51 That is indeed a good question. I have not seen the error before for a ShuffleMapStage, but let me try to reproduce it if I can. Thank you.

squito · 2019-01-03T22:36:20Z

this problem occurs only when a task from previous stage attempt just finishes before new attempt for the same stage gets created as described in the Jira, so when the code in PR #21131 executes, the tasksets for stage 4.1 have been submitted but not yet created, so taskSetsByStageIdAndAttempt does not have the required entry as of yet.

ooh this is a very good point -- really this is a bug in my change from #21131. It seems to me that really the original change should be removed, and you should do this instead -- but only kill tasks if its in a result stage.

pgandhi999 · 2019-01-04T17:33:43Z

@squito Thank you for your suggestion, I have one question though. This was brought up by @Ngone51. I have seen this bug coming up in case of FetchFailure for a ResultStage, however, I was not able to reproduce the same for ShuffleMapStage. Could it be that this issue might also be affecting ShuffleMapStage? Should I also add the same fix on task completion for a ShuffleMapStage? WDYT?

squito · 2019-01-04T20:35:54Z

well, you wouldn't see the exact same problem with a ShuffleMapStage, but you could still have the same problem in general. Certainly you wouldn't see TaskCommitDenied, since thats only with result stages. I think the problem you've mentioned could lead to multiple active task sets as described in SPARK-23433; it requires a task to get completed from the zombie taskset, but before the DAGScheduler has launched a new taskset. How are you trying to test this, that you aren't able to reproduce?

Also I'd change the PR description to focus on that key point, something like:

[SPARK-25250][CORE] Late zombie taskcompletions handled correctly even before new taskset launched

SPARK-23433 tried to ensure that late task completions from a zombie taskset were properly updated in all tasksets for the stage (zombie or not). However, because it did this outside of the DAGScheduler event loop, the DAGScheduler could launch another taskset for the stage at the same time, before it had updated the set of tasks left to run. The new active taskset would never learn about the old completion from the zombie taskset. This could lead to multiple active tasksets (as described in SPARK-23433) or resultstage failure from repeated TaskCommitDenied exceptions for the task which the final stage attemps to re-run.

This change fixes it by moving that logic into the DAGScheduler event loop. As an optimization, it also kills tasks in all task sets, only if its a result stage (an extension of SPARK-25773).

squito · 2019-01-04T20:36:11Z

also cc @jiangxb1987 @markhamstra

Ngone51 · 2019-02-20T02:22:26Z

IMHO, the updated PR is far way from my proposal mentioned above. And considering this pr's discussion may be too long for other people to follow, I'd like to post another pr separately.

pgandhi999 · 2019-02-20T21:14:29Z

I would like to disagree with you on the above @Ngone51 , the PR is an implementation of your proposal, but with small changes.

In TaskSchedulerImpl we maintain a map, e.g. called stageIdToFinishedPartitions. And each time we call sched.markPartitionCompletedInAllTaskSets(stageId, tasks(index).partitionId, info), we do an extra thing, updating the finished partitionId into stageIdToFinishedPartitions. Then, when creating a new TaskSetManager In TaskSchedulerImpl, we always excludes Tasks that corresponding to the finished partitions firstly by looking into stageIdToFinishedPartitions.

In your proposal above, you have mentioned that we can look the map stageIdToFinishedPartitions and exclude those partitions which are finished while creating a new TSM. This was exactly the reason I had to come up with the PR in the first place as when the new TSM is created, the tasks from the previous stage attempt are still running so partitions are not yet marked as complete.

I have basically come up with a solution to ensure that once the partition completes, other tasks running on the same partition fail once and then do not get rescheduled as it is happening currently. The problem with the old change was that we were not using a lock while calling markPartitionCompleted() from TaskSchedulerImpl, the new solution ensures that all calls happen within the lock and no new locks need to be created.

pgandhi999 · 2019-02-20T21:15:42Z

Test this please

squito · 2019-02-20T21:12:04Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -1383,6 +1383,7 @@ private[spark] class DAGScheduler(

    event.reason match {
      case Success =>
+        taskScheduler.markPartitionCompletedFromEventLoop(task.partitionId, task.stageId)


@pgandhi999 I don't follow your explanation -- I think I agree w/ @Ngone51 . yes, TSM calls DAGs.taskEnded while it has a lock on the TaskSchedulerImpl, but the DAGs.taskEnded call just puts an event into the queue and then returns. So this part is happening w/ out the lock. but maybe you meant something else?

squito · 2019-02-20T21:18:53Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -920,6 +923,9 @@ private[spark] class TaskSetManager(
        s" be re-executed (either because the task failed with a shuffle data fetch failure," +
        s" so the previous stage needs to be re-run, or because a different copy of the task" +
        s" has already succeeded).")
+    } else if (sched.stageIdToFinishedPartitions.get(stageId).exists(
+      partitions => partitions.contains(tasks(index).partitionId))) {
+      sched.markPartitionCompletedInAllTaskSets(stageId, tasks(index).partitionId, info)


stageIdToFinishedPartitions is getting updated in the dag scheduler event loop, but here you're querying it outside of the event loop, which is definitely not safe.

(I'm also not really convinced this would solve the problem, but I'm afraid I have to spend a bit more time paging it all back in ...)

Yes that is correct, stageIdToFinishedPartitions is getting updated in the event loop and is being queried outside the event loop within the TaskSchedulerImpl lock. However, I did not get as to why is it not safe as the only thing that can happen is that while querying, TSM does not find the partition to be present in the HashSet while it is getting updated, but it will definitely catch this in the next check. I could be wrong though.

hashmaps are totally unsafe to be used for multiple threads -- its not just getting inconsistent values, its that the hashmap may be in some undefined state b/c of rehashing. see eg. http://javabypatel.blogspot.com/2016/01/infinite-loop-in-hashmap.html (I just skimmed this but I think it has the right idea).

I see, in that case, what if I turn stageIdToFinishedPartitions to a ConcurrentHashMap? That should take care of the safety feature.

that will take care of the access-during-rehash, but I'm still not sure the values of this are safe. Here you're querying it from a task-result-getter thread, but you're also updating it from the dag scheduler event loop, with no lock protecting access.

The other proposal is easier to reason about, because it keeps this structure protected by a lock on TaskSchedulerImpl.

…bout the finished partitions ## What changes were proposed in this pull request? This is an optional solution for #22806 . #21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why #22806 hit the issue. This pr extends #21131 's behavior by adding `stageIdToFinishedPartitions` into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into `stageIdToFinishedPartitions ` and won't launch any duplicate tasks. ## How was this patch tested? Add. Closes #23871 from Ngone51/dev-23433-25250. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit e5c6143) Signed-off-by: Imran Rashid <irashid@cloudera.com>

…bout the finished partitions ## What changes were proposed in this pull request? This is an optional solution for #22806 . #21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why #22806 hit the issue. This pr extends #21131 's behavior by adding `stageIdToFinishedPartitions` into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into `stageIdToFinishedPartitions ` and won't launch any duplicate tasks. ## How was this patch tested? Add. Closes #23871 from Ngone51/dev-23433-25250. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>

squito · 2019-03-06T17:56:42Z

mind closing this now @pgandhi999 ?

…a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. #17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. #21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, #21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. #22806 and #23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, #21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). #22806 and #23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes #23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit cb20fbc) Signed-off-by: Imran Rashid <irashid@cloudera.com>

…a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. #17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. #21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, #21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. #22806 and #23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, #21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). #22806 and #23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes #23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>

pgandhi999 · 2019-03-06T18:56:25Z

Sure. Thank you @squito .

…ould learn about the finished partitions ## What changes were proposed in this pull request? This is an optional solution for #22806 . #21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why #22806 hit the issue. This pr extends #21131 's behavior by adding stageIdToFinishedPartitions into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into stageIdToFinishedPartitions and won't launch any duplicate tasks. ## How was this patch tested? Add. Closes #24007 from Ngone51/dev-23433-25250-branch-2.3. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>

…bout the finished partitions ## What changes were proposed in this pull request? This is an optional solution for apache#22806 . apache#21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why apache#22806 hit the issue. This pr extends apache#21131 's behavior by adding `stageIdToFinishedPartitions` into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into `stageIdToFinishedPartitions ` and won't launch any duplicate tasks. ## How was this patch tested? Add. Closes apache#23871 from Ngone51/dev-23433-25250. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit e5c6143) Signed-off-by: Imran Rashid <irashid@cloudera.com>

…a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. apache#17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. apache#21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, apache#21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. apache#22806 and apache#23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, apache#21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). apache#22806 and apache#23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes apache#23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit cb20fbc) Signed-off-by: Imran Rashid <irashid@cloudera.com>

…bout the finished partitions ## What changes were proposed in this pull request? This is an optional solution for apache#22806 . apache#21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why apache#22806 hit the issue. This pr extends apache#21131 's behavior by adding `stageIdToFinishedPartitions` into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into `stageIdToFinishedPartitions ` and won't launch any duplicate tasks. ## How was this patch tested? Add. Closes apache#23871 from Ngone51/dev-23433-25250. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit e5c6143) Signed-off-by: Imran Rashid <irashid@cloudera.com>

…a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. apache#17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. apache#21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, apache#21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. apache#22806 and apache#23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, apache#21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). apache#22806 and apache#23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes apache#23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit cb20fbc) Signed-off-by: Imran Rashid <irashid@cloudera.com>

…bout the finished partitions ## What changes were proposed in this pull request? This is an optional solution for apache#22806 . apache#21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why apache#22806 hit the issue. This pr extends apache#21131 's behavior by adding `stageIdToFinishedPartitions` into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into `stageIdToFinishedPartitions ` and won't launch any duplicate tasks. ## How was this patch tested? Add. Closes apache#23871 from Ngone51/dev-23433-25250. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit e5c6143) Signed-off-by: Imran Rashid <irashid@cloudera.com>

…a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. apache#17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. apache#21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, apache#21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. apache#22806 and apache#23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, apache#21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). apache#22806 and apache#23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes apache#23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit cb20fbc) Signed-off-by: Imran Rashid <irashid@cloudera.com>

Merge branch 'master' of https://github.com/pgandhi999/spark into SPA…

8667c28

…RK-25250 [SPARK-25250] : Upmerging with master to fix unit tests

sujithjay reviewed Nov 29, 2018

View reviewed changes

pgandhi999 changed the title ~~[SPARK-25250] : On successful completion of a task attempt on a parti…~~ [SPARK-25250][CORE] : On successful completion of a task attempt on a parti… Dec 23, 2018

pgandhi added 2 commits December 28, 2018 10:55

[SPARK-25250] : Calling maybeFinishTaskSet() from method and adding c…

a73f619

…omment

Merge branch 'master' of https://github.com/pgandhi999/spark into SPA…

5509165

…RK-25250 [SPARK-25250]: Upmerging with master branch

[SPARK-25250] : Fixing scalastyle tests

ee5bc68

Multiline comment indentation

srowen requested changes Dec 28, 2018

View reviewed changes

[SPARK-25250] : Addressing Reviews January 2, 2019

7677aec

pgandhi999 changed the title ~~[SPARK-25250][CORE] : On successful completion of a task attempt on a parti…~~ [SPARK-25250][CORE] : Late zombie task completions handled correctly even before new taskset launched Jan 4, 2019

squito reviewed Feb 20, 2019

View reviewed changes

Ngone51 mentioned this pull request Feb 22, 2019

[SPARK-23433][SPARK-25250] [CORE] Later created TaskSet should learn about the finished partitions #23871

Closed

cloud-fan mentioned this pull request Mar 1, 2019

[SPARK-27065][CORE] avoid more than one active task set managers for a stage #23927

Closed

pgandhi999 closed this Mar 6, 2019

This was referenced Mar 7, 2019

[SPARK-23433][SPARK-25250] [CORE][BRANCH-2.3] Later created TaskSet should learn about the finished partitions #24006

Closed

[SPARK-23433][SPARK-25250] [CORE][BRANCH-2.3] Later created TaskSet should learn about the finished partitions #24007

Closed

pgandhi999 mentioned this pull request Apr 16, 2019

[SPARK-27474][CORE] avoid retrying a task failed with CommitDeniedException many times #24375

Closed

[SPARK-25250][CORE] : Late zombie task completions handled correctly even before new taskset launched #22806

[SPARK-25250][CORE] : Late zombie task completions handled correctly even before new taskset launched #22806

Conversation

pgandhi999 commented Oct 23, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 23, 2018

SparkQA commented Oct 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sujithjay commented Nov 29, 2018

SparkQA commented Dec 28, 2018

pgandhi999 commented Dec 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 28, 2018

cloud-fan commented Dec 30, 2018

Ngone51 commented Dec 30, 2018

Ngone51 commented Jan 1, 2019

pgandhi999 commented Jan 2, 2019

pgandhi999 commented Jan 2, 2019

cloud-fan commented Jan 2, 2019

pgandhi999 commented Jan 2, 2019

SparkQA commented Jan 2, 2019

Ngone51 commented Jan 3, 2019

pgandhi999 commented Jan 3, 2019

squito commented Jan 3, 2019

pgandhi999 commented Jan 4, 2019

squito commented Jan 4, 2019

squito commented Jan 4, 2019

Ngone51 commented Feb 20, 2019

pgandhi999 commented Feb 20, 2019 • edited Loading

pgandhi999 commented Feb 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito commented Mar 6, 2019

pgandhi999 commented Mar 6, 2019

pgandhi999 commented Oct 23, 2018 •

edited

Loading

pgandhi999 commented Feb 20, 2019 •

edited

Loading