[SPARK-32003][CORE][3.0] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #29193

wypoon · 2020-07-22T18:51:55Z

What changes were proposed in this pull request?

If an executor is lost, the DAGScheduler handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files.
In such a case, when fetches from the executor's outputs fail in the same stage, the DAGScheduler again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased.

We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros.

Why are the changes needed?

Without the changes, the loss of a node could require two stage attempts to recover instead of one.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test. This test fails without the change and passes with it.

… outputs for executor on fetch failure after executor is lost If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files. In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased. We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros. Without the changes, the loss of a node could require two stage attempts to recover instead of one. No. New unit test. This test fails without the change and passes with it.

wypoon · 2020-07-22T18:55:58Z

This is a backport of #28848 to branch-3.0. The backport of DAGScheduler.scala is straightforward, with a minor diff conflict in a comment. The backport of DAGSchedulerSuite.scala needed some minor adjustments, it is between the version in master and the version in branch-2.4.

SparkQA · 2020-07-22T21:42:35Z

Test build #126354 has finished for PR 29193 at commit e54f221.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

wypoon · 2020-07-23T01:41:55Z

retest this please

SparkQA · 2020-07-23T01:46:55Z

Test build #126370 has finished for PR 29193 at commit e54f221.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

wypoon · 2020-07-23T16:44:45Z

retest this please

SparkQA · 2020-07-23T19:26:33Z

Test build #126425 has finished for PR 29193 at commit e54f221.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wypoon · 2020-07-23T21:31:37Z

org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.Fallback Parquet V2 to V1 failed in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126425; however, earlier, it passed in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126354/. It must be flaky.
Update: this was reported in https://issues.apache.org/jira/browse/SPARK-32054

wypoon · 2020-07-23T21:31:52Z

retest this please

wypoon · 2020-07-23T22:07:47Z

@dongjoon-hyun are you aware of any CI issues currently? I think an issue with PySpark pip packaging tests was fixed recently: SPARK-32303. I saw the same symptom again in build #126354 above.
Don't know if the problem is branch-3.0 specific.

I also hit the bad .m2 problem in build #126370.

SparkQA · 2020-07-24T00:16:14Z

Test build #126434 has finished for PR 29193 at commit e54f221.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wypoon · 2020-07-24T01:27:58Z

retest this please

SparkQA · 2020-07-24T05:10:28Z

Test build #126454 has finished for PR 29193 at commit e54f221.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2020-07-24T16:20:37Z

The failure in test run 126434, "BarrierTaskContextSuite.global sync by barrier() call" was supposedly fixed here: SPARK-31730

dongjoon-hyun · 2020-07-24T17:29:39Z

Unfortunately, we don't have GitHub Action coverage on branch-3.0. Please re-trigger the Jenkins until it passed, @wypoon .

wypoon · 2020-07-24T17:39:32Z

Urgh, BarrierTaskContextSuite seems flaky (failed in two builds now, but passed in two earlier builds that ran into other issues), although the failing test is not the same each time. And this time, ran into Kafka failures as well.

wypoon · 2020-07-24T17:39:40Z

retest this please

dongjoon-hyun · 2020-07-24T17:41:20Z

Ya. I agree that those tests are really flaky. However, without passing them in Scala/Java, we cannot reach PySpark/SparkR UT stages.

SparkQA · 2020-07-24T20:22:57Z

Test build #126507 has finished for PR 29193 at commit e54f221.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wypoon · 2020-07-24T22:26:34Z

retest this please

SparkQA · 2020-07-25T01:31:03Z

Test build #126515 has finished for PR 29193 at commit e54f221.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wypoon · 2020-07-25T21:56:25Z

Finally, this has passed the tests!

SparkQA · 2020-08-03T18:33:25Z

Test build #5056 has finished for PR 29193 at commit e54f221.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-03T21:24:35Z

Test build #5057 has finished for PR 29193 at commit e54f221.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wypoon · 2020-08-03T22:53:42Z

retest this please

SparkQA · 2020-08-04T02:08:43Z

Test build #127010 has finished for PR 29193 at commit e54f221.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ister outputs for executor on fetch failure after executor is lost ### What changes were proposed in this pull request? If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files. In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased. We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros. ### Why are the changes needed? Without the changes, the loss of a node could require two stage attempts to recover instead of one. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. This test fails without the change and passes with it. Closes #29193 from wypoon/SPARK-32003-3.0. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>

squito · 2020-08-04T16:35:40Z

merged, thanks @wypoon

probot-autolabeler bot added the CORE label Jul 22, 2020

dongjoon-hyun mentioned this pull request Jul 24, 2020

[SPARK-32003][CORE][2.4] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #29182

Closed

wypoon closed this Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32003][CORE][3.0] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #29193

[SPARK-32003][CORE][3.0] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #29193

wypoon commented Jul 22, 2020

wypoon commented Jul 22, 2020

SparkQA commented Jul 22, 2020

wypoon commented Jul 23, 2020

SparkQA commented Jul 23, 2020

wypoon commented Jul 23, 2020

SparkQA commented Jul 23, 2020

wypoon commented Jul 23, 2020 •

edited

Loading

wypoon commented Jul 23, 2020

wypoon commented Jul 23, 2020

SparkQA commented Jul 24, 2020

wypoon commented Jul 24, 2020

SparkQA commented Jul 24, 2020

squito commented Jul 24, 2020

dongjoon-hyun commented Jul 24, 2020

wypoon commented Jul 24, 2020

wypoon commented Jul 24, 2020

dongjoon-hyun commented Jul 24, 2020 •

edited

Loading

SparkQA commented Jul 24, 2020

wypoon commented Jul 24, 2020

SparkQA commented Jul 25, 2020

wypoon commented Jul 25, 2020

SparkQA commented Aug 3, 2020

SparkQA commented Aug 3, 2020

wypoon commented Aug 3, 2020

SparkQA commented Aug 4, 2020

squito commented Aug 4, 2020

[SPARK-32003][CORE][3.0] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #29193

[SPARK-32003][CORE][3.0] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #29193

Conversation

wypoon commented Jul 22, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

wypoon commented Jul 22, 2020

SparkQA commented Jul 22, 2020

wypoon commented Jul 23, 2020

SparkQA commented Jul 23, 2020

wypoon commented Jul 23, 2020

SparkQA commented Jul 23, 2020

wypoon commented Jul 23, 2020 • edited Loading

wypoon commented Jul 23, 2020

wypoon commented Jul 23, 2020

SparkQA commented Jul 24, 2020

wypoon commented Jul 24, 2020

SparkQA commented Jul 24, 2020

squito commented Jul 24, 2020

dongjoon-hyun commented Jul 24, 2020

wypoon commented Jul 24, 2020

wypoon commented Jul 24, 2020

dongjoon-hyun commented Jul 24, 2020 • edited Loading

SparkQA commented Jul 24, 2020

wypoon commented Jul 24, 2020

SparkQA commented Jul 25, 2020

wypoon commented Jul 25, 2020

SparkQA commented Aug 3, 2020

SparkQA commented Aug 3, 2020

wypoon commented Aug 3, 2020

SparkQA commented Aug 4, 2020

squito commented Aug 4, 2020

wypoon commented Jul 23, 2020 •

edited

Loading

dongjoon-hyun commented Jul 24, 2020 •

edited

Loading