Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32003][CORE][3.0] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #29193

Closed
wants to merge 1 commit into from

Conversation

wypoon
Copy link
Contributor

@wypoon wypoon commented Jul 22, 2020

What changes were proposed in this pull request?

If an executor is lost, the DAGScheduler handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files.
In such a case, when fetches from the executor's outputs fail in the same stage, the DAGScheduler again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased.

We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros.

Why are the changes needed?

Without the changes, the loss of a node could require two stage attempts to recover instead of one.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test. This test fails without the change and passes with it.

… outputs for executor on fetch failure after executor is lost

If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files.
In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased.

We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros.

Without the changes, the loss of a node could require two stage attempts to recover instead of one.

No.

New unit test. This test fails without the change and passes with it.
@wypoon
Copy link
Contributor Author

wypoon commented Jul 22, 2020

This is a backport of #28848 to branch-3.0. The backport of DAGScheduler.scala is straightforward, with a minor diff conflict in a comment. The backport of DAGSchedulerSuite.scala needed some minor adjustments, it is between the version in master and the version in branch-2.4.

@SparkQA
Copy link

SparkQA commented Jul 22, 2020

Test build #126354 has finished for PR 29193 at commit e54f221.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wypoon
Copy link
Contributor Author

wypoon commented Jul 23, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 23, 2020

Test build #126370 has finished for PR 29193 at commit e54f221.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wypoon
Copy link
Contributor Author

wypoon commented Jul 23, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 23, 2020

Test build #126425 has finished for PR 29193 at commit e54f221.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wypoon
Copy link
Contributor Author

wypoon commented Jul 23, 2020

org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.Fallback Parquet V2 to V1 failed in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126425; however, earlier, it passed in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126354/. It must be flaky.
Update: this was reported in https://issues.apache.org/jira/browse/SPARK-32054

@wypoon
Copy link
Contributor Author

wypoon commented Jul 23, 2020

retest this please

@wypoon
Copy link
Contributor Author

wypoon commented Jul 23, 2020

@dongjoon-hyun are you aware of any CI issues currently? I think an issue with PySpark pip packaging tests was fixed recently: SPARK-32303. I saw the same symptom again in build #126354 above.
Don't know if the problem is branch-3.0 specific.

I also hit the bad .m2 problem in build #126370.

@SparkQA
Copy link

SparkQA commented Jul 24, 2020

Test build #126434 has finished for PR 29193 at commit e54f221.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wypoon
Copy link
Contributor Author

wypoon commented Jul 24, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 24, 2020

Test build #126454 has finished for PR 29193 at commit e54f221.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Jul 24, 2020

The failure in test run 126434, "BarrierTaskContextSuite.global sync by barrier() call" was supposedly fixed here: SPARK-31730

@dongjoon-hyun
Copy link
Member

Unfortunately, we don't have GitHub Action coverage on branch-3.0. Please re-trigger the Jenkins until it passed, @wypoon .

@wypoon
Copy link
Contributor Author

wypoon commented Jul 24, 2020

Urgh, BarrierTaskContextSuite seems flaky (failed in two builds now, but passed in two earlier builds that ran into other issues), although the failing test is not the same each time. And this time, ran into Kafka failures as well.

@wypoon
Copy link
Contributor Author

wypoon commented Jul 24, 2020

retest this please

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 24, 2020

Ya. I agree that those tests are really flaky. However, without passing them in Scala/Java, we cannot reach PySpark/SparkR UT stages.

@SparkQA
Copy link

SparkQA commented Jul 24, 2020

Test build #126507 has finished for PR 29193 at commit e54f221.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wypoon
Copy link
Contributor Author

wypoon commented Jul 24, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 25, 2020

Test build #126515 has finished for PR 29193 at commit e54f221.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wypoon
Copy link
Contributor Author

wypoon commented Jul 25, 2020

Finally, this has passed the tests!

@SparkQA
Copy link

SparkQA commented Aug 3, 2020

Test build #5056 has finished for PR 29193 at commit e54f221.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2020

Test build #5057 has finished for PR 29193 at commit e54f221.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wypoon
Copy link
Contributor Author

wypoon commented Aug 3, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Aug 4, 2020

Test build #127010 has finished for PR 29193 at commit e54f221.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Aug 4, 2020
…ister outputs for executor on fetch failure after executor is lost

### What changes were proposed in this pull request?

If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files.
In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased.

We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros.

### Why are the changes needed?

Without the changes, the loss of a node could require two stage attempts to recover instead of one.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test. This test fails without the change and passes with it.

Closes #29193 from wypoon/SPARK-32003-3.0.

Authored-by: Wing Yew Poon <wypoon@cloudera.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
@squito
Copy link
Contributor

squito commented Aug 4, 2020

merged, thanks @wypoon

@wypoon wypoon closed this Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants