Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32643][CORE][K8s] Consolidate state decommissioning in the TaskSchedulerImpl realm #29452

Closed

Conversation

agrawaldevesh
Copy link
Contributor

What changes were proposed in this pull request?

The decommissioning state is a bit fragment across two places in the TaskSchedulerImpl:

#29014 stored the incoming decommission info messages in TaskSchedulerImpl.executorsPendingDecommission.
While #28619 was storing just the executor end time in the map TaskSetManager.tidToExecutorKillTimeMapping (which in turn is contained in TaskSchedulerImpl).
While the two states are not really overlapping, it's a bit of a code hygiene concern to save this state in two places.

With #29422, TaskSchedulerImpl is emerging as the place where all decommissioning book keeping is kept within the driver. So consolidate the information in tidToExecutorKillTimeMapping into executorsPendingDecommission.

However, in order to do so, we need to walk away from keeping the raw ExecutorDecommissionInfo messages and instead keep another class ExecutorDecommissionState. This decoupling will allow the RPC message class ExecutorDecommissionInfo to evolve independently from the book keeping ExecutorDecommissionState.

Why are the changes needed?

This is just a code cleanup. These two features were added independently and its time to consolidate their state for good hygiene.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests.

@agrawaldevesh
Copy link
Contributor Author

agrawaldevesh commented Aug 18, 2020

cc: @holdenk and @prakharjain09 ... This PR simply does some state cleanup/consolidation without making any semantic changes. I would be grateful for your review. I have also created a Jira associated with this. Thanks !

Also @Ngone51 and @cloud-fan

@SparkQA
Copy link

SparkQA commented Aug 18, 2020

Test build #127518 has finished for PR 29452 at commit 61ac7b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ExecutorDecommissionState(message: String,

@Ngone51
Copy link
Member

Ngone51 commented Aug 18, 2020

@agrawaldevesh Could you please add the [CORE] tag in the PR title like other PRs?

@agrawaldevesh agrawaldevesh changed the title [SPARK-32643] Consolidate state decommissioning in the TaskSchedulerImpl realm [SPARK-32643][CORE] Consolidate state decommissioning in the TaskSchedulerImpl realm Aug 18, 2020
@SparkQA
Copy link

SparkQA commented Aug 18, 2020

Test build #127522 has finished for PR 29452 at commit 97ad7fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ExecutorDecommissionState(message: String,

@SparkQA
Copy link

SparkQA commented Aug 18, 2020

Test build #127590 has finished for PR 29452 at commit 7449fa2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ExecutorDecommissionState(

@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Test build #127625 has finished for PR 29452 at commit 6a5be83.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Test build #127627 has finished for PR 29452 at commit 9222f05.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Test build #127629 has finished for PR 29452 at commit 9222f05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Test build #127654 has finished for PR 29452 at commit ebd8408.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Aug 19, 2020

Can we tag this PR with Kubernetes (add [K8S] to the title) so it runs the K8s integration tests?

@agrawaldevesh agrawaldevesh changed the title [SPARK-32643][CORE] Consolidate state decommissioning in the TaskSchedulerImpl realm [SPARK-32643][CORE][K8S] Consolidate state decommissioning in the TaskSchedulerImpl realm Aug 19, 2020
@agrawaldevesh agrawaldevesh changed the title [SPARK-32643][CORE][K8S] Consolidate state decommissioning in the TaskSchedulerImpl realm [SPARK-32643][CORE][K8s] Consolidate state decommissioning in the TaskSchedulerImpl realm Aug 19, 2020
@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32287/

@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32287/

@agrawaldevesh
Copy link
Contributor Author

@holdenk, I think we should disable or mark flaky this K8s integration test: I have seen it fail many a times now:

- Run SparkPi with no resources *** FAILED ***
  The code passed to eventually never returned normally. Attempted 190 times over 3.0025253718499996 minutes. Last failure message: false was not true. (KubernetesSuite.scala:383)

(in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32521/console). I can create a ticket for this.

@SparkQA
Copy link

SparkQA commented Aug 25, 2020

Test build #127895 has finished for PR 29452 at commit 9389ed5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32531/

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32531/

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127905 has finished for PR 29452 at commit 6b9ded4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@agrawaldevesh
Copy link
Contributor Author

retest this please

@agrawaldevesh
Copy link
Contributor Author

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32539/

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32539/

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127913 has finished for PR 29452 at commit 6b9ded4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127917 has finished for PR 29452 at commit 29fe131.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@agrawaldevesh
Copy link
Contributor Author

Hi @holdenk, This PR is back in your court to review again. @Ngone51 is fine with the new set of changes. If it looks good, please feel free to merge it in. Thank you !

@holdenk
Copy link
Contributor

holdenk commented Aug 26, 2020

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32559/

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32559/

@holdenk
Copy link
Contributor

holdenk commented Aug 26, 2020

This looks good to me I'm going to go ahead and merge this.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127933 has finished for PR 29452 at commit 29fe131.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in b786f31 Aug 26, 2020
@agrawaldevesh
Copy link
Contributor Author

Thank you @holdenk for shepherding this all the way through !

@holdenk
Copy link
Contributor

holdenk commented Aug 27, 2020

Thanks for working on this so much! I’m really excited to launch this feature in 3.1 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants