-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32643][CORE][K8s] Consolidate state decommissioning in the TaskSchedulerImpl realm #29452
[SPARK-32643][CORE][K8s] Consolidate state decommissioning in the TaskSchedulerImpl realm #29452
Conversation
61ac7b8
to
97ad7fe
Compare
cc: @holdenk and @prakharjain09 ... This PR simply does some state cleanup/consolidation without making any semantic changes. I would be grateful for your review. I have also created a Jira associated with this. Thanks ! Also @Ngone51 and @cloud-fan |
Test build #127518 has finished for PR 29452 at commit
|
@agrawaldevesh Could you please add the |
Test build #127522 has finished for PR 29452 at commit
|
core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala
Outdated
Show resolved
Hide resolved
97ad7fe
to
7449fa2
Compare
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
Test build #127590 has finished for PR 29452 at commit
|
core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
7449fa2
to
6a5be83
Compare
Test build #127625 has finished for PR 29452 at commit
|
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
6a5be83
to
9222f05
Compare
Test build #127627 has finished for PR 29452 at commit
|
retest this please |
Test build #127629 has finished for PR 29452 at commit
|
9222f05
to
ebd8408
Compare
Test build #127654 has finished for PR 29452 at commit
|
Can we tag this PR with Kubernetes (add [K8S] to the title) so it runs the K8s integration tests? |
ebd8408
to
3541ba9
Compare
Kubernetes integration test starting |
Kubernetes integration test status failure |
@holdenk, I think we should disable or mark flaky this K8s integration test: I have seen it fail many a times now:
(in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32521/console). I can create a ticket for this. |
Test build #127895 has finished for PR 29452 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #127905 has finished for PR 29452 at commit
|
retest this please |
jenkins retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #127913 has finished for PR 29452 at commit
|
6b9ded4
to
29fe131
Compare
Test build #127917 has finished for PR 29452 at commit
|
Jenkins retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
This looks good to me I'm going to go ahead and merge this. |
Test build #127933 has finished for PR 29452 at commit
|
Thank you @holdenk for shepherding this all the way through ! |
Thanks for working on this so much! I’m really excited to launch this feature in 3.1 👍 |
What changes were proposed in this pull request?
The decommissioning state is a bit fragment across two places in the TaskSchedulerImpl:
#29014 stored the incoming decommission info messages in TaskSchedulerImpl.executorsPendingDecommission.
While #28619 was storing just the executor end time in the map TaskSetManager.tidToExecutorKillTimeMapping (which in turn is contained in TaskSchedulerImpl).
While the two states are not really overlapping, it's a bit of a code hygiene concern to save this state in two places.
With #29422, TaskSchedulerImpl is emerging as the place where all decommissioning book keeping is kept within the driver. So consolidate the information in tidToExecutorKillTimeMapping into executorsPendingDecommission.
However, in order to do so, we need to walk away from keeping the raw ExecutorDecommissionInfo messages and instead keep another class ExecutorDecommissionState. This decoupling will allow the RPC message class ExecutorDecommissionInfo to evolve independently from the book keeping ExecutorDecommissionState.
Why are the changes needed?
This is just a code cleanup. These two features were added independently and its time to consolidate their state for good hygiene.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests.