Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pod reconciler to handle race conditions in exclusive placement #342

Merged
merged 15 commits into from
Dec 12, 2023

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Nov 22, 2023

Fixes #341

In the following changes, we aim to enforce that for JobSets using exclusive placement, pods that belong to the same job are scheduled on the same topology domain exclusively. We do this by performing the following steps:

  1. Reconcile leader pods which are scheduled and using exclusive placement.
  2. For a given leader pod, check all follower pods's nodeSelectors are all configured to select the same topology as the leader pod is currently placed on.
  3. If the above condition is false, we delete all the follower pods in this job to allow them to be rescheduled correctly. When deleting pods, we first add a pod condition so that these deletions can be ignored by a podFailurePolicy, to not count against a backoffLimit.

I will add an e2e test forcing the race condition to happen and verifying the reconciler handles it in a follow-up PR.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 22, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 22, 2023
@danielvegamyhre
Copy link
Contributor Author

/hold until #340 is merged

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 22, 2023
@danielvegamyhre
Copy link
Contributor Author

/retest

webhook still isn't ready after 10min somewhat frequently in our CI e2e tests, I think we need to bump the test infra CPU/memory resources.

@danielvegamyhre danielvegamyhre changed the title Update pod reconciler to handle race conditions in exclusive placement [WIP] Update pod reconciler to handle race conditions in exclusive placement Nov 23, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2023
pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/controllers/pod_controller.go Show resolved Hide resolved
pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/controllers/pod_controller.go Show resolved Hide resolved
@danielvegamyhre danielvegamyhre changed the title [WIP] Update pod reconciler to handle race conditions in exclusive placement Update pod reconciler to handle race conditions in exclusive placement Nov 27, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 27, 2023
@danielvegamyhre
Copy link
Contributor Author

/hold cancel

since #309 and #340 have been merged

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 27, 2023
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

l have few more comments, will discuss f2f

pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/controllers/pod_controller.go Show resolved Hide resolved
pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
…delete follower pods when there is a placement violation
@danielvegamyhre
Copy link
Contributor Author

/retest

pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
@ahg-g
Copy link
Contributor

ahg-g commented Dec 8, 2023

Looks good, but lets first continue our investigation on the pod reconciler's update events and manual tests.

@danielvegamyhre
Copy link
Contributor Author

/retest

Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a couple of nits

pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
@danielvegamyhre
Copy link
Contributor Author

@ahg-g I cleaned up the logging in the webhook by removing 1 superfluous log line and marking the others at V(3), which is higher than the controller logs which are V(2), so the user can choose to not view them unless they want to debug something.

@ahg-g
Copy link
Contributor

ahg-g commented Dec 12, 2023

/label tide/merge-method-squash
/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Dec 12, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 12, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [ahg-g,danielvegamyhre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit d49514b into kubernetes-sigs:main Dec 12, 2023
@danielvegamyhre danielvegamyhre mentioned this pull request Dec 12, 2023
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle race conditions in exclusive placement
3 participants