Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Speed up ignoring terminating Pods when draining unreachable Nodes #10706

Merged
merged 1 commit into from
Jun 17, 2024

Conversation

typeid
Copy link
Contributor

@typeid typeid commented May 30, 2024

What this PR does / why we need it:

This PR reduces the time we wait for pod deletion in the case of unreachable nodes: we currently wait for the deletionTimestamp to be 5 minutes old. The goal is to allow MHC to replace unreachable nodes faster.

This PR aligns the behavior of CAPI with OpenShift MAPI: https://github.com/openshift/machine-api-operator/blob/dcf1387cb69f8257345b2062cff79a6aefb1f5d9/pkg/controller/machine/drain_controller.go#L164-L171

This PR is a result from a discussion here.

Which issue(s) this PR fixes
Fixes #10705

Area example:
/area machine

@k8s-ci-robot k8s-ci-robot added area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 30, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @typeid. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 30, 2024
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 30, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3be733612fe5af9c5e0a93e5ef326ea019f46b3a

@vincepri
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 30, 2024
@vincepri
Copy link
Member

It would be good though if we could add an e2e testing this behavior and expectations

@typeid
Copy link
Contributor Author

typeid commented May 30, 2024

/retest

It would be good though if we could add an e2e testing this behavior and expectations

@vincepri I can definitely take a look at doing that. It's going to be my first shot at adding an E2E test here, so no promises. I'll check if there's something I can re-use, but the node being unreachable doesn't seem something we already have a base for.

@enxebre
Copy link
Member

enxebre commented May 31, 2024

This PR reduces the time we wait for pod deletion in the case of unhealthy nodes:

Just to clarify as this topic gets fuzzy every time we need to get back to it after a while. I would replace "unhealthy" (ambiguous) with "Unreachable" (ready = Unknown), which is the only semantic where we set this timeout because we assume those pods as gone. This PR is irrelevant for other "Unhealthy" criteria.

@typeid
Copy link
Contributor Author

typeid commented Jun 3, 2024

@enxebre / @vincepri The E2E doesn't seem trivial. I would skip adding the E2E for now if that's okay. Shall I create a follow-up issue to add functionality in our test framework to render nodes unreachable? This would allow us to improve our coverage in other areas as well.

Adding a unit test here would also not add any coverage of value IMO and require more code changes.

@enxebre
Copy link
Member

enxebre commented Jun 3, 2024

I'm good with that, I'll defer to Vince to cancel the hold as he considers
/approve
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2024
Copy link

linux-foundation-easycla bot commented Jun 4, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 4, 2024
Co-authored-by: Michael Shen <mishen@umich.edu>
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 4, 2024
@enxebre
Copy link
Member

enxebre commented Jun 6, 2024

/lgtm
/assign @sbueringer

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 6, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4f450481d31b49cacb7dda4d01cf0c96c6da8086

@sbueringer sbueringer changed the title 🐛 Fix delayed MHC replacement of unreachable nodes 🐛 Speed up ignoring terminating Pods when draining unreachable Nodes Jun 13, 2024
@sbueringer
Copy link
Member

/lgtm

Thx everyone for the explanations!

/hold
Similar to the other PR, I asked some other folks for their opinion. I'll get back to you ASAP (I expect early next week)

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 14, 2024
@sbueringer
Copy link
Member

sbueringer commented Jun 14, 2024

Similar to the other PR, I asked some other folks for their opinion. I'll get back to you ASAP (I expect early next week)

/hold

If that's okay?

EDIT: I think tide is going to merge regardless again :/ (I think once a PR is in the merge pool there is no way to get it out of it)
EDIT2: Seems like it worked this time

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 14, 2024
@vincepri
Copy link
Member

Sorry I was under the impression per Alberto's comment that we were waiting for my review 😄

@sbueringer
Copy link
Member

Thx everyone!

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2024
@k8s-ci-robot k8s-ci-robot merged commit 5b6043e into kubernetes-sigs:main Jun 17, 2024
26 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.8 milestone Jun 17, 2024
@sbueringer
Copy link
Member

/cherry-pick release-1.7

@k8s-infra-cherrypick-robot

@sbueringer: new pull request created: #10766

In response to this:

/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delayed MHC replacement of unreachable nodes
6 participants