Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check, update node label on machine obj prior to drain,termination #887

Merged
merged 3 commits into from
Dec 27, 2023
Merged

Check, update node label on machine obj prior to drain,termination #887

merged 3 commits into from
Dec 27, 2023

Conversation

elankath
Copy link
Contributor

@elankath elankath commented Dec 26, 2023

What this PR does / why we need it:

When a Node is never associated with its Machine. Ie the machine object never has the machine.Labels[v1alpha1.NodeLabelKey] set after the machine creation, then during the deletion flow, our Node object is not deleted. (Label updation can be missed if the machine object update transiently fails)

Then after some time, the dangling Node object gets the NotManagedByMCM annotation.

Which issue(s) this PR fixes:
Fixes #875

Special notes for your reviewer:

Release note:

Fix for edge case of Node object deletion missed during machine termination.

@elankath elankath requested a review from a team as a code owner December 26, 2023 03:53
@gardener-robot gardener-robot added needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Dec 26, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 26, 2023
@elankath elankath added do-not-merge/work-in-progress and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 26, 2023
@elankath
Copy link
Contributor Author

Teested fix on GCP (node label is same as machine name for this provider). Removed the node label and initiated machine deletion. Node label is now set again prior to drain and deletion. Node deletion successfully occurs even when node label is missing.

I1226 10:08:31.120914   86315 machine.go:128] reconcileClusterMachine: Start for "shoot--i034796--g1-w1-z1-788d9-hlgnx" with phase:"Terminating", description:"Set machine status to termination. Now, getting VM Status"
I1226 10:08:33.742750   86315 machine_util.go:1685] Updating "node" label on machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" to "shoot--i034796--g1-w1-z1-788d9-hlgnx"
I1226 10:08:33.926901   86315 machine_util.go:1696] Updated "node" label on machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" to "shoot--i034796--g1-w1-z1-788d9-hlgnx
I1226 10:08:41.699711   86315 machine_util.go:1104] Normal delete/drain has been triggerred for machine "shoot--i034796--g1-w1-z1-788d9-hlgnx"
...
I1226 10:11:05.334091   86315 machine_controller.go:131] VM "gce:///OMITTED/shoot--i034796--g1-w1-z1-788d9-hlgnx" for Machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" was terminated succesfully
I1226 10:11:10.681394   86315 machine_util.go:1357] Deleting node "shoot--i034796--g1-w1-z1-788d9-hlgnx" associated with machine "shoot--i034796--g1-w1-z1-788d9-hlgnx"
I1226 10:11:16.055535   86315 machine.go:648] Machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" with providerID "gce:///OMITTED/shoot--i034796--g1-w1-z1-788d9-hlgnx" and nodeName "shoot--i034796--g1-w1-z1-788d9-hlgnx" deleted successfully

@elankath
Copy link
Contributor Author

Teested fix on AWS (node label is diff from machine name for this provider). Removed the node label and initiated machine deletion. Node label is now set again prior to drain and deletion. Node deletion successfully occurs even when node label is missing.

I1226 10:37:06.233292   90959 machine_util.go:1685] Updating "node" label on machine "shoot--i034796--aw3-a-z1-9cc57-q6qbl" to "ip-10-180-29-167.eu-west-1.compute.internal"
I1226 10:37:06.405762   90959 machine_util.go:1696] Updated "node" label on machine "shoot--i034796--aw3-a-z1-9cc57-q6qbl" to "ip-10-180-29-167.eu-west-1.compute.internal"
I1226 10:37:55.994989   90959 core.go:285] Machine deletion request has been recieved for "shoot--i034796--aw3-a-z1-9cc57-q6qbl"
I1226 10:37:56.382539   90959 core.go:311] VM "aws:///eu-west-1/i-078071299a1bce4ca" for Machine "shoot--i034796--aw3-a-z1-9cc57-q6qbl" was terminated successfully
I1226 10:38:01.724111   90959 machine_util.go:1357] Deleting node "ip-10-180-29-167.eu-west-1.compute.internal" associated with machine "shoot--i034796--aw3-a-z1-9cc57-q6qbl"
I1226 10:38:01.724126   90959 machine_util.go:1365] Deletion of Node Object "ip-10-180-29-167.eu-west-1.compute.internal" is successful. Initiate machine object finalizer removal
I1226 10:38:07.079213   90959 machine.go:648] Machine "shoot--i034796--aw3-a-z1-9cc57-q6qbl" with providerID "aws:///eu-west-1/i-078071299a1bce4ca" and nodeName "ip-10-180-29-167.eu-west-1.compute.internal" deleted successfully

@gardener-robot gardener-robot added size/s Size of pull request is small (see gardener-robot robot/bots/size.py) and removed size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Dec 26, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 26, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 26, 2023
@elankath elankath changed the title Check, update node label on machine obj prior to drain Check, update node label on machine obj prior to drain,termination Dec 26, 2023
Copy link
Contributor

@himanshu-kun himanshu-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some log changes requested

pkg/util/provider/machinecontroller/machine_util.go Outdated Show resolved Hide resolved
@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Dec 26, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 27, 2023
Copy link
Contributor

@himanshu-kun himanshu-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes needs/review Needs review labels Dec 27, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 27, 2023
@himanshu-kun himanshu-kun merged commit d8d603b into gardener:master Dec 27, 2023
8 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/s Size of pull request is small (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Edge case where Node Deletion is missed if machine 'node' label is not present
6 participants