Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: annotate nodes for reboot before aborting due to blocked #749

Merged

Conversation

jackfrancis
Copy link
Collaborator

This PR moves the checking of "reboot blockers" (e.g., matching podSelectors running on the node to-be-rebooted) further down in the conditional reboot flow so that those checks happen after we (conditionally) annotate the nodes.

This enables the following when a reboot is detected (by default the presence of the /var/run/reboot-required file):

  1. If --annotate-nodes is true, we annotate the nodes with the following annotations:
  • "weave.works/kured-reboot-in-progress" (indicates that this node will be rebooted at some point in the near future)
  • "weave.works/kured-most-recent-reboot-needed" (marks a timestamp of when this need for reboot was detected by kured)
  1. After we annotate nodes, add a "prefer no schedule taint" to communicate a preference (though not a strict requirement) to the scheduler not to consider this node for future pod scheduling
  2. We check to see if any prometheus or pod selector blockers prevent an immediate reboot, if so we short-circuit and requeue for a later reboot attempt when blockers and give the green light
  3. If we're not blocked, we cordon+drain node
  4. We reboot

I'm simplifying a bit above, there is a bit more complexity, but the key point is that we are now marking "this node is definitely going to be reboote" prior to blockers. This allows other, complementary tooling to know about the state of the node ("gonna be rebooted soon") and do things like extra-manual stateful application migration off of that node, add additional taints to better prevent future scheduling, etc.

Fixes #702

@jackfrancis
Copy link
Collaborator Author

cc @timo-42

Signed-off-by: Jack Francis <jackfrancis@gmail.com>
@jackfrancis jackfrancis force-pushed the fix-reboot-in-progress-pod-selector branch from 5bf1de3 to 5c683b9 Compare March 24, 2023 18:27
@jackfrancis
Copy link
Collaborator Author

/assign @ckotzbauer

@ckotzbauer ckotzbauer self-requested a review March 25, 2023 08:59
Copy link
Member

@ckotzbauer ckotzbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@timo-42
Copy link

timo-42 commented Apr 12, 2023

@jackfrancis @ckotzbauer Please merge this PR.

@ckotzbauer
Copy link
Member

We're just discussing if this should land now right before the upcoming 1.13.0 release.

@ckotzbauer ckotzbauer merged commit 1929c11 into kubereboot:main Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New Annotation reboot-required
3 participants