Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add reboot-required annotation #715

Closed
wants to merge 1 commit into from

Conversation

timo-42
Copy link

@timo-42 timo-42 commented Jan 10, 2023

@timo-42 timo-42 changed the title feat: Add reboot required annotation feat: Add reboot-required annotation Jan 11, 2023
@ckotzbauer ckotzbauer self-requested a review January 11, 2023 18:27
Copy link
Member

@ckotzbauer ckotzbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. I added one smaller comment above.
I think it would make sense to only add/remove this annotation when the --annotate-nodes flag is set. This would streamline the new annotation with the others.

cmd/kured/main.go Outdated Show resolved Hide resolved
@ckotzbauer ckotzbauer linked an issue Jan 11, 2023 that may be closed by this pull request
@timo-42 timo-42 force-pushed the reboot-required-annoation branch 2 times, most recently from 939f18a to 53e009a Compare January 13, 2023 18:29
@timo-42
Copy link
Author

timo-42 commented Jan 13, 2023

Thanks for your PR. I added one smaller comment above. I think it would make sense to only add/remove this annotation when the --annotate-nodes flag is set. This would streamline the new annotation with the others.

Put the new annotation behind the annotation toggle.

@timo-42
Copy link
Author

timo-42 commented Jan 17, 2023

@ckotzbauer @dholbach I pushed the changes you requested.

@timo-42
Copy link
Author

timo-42 commented Jan 25, 2023

@ckotzbauer @dholbach Can you please have a look?

@ckotzbauer
Copy link
Member

@timo-42 I have not forgotten, but I am a bit busy at the moment.

@vitality411
Copy link

@timo-42 Thank you for this feature. I am very interested in this.

Does it work as expected? When I understand the code correctly, then annotateNodes is only executed during the reboot time window.

Outside reboot time window:

k logs kured-knvc5 -f
time="2023-02-07T09:45:58Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
time="2023-02-07T09:45:58Z" level=info msg="Kubernetes Reboot Daemon: c7b7d6a"
time="2023-02-07T09:45:58Z" level=info msg="Node ID: node1"
time="2023-02-07T09:45:58Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2023-02-07T09:45:58Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2023-02-07T09:45:58Z" level=info msg="Lock release delay set, lock release will be delayed by: 30m0s"
time="2023-02-07T09:45:58Z" level=info msg="PreferNoSchedule taint: "
time="2023-02-07T09:45:58Z" level=info msg="Blocking Pod Selectors: []"
time="2023-02-07T09:45:58Z" level=info msg="Reboot schedule: ---MonTueWedThuFri--- between 02:30 and 06:00 Europe/Berlin"
time="2023-02-07T09:45:58Z" level=info msg="Reboot check command: [test -f /var/run/reboot-required] every 1m0s"
time="2023-02-07T09:45:58Z" level=info msg="Reboot command: [/bin/systemctl reboot]"
time="2023-02-07T09:45:58Z" level=info msg="Will annotate nodes during kured reboot operations"

k exec -it  kured-knvc5 -- sh
/ # wget -qO- 127.0.0.1:8080/metrics | grep kured
# HELP kured_reboot_required OS requires reboot due to software updates.
# TYPE kured_reboot_required gauge
kured_reboot_required{node="node1"} 1

k describe node node1 | grep weave.works
<empty>

But when it is in the reboot time window, reboot is required (and it is possible to acquire the lock) then the drain and reboots follows quickly:

k exec -it  kured-pxkjr -- sh
/ # wget -qO- 127.0.0.1:8080/metrics | grep kured
# HELP kured_reboot_required OS requires reboot due to software updates.
# TYPE kured_reboot_required gauge
kured_reboot_required{node="node1"} 1

k logs kured-pxkjr -f
time="2023-02-07T10:20:21Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
time="2023-02-07T10:20:21Z" level=info msg="Kubernetes Reboot Daemon: c7b7d6a"
time="2023-02-07T10:20:21Z" level=info msg="Node ID: node1"
time="2023-02-07T10:20:21Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2023-02-07T10:20:21Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2023-02-07T10:20:21Z" level=info msg="Lock release delay set, lock release will be delayed by: 30m0s"
time="2023-02-07T10:20:21Z" level=info msg="PreferNoSchedule taint: "
time="2023-02-07T10:20:21Z" level=info msg="Blocking Pod Selectors: []"
time="2023-02-07T10:20:21Z" level=info msg="Reboot schedule: ---MonTueWedThuFri--- between 10:00 and 13:00 Europe/Berlin"
time="2023-02-07T10:20:21Z" level=info msg="Reboot check command: [test -f /var/run/reboot-required] every 1m0s"
time="2023-02-07T10:20:21Z" level=info msg="Reboot command: [/bin/systemctl reboot]"
time="2023-02-07T10:20:21Z" level=info msg="Will annotate nodes during kured reboot operations"
time="2023-02-07T10:22:56Z" level=info msg="Reboot required"
time="2023-02-07T10:22:56Z" level=info msg="Adding node node1 annotation: weave.works/reboot-required=true"
time="2023-02-07T10:22:56Z" level=info msg="Adding node node1 annotation: weave.works/kured-reboot-in-progress=2023-02-07T10:22:56Z"
time="2023-02-07T10:22:56Z" level=info msg="Adding node node1 annotation: weave.works/kured-most-recent-reboot-needed=2023-02-07T10:22:56Z"
time="2023-02-07T10:22:56Z" level=info msg="Acquired reboot lock"
time="2023-02-07T10:22:56Z" level=info msg="Draining node node1"
...
time="2023-02-07T10:23:32Z" level=info msg="Running command: [/usr/bin/nsenter -m/proc/1/ns/mnt -- /bin/systemctl reboot] for node: node1"
time="2023-02-07T10:23:32Z" level=info msg="Waiting for reboot"

k describe node node1 | grep weave.works
                    weave.works/kured-most-recent-reboot-needed: 2023-02-07T10:22:56Z
                    weave.works/kured-reboot-in-progress: 2023-02-07T10:22:56Z
                    weave.works/reboot-required: true

After the reboot:

k exec -it  kured-pxkjr -- sh
/ # wget -qO- 127.0.0.1:8080/metrics | grep kured
# HELP kured_reboot_required OS requires reboot due to software updates.
# TYPE kured_reboot_required gauge
kured_reboot_required{node="node1"} 0

k logs kured-pxkjr -f
...
time="2023-02-07T10:25:35Z" level=info msg="Deleting node node1 annotation weave.works/reboot-required"
time="2023-02-07T10:25:35Z" level=info msg="Holding lock"
time="2023-02-07T10:25:35Z" level=info msg="Uncordoning node node1"
time="2023-02-07T10:25:35Z" level=info msg="Deleting node node1 annotation weave.works/kured-reboot-in-progress"
time="2023-02-07T10:25:35Z" level=info msg="Delaying lock release by 30m0s"

k describe node node1 | grep weave.works
                    weave.works/kured-most-recent-reboot-needed: 2023-02-07T10:22:56Z

In my opinion, this defeats the purpose of your issue #702.

I think for this to work, a separate goroutine would be needed like the maintainRebootRequiredMetric goroutine, which updates the kured_reboot_required metric accordingly outside of the reboot time window.

Please let me know if it works for you as expected and if I have misunderstood something.

FYI: I opened #725 to discuss the issue of the "restart required" log message outside the restart window.

@timo-42
Copy link
Author

timo-42 commented Feb 7, 2023

You're right. I did not consider checking the state outside the window! Moved the code to a separate goroutine.
The question is: Should we create a separate goroutine for prometheus metric and the annotations or should we merge it?

@vitality411
Copy link

I'm not a developer per se, but I think it would have advantages for the memory footprint to have only one goroutine. Especially since they partially execute the same commands.

Signed-off-by: Timo Haas <haastimo@gmx.de>
@timo-42
Copy link
Author

timo-42 commented Feb 8, 2023

Moved code into metric goroutine.

@timo-42
Copy link
Author

timo-42 commented Feb 22, 2023

@vitality411 @ckotzbauer Is there anything more I should do?

@vitality411
Copy link

Hi @timo-42, from my side it looks good.

@jackfrancis
Copy link
Collaborator

@timo-42 Thank you for your patience during the review process, I'm back online and able to do this responsively going forward.

As I understand it, what you'd like to accomplish here is to not schedule certain workloads on nodes that are running kured, because those nodes will responsively reboot, and cordon/drain is not sufficient to gracefully evict and reschedule those workloads. Is that correct?

If so, I wonder if this issue would be a more elegant solution: #737

My observation: there is a problem with using a node annotation via the kured daemonset to identify "I am running the kured daemonset", because if for any reason that daemonset is removed manually, and/or the appropriate node taints are added so that it is no longer running kured, that annotation will have to be manually removed (and inevitably this will not happen 100% of the time, and so that annotation will not be 100% accurate in reflecting the state of that node).

@timo-42
Copy link
Author

timo-42 commented Mar 2, 2023

@jackfrancis

As I understand it, what you'd like to accomplish here is to not schedule certain workloads on nodes that are running kured, because those nodes will responsively reboot, and cordon/drain is not sufficient to gracefully evict and reschedule those workloads. Is that correct?

Yes. We want to not schedule new workload on it and want to be notified that a node needs a reboot, so we can start draining the pods in their respective maintenance windows.

If so, I wonder if this issue would be a more elegant solution: #737

If I understand this correctly, the proposal is to run an arbitrary kubectl command before draining. This is already to late in the reboot process for us. We need days to drain a node, therefore we need the annotation as soon as possible and before the kured acquires the cluster lock and runs the reboot command.

My observation: there is a problem with using a node annotation via the kured daemonset to identify "I am running the kured daemonset", because if for any reason that daemonset is removed manually, and/or the appropriate node taints are added so that it is no longer running kured, that annotation will have to be manually removed (and inevitably this will not happen 100% of the time, and so that annotation will not be 100% accurate in reflecting the state of that node).

If the concern is cleanup, than we could add a pre delete hook to delete all dangling annotations.

@jackfrancis
Copy link
Collaborator

@timo-42 Thanks for the explanation.

Here's my thinking: by putting the annotation on the node as soon as kured launches its daemonset runtime you are essentially declaring: don't schedule any non-evictable workloads on this node because I may be rebooted at some point in the future. And then you would begin the process of evicting existing workloads, which may take days. But it's not clear to me how having this advance notice will actually guarantee that the underlying host OS will not put a /var/run/reboot-required file before that days-long eviction completes. And so you are still at risk of the node rebooting during the workload eviction process.

It would seem to me that the workload requirements you're outlining would forbid kured from running at all on the nodes that are running those workloads. What is the downside of simply tainting such nodes and not having kured run on it. I'm not sure that kured is as it currently is a viable solution for these operational requirements.

@dholbach @ckotzbauer @evrardjp any other thoughts?

@timo-42
Copy link
Author

timo-42 commented Mar 4, 2023

@jackfrancis

It would seem to me that the workload requirements you're outlining would forbid kured from running at all on the nodes that are running those workloads. What is the downside of simply tainting such nodes and not having kured run on it. I'm not sure that kured is as it currently is a viable solution for these operational requirements.

This is easy :) - I use --blocking-pod-selector to prevent rebooting nodes which have these kinds of workloads.

@jackfrancis
Copy link
Collaborator

@timo-42 Thanks for clarifying, that makes sense!

In this case this would be my suggestion (curious for others' thoughts as well):

Rather than add a third annotation to a node to describe kured reboot state, let's simply re-order the steps in the flow where we annotate "weave.works/kured-reboot-in-progress". At present, we short-circuit the flow if a matching blocking pod selector is found, but we can short-circuit due to that critera later in the flow. For example:

If a reboot is required:

  1. Add the "weave.works/kured-reboot-in-progress" annotation
  2. Add the "prefer no schedule" taint to the node
  3. If reboot is blocked due to matching pod selector, continue
    a. Else, cordon/drain and reboot node

The above should solve for your use-case. You may use the "weave.works/kured-reboot-in-progress" annotation in your own workflows to prevent new, non-interruptible workloads from landing on this node (or you may be able to rely upon the "prefer no schedule" taint, but that taint is not strict, so FYI). Only when existing, non-interruptible workloads have gracefully exited, and the pod selector finds no match, will the node actually cordon/drain + reboot.

@ckotzbauer @dholbach @evrardjp if they have any criticisms to the above

@ckotzbauer
Copy link
Member

I like the idea @jackfrancis. I'm not fully aware of the current implementation of "weave.works/kured-reboot-in-progress": Would this be a breaking change?

@jackfrancis
Copy link
Collaborator

@ckotzbauer In my view it wouldn't be a breaking change. The original implementation of "weave.works/kured-reboot-in-progress" was meant to annotate the node as soon as we know it will eventually be rebooted. One particular condition that can delay a reboot from happening right away is the inability to acquire the exclusive node reboot lock. But we still want to annotate the node because in the next loop we may require that lock, and thus we still don't want applications that are consuming that annotation to consider that node as a viable node for scheduling new workloads.

It seems that the blocking pod selector criteria was overlooked when implementing the "weave.works/kured-reboot-in-progress" annotation, but that criteria is functionally equivalent to the exclusive node reboot lock: in the next loop there may be no more pods that match the blocking pod criteria, and so we are in a similar situation where we know that the node will eventually be rebooted.

tl;dr I think there's a bug in the implementation of the "weave.works/kured-reboot-in-progress" node annotation, and if we fix that bug, then @timo-42 and other folks can use it to build conditional scheduling flows around.

@timo-42 does that make sense? I'd be happy to fix that bug myself.

@timo-42
Copy link
Author

timo-42 commented Mar 16, 2023

@jackfrancis This is exactly what we need 👍 . Thanks for the inside regarding the annotation and taking care of this bug yourself. I will than change our internal code to use weave.works/kured-reboot-in-progress annotation, but this is an easy change.
This makes this PR obsolete and it can be closed as soon as your Fix PR is opened.

@timo-42
Copy link
Author

timo-42 commented Apr 15, 2023

Closing since #749 got merged.

@timo-42 timo-42 closed this Apr 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New Annotation reboot-required
4 participants