Add delay and smarter verification between node restarts #12

jahkeup · 2019-11-11T22:10:23Z

What I'd like:

Dogswatch should add some delay between the restart of Nodes in a cluster. During this time, the Controller should check in with the Node that has been updated to confirm that it has come up healthy and that Workloads have returned to it. After this, the Controller should have a configurable duration used to delay between each Node restart.

samuelkarp · 2020-03-17T16:35:32Z

This seems potentially related to before and after reboot checks.

anguslees · 2020-08-21T08:04:45Z

Indeed. In case you want concrete use cases for before/after reboot checks, I use them (with coreos/flatcar update operator) to delay until the rook/ceph cluster is healthy[1], and to signal to rook/ceph that the rook storage cluster should set the "noout" flag[2]. After reboot clears the noout flag and again blocks until cluster is healthy again.

[1] eg: Data is replicated sufficiently. This signal is "global" and much more complex than what a single pod readinessProbe can represent, which is why it can't be just a PodDisruptionBudget. A better implementation might only consider the redundancy of the data on "this" node. In particular, a naive time "delay", or generic check that pods were running again (as suggested in the issue description) would not be sufficient here.

[2] noout means the node outage that is about to happen is expected to be brief, and rook should not start frantically re-replicating "lost" data onto new nodes.

This wasn't my idea at all, the standard rook docs for this are: https://rook.io/docs/rook/v1.4/container-linux.html

Having used this for a long time now, it works great. What might not be obvious at first is that the reboot script itself is deployed as a daemonset limited to nodes with the "before-reboot" label. That means it automatically "finds" and installs itself only on the relevant nodes, and only for the relevant time, which is pretty neat. Debugging the system when updates are not proceeding does require an understanding of the various state machine interactions though, of course.

I would expect very similar challenges exist for something like an elasticsearch cluster, where data replication is important and also not represented in the "health" of any specific container. I agree this probably points to a missing feature in PodDisruptionBudget, since it is still fundamentally a question of "is it ok to make $pod unavailable now".

chancez · 2020-09-02T03:43:09Z

I'm not sure about the best approach, but one of my use-cases is jupyterhub notebook pods. These pods can't be interrupted, but we regularly cull inactive/idle ones. I'd like to be able to cordon the node that needs updating, and wait for the notebook pods to be stopped (which could be a while) before allowing with the node to be rebooted. This might be done using a tool like https://github.com/planetlabs/draino, but the update-operator would need coordination.

jahkeup · 2020-09-02T16:54:03Z

Thanks for sharing your use case and laying out what your ideal operation would look like.

This might be done using a tool like planetlabs/draino, but the update-operator would need coordination.

Draino looks very closely related to this problem space. The project appears to build on the Kubernetes autoscaler in order to accomplish its task. I'm curious what other projects are integrating with the autoscaler and what they use to enhance the features provided.

We'll likely check out both of these projects as the design is sketched out.

webern transferred this issue from bottlerocket-os/bottlerocket Feb 26, 2020

jahkeup changed the title ~~dogswatch: add delay, verification between Node restarts~~ Add delay and smarter verification between node restarts Feb 27, 2020

samuelkarp mentioned this issue Mar 17, 2020

Coordinate node readiness with pod workload checks and health reporting #13

Closed

jhaynes added this to the Backlog milestone May 21, 2021

jhaynes added enhancement type/enhancement priority/p0 and removed enhancement labels May 21, 2021

jhaynes modified the milestones: Backlog, next May 21, 2021

gthao313 self-assigned this Jul 15, 2021

gthao313 added status/design and removed status/notstarted labels Jul 15, 2021

gthao313 assigned srgothi92 Jul 22, 2021

Vaishvenk added this to Coming Soon in Bottlerocket Roadmap Jul 22, 2021

srgothi92 mentioned this issue Jul 30, 2021

Handles drain failure and node health check #69

Closed

cbgbt modified the milestones: next, next+1 Sep 7, 2021

kdaula modified the milestones: brupop 0.1.x next, brupop 1.0.0 Feb 4, 2022

cbgbt mentioned this issue Mar 11, 2022

Add reset algorithm to better handle crash loop. #160

Merged

cbgbt removed the status/design label Apr 5, 2022

gthao313 unassigned srgothi92 May 2, 2022

gthao313 removed their assignment Aug 22, 2022

gthao313 removed the priority/p0 label Oct 14, 2022

gthao313 modified the milestones: brupop 1.0.0, Backlog Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add delay and smarter verification between node restarts #12

Add delay and smarter verification between node restarts #12

jahkeup commented Nov 11, 2019

samuelkarp commented Mar 17, 2020

anguslees commented Aug 21, 2020 •

edited

Loading

chancez commented Sep 2, 2020

jahkeup commented Sep 2, 2020

Add delay and smarter verification between node restarts #12

Add delay and smarter verification between node restarts #12

Comments

jahkeup commented Nov 11, 2019

samuelkarp commented Mar 17, 2020

anguslees commented Aug 21, 2020 • edited Loading

chancez commented Sep 2, 2020

jahkeup commented Sep 2, 2020

anguslees commented Aug 21, 2020 •

edited

Loading