Coordinate node readiness with pod workload checks and health reporting #13

jahkeup · 2019-11-11T22:06:06Z

What I'd like:

Dogswatch should check on the status of currently running Pod workloads on a Node before considering an update to be possible. The Controller should verify that the Pods that are about to be terminated are in a healthy state that the service (to be impacted) will remain available elsewhere in the cluster prior to removing the workload from a Node.

Ideally, it would be configurable to conditionally handle the termination of transient Pods that are not controlled by higher level schedulers (the likes ReplicaSets or Deployments). I'm sure there are many other configurables that could easily come into play in this particular critical path - thought should be given to how it could be extended to handle additional considerations.

anguslees · 2020-03-13T04:30:59Z

Is this what a regular kubectl drain achieves via eviction and PodDisruptionBudget? I think bottlerocket should just do that (or the equivalent via API calls).

jahkeup · 2020-03-13T16:34:07Z

Thanks for the pointer, I think you're right that disruptions is what this'll end up being/using. We'll strongly consider using, and maybe only using, the native tools for handling disruptions to nodes as we design our integration.

If we can use existing tunables/configurables then we will. Using what's already there, what's already battle-tested, is ideal and preferred over introducing unnecessary complexity and configuration in order to reimplement functionality.

anguslees · 2020-03-15T22:29:10Z

Fyi, the most relevant prior art is probably the CoreOS container linux updater: https://github.com/coreos/container-linux-update-operator
In particular, I think the before/after reboot checks are a useful advanced feature that bottlerocket updater should also adopt.

Also WeaveWorks' kured: https://github.com/weaveworks/kured - not tied to any specific OS.

jahkeup · 2020-03-17T14:53:25Z

the most relevant prior art is probably the CoreOS container linux updater: coreos/container-linux-update-operator

In my opinion, I'd say the CoreOS operator in particular had the largest influence on the MVP design that we see here today. Before we began, we looked at the current set of operators for inspiration and any known pitfalls (including Weavework's kured and CoreOS' Container Linux update operator) that were reaching towards a similar goal - OS updates.

We were inspired by the community's prior art and I think our implementation shows that in a few ways (example: the operator's annotations were inspired by, but are different from, CLUO's). There are features, designs, and posted ideas that have intentionally left room for improvement to let user input drive its shape and to gather underpinning use-cases. I suspect that many user-informed designs (now and in the future) will overlap with ones we, as a team, have shared or discussed amongst ourselves, like configured update policy (which we discussed above 👍).

I also expect that it will be this combination of user asks, provided use cases, and our project's tenets that will drive us to implement richer and richer feature sets over time. I've been throwing ideas around to the team, for some time really, but they've tended to require an understanding of how and what folks want the update operator to do. With users finding and sharing new ways that the update operator can work for them, the more that we, together, can better equip it to do so!

I think the before/after reboot checks are a useful advanced feature that bottlerocket updater should also adopt.

Thanks for calling this one out. This is a great well-defined feature that likely has wide ranging use cases backing it. I think we'd consider something like this for implementation in the Bottlerocket update operator. Of course before we moved on it, we'd like to understand what those existing use-cases are (in Container Linux) and, importantly, how the use cases change when accounting for Bottlerocket's design (being isolated from the OS for example).

If anyone wants to talk further about the reboot checks, please cut a new issue to discuss there. I think this is related to this issue but is a sub-topic with its own analysis required.

samuelkarp · 2020-03-17T16:36:07Z

If anyone wants to talk further about the reboot checks, please cut a new issue to discuss there.

#12 seems like a good place to discuss this.

cbgbt · 2022-02-21T21:37:26Z

We now fully respect PodDisruptionBugest, but we do allow Brupop to consider a host successfully updated even if there's no guarantee that other workloads are succeeding. Logically, at least the agent must come back up before the node can complete.

I think the rest of the scope of this falls under #12, so I'll close this ticket for now and track remaining items there.

webern transferred this issue from bottlerocket-os/bottlerocket Feb 26, 2020

jahkeup changed the title ~~dogswatch: coordinate with Pod workload checks~~ Coordinate node readiness with pod workload checks and health reporting Feb 27, 2020

jhaynes added this to the Backlog milestone May 21, 2021

jhaynes added enhancement type/enhancement priority/p0 and removed enhancement labels May 21, 2021

jhaynes modified the milestones: Backlog, next May 21, 2021

jhaynes modified the milestones: next, next+1 Jul 28, 2021

cbgbt closed this as completed Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinate node readiness with pod workload checks and health reporting #13

Coordinate node readiness with pod workload checks and health reporting #13

jahkeup commented Nov 11, 2019

anguslees commented Mar 13, 2020

jahkeup commented Mar 13, 2020

anguslees commented Mar 15, 2020

jahkeup commented Mar 17, 2020

samuelkarp commented Mar 17, 2020

cbgbt commented Feb 21, 2022

Coordinate node readiness with pod workload checks and health reporting #13

Coordinate node readiness with pod workload checks and health reporting #13

Comments

jahkeup commented Nov 11, 2019

anguslees commented Mar 13, 2020

jahkeup commented Mar 13, 2020

anguslees commented Mar 15, 2020

jahkeup commented Mar 17, 2020

samuelkarp commented Mar 17, 2020

cbgbt commented Feb 21, 2022