-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coordinate node readiness with pod workload checks and health reporting #13
Comments
Is this what a regular |
Thanks for the pointer, I think you're right that disruptions is what this'll end up being/using. We'll strongly consider using, and maybe only using, the native tools for handling disruptions to nodes as we design our integration. If we can use existing tunables/configurables then we will. Using what's already there, what's already battle-tested, is ideal and preferred over introducing unnecessary complexity and configuration in order to reimplement functionality. |
Fyi, the most relevant prior art is probably the CoreOS container linux updater: https://github.com/coreos/container-linux-update-operator Also WeaveWorks' kured: https://github.com/weaveworks/kured - not tied to any specific OS. |
In my opinion, I'd say the CoreOS operator in particular had the largest influence on the MVP design that we see here today. Before we began, we looked at the current set of operators for inspiration and any known pitfalls (including Weavework's We were inspired by the community's prior art and I think our implementation shows that in a few ways (example: the operator's annotations were inspired by, but are different from, CLUO's). There are features, designs, and posted ideas that have intentionally left room for improvement to let user input drive its shape and to gather underpinning use-cases. I suspect that many user-informed designs (now and in the future) will overlap with ones we, as a team, have shared or discussed amongst ourselves, like configured update policy (which we discussed above 👍). I also expect that it will be this combination of user asks, provided use cases, and our project's tenets that will drive us to implement richer and richer feature sets over time. I've been throwing ideas around to the team, for some time really, but they've tended to require an understanding of how and what folks want the update operator to do. With users finding and sharing new ways that the update operator can work for them, the more that we, together, can better equip it to do so!
Thanks for calling this one out. This is a great well-defined feature that likely has wide ranging use cases backing it. I think we'd consider something like this for implementation in the Bottlerocket update operator. Of course before we moved on it, we'd like to understand what those existing use-cases are (in Container Linux) and, importantly, how the use cases change when accounting for Bottlerocket's design (being isolated from the OS for example). If anyone wants to talk further about the reboot checks, please cut a new issue to discuss there. I think this is related to this issue but is a sub-topic with its own analysis required. |
#12 seems like a good place to discuss this. |
We now fully respect PodDisruptionBugest, but we do allow Brupop to consider a host successfully updated even if there's no guarantee that other workloads are succeeding. Logically, at least the agent must come back up before the node can complete. I think the rest of the scope of this falls under #12, so I'll close this ticket for now and track remaining items there. |
What I'd like:
Dogswatch should check on the status of currently running Pod workloads on a Node before considering an update to be possible. The Controller should verify that the Pods that are about to be terminated are in a healthy state that the service (to be impacted) will remain available elsewhere in the cluster prior to removing the workload from a Node.
Ideally, it would be configurable to conditionally handle the termination of transient Pods that are not controlled by higher level schedulers (the likes ReplicaSets or Deployments). I'm sure there are many other configurables that could easily come into play in this particular critical path - thought should be given to how it could be extended to handle additional considerations.
The text was updated successfully, but these errors were encountered: