Kubernetes Update Operator #241

mitchellmaler · 2019-08-02T01:43:56Z

Currently on CoreOS Container Linux we make use of the container linux update operator to orchestrate the updates (restart) of our Kubernetes cluster nodes based on it's configuration and agent integrating with locksmith. Will there be an equivalent for Fedora Coreos that can be deployed to a Kubernetes cluster and work with zincati to orchestrate updates?

I noticed the airlock project which can run as a container and needs to connect to an etcd3 server (cluster) but while running under kubernetes we already have etcd nodes but cannot give access to those (policy). Does this mean we are required to run another etcd cluster just for updates or is it possible to make use of kubernetes objects to orchestrate the updates using an operator?

lucab · 2019-08-02T07:34:09Z

All questions are really on spot but many pieces are still moving, so I'll try to give an overview of the current state (which may change soonish).
Please do note that this could fit into a larger discussion around k8s-updates / config-management / machine-config-operator, but I'll keep the scope of this ticket to "FCOS update-reboot orchestration on k8s" only, on purpose.

For reference, the historical decisions behind this are recorded at #3.

Does this mean we are required to run another etcd cluster just for updates or is it possible to make use of kubernetes objects to orchestrate the updates using an operator?

That isn't the intended usage, no. The scope of airlock is just to replace the same logic in locksmith, which only supported etcd as distributed backend. The usecase is for machines that already have direct access to an etcd cluster, likely without any access to the objects of an higher-level orchestrator. If you have to deploy an etcd cluster just for airlock, then there are better options to consider.

Will there be an equivalent for Fedora Coreos that can be deployed to a Kubernetes cluster and work with zincati to orchestrate updates?

That's the idea, yes. But we don't plan to write orchestrators for each possible backend on our own, nor shove all of those into airlock.
Instead, the plan is to stabilize the HTTPS-based protocol that Zincati uses, so that the reboot-manager can run in a separate container and its implementation can be swapped to support other backends.
Within this context, each community with a common interest can maintain its own containerized manager, decoupled from the OS and from other backends/implementations.

As of this date, we are still stabilizing the basics of auto-updates, so fleet-wide orchestration is still on the development radar. The protocol is currently drafted at coreos/airlock#1, while the client-support in Zincati is tracked at coreos/zincati#37.

mitchellmaler · 2019-08-02T16:55:54Z

@lucab Thanks for the overview! I am glad there will be similar functionality in the future.

LorbusChris · 2019-08-02T21:26:02Z

Right now in Red Hat OpenShift we have the machine-config-operator (mco) for this.
In the initial release of OKD4 it will do the FCOS updates instead of the airlock/zincati duo that usually does it in FCOS, and using a slightly different delivery update payloadmechanism (ie. os-container aka container embedded ostree vs usual rpm-ostree commit). We will do our best to abstract away the interfaces for those controllers and make them replaceable/pluggable (in way that would allow Zincati/Airlock to control how mco/the cluster does things)

MPV · 2020-02-06T07:32:34Z

@lucab coreos/zincati#37 now seems closed, would you be open to sharing what the current state is? 😍

lucab · 2020-02-06T09:11:29Z

Related inquiry: coreos/zincati#214.

lucab · 2020-02-06T09:28:10Z

@MPV I've left a few cross-links in place, so if you want to explore more feel free to click-through. However, below is a quick summary of the current status.

client-side logic is done, see https://github.com/coreos/zincati/blob/master/docs/usage/updates-strategy.md#lock-based-strategy
the server-side logic to replace locksmith etcd strategy is done, see https://github.com/coreos/airlock
OKD4 decided not to use FCOS auto-updates, orchestrating everything via https://github.com/openshift/machine-config-operator instead
I am not aware of any k8s-native reboot-lock-manager implementation at this point. And to the best of my knowledge there is isn't any plan on our (@coreos) side to write such operator.

Circling back to my original reply, now we are basically at this point:

[...] the reboot-manager can run in a separate container and its implementation can be swapped to support other backends.
Within this context, each community with a common interest can maintain its own containerized manager, decoupled from the OS and from other backends/implementations.

schmitch · 2020-02-06T11:14:11Z

I do not get the point why airlock was done with etcd instead of k8s as a backing store. I think airlock should actually be configurable to use k8s locking mechanisms.

Edit: the question is also, what happens if airlock is only installed on 1 node and the node restarts, does the lock still stands or does the node retries until the airlock server is up again? if the latter is the quase, it will probably be really simple to create a good k8s integration.

lucab · 2020-02-06T13:05:57Z

I do not get the point why airlock was done with etcd instead of k8s as a backing store.

This is recorded with actual historical details and technical discussions at #3, feel free to go through it. The TLDR is "because it replaces locksmith etcd strategy".

Also, please beware that k8s API does not model a database with strongly consistent primitives (e.g. old HA clusters without "etcd quorum read" do return stale reads).

I think airlock should actually be configurable to use k8s locking mechanisms.

That's understandable, but its design scope is explicitly not covering it.
There are plenty of details to figure out (authentication, consistency, hooks, tolerations, draining, etc.) to warrant its own project by somebody intimately knowledgeable with k8s.
See the rest of the discussion about having dedicated containerized lock-managers.

The client->server protocol itself is documented at https://github.com/coreos/airlock/pull/1/files and designed to be easy to implement as small web-service on top of any consistent database.

schmitch · 2020-02-06T13:30:24Z

the pr actually points to a rough explanation. not to a "protocol documentation".

mitchellmaler · 2020-02-06T14:45:34Z

Just saw this new project being worked on by Rancher to be a more generic upgrade operator not just rancher specific. Wonder if it could be enhanced to work with Fcos upgrades. It might even be able to work as it is, need to dig into it more.

https://github.com/rancher/system-upgrade-controller

lukasmrtvy · 2020-03-29T10:11:28Z

Wait, what https://docs.fedoraproject.org/en-US/fedora-coreos/faq/#_how_do_i_coordinate_cluster_wide_os_updates_is_locksmith_or_the_container_linux_update_operator_available_for_fedora_coreos ?

bgilbert · 2020-03-29T22:23:31Z

@lukasmrtvy Excellent question! @lucab, do you know what that text is about?

dustymabe · 2020-03-30T00:56:14Z

Looks like that text was part of our annoucement lauch FAQ posted in June of 2018, so it may have been a little misguided or incorrect in retrospect.

lucab · 2020-08-12T09:49:15Z

Bunch of updates:

the FAQ entry above is stale, it references the first step of the MCO. faq: update locksmith/CLUO status fedora-coreos-docs#158 updates it to current state.
third-party lock-manager implementations are starting to appear: https://github.com/opencounter/terraform-fleet-lock-dynamodb
integration with Rancher's system-upgrade-controller is tracked at Upgrading Fedora CoreOS hosts rancher/system-upgrade-controller#87

dghubble · 2020-08-24T03:03:18Z

https://github.com/poseidon/fleetlock implements Zincati's FleetLock protocol on Kubernetes. Its small, nothing fancy (no drain).

curantes · 2022-03-19T21:42:54Z

https://github.com/poseidon/fleetlock implements Zincati's FleetLock protocol on Kubernetes. Its small, nothing fancy (no drain).

It actually have drain support now

lucab added area/updates component/zincati kind/new-feature labels Aug 2, 2019

lucab mentioned this issue Aug 26, 2019

Support updating air-gapped instances #261

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes Update Operator #241

Kubernetes Update Operator #241

mitchellmaler commented Aug 2, 2019

lucab commented Aug 2, 2019 •

edited

Loading

mitchellmaler commented Aug 2, 2019

LorbusChris commented Aug 2, 2019

MPV commented Feb 6, 2020

lucab commented Feb 6, 2020

lucab commented Feb 6, 2020

schmitch commented Feb 6, 2020 •

edited

Loading

lucab commented Feb 6, 2020

schmitch commented Feb 6, 2020

mitchellmaler commented Feb 6, 2020 •

edited

Loading

lukasmrtvy commented Mar 29, 2020

bgilbert commented Mar 29, 2020

dustymabe commented Mar 30, 2020

lucab commented Aug 12, 2020

dghubble commented Aug 24, 2020

curantes commented Mar 19, 2022

Kubernetes Update Operator #241

Kubernetes Update Operator #241

Comments

mitchellmaler commented Aug 2, 2019

lucab commented Aug 2, 2019 • edited Loading

mitchellmaler commented Aug 2, 2019

LorbusChris commented Aug 2, 2019

MPV commented Feb 6, 2020

lucab commented Feb 6, 2020

lucab commented Feb 6, 2020

schmitch commented Feb 6, 2020 • edited Loading

lucab commented Feb 6, 2020

schmitch commented Feb 6, 2020

mitchellmaler commented Feb 6, 2020 • edited Loading

lukasmrtvy commented Mar 29, 2020

bgilbert commented Mar 29, 2020

dustymabe commented Mar 30, 2020

lucab commented Aug 12, 2020

dghubble commented Aug 24, 2020

curantes commented Mar 19, 2022

lucab commented Aug 2, 2019 •

edited

Loading

schmitch commented Feb 6, 2020 •

edited

Loading

mitchellmaler commented Feb 6, 2020 •

edited

Loading