Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes Update Operator #241

Open
mitchellmaler opened this issue Aug 2, 2019 · 16 comments
Open

Kubernetes Update Operator #241

mitchellmaler opened this issue Aug 2, 2019 · 16 comments

Comments

@mitchellmaler
Copy link

Currently on CoreOS Container Linux we make use of the container linux update operator to orchestrate the updates (restart) of our Kubernetes cluster nodes based on it's configuration and agent integrating with locksmith. Will there be an equivalent for Fedora Coreos that can be deployed to a Kubernetes cluster and work with zincati to orchestrate updates?

I noticed the airlock project which can run as a container and needs to connect to an etcd3 server (cluster) but while running under kubernetes we already have etcd nodes but cannot give access to those (policy). Does this mean we are required to run another etcd cluster just for updates or is it possible to make use of kubernetes objects to orchestrate the updates using an operator?

@lucab
Copy link
Contributor

lucab commented Aug 2, 2019

All questions are really on spot but many pieces are still moving, so I'll try to give an overview of the current state (which may change soonish).
Please do note that this could fit into a larger discussion around k8s-updates / config-management / machine-config-operator, but I'll keep the scope of this ticket to "FCOS update-reboot orchestration on k8s" only, on purpose.

For reference, the historical decisions behind this are recorded at #3.

Does this mean we are required to run another etcd cluster just for updates or is it possible to make use of kubernetes objects to orchestrate the updates using an operator?

That isn't the intended usage, no. The scope of airlock is just to replace the same logic in locksmith, which only supported etcd as distributed backend. The usecase is for machines that already have direct access to an etcd cluster, likely without any access to the objects of an higher-level orchestrator. If you have to deploy an etcd cluster just for airlock, then there are better options to consider.

Will there be an equivalent for Fedora Coreos that can be deployed to a Kubernetes cluster and work with zincati to orchestrate updates?

That's the idea, yes. But we don't plan to write orchestrators for each possible backend on our own, nor shove all of those into airlock.
Instead, the plan is to stabilize the HTTPS-based protocol that Zincati uses, so that the reboot-manager can run in a separate container and its implementation can be swapped to support other backends.
Within this context, each community with a common interest can maintain its own containerized manager, decoupled from the OS and from other backends/implementations.

As of this date, we are still stabilizing the basics of auto-updates, so fleet-wide orchestration is still on the development radar. The protocol is currently drafted at coreos/airlock#1, while the client-support in Zincati is tracked at coreos/zincati#37.

@mitchellmaler
Copy link
Author

@lucab Thanks for the overview! I am glad there will be similar functionality in the future.

@LorbusChris
Copy link
Contributor

Right now in Red Hat OpenShift we have the machine-config-operator (mco) for this.
In the initial release of OKD4 it will do the FCOS updates instead of the airlock/zincati duo that usually does it in FCOS, and using a slightly different delivery update payloadmechanism (ie. os-container aka container embedded ostree vs usual rpm-ostree commit). We will do our best to abstract away the interfaces for those controllers and make them replaceable/pluggable (in way that would allow Zincati/Airlock to control how mco/the cluster does things)

@MPV
Copy link

MPV commented Feb 6, 2020

@lucab coreos/zincati#37 now seems closed, would you be open to sharing what the current state is? 😍

@lucab
Copy link
Contributor

lucab commented Feb 6, 2020

Related inquiry: coreos/zincati#214.

@lucab
Copy link
Contributor

lucab commented Feb 6, 2020

@MPV I've left a few cross-links in place, so if you want to explore more feel free to click-through. However, below is a quick summary of the current status.

Circling back to my original reply, now we are basically at this point:

[...] the reboot-manager can run in a separate container and its implementation can be swapped to support other backends.
Within this context, each community with a common interest can maintain its own containerized manager, decoupled from the OS and from other backends/implementations.

@schmitch
Copy link

schmitch commented Feb 6, 2020

I do not get the point why airlock was done with etcd instead of k8s as a backing store. I think airlock should actually be configurable to use k8s locking mechanisms.

Edit: the question is also, what happens if airlock is only installed on 1 node and the node restarts, does the lock still stands or does the node retries until the airlock server is up again? if the latter is the quase, it will probably be really simple to create a good k8s integration.

@lucab
Copy link
Contributor

lucab commented Feb 6, 2020

I do not get the point why airlock was done with etcd instead of k8s as a backing store.

This is recorded with actual historical details and technical discussions at #3, feel free to go through it. The TLDR is "because it replaces locksmith etcd strategy".

Also, please beware that k8s API does not model a database with strongly consistent primitives (e.g. old HA clusters without "etcd quorum read" do return stale reads).

I think airlock should actually be configurable to use k8s locking mechanisms.

That's understandable, but its design scope is explicitly not covering it.
There are plenty of details to figure out (authentication, consistency, hooks, tolerations, draining, etc.) to warrant its own project by somebody intimately knowledgeable with k8s.
See the rest of the discussion about having dedicated containerized lock-managers.

The client->server protocol itself is documented at https://github.com/coreos/airlock/pull/1/files and designed to be easy to implement as small web-service on top of any consistent database.

@schmitch
Copy link

schmitch commented Feb 6, 2020

the pr actually points to a rough explanation. not to a "protocol documentation".

@mitchellmaler
Copy link
Author

mitchellmaler commented Feb 6, 2020

Just saw this new project being worked on by Rancher to be a more generic upgrade operator not just rancher specific. Wonder if it could be enhanced to work with Fcos upgrades. It might even be able to work as it is, need to dig into it more.

https://github.com/rancher/system-upgrade-controller

@bgilbert
Copy link
Contributor

@lukasmrtvy Excellent question! @lucab, do you know what that text is about?

@dustymabe
Copy link
Member

Looks like that text was part of our annoucement lauch FAQ posted in June of 2018, so it may have been a little misguided or incorrect in retrospect.

@lucab
Copy link
Contributor

lucab commented Aug 12, 2020

Bunch of updates:

@dghubble
Copy link
Member

https://github.com/poseidon/fleetlock implements Zincati's FleetLock protocol on Kubernetes. Its small, nothing fancy (no drain).

@curantes
Copy link

https://github.com/poseidon/fleetlock implements Zincati's FleetLock protocol on Kubernetes. Its small, nothing fancy (no drain).

It actually have drain support now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants