-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm operator #1239
kubeadm operator #1239
Conversation
@fabriziopandini: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the overall design. Thank you @fabriziopandini!
Some comments, with two worries: when containerized workloads perform some critical operations (e.g. kubelet restart), and how to get their status; and how to perform certain operations like package upgrade on the base OS.
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
alphaCertsRenew: {} | ||
``` | ||
|
||
It is important to notice that the `Action controller` is going to always generate `TaskGroup` in sequential order - only after the current `TaskGroup` is completed successfully a new one will be generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the current TaskGroup
does not succeed? Assuming an upgrade, if I understood correctly:
- TaskGroup created for control plane nodes (
node-role.kubernetes.io/master: ""
) - When 1. has succeeded, TaskGroup created for worker nodes (unclear how they'll be targeted if not explicitly labeled -- with a
matchExpression
maybe?)
If 1. fails, will it be retried?; do you foresee a retrial count limit?. Will the TaskGroup
be eventually removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting question to me too. I would bet, that all tasks have to be completed successfully here, to preserve the state of the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the current TaskGroup does not succeed?
The current working assumption is to block the execution immediately after a task fail, and to not take initiative automatically. I'm working on a possible UX to let the user to choose if to retry or to skip the failed task.
unclear how they'll be targeted if not explicitly labeled -- with a matchExpression maybe
matchExpression is an option, even if not currently supported by controller-runtime.
I don't have a veto also on requiring additional labels for the operator (might be only during alpha)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, we can try to do away with the Task objects (possibly even with Task Groups) and leverage DaemonSets and Jobs where part of these problems have been solved and kubectl is your friend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have described pros and cons about using bare DaemonSets and Jobs the alternatives paragraphs, and TBH I don't think that going on this path is a good idea, because we will be stuck as soon as we have a small requirement that deviates from DaemonSets and Jobs abstractions. And as far as I am concerned, we are already in this case for operability requirements (e.g. dry-run, pause, error management)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[1]
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
Also a generic question, during the regular
How do you envision mainly 3. (upgrading a package on the base OS) from a containerized workload? Also, 4. is a little tricky but I guess using the systemd dbus interface to restart the service should be good enough to be performed within the container. Do you envision other way of doing this? |
@ereslibre thanks for the feedback! I'm looking forward to getting you onboard for this effort!
My priority now is the UX/on the semantics that will drive the task orchestration, and once we get agreement on this we can focus on Task implementation. WRT to kubeadm/kbìubectl/kubelet binary upgrade, I agree this is tricky (or more tricky that invoking kubeadm commands). Up to know I have only some working assumption (e.g. replace the binary vs using apt/yum packages), but I'm confident we can get something in place when the time comes 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fabriziopandini !
I do like where this is going, but I also suspect, that we can easily end up having to deal with Linux distro specific problems, that we were previously avoiding in kubeadm. One such problem is different package managers.
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
![](20190916-kubeadm-operator.png) | ||
|
||
#### Action | ||
The first step of the sequence above is the user asking the kubeadm operator to perform an `Action`; `Action` is a new CRD that defines actions performed by the kubeadm operator in a high-level, end-user oriented format e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see your point here. I am looking at this from the standard Kubernetes point of view, where a typical object describes a "desired state" to which controllers strive to keep as close as possible. Hence, an Action
looks a bit uncanny to me at first.
For example, what is the lifespan of a typical upgrade action, like the example below? Is it going to be deleted after a successful upgrade? If so, is it going to be done by users or the task controller?
What is the typical lifespan of a typical one-off task, such as certificate renewal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this is really an interesting question.
When initially thinking about the operator, I was thinking of making the operator reconcile changes on the ClusterConfiguration
object, assuming this as a description of the "desired state".
However, it quickly came to me that detecting changes to the ClusterConfiguration
object and mapping such changes to the kubeadm workflows is far from trivial, and TBH, also a huge paradigm shift from what we are doing today.
Are we, as a kubeadm maintainers, able to figure out all the knobs of this evolution?
Most importantly, are the kubeadm users ready for moving from today's "full control" to "total trust" to the kubeadm operator?
This is why I decided to reduce the scope and focus only on the part of the operator that orchestrates kubeadm workflows, the Actions.
And this is also the reason why I'm assuming that the operator should never automatically delete Action or Task, because outstanding observability of what is happening/what happened is a must for getting the users to trust the operator.
Just to be clear, as soon as the orchestrates layer of the operator is in place and reliable, I'm totally OK in moving forward and start to design an higher-level controller that automatically generates actions according to the changes to the ClusterConfiguration
object.
I just feel we should get there step by step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[1] after giving this more thought, I think we will endup implementing bits of the Jobs + TTL controller if we want cleanup (which an alpha feature in Kubernetes IIRC).
I'd prefer us to refine the flows (short-lived, long-lived, short-lived with cleanup etc..) and see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super interesting thought. I have to say that solving the original problem outlined is more appealing to me. Detecting changes in ClusterConfiguration
and figuring out how to bring the cluster into that state feels a lot more like declarative management and kubernetes-like than an Action
as a CRD that can now run once and only once (on success). Since we are not defining a final state within the action, we have no way of knowing when we get there without running an action, checking some state and then what happens? The action must be marked as successful or unsuccessful. Then we would need to know if it was successful, or unsuccessful and retrying or unsuccessful and erroring. The logic of this operator is now diffing an Action and a ClusterConfiguration to determine the Action's status. This diff does not seem more different than diffing a ClusterConfiguration and a ClusterConfiguration, except it's a less direct comparison.
I see the path outlined here, but I'm not convinced actions need to be stateful. I'd say that actions, in the fullness of time, would end up being events that are calculated from the diff of a cluster configuration. Many actions may be run while the kubeadm-operator is trying to reconcile the system, just like TaskGroup you have outlined in the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I can see all of your points here.
My personal preference is more on the declarative approach (hence actions are a bit uncanny to me). Diffing the kubeadm ClusterConfiguration in its entirety can be tricky and defining an action for each changed field is probably a doomed effort.
What I think, we should try to do is to start with an operator specific ClusterConfiguration
. That is something on the lines of:
apiVersion: operator.kubeadm.x-k8s.io/v1alpha1
kind: ClusterConfiguration
metadata:
name: my-cluster
kubernetesVersion: 1.20.0
certificateRotationCount: 3
As you can see, this is not the kubeadm, but an operator specific ClusterConfiguration
. It contains only fields, that we know how to tackle should they get changed. Some of the fields may not even be present in the kubeadm ClusterConfiguration
(in this example, certificateRotationCount
).
That way we still end with a declarative approach and a narrow set of actions for a start.
In addition to this, I do have the same concern as @yastij, that we may end up re-implementing large chunks of the Job and/or DaemonSet controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so a wrapper on top of a cluster configuration? My expectation here is that when this operator is installed it will generate the object that reflects the state of current configuration. Is that right? Then, a modification to this object will result in some actions on the cluster in an attempt to get the cluster to match the supplied configuration.
I think that's a reasonable way to build up to all the features we'd need to support and I would be in favor of this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so a wrapper on top of a cluster configuration?
Depends on what you call "wrapper on top of a cluster configuration". The full kubeadm.ClusterConfiguration
don't need to be exposed, but only a subset of fields from it. The new operator.ClusterConfiguration
should also have some fields to deal with concepts, that are not reflected in the kubeadm.ClusterConfiguration
. Those concepts are usually backed by kubeadm actions. Certificate rotation is one such thing.
For example, the above proposition is roughly going to match the following struct:
type ClusterConfiguration struct {
TypeMeta
// Here, ObjectMeta.Name is going to correspond to kubeadmapiv1beta2.ClusterConfiguration.ClusterName
ObjectMeta
// 1:1 correspondence to kubeadmapiv1beta2.ClusterConfiguration.KubernetesVersion
// A change here is going to trigger upgrade on the machines to the target version
KubernetesVersion string
// A new field, not present in kubeadmapiv1beta2.ClusterConfiguration.
// This is a number, that can only be bumped up. A change to this will trigger certificate rotation.
CertificateRotationCound int
}
My expectation here is that when this operator is installed it will generate the object that reflects the state of current configuration. Is that right? Then, a modification to this object will result in some actions on the cluster in an attempt to get the cluster to match the supplied configuration.
That's my vision of where things should be going too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're tossing out ideas, I know this has no relevance to the PR anymore, but it would be cool if the object had a field like CertificateExpiration
and showed you when the certificates were going to expire. Then you could set the field to a future date and your certs would rotate to get a new expiration date that matched your desired expiration date, perhaps with an "autoRenew: bool" field on that always or never renewed your certificates.
I recognize these behaviors can all be expressed with Action
s, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chuckha @rosti @yastij
Point taken.
Please consider that we are approaching this incrementally, and mi suggestion is to initially focus on the orchestration layer. In the meantime, we can discuss how to best approach the higher level reconciliation loop, that will be implemented in next iterations.
alphaCertsRenew: {} | ||
``` | ||
|
||
It is important to notice that the `Action controller` is going to always generate `TaskGroup` in sequential order - only after the current `TaskGroup` is completed successfully a new one will be generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting question to me too. I would bet, that all tasks have to be completed successfully here, to preserve the state of the cluster.
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
40297e3
to
7c90f9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First section reviewed! I'm really excited for this work :D
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
To clarify, the goal of my review was to make sure the language was decisive and clear to ensure that we agree on what is being solved. Please feel free to reword or ignore my comments at your discretion |
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
![](20190916-kubeadm-operator.png) | ||
|
||
#### Action | ||
The first step of the sequence above is the user asking the kubeadm operator to perform an `Action`; `Action` is a new CRD that defines actions performed by the kubeadm operator in a high-level, end-user oriented format e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[1] after giving this more thought, I think we will endup implementing bits of the Jobs + TTL controller if we want cleanup (which an alpha feature in Kubernetes IIRC).
I'd prefer us to refine the flows (short-lived, long-lived, short-lived with cleanup etc..) and see
alphaCertsRenew: {} | ||
``` | ||
|
||
It is important to notice that the `Action controller` is going to always generate `TaskGroup` in sequential order - only after the current `TaskGroup` is completed successfully a new one will be generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[1]
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a concern regarding the approach.
The proposal should be clear about which use cases are solved. For example, Story 2 is not solved with Actions/Task/TaskGroups. I would say it is systematic but I would not call Actions declarative.
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
![](20190916-kubeadm-operator.png) | ||
|
||
#### Action | ||
The first step of the sequence above is the user asking the kubeadm operator to perform an `Action`; `Action` is a new CRD that defines actions performed by the kubeadm operator in a high-level, end-user oriented format e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super interesting thought. I have to say that solving the original problem outlined is more appealing to me. Detecting changes in ClusterConfiguration
and figuring out how to bring the cluster into that state feels a lot more like declarative management and kubernetes-like than an Action
as a CRD that can now run once and only once (on success). Since we are not defining a final state within the action, we have no way of knowing when we get there without running an action, checking some state and then what happens? The action must be marked as successful or unsuccessful. Then we would need to know if it was successful, or unsuccessful and retrying or unsuccessful and erroring. The logic of this operator is now diffing an Action and a ClusterConfiguration to determine the Action's status. This diff does not seem more different than diffing a ClusterConfiguration and a ClusterConfiguration, except it's a less direct comparison.
I see the path outlined here, but I'm not convinced actions need to be stateful. I'd say that actions, in the fullness of time, would end up being events that are calculated from the diff of a cluster configuration. Many actions may be run while the kubeadm-operator is trying to reconcile the system, just like TaskGroup you have outlined in the document.
Poke me when you want me to go through in detail and we can walk through final comments during office hours. /cc @dlipovetsky Perhaps we can add an event hook for lifecycle updates to make the integration easier. |
Thanks a lot for the KEP! I'm still reading through it, but had a question around kubeadm/kubectl/kubelet binary upgrades, which are in scope. If the binary has dependencies that need to be upgraded, is upgrading the dependencies in scope? I suppose that, if system packages are used, then those dependencies will be upgraded as well. |
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
fb07318
to
f76e3ee
Compare
That's an interesting question to me too. I suppose, that the best way to go is to use the distro specific package manager to do this. That way dependencies will be upgraded too. However, with so many distros and different package managers out there, it gets way more complicated than "systemd vs OpenRC". |
As mentioned earlier I think this is one of the most challenging matters in this proposal. Getting to modify the packages in the host from within a container is really tricky, and let that alone, make that work for different distributions. PackageKit could be a generic solution because it exposes a D-Bus interface, and calling to the host's D-Bus from within a container could be a solution, but I fear its interface won't be enough to ensure the states we want during the upgrade process. |
the operator will get into a big mess with the linux distro flavors. it needs a good abstraction layer to detect and support the package manager and init system. i foresee a big maintenance burden for us here... |
The image-based approach taken by Cluster API has the upside of delegating this problem to the whoever builds the image.
An alternative, I think, is to ensure the pre-flight checks are complete (there are some, but we may want to add more). At a high level, the checks would guarantee that, if any dependency is not present on a host, then kubeadm operator takes no action on that host. In the short term, users will have to satisfy the checks themselves, but we will gain time to understand how kubeadm operator can itself satisfy them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so from a high level I think we need to be super careful on the type of API we want to support, and TBH I think that is almost impossible to define up-front. I'd almost prefer to remove sections of the current incantation and merge as provisional and create a POC.
I'd ideally like to solicit feedback from @kubernetes/api-reviewers on recommendations given the user stories. The user stories are solid, the API... I'm not sold.
keps/sig-cluster-lifecycle/kubeadm/20190916-kubeadm-operator.md
Outdated
Show resolved
Hide resolved
|
||
```yaml | ||
apiVersion: operator.kubeadm.x-k8s.io/v1alpha1 | ||
kind: Action |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gunna be honest, I don't know if I like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
point taken
It is a declared objective of the first iteration to discover a good semantic for expressing operations. Any suggestion is more than welcome!
from that group, @smarterclayton and @deads2k probably have the most practical experience with front-end APIs driving operators specifically |
f76e3ee
to
dba9969
Compare
Let's discuss this proposal during this weeks office hours and get this unblocked so folks can work on a PoC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
So long as we use our learnings to create a more declarative api in the long term I'm fine with this.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini, timothysc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- certificate authority rotation (NEW) | ||
- change configuration in an existing cluster (NEW) | ||
|
||
> [1] Please note that we are referring to "in place" mutations of kubeadm generated artifacts in order to highlight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CAPI recently merged tighter integration w/ cert-manager - I'm wondering if this footnote is outdated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first sight, I don't see relations between cert-manager and the kubeadm-operator, but I prefer to double-check. Can you provide a link?
This PR adds a KEP for defining the kubeadm operator, a tool for enabling declarative control of kubeadm workflows, automating the execution and the orchestration of workflow tasks across existing nodes in a cluster.
All the comments from hackmd are already addressed
@kubernetes/sig-cluster-lifecycle-pr-reviews
/sig cluster-lifecycle
/area kubeadm
/cc @neolit123
/cc @rosti
/cc @ereslibre
/cc @detiber
/cc @vincepri
/cc @yastij
/cc @chuckha
/assign @timothysc
/assign @luxas