Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm operator #1239

Merged
merged 1 commit into from
Oct 7, 2019
Merged

Conversation

fabriziopandini
Copy link
Member

This PR adds a KEP for defining the kubeadm operator, a tool for enabling declarative control of kubeadm workflows, automating the execution and the orchestration of workflow tasks across existing nodes in a cluster.

All the comments from hackmd are already addressed

@kubernetes/sig-cluster-lifecycle-pr-reviews

/sig cluster-lifecycle
/area kubeadm

/cc @neolit123
/cc @rosti
/cc @ereslibre
/cc @detiber
/cc @vincepri
/cc @yastij
/cc @chuckha

/assign @timothysc
/assign @luxas

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: The label(s) area/kubeadm cannot be appled. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

This PR adds a KEP for defining the kubeadm operator, a tool for enabling declarative control of kubeadm workflows, automating the execution and the orchestration of workflow tasks across existing nodes in a cluster.

All the comments from hackmd are already addressed

@kubernetes/sig-cluster-lifecycle-pr-reviews

/sig cluster-lifecycle
/area kubeadm

/cc @neolit123
/cc @rosti
/cc @ereslibre
/cc @detiber
/cc @vincepri
/cc @yastij
/cc @chuckha

/assign @timothysc
/assign @luxas

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Sep 16, 2019
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Sep 16, 2019
Copy link

@ereslibre ereslibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the overall design. Thank you @fabriziopandini!

Some comments, with two worries: when containerized workloads perform some critical operations (e.g. kubelet restart), and how to get their status; and how to perform certain operations like package upgrade on the base OS.

alphaCertsRenew: {}
```

It is important to notice that the `Action controller` is going to always generate `TaskGroup` in sequential order - only after the current `TaskGroup` is completed successfully a new one will be generated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the current TaskGroup does not succeed? Assuming an upgrade, if I understood correctly:

  1. TaskGroup created for control plane nodes (node-role.kubernetes.io/master: "")
  2. When 1. has succeeded, TaskGroup created for worker nodes (unclear how they'll be targeted if not explicitly labeled -- with a matchExpression maybe?)

If 1. fails, will it be retried?; do you foresee a retrial count limit?. Will the TaskGroup be eventually removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting question to me too. I would bet, that all tasks have to be completed successfully here, to preserve the state of the cluster.

Copy link
Member Author

@fabriziopandini fabriziopandini Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the current TaskGroup does not succeed?

The current working assumption is to block the execution immediately after a task fail, and to not take initiative automatically. I'm working on a possible UX to let the user to choose if to retry or to skip the failed task.

unclear how they'll be targeted if not explicitly labeled -- with a matchExpression maybe

matchExpression is an option, even if not currently supported by controller-runtime.
I don't have a veto also on requiring additional labels for the operator (might be only during alpha)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, we can try to do away with the Task objects (possibly even with Task Groups) and leverage DaemonSets and Jobs where part of these problems have been solved and kubectl is your friend.

Copy link
Member Author

@fabriziopandini fabriziopandini Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have described pros and cons about using bare DaemonSets and Jobs the alternatives paragraphs, and TBH I don't think that going on this path is a good idea, because we will be stuck as soon as we have a small requirement that deviates from DaemonSets and Jobs abstractions. And as far as I am concerned, we are already in this case for operability requirements (e.g. dry-run, pause, error management)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1]

@ereslibre
Copy link

ereslibre commented Sep 16, 2019

Also a generic question, during the regular kubeadm upgrade process, we document:

  1. Upgrade kubeadm first
  2. Run the upgrade command
  3. Upgrade kubelet and kubectl
  4. Restart kubelet

How do you envision mainly 3. (upgrading a package on the base OS) from a containerized workload?

Also, 4. is a little tricky but I guess using the systemd dbus interface to restart the service should be good enough to be performed within the container. Do you envision other way of doing this?

@fabriziopandini
Copy link
Member Author

@ereslibre thanks for the feedback! I'm looking forward to getting you onboard for this effort!

How do you envision mainly 3. (upgrading a package on the base OS) from a containerized workload?

My priority now is the UX/on the semantics that will drive the task orchestration, and once we get agreement on this we can focus on Task implementation.

WRT to kubeadm/kbìubectl/kubelet binary upgrade, I agree this is tricky (or more tricky that invoking kubeadm commands).

Up to know I have only some working assumption (e.g. replace the binary vs using apt/yum packages), but I'm confident we can get something in place when the time comes 😉

Copy link
Contributor

@rosti rosti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fabriziopandini !
I do like where this is going, but I also suspect, that we can easily end up having to deal with Linux distro specific problems, that we were previously avoiding in kubeadm. One such problem is different package managers.

![](20190916-kubeadm-operator.png)

#### Action
The first step of the sequence above is the user asking the kubeadm operator to perform an `Action`; `Action` is a new CRD that defines actions performed by the kubeadm operator in a high-level, end-user oriented format e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point here. I am looking at this from the standard Kubernetes point of view, where a typical object describes a "desired state" to which controllers strive to keep as close as possible. Hence, an Action looks a bit uncanny to me at first.
For example, what is the lifespan of a typical upgrade action, like the example below? Is it going to be deleted after a successful upgrade? If so, is it going to be done by users or the task controller?
What is the typical lifespan of a typical one-off task, such as certificate renewal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this is really an interesting question.

When initially thinking about the operator, I was thinking of making the operator reconcile changes on the ClusterConfiguration object, assuming this as a description of the "desired state".

However, it quickly came to me that detecting changes to the ClusterConfiguration object and mapping such changes to the kubeadm workflows is far from trivial, and TBH, also a huge paradigm shift from what we are doing today.
Are we, as a kubeadm maintainers, able to figure out all the knobs of this evolution?
Most importantly, are the kubeadm users ready for moving from today's "full control" to "total trust" to the kubeadm operator?

This is why I decided to reduce the scope and focus only on the part of the operator that orchestrates kubeadm workflows, the Actions.

And this is also the reason why I'm assuming that the operator should never automatically delete Action or Task, because outstanding observability of what is happening/what happened is a must for getting the users to trust the operator.

Just to be clear, as soon as the orchestrates layer of the operator is in place and reliable, I'm totally OK in moving forward and start to design an higher-level controller that automatically generates actions according to the changes to the ClusterConfiguration object.
I just feel we should get there step by step.

Copy link
Member

@yastij yastij Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1] after giving this more thought, I think we will endup implementing bits of the Jobs + TTL controller if we want cleanup (which an alpha feature in Kubernetes IIRC).
I'd prefer us to refine the flows (short-lived, long-lived, short-lived with cleanup etc..) and see

cc @timothysc @neolit123

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super interesting thought. I have to say that solving the original problem outlined is more appealing to me. Detecting changes in ClusterConfiguration and figuring out how to bring the cluster into that state feels a lot more like declarative management and kubernetes-like than an Action as a CRD that can now run once and only once (on success). Since we are not defining a final state within the action, we have no way of knowing when we get there without running an action, checking some state and then what happens? The action must be marked as successful or unsuccessful. Then we would need to know if it was successful, or unsuccessful and retrying or unsuccessful and erroring. The logic of this operator is now diffing an Action and a ClusterConfiguration to determine the Action's status. This diff does not seem more different than diffing a ClusterConfiguration and a ClusterConfiguration, except it's a less direct comparison.

I see the path outlined here, but I'm not convinced actions need to be stateful. I'd say that actions, in the fullness of time, would end up being events that are calculated from the diff of a cluster configuration. Many actions may be run while the kubeadm-operator is trying to reconcile the system, just like TaskGroup you have outlined in the document.

Copy link
Contributor

@rosti rosti Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I can see all of your points here.
My personal preference is more on the declarative approach (hence actions are a bit uncanny to me). Diffing the kubeadm ClusterConfiguration in its entirety can be tricky and defining an action for each changed field is probably a doomed effort.
What I think, we should try to do is to start with an operator specific ClusterConfiguration. That is something on the lines of:

apiVersion: operator.kubeadm.x-k8s.io/v1alpha1
kind: ClusterConfiguration
metadata:
  name: my-cluster
kubernetesVersion: 1.20.0
certificateRotationCount: 3

As you can see, this is not the kubeadm, but an operator specific ClusterConfiguration. It contains only fields, that we know how to tackle should they get changed. Some of the fields may not even be present in the kubeadm ClusterConfiguration (in this example, certificateRotationCount).
That way we still end with a declarative approach and a narrow set of actions for a start.

In addition to this, I do have the same concern as @yastij, that we may end up re-implementing large chunks of the Job and/or DaemonSet controllers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so a wrapper on top of a cluster configuration? My expectation here is that when this operator is installed it will generate the object that reflects the state of current configuration. Is that right? Then, a modification to this object will result in some actions on the cluster in an attempt to get the cluster to match the supplied configuration.

I think that's a reasonable way to build up to all the features we'd need to support and I would be in favor of this approach.

Copy link
Contributor

@rosti rosti Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so a wrapper on top of a cluster configuration?

Depends on what you call "wrapper on top of a cluster configuration". The full kubeadm.ClusterConfiguration don't need to be exposed, but only a subset of fields from it. The new operator.ClusterConfiguration should also have some fields to deal with concepts, that are not reflected in the kubeadm.ClusterConfiguration. Those concepts are usually backed by kubeadm actions. Certificate rotation is one such thing.

For example, the above proposition is roughly going to match the following struct:

type ClusterConfiguration struct {
	TypeMeta
	
	// Here, ObjectMeta.Name is going to correspond to kubeadmapiv1beta2.ClusterConfiguration.ClusterName
	ObjectMeta
	
	// 1:1 correspondence to kubeadmapiv1beta2.ClusterConfiguration.KubernetesVersion
	// A change here is going to trigger upgrade on the machines to the target version
	KubernetesVersion string
	
	// A new field, not present in kubeadmapiv1beta2.ClusterConfiguration.
	// This is a number, that can only be bumped up. A change to this will trigger certificate rotation.
	CertificateRotationCound int
}

My expectation here is that when this operator is installed it will generate the object that reflects the state of current configuration. Is that right? Then, a modification to this object will result in some actions on the cluster in an attempt to get the cluster to match the supplied configuration.

That's my vision of where things should be going too.

Copy link
Contributor

@chuckha chuckha Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're tossing out ideas, I know this has no relevance to the PR anymore, but it would be cool if the object had a field like CertificateExpiration and showed you when the certificates were going to expire. Then you could set the field to a future date and your certs would rotate to get a new expiration date that matched your desired expiration date, perhaps with an "autoRenew: bool" field on that always or never renewed your certificates.

I recognize these behaviors can all be expressed with Actions, however.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chuckha @rosti @yastij
Point taken.
Please consider that we are approaching this incrementally, and mi suggestion is to initially focus on the orchestration layer. In the meantime, we can discuss how to best approach the higher level reconciliation loop, that will be implemented in next iterations.

alphaCertsRenew: {}
```

It is important to notice that the `Action controller` is going to always generate `TaskGroup` in sequential order - only after the current `TaskGroup` is completed successfully a new one will be generated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting question to me too. I would bet, that all tasks have to be completed successfully here, to preserve the state of the cluster.

Copy link
Contributor

@chuckha chuckha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First section reviewed! I'm really excited for this work :D

@chuckha
Copy link
Contributor

chuckha commented Sep 17, 2019

To clarify, the goal of my review was to make sure the language was decisive and clear to ensure that we agree on what is being solved. Please feel free to reword or ignore my comments at your discretion

![](20190916-kubeadm-operator.png)

#### Action
The first step of the sequence above is the user asking the kubeadm operator to perform an `Action`; `Action` is a new CRD that defines actions performed by the kubeadm operator in a high-level, end-user oriented format e.g.
Copy link
Member

@yastij yastij Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1] after giving this more thought, I think we will endup implementing bits of the Jobs + TTL controller if we want cleanup (which an alpha feature in Kubernetes IIRC).
I'd prefer us to refine the flows (short-lived, long-lived, short-lived with cleanup etc..) and see

cc @timothysc @neolit123

alphaCertsRenew: {}
```

It is important to notice that the `Action controller` is going to always generate `TaskGroup` in sequential order - only after the current `TaskGroup` is completed successfully a new one will be generated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1]

Copy link
Contributor

@chuckha chuckha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a concern regarding the approach.

The proposal should be clear about which use cases are solved. For example, Story 2 is not solved with Actions/Task/TaskGroups. I would say it is systematic but I would not call Actions declarative.

![](20190916-kubeadm-operator.png)

#### Action
The first step of the sequence above is the user asking the kubeadm operator to perform an `Action`; `Action` is a new CRD that defines actions performed by the kubeadm operator in a high-level, end-user oriented format e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super interesting thought. I have to say that solving the original problem outlined is more appealing to me. Detecting changes in ClusterConfiguration and figuring out how to bring the cluster into that state feels a lot more like declarative management and kubernetes-like than an Action as a CRD that can now run once and only once (on success). Since we are not defining a final state within the action, we have no way of knowing when we get there without running an action, checking some state and then what happens? The action must be marked as successful or unsuccessful. Then we would need to know if it was successful, or unsuccessful and retrying or unsuccessful and erroring. The logic of this operator is now diffing an Action and a ClusterConfiguration to determine the Action's status. This diff does not seem more different than diffing a ClusterConfiguration and a ClusterConfiguration, except it's a less direct comparison.

I see the path outlined here, but I'm not convinced actions need to be stateful. I'd say that actions, in the fullness of time, would end up being events that are calculated from the diff of a cluster configuration. Many actions may be run while the kubeadm-operator is trying to reconcile the system, just like TaskGroup you have outlined in the document.

@timothysc
Copy link
Member

timothysc commented Sep 19, 2019

Poke me when you want me to go through in detail and we can walk through final comments during office hours.

/cc @dlipovetsky
^ This is what you are looking for, and you make need to wrap cluster-api w/kubeadm-operator to get the overall workflow you would like and that's ok.

Perhaps we can add an event hook for lifecycle updates to make the integration easier.

@dlipovetsky
Copy link
Contributor

@fabriziopandini

Thanks a lot for the KEP! I'm still reading through it, but had a question around kubeadm/kubectl/kubelet binary upgrades, which are in scope.

If the binary has dependencies that need to be upgraded, is upgrading the dependencies in scope? I suppose that, if system packages are used, then those dependencies will be upgraded as well.

@rosti
Copy link
Contributor

rosti commented Sep 24, 2019

If the binary has dependencies that need to be upgraded, is upgrading the dependencies in scope? I suppose that, if system packages are used, then those dependencies will be upgraded as well.

That's an interesting question to me too. I suppose, that the best way to go is to use the distro specific package manager to do this. That way dependencies will be upgraded too. However, with so many distros and different package managers out there, it gets way more complicated than "systemd vs OpenRC".

@ereslibre
Copy link

That's an interesting question to me too. I suppose, that the best way to go is to use the distro specific package manager to do this. That way dependencies will be upgraded too. However, with so many distros and different package managers out there, it gets way more complicated than "systemd vs OpenRC".

As mentioned earlier I think this is one of the most challenging matters in this proposal. Getting to modify the packages in the host from within a container is really tricky, and let that alone, make that work for different distributions.

PackageKit could be a generic solution because it exposes a D-Bus interface, and calling to the host's D-Bus from within a container could be a solution, but I fear its interface won't be enough to ensure the states we want during the upgrade process.

@neolit123
Copy link
Member

the operator will get into a big mess with the linux distro flavors.

it needs a good abstraction layer to detect and support the package manager and init system.
we can add Debian and RedHat families as a start, as that's what packages we support.
from there it should be a community effort to contribute additional distro support.

i foresee a big maintenance burden for us here...

@dlipovetsky
Copy link
Contributor

i foresee a big maintenance burden for us here...

The image-based approach taken by Cluster API has the upside of delegating this problem to the whoever builds the image.

it needs a good abstraction layer to detect and support the package manager and init system.
we can add Debian and RedHat families as a start, as that's what packages we support.
from there it should be a community effort to contribute additional distro support.

An alternative, I think, is to ensure the pre-flight checks are complete (there are some, but we may want to add more). At a high level, the checks would guarantee that, if any dependency is not present on a host, then kubeadm operator takes no action on that host.

In the short term, users will have to satisfy the checks themselves, but we will gain time to understand how kubeadm operator can itself satisfy them.

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so from a high level I think we need to be super careful on the type of API we want to support, and TBH I think that is almost impossible to define up-front. I'd almost prefer to remove sections of the current incantation and merge as provisional and create a POC.

I'd ideally like to solicit feedback from @kubernetes/api-reviewers on recommendations given the user stories. The user stories are solid, the API... I'm not sold.


```yaml
apiVersion: operator.kubeadm.x-k8s.io/v1alpha1
kind: Action
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gunna be honest, I don't know if I like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

point taken
It is a declared objective of the first iteration to discover a good semantic for expressing operations. Any suggestion is more than welcome!

@liggitt
Copy link
Member

liggitt commented Sep 26, 2019

I'd ideally like to solicit feedback from @kubernetes/api-reviewers on recommendations given the user stories. The user stories are solid, the API... I'm not sold.

from that group, @smarterclayton and @deads2k probably have the most practical experience with front-end APIs driving operators specifically

@timothysc
Copy link
Member

Let's discuss this proposal during this weeks office hours and get this unblocked so folks can work on a PoC.

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

So long as we use our learnings to create a more declarative api in the long term I'm fine with this.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini, timothysc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2019
@k8s-ci-robot k8s-ci-robot merged commit d0177e1 into kubernetes:master Oct 7, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 7, 2019
@fabriziopandini fabriziopandini deleted the kubeadm-operator branch October 10, 2019 07:16
- certificate authority rotation (NEW)
- change configuration in an existing cluster (NEW)

> [1] Please note that we are referring to "in place" mutations of kubeadm generated artifacts in order to highlight
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAPI recently merged tighter integration w/ cert-manager - I'm wondering if this footnote is outdated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first sight, I don't see relations between cert-manager and the kubeadm-operator, but I prefer to double-check. Can you provide a link?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.