From 9de48bd80885488a196c39d2ea7d05498ae95b92 Mon Sep 17 00:00:00 2001 From: Jack Francis Date: Mon, 10 Apr 2023 12:18:12 -0700 Subject: [PATCH 1/5] CAEP: Contract Changes to Support Managed Kubernetes --- docs/proposals/20220725-managed-kubernetes.md | 27 +- ...30407-managed-k8s-capi-contract-changes.md | 386 ++++++++++++++++++ 2 files changed, 402 insertions(+), 11 deletions(-) create mode 100644 docs/proposals/20230407-managed-k8s-capi-contract-changes.md diff --git a/docs/proposals/20220725-managed-kubernetes.md b/docs/proposals/20220725-managed-kubernetes.md index c77215da9b85..0541450e0b27 100644 --- a/docs/proposals/20220725-managed-kubernetes.md +++ b/docs/proposals/20220725-managed-kubernetes.md @@ -14,9 +14,9 @@ reviewers: - "@shyamradhakrishnan" - "@yastij" creation-date: 2022-07-25 -last-updated: 2022-08-23 +last-updated: 2023-06-15 status: implementable -see-also: +see-also: ./20230407-managed-k8s-capi-contract-changes.md replaces: superseded-by: --- @@ -68,13 +68,13 @@ superseded-by: - [Implementation History](#implementation-history) - + ## Glossary - **Managed Kubernetes** - a Kubernetes service offered/hosted by a service provider where the control plane is run & managed by the service provider. As a cluster service consumer, you don’t have to worry about managing/operating the control plane machines. Additionally, the managed Kubernetes service may extend to cover running managed worker nodes. Examples are EKS in AWS and AKS in Azure. This is different from a traditional implementation in Cluster API, where the control plane and worker nodes are deployed and managed by the cluster admin. - **Unmanaged Kubernetes** - a Kubernetes cluster where a cluster admin is responsible for provisioning and operating the control plane and worker nodes. In Cluster API this traditionally means a Kubeadm bootstrapped cluster on infrastructure machines (virtual or physical). - **Managed Worker Node** - an individual Kubernetes worker node where the underlying compute (vm or bare-metal) is provisioned and managed by the service provider. This usually includes the joining of the newly provisioned node into a Managed Kubernetes cluster. The lifecycle is normally controlled via a higher level construct such as a Managed Node Group. -- **Managed Node Group** - is a service that a service provider offers that automates the provisioning of managed worker nodes. Depending on the service provider this group of nodes could contain a fixed number of replicas or it might contain a dynamic pool of replicas that auto-scales up and down. Examples are Node Pools in GCP and EKS managed node groups. +- **Managed Node Group** - is a service that a service provider offers that automates the provisioning of managed worker nodes. Depending on the service provider this group of nodes could contain a fixed number of replicas or it might contain a dynamic pool of replicas that auto-scales up and down. Examples are Node Pools in GCP and EKS managed node groups. - **Cluster Infrastructure Provider (Infrastructure)** - an Infrastructure provider supplies whatever prerequisites are necessary for creating & running clusters such as networking, load balancers, firewall rules, and so on. ([docs](../book/src/developer/providers/cluster-infrastructure.md)) - **ControlPlane Provider (ControlPlane)** - a control plane provider instantiates a Kubernetes control plane consisting of k8s control plane components such as kube-apiserver, etcd, kube-scheduler and kube-controller-manager. ([docs](../book/src/developer/architecture/controllers/control-plane.md#control-plane-provider)) - **MachineDeployment** - a MachineDeployment orchestrates deployments over a fleet of MachineSets, which is an immutable abstraction over Machines. ([docs](../book/src/developer/architecture/controllers/machine-deployment.md)) @@ -90,7 +90,9 @@ Cluster API was originally designed with unmanaged Kubernetes clusters in mind a Some Cluster API Providers (i.e. Azure with AKS first and then AWS with EKS) have implemented support for their managed Kubernetes services. These implementations have followed the existing documentation & contracts (that were designed for unmanaged Kubernetes) and have ended up with 2 different implementations. -While working on supporting ClusterClass for EKS in Cluster API Provider AWS (CAPA), it was discovered that the current implementation of EKS within CAPA, where a single resource kind (AWSManagedControlPlane) is used for both ControlPlane and Infrastructure, is incompatible with ClusterClass (See the [issue](https://github.com/kubernetes-sigs/cluster-api/issues/6126)). Separation of ControlPlane and Infrastructure is expected for the ClusterClass implementation to work correctly. +> _While working on supporting ClusterClass for EKS in Cluster API Provider AWS (CAPA), it was discovered that the current implementation of EKS within CAPA, where a single resource kind (AWSManagedControlPlane) is used for both ControlPlane and Infrastructure, is incompatible with other parts of CAPI assuming the two objects are different (Reference [issue here](https://github.com/kubernetes-sigs/cluster-api/issues/6126)). Separation of ControlPlane and Infrastructure is expected for the ClusterClass implementation to work correctly._ + +(Note: the above quoted, italicized text matter is no longer relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented.) The responsibilities between the CAPI control plane and infrastructure are blurred with a managed Kubernetes service like AKS or EKS. For example, when you create a EKS control plane in AWS it also creates infrastructure that CAPI would traditionally view as the responsibility of the cluster “infrastructure provider”. @@ -238,7 +240,7 @@ type GCPManagedControlPlaneSpec struct { // +optional Network NetworkSpec `json:"network"` - // AddonsConfig defines the addons to enable with the GKE cluster. + // AddonsConfig defines the addons to enable with the GKE cluster. // +optional AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"` @@ -265,7 +267,9 @@ CAPA decided to represent an EKS cluster as a CAPI control-plane. This meant tha Initially CAPA had an infrastructure cluster kind that reported back the control plane endpoint. This required less than ideal code in its controller to watch the control plane and use its value of the control plane endpoint. -As the infrastructure cluster kind only acted as a passthrough (to satisfy the contract with CAPI) it was decided that it would be removed and the control-plane kind (AWSManagedControlPlane) could be used to satisfy both the “infrastructure” and “control-plane” contracts. This worked well until ClusterClass arrived with its expectation that the “infrastructure” and “control-plane” are 2 different resource kinds. +As the infrastructure cluster kind only acted as a passthrough (to satisfy the contract with CAPI) it was decided that it would be removed and the control-plane kind (AWSManagedControlPlane) could be used to satisfy both the “infrastructure” and “control-plane” contracts. _This worked well until ClusterClass arrived with its expectation that the “infrastructure” and “control-plane” are 2 different resource kinds._ + +(Note: the above italicized text matter is no longer relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 merges is implemented.) Note that CAPZ had a similar discussion and an [issue](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/1396) to remove AzureManagedCluster: AzureManagedCluster is useless; let's remove it (and keep AzureManagedControlPlane) @@ -276,6 +280,7 @@ Note that CAPZ had a similar discussion and an [issue](https://github.com/kubern **Cons** - Doesn’t work with the current implementation of ClusterClass, which expects a separation of ControlPlane and Infrastructure. + - when CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented this con will no longer be true - Doesn’t provide separation of responsibilities between creating the general cloud infrastructure for the cluster and the actual cluster control plane. - Managed Kubernetes look different from unmanaged Kubernetes where two separate kinds are used for a control plane and infrastructure. This would impact products building on top of CAPI. @@ -334,6 +339,8 @@ type GCPManagedClusterSpec struct { - Need to maintain Infra cluster kind, which is a pass-through layer and has no other functions. In addition to the CRD, controllers, webhooks and conversions webhooks need to be maintained. - Infra provider doesn’t provision infrastructure and whilst it may meet the CAPI contract, it doesn’t actually create infrastructure as this is done via the control plane. +Note: when CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented this option will no longer be relevant, as we can simply drop the InfraCluster altogether. + #### Option 3: Two kinds with a Managed Control Plane and Managed Infra Cluster with Better Separation of Responsibilities This option more closely follows the original separation of concerns with the different CAPI provider types. With this option, 2 new resource kinds will be introduced: @@ -438,11 +445,9 @@ The reasons for this recommendation are as follows: - The infra cluster provisions and manages the general infrastructure required for the cluster but not the control plane. - By having a separate infra cluster API definition, it allows differences in the API between managed and unmanaged clusters. -Providers like CAPZ and CAPA have already implemented managed Kubernetes support and there should be no requirement on them to move to Option 3. Both Options 2 and 4 are solutions that would work with ClusterClass and so could be used if required. - -Option 1 is the only option that will not work with ClusterClass and would require a change to CAPI. Therefore this option is not recommended. +Providers like CAPZ and CAPA have already implemented managed Kubernetes support and there should be no requirement on them to move to Option 3. Both Options 4 also works well with ClusterClass and so could be used if required. Option 2 will no longer be relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented. -*** This means that CAPA will have to make changes to move away from Option 1 if it wants to support ClusterClass. +Option 1 will be available when CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented. Until then it is not recommended. ### Additional notes on option 3 diff --git a/docs/proposals/20230407-managed-k8s-capi-contract-changes.md b/docs/proposals/20230407-managed-k8s-capi-contract-changes.md new file mode 100644 index 000000000000..8c0687850eb4 --- /dev/null +++ b/docs/proposals/20230407-managed-k8s-capi-contract-changes.md @@ -0,0 +1,386 @@ +--- +title: Contract Changes to Support Managed Kubernetes +authors: + - "@jackfrancis" +reviewers: + - "@richardcase" + - "@pydctw" + - "@mtougeron" + - "@CecileRobertMichon" + - "@fabriziopandini" + - "@sbueringer" + - "@killianmuldoon" + - "@mboersma" + - "@nojnhuh" +creation-date: 2023-04-07 +last-updated: 2023-04-07 +status: provisional +see-also: + - "/docs/proposals/20220725-managed-kubernetes.md" +--- + +# Contract Changes to Support Managed Kubernetes + +## Table of Contents + +A table of contents is helpful for quickly jumping to sections of a proposal and for highlighting +any additional information provided beyond the standard proposal template. +[Tools for generating](https://github.com/ekalinin/github-markdown-toc) a table of contents from markdown are available. + +- [Contract Changes to Support Managed Kubernetes](#contract-changes-to-support-managed-kubernetes) + - [Table of Contents](#table-of-contents) + - [Glossary](#glossary) + - [Summary](#summary) + - [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Future work](#future-work) + - [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Requirements (Optional)](#requirements-optional) + - [Functional Requirements](#functional-requirements) + - [FR1](#fr1) + - [FR2](#fr2) + - [Non-Functional Requirements](#non-functional-requirements) + - [NFR1](#nfr1) + - [NFR2](#nfr2) + - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) + - [Security Model](#security-model) + - [Risks and Mitigations](#risks-and-mitigations) + - [Alternatives](#alternatives) + - [Upgrade Strategy](#upgrade-strategy) + - [Additional Details](#additional-details) + - [Test Plan [optional]](#test-plan-optional) + - [Graduation Criteria [optional]](#graduation-criteria-optional) + - [Version Skew Strategy [optional]](#version-skew-strategy-optional) + - [Implementation History](#implementation-history) + +## Glossary + +Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html). + +The following terms will be used in this document. + +- `Cluster` + - When we say `Cluster` we refer to any provider's infra-specific implementation of the Cluster API `Cluster` resource spec. When you see ``, interpret that as a placeholder for any provider implementation. Some concrete examples of provider infra cluster implementations are Azure's CAPZ provider (e.g., `AzureCluster` and `AzureManagedCluster`), AWS's CAPA provider (e.g., `AWSCluster` and `AWSManagedCluster`), and Google Cloud's CAPG provider (e.g., `GCPCluster` and `GCPManagedCluster`). Rather than referencing any one of the preceding actual implementations of infra cluster resources, we prefer to generalize to `Cluster` so that we don't suggest any provider-specific bias informing our conclusions. +- `ControlPlane` + - When we say `ControlPlane` we refer to any provider's infra-specific implementation of the a Kubernetes cluster's control plane. When you see ``, interpret that as a placeholder for any provider implementation. Some concrete examples of provider infra control plane implementations are Azure's CAPZ provider (e.g., `AzureManagedControlPlane`), AWS's CAPA provider (e.g., `AWSManagedControlPlane`), and Google Cloud's CAPG provider (e.g., `GCPManagedControlPlane`). +- Managed Kubernetes + - Managed Kubernetes refers to any Kubernetes Cluster provisioning and maintenance platform that is exposed by a service API. For example: [EKS](https://aws.amazon.com/eks/), [OKE](https://www.oracle.com/cloud/cloud-native/container-engine-kubernetes/), [AKS](https://azure.microsoft.com/en-us/products/kubernetes-service), [GKE](https://cloud.google.com/kubernetes-engine), [IBM Cloud Kubernetes Service](https://www.ibm.com/cloud/kubernetes-service), [DOKS](https://www.digitalocean.com/products/kubernetes), and many more throughout the Kubernetes Cloud Native ecosystem. +- _Kubernetes Cluster Infrastructure_ + - When we refer to _Kubernetes Cluster Infrastructure_ we aim to distinguish required environmental infrastructure (e.g., cloud virtual networks) in which a Kubernetes cluster resides as a "set of child resources" from the Kubernetes cluster resources themselves (e.g., virtual machines that underlie nodes, managed by Cluster API). Sometimes this is referred to as "BYO Infrastructure"; essentially, we are talking about **infrastructure that supports a Kubernetes cluster, but is not actively managed by Cluster API**. As we will see, this boundary is different when discussing Managed Kubernetes: more infrastructure resources are not managed by Cluster API when running Managed Kubernetes. +- e.g. + - This just means "For example:"! + +## Summary + +We propose to make provider `Cluster` resources optional in order to better represent Managed Kubernetes scenarios where all _Kubernetes Cluster Infrastructure_ is managed by the service provider, and not by Cluster API. In order to support that, we propose that the API Server endpoint reference can also originate from the `ControlPlane` resource, and not the `Cluster` resource. These changes will introduce two new possible implementation options for providers implementing Managed Kubernetes in Cluster API: + +1. A Managed Kubernetes cluster solution whose configuration surface area is expressed exclusively in a `ControlPlane` resource (no `Cluster` resource). +2. A Managed Kubernetes cluster solution whose configuration surface area comprises both a `Cluster` and a `ControlPlane` resource, with `ControlPlane` being solely responsible for configuring the API Server endpoint (instead of the API Server endpoint being configured via the `Cluster`). + +## Motivation + +The implementation of Managed Kubernetes scenarios by Cluster API providers occurred after the architectural design of Cluster API, and thus that design process did not consider these Managed Kubernetes scenarios as a user story. In practice, Cluster API's specification has allowed Managed Kubernetes solutions to emerge that aid running fleets of clusters at scale, with CAPA's `AWSManagedCluster` and `AzureManagedCluster` being notable examples. However, because these Managed Kubernetes solutions arrived after the Cluster API contract was defined, providers have not settled on a consistent rendering of how a "Service-Managed Kubernetes" specification fits into a "Cluster API-Managed Kubernetes" surface area. + +One particular part of the existing Cluster API surface area that is inconsistent with most Managed Kubernetes user experiences is the accounting of the [Kubernetes API server](https://kubernetes.io/docs/concepts/overview/components/#kube-apiserver). In the canonical "self-managed" user story that Cluster API addresses, it is the provider implementation of Cluster API (e.g., CAPA) that is responsible for scaffolding the necessary _Kubernetes Cluster Infrastructure_ that is required in order to create the Kubernetes API server (e.g., a Load Balancer and a public IP address). This provider responsibility is declared in the `Cluster` resource, and carried out via its controllers; and then finally this reconciliation is synchronized with the parent `Cluster` Cluster API resource. + +Because there exist Managed Kubernetes scenarios that handle all _Kubernetes Cluster Infrastructure_ responsibilities themselves, Cluster API's requirement of a `Cluster` resource leads to weird implementation decisions, because in these scenarios there is no actual work for a Cluster API provider to do to scaffold _Kubernetes Cluster Infrastructure_. + +### Goals + +- Build upon [the existing Cluster API Managed Kubernetes proposal](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220725-managed-kubernetes.md). Any net new recommendations and/or proposals will be a continuation of the existing proposal, and consistent with its original conclusions. +- Make `Cluster` resources optional. +- Enable API Server endpoint reporting from a provider's Control Plane resource rather than from its `Cluster` resource. +- Ensure any changes to the current behavioral contract are backwards-compatible. + +### Non-Goals + +- Changes to existing Cluster API CRDs. +- Introduce new "Managed Kubernetes" data types in Cluster API. +- Invalidate [the existing Cluster API Managed Kubernetes proposal and concluding recommendations](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220725-managed-kubernetes.md). + +### Future Work + +- Detailed documentation that references the flavors of Managed Kubernetes scenarios and how they can be implemented in Cluster API, with provider examples. + +## Proposal + +### User Stories + +#### Story 1 + +As a cluster operator, I want to use Cluster API to provision and manage the lifecycle of a control plane that utilizes my service provider's managed Kubernetes control plane (i.e. EKS, AKS, GKE), so that I don’t have to worry about the management/provisioning of control plane nodes, and so I can take advantage of any value add services offered by my cloud provider. + +#### Story 2 + +As a cluster operator, I want to be able to provision both "unmanaged" and "managed" Kubernetes clusters from the same management cluster, so that I can support different requirements and use cases as needed whilst using a single operating model. + +#### Story 3 + +As a Cluster API provider implementor, I want to be able to return the control plane endpoint via the ControlPlane custom resource, so that it fits naturally with how I create an instance of the service provider's Managed Kubernetes which creates the endpoint, and so i don't have to pass through the value via another custom resource. + +#### Story 4 + +As a Cluster API provider developer, I want guidance on how to incorporate a managed Kubernetes service into my provider, so that its usage is compatible with Cluster API architecture/features and its usage is consistant with other providers. + +#### Story 5 + +As a Cluster API provider developer, I want to enable the ClusterClass feature for a Managed Kubernetes service, so that users can take advantage of an improved UX with ClusterClass-based clusters. + +#### Story 6 + +As a cluster operator, I want to use Cluster API to provision and manage the lifecycle of worker nodes that utilizes my cloud providers' managed instances (if they support them), so that I don't have to worry about the management of these instances. + +#### Story 7 + +As a service provider I want to be able to offer Managed Kubernetes clusters by using CAPI referencing my own managed control plane implementation that satisfies Cluster API contracts. + +### Current State of Managed Kubernetes in CAPI + +#### EKS in CAPA + +- [Docs](https://cluster-api-aws.sigs.k8s.io/topics/eks/index.html) +- Feature Status: GA +- CRDs + - AWSManagedCluster - passthrough kind to fullfill the capi contract + - AWSManagedControlPlane - provision EKS cluster + - AWSManagedMachinePool - corresponds to EKS managed node pool +- Supported Flavors + - AWSManagedControlPlane with MachineDeployment / AWSMachine + - AWSManagedControlPlane with MachinePool / AWSMachinePool + - AWSManagedControlPlane with MachinePool / AWSManagedMachinePool +- Bootstrap Provider + - Cluster API bootstrap provider EKS (CABPE) +- Features + - Provisioning/managing an Amazon EKS Cluster + - Upgrading the Kubernetes version of the EKS Cluster + - Attaching self-managed machines as nodes to the EKS cluster + - Creating a machine pool and attaching it to the EKS cluster (experimental) + - Creating a managed machine pool and attaching it to the EKS cluster + - Managing "EKS Addons" + - Creating an EKS Fargate profile (experimental) + - Managing aws-iam-authenticator configuration + +#### AKS in CAPZ + +- [Docs](https://capz.sigs.k8s.io/topics/managedcluster.html) +- Feature Status: GA +- CRDs + - AzureManagedControlPlane, AzureManagedCluster - provision AKS cluster + - AzureManagedMachinePool - corresponds to AKS node pool +- Supported Flavor + - AzureManagedControlPlane + AzureManagedCluster with AzureManagedMachinePool + +#### GKE in CAPG + +- [Docs](https://github.com/kubernetes-sigs/cluster-api-provider-gcp/blob/v1.3.0/docs/book/src/topics/gke/index.md) +- Feature Status: Experimental () +- CRDs + - GCPManagedControlPlane, GCPManagedCluster - provision GKE cluster + - GCPManagedMachinePool - corresponds to the managed node pool for the cluster +- Supported Flavor + - GCPManagedControlPlane + GCPManagedCluster with GCPManagedMachinePool + + +#### Learnings from original Proposal: Two kinds with a Managed Control Plane & Managed Infra Cluster adhering to the current CAPI contracts + +The original Managed Kubernetes proposal recommends managing two separate resources for cluster and control plane configuration, what we're referring to as a `Cluster` and a `ControlPlane`. That recommendation is outlined as [Option 3 in the proposal, here][managedKubernetesRecommendation]. This recommendation has been followed by CAPOCI and CAPG as of this writing. + +This propsal was able to be implemented with no upstream changes in CAPI. It makes the following assumptions about representing Managed Kubernetes: + +- **`Cluster`** - Provides any base infrastructure that is required as a prerequisite for the target environment required for running machines and creating a Managed Kubernetes service. +- **`ControlPlane`** - Represents an instance of the actual Managed Kubernetes service in the target environment (i.e. cloud/service provider). It’s based on the assumption that a Managed Kubernetes service supplies the Kubernetes control plane. + +These broadly follow the existing separation within CAPI. + +However, for many Managed Kubernetes services this will require less than ideal code in the controllers to retrieve the control plane endpoint from the `ControlPlane` kind and report it back via the ControlPlaneEndpoint property on the `Cluster` to satisfy CAPI contracts. + +To give an idea what this means: +- `Cluster` watches the control plane and vice versa +- `Cluster` controller create base infra and sets Ready = true +- `ControlPlane` waits for `Cluster` to be Ready +- `ControlPlane` creates an instance of the managed k8s service +- `ControlPlane` gets the API server endpoint from the managed k8s service and stores it in the CRD instance +- `Cluster` is watching for changes to `ControlPlane` and if the "api server endpoint" on the `ControlPlane` CRD instance is not empty then: + - Map `ControlPlane` to `Cluster` and queue event + - `Cluster` reconciler loop gets the `ControlPlane` CRD instance and takes the value for "api server endpoint" and populates `ControlPlaneEndpoint` on the `Cluster` CRD instance. + - (which will then cause the reconciler for `ControlPlane` to run... again) + +The implementation of the controllers for Managed Kubernetes would be simplified if there was an option to report the ControlPlaneEndpoint via `ControlPlane` instead. Below we will outline two new flows that reduce much of the complexity of the above, while allowing Managed Kubernetes providers to represent their services intuitively. + +### Two New Flows + +#### Flow 1: `Cluster` and `ControlPlane`, with `ControlPlaneEndpoint` reported via `ControlPlane` + +We will describe a CRD composition that adheres to the original separation of concerns of the different provider types as documented in the Cluster API documentation, with a different API Server endpoint reporting flow. + +As described above, at present the control plane endpoint must be returned via the `ControlPlaneEndpoint` field on the spec of the `Cluster` [reference here](https://cluster-api.sigs.k8s.io/developer/providers/cluster-infrastructure.html). This is OK for self-managed clusters, as a load balancer is usually created as part of the reconciliation. But with Managed Kubernetes services the API Server endpoint usually comes from the service directly, which means that the `Cluster` has to get the `ControlPlaneEndpoint` from the managed service so that it can be reported back to CAPI. In practice, this results in `Cluster` watching the `ControlPlane` and the `ControlPlane` watching the `Cluster`, and without care this can cause event storms in the CAPI management cluster. + +This flow would require making changes to CAPI controllers so that there is an option to report the `ControlPlaneEndpoint` via the `ControlPlane` as an alternative to coming from the `Cluster`. + +Using CAPG as an example: + +```go +type GCPManagedControlPlaneSpec struct { + // AddonsConfig defines the addons to enable with the GKE cluster. + // +optional + AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"` + + // Logging contains the logging configuration for the GKE cluster. + // +optional + Logging *ControlPlaneLoggingSpec `json:"logging,omitempty"` + + // EnableKubernetesAlpha will indicate the kubernetes alpha features are enabled + // +optional + EnableKubernetesAlpha bool + + // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. + // +optional + ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` + + ... +} + + +type GCPManagedClusterSpec struct { + // Project is the name of the project to deploy the cluster to. + Project string `json:"project"` + + // The GCP Region the cluster lives in. + Region string `json:"region"` + + // NetworkSpec encapsulates all things related to the GCP network. + // +optional + Network NetworkSpec `json:"network"` + + // FailureDomains is an optional field which is used to assign selected availability zones to a cluster + // FailureDomains if empty, defaults to all the zones in the selected region and if specified would override + // the default zones. + // +optional + FailureDomains []string `json:"failureDomains,omitempty"` + + ... +} +``` + +**Pros** + +- Simplifies provider implementation when reporting `ControlPlaneEndpoint` +- Clearer separation between the lifecycle management of the general cloud infrastructure required for the cluster and the actual managed control plane (GKE in this example) +- Follows the original intentions of an "infrastructure" and "control-plane" provider +- Enables removal/addition of properties for a Managed Kubernetes cluster that may be different from a self-managed Kubernetes cluster +- Works with ClusterClass + +**Cons** + +- Requires changes upstream to CAPI controllers to support the change of reporting `ControlPlaneEndpoint` +- Duplication of API definitions between self-managed and managed `Cluster` definitions and related controllers +- Users need to be aware of when to use the unmanaged or managed `Cluster` definitions. + +#### Flow 2: Change CAPI to make `Cluster` optional + +This option follows along from the first flow above (`ControlPlaneEndpoint` reported by `ControlPlane` resource rather than `Cluster` resource), but takes it further and makes the `Cluster` resource optional. + +This option would allow providers to implement only a `ControlPlane` resource. Using CAPG as an example, rather than: + +- `Cluster` ←→ `GCPManagedCluster` + `GCPManagedControlPlane` + +We would enable: + +- `Cluster` ←→ `GCPManagedControlPlane` + +This would have the advantage of imposing a separation of configuration between each provider’s `Cluster` and `ControlPlane`’s resources. Because our observations have been that various Managed Kubernetes service providers do things a little bit differently, this separation is hard to define and enforce across all providers in a way that is agreeable to each provider. + +In practice this will help Managed Kubernetes provider implementations that do not provide infrastructure resources as part of the service contract, and as of now are required to implement a `Cluster` resource (e.g., `AzureManagedCluster` ) as a sort of proxy resource that exists solely to fulfill the CAPI requirement for an `Cluster` partner of its corresponding Cluster resource even though there is no infrastructure to describe: + +```golang +type ClusterSpec struct { + ... + // InfrastructureRef is a reference to a provider-specific resource that holds the details + // for provisioning infrastructure for a cluster in said provider. + // +optional + InfrastructureRef *corev1.ObjectReference `json:"infrastructureRef,omitempty"` + ... +} +``` + +The above API specification snippet for `ClusterSpec` emphasizes (in the type comment) that in fact the `InfrastructureRef` child property is an optional property of the data model. We are able to take advantage of this data specification to accommodate these non-infrastructure-providing Managed Cluster infrastructure scenarios, and are entirely able to be represented as a “managed control plane” abstraction. Work will need to be done in the CAPI controllers to support this new workflow, which was originally implemented prior to Managed Kubernetes scenarios being considered. + +**Pros** + +- Does not require any change to existing Cluster API CRDs +- Flexible: enables more expressive API semantics for the various scenarios of Managed Kubernetes +- Is a natural evolution of the prior effort to standardize Managed Kubernetes on CAPI, doesn’t require users following this effort to entirely rethink how they can invest in CAPI + Managed Kubernetes + +**Cons** + +- Would require an update to the existing Cluster API contract to accommodate new workflows + +#### Alternative Option: Introduce a new Managed Kubernetes provider type (with contract) + +This option would introduce a new native Managed Kubernetes type definition into Cluster API, which would have the result of standardizing what Managed Kubernetes looks like for all providers under a common interface. We can use the CAPI type definition of “Cluster”, and the various provider implementations of that (e.g., `GCPCluster`) as a model to copy when we design a native Managed Kubernetes specification. + +Defining a new CAPI Managed Kubernetes type would require us to discover and standardize the set of "common" (relevant across all providers) specification data into a new set of CAPI types, e.g.: + +```golang +type ManagedCluster struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec ManagedClusterSpec `json:"spec,omitempty"` + Status ManagedClusterStatus `json:"status,omitempty"` +} + +type ManagedClusterSpec struct { + // Cluster network configuration. + // +optional + ClusterNetwork *ClusterNetwork `json:"clusterNetwork,omitempty"` + + // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. + // +optional + ControlPlaneEndpoint APIEndpoint `json:"controlPlaneEndpoint,omitempty"` + + // InfrastructureRef is a reference to a provider-specific resource that holds the details + // for provisioning infrastructure for a cluster in said provider. + // +optional + InfrastructureRef *corev1.ObjectReference `json:"infrastructureRef,omitempty"` +} +``` + +Each provider would then implement its own corresponding type definition: + +```golang +type GCPManagedCluster struct { + .... +} +``` + +Our job is to balance the beneficial outcomes of standardization and consistency by strictly defining certain "common" properties that each provider will fulfill, while enabling enough flexibility to allow providers to meaningfully represent their particular environments. + +**Pros** + +- Standardizing the spec at the foundational, Cluster API layer optimizes for consistency across providers + +**Cons** + +- Would require a new set of resource specifications to the existing Cluster API spec +- Differentiates "self-managed clusters" from "managed clusters" at the foundational API layer: + - Self-managed clusters would use the `Cluster` API resource as the top-level primitive object + - Managed clusters would use the `ManagedCluster` API resource as the top-level primitive object + - For example, to see all clusters under management at present, you can issue a `kubectl get clusters --all-namespaces` command (or the API equivalent); going forward, you would issue `kubectl get clusters,managedclusters --all-namespaces` +- There are no existing provider implementations. All existing provider implementations (e.g., CAPA, CAPZ, CAPOCI, CAPG) would need to be replaced or augmented in order to use a new spec. + +## Recommendations + +Because Managed Kubernetes was not yet in scope for Cluster API when it first appeared and gained rapid adoption, we are incentivized for paths forward that use the existing, mature, widely used API specification. The option to create a new `ManagedCluster` API type to best enforce provider consistency thusly has a high bar to clear in order to justify itself as the best option for the next phase of Managed Kubernetes in Cluster API. + +We conclude that enabling the the CAPI controllers to source authoritative `ControlPlaneEndpoint` data from the `ControlPlane` resource is non-invasive to existing API contracts, and offers non-trivial flexibility for CAPI Managed Kubernetes providers at a small additional cost to CAPI maintenance going forward. Existing implementations that leverage a "proxy" `Cluster` resource merely to satisfy CAPI contracts can be simplified by dropping the `Cluster` resource altogether at little-to-no cost to their existing user communities. New Managed Kubernetes provider implementations will now have a little more flexibility to use a implementation that uses only a `ControlPlane` resource, if that is appropriate, or for implementations that define both a `Cluster` + `ControlPlane` with the appropriate configuration distribution [following our recommendation][managedKubernetesRecommendation], those implementations can be non-trivially simplified with their `ControlPlaneEndpoint` data being observed and straightforwardly returned via `ControlPlane`, the most common source of truth for a Managed Kubernetes service. + +## Implementation History + +- [x] 01/11/2023: Compile a Google Doc to organize thoughts prior to CAEP [link here](https://docs.google.com/document/d/1rqzZfsO6k_RmOHUxx47cALSr_6SeTG89e9C44-oHHdQ/) + +[managedKubernetesRecommendation]: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220725-managed-kubernetes.md#option-3-two-kinds-with-a-managed-control-plane-and-managed-infra-cluster-with-better-separation-of-responsibilities From 09c0d3507de82ac62bfc2ce216b5eaa9e1efbe06 Mon Sep 17 00:00:00 2001 From: Richard Case Date: Thu, 15 Jun 2023 15:17:14 +0100 Subject: [PATCH 2/5] docs: updates to managed k9s caep The managed k8s caep has been updated to reflect the contract changes proposed for managed k8s. The recommendations remain the same from the original proposal. Additionally, formatting changes have been made and some updates on the current state of managed k8s in CAPI. Signed-off-by: Richard Case --- docs/proposals/20220725-managed-kubernetes.md | 249 ++++++++++++------ ...30407-managed-k8s-capi-contract-changes.md | 35 +++ 2 files changed, 197 insertions(+), 87 deletions(-) diff --git a/docs/proposals/20220725-managed-kubernetes.md b/docs/proposals/20220725-managed-kubernetes.md index 0541450e0b27..a9127c9dc30d 100644 --- a/docs/proposals/20220725-managed-kubernetes.md +++ b/docs/proposals/20220725-managed-kubernetes.md @@ -50,13 +50,14 @@ superseded-by: - [EKS in CAPA](#eks-in-capa) - [AKS in CAPZ](#aks-in-capz) - [OKE in CAPOCI](#oke-in-capoci) + - [GKE in CAPG](#gke-in-capg) - [Managed Kubernetes API Design Approaches](#managed-kubernetes-api-design-approaches) - - [Option 1: Single kind for Control Plane and Infrastructure](#option-1-single-kind-for-control-plane-and-infrastructure) - - [Background: Why did EKS in CAPA choose this option?](#background-why-did-eks-in-capa-choose-this-option) - - [Option 2: Two kinds with a ControlPlane and a pass-through InfraCluster](#option-2-two-kinds-with-a-controlplane-and-a-pass-through-infracluster) + - [Option 1: Two kinds with a ControlPlane and a pass-through InfraCluster](#option-1-two-kinds-with-a-controlplane-and-a-pass-through-infracluster) + - [Option 2: Just a ControlPlane kind and no InfraCluster](#option-2-just-a-controlplane-kind-and-no-infracluster) - [Option 3: Two kinds with a Managed Control Plane and Managed Infra Cluster with Better Separation of Responsibilities](#option-3-two-kinds-with-a-managed-control-plane-and-managed-infra-cluster-with-better-separation-of-responsibilities) - - [Option 4: Two kinds with a Managed Control Plane and Shared Infra Cluster with Better Separation of Responsibilities](#option-4-two-kinds-with-a-managed-control-plane-and-shared-infra-cluster-with-better-separation-of-responsibilities) - [Recommendations](#recommendations) + - [Vanilla Managed Kubernetes (i.e. without any additional infrastructure)](#vanilla-managed-kubernetes-ie-without-any-additional-infrastructure) + - [Existing Managed Kubernetes Implementations](#existing-managed-kubernetes-implementations) - [Additional notes on option 3](#additional-notes-on-option-3) - [Managed Node Groups for Worker Nodes](#managed-node-groups-for-worker-nodes) - [Provider Implementers Documentation](#provider-implementers-documentation) @@ -64,6 +65,10 @@ superseded-by: - [ClusterClass support for MachinePool](#clusterclass-support-for-machinepool) - [clusterctl integration](#clusterctl-integration) - [Add-ons management](#add-ons-management) +- [Alternatives](#alternatives) + - [Alternative 1: Single kind for Control Plane and Infrastructure](#alternative-1-single-kind-for-control-plane-and-infrastructure) + - [Background: Why did EKS in CAPA choose this option?](#background-why-did-eks-in-capa-choose-this-option) + - [Alternative 2: Two kinds with a Managed Control Plane and Shared Infra Cluster with Better Separation of Responsibilities](#alternative-2-two-kinds-with-a-managed-control-plane-and-shared-infra-cluster-with-better-separation-of-responsibilities) - [Upgrade Strategy](#upgrade-strategy) - [Implementation History](#implementation-history) @@ -90,9 +95,9 @@ Cluster API was originally designed with unmanaged Kubernetes clusters in mind a Some Cluster API Providers (i.e. Azure with AKS first and then AWS with EKS) have implemented support for their managed Kubernetes services. These implementations have followed the existing documentation & contracts (that were designed for unmanaged Kubernetes) and have ended up with 2 different implementations. -> _While working on supporting ClusterClass for EKS in Cluster API Provider AWS (CAPA), it was discovered that the current implementation of EKS within CAPA, where a single resource kind (AWSManagedControlPlane) is used for both ControlPlane and Infrastructure, is incompatible with other parts of CAPI assuming the two objects are different (Reference [issue here](https://github.com/kubernetes-sigs/cluster-api/issues/6126)). Separation of ControlPlane and Infrastructure is expected for the ClusterClass implementation to work correctly._ +While working on supporting ClusterClass for EKS in Cluster API Provider AWS (CAPA), it was discovered that the current implementation of EKS within CAPA, where a single resource kind (AWSManagedControlPlane) is used for both ControlPlane and Infrastructure, is incompatible with other parts of CAPI assuming the two objects are different (Reference [issue here](https://github.com/kubernetes-sigs/cluster-api/issues/6126)). -(Note: the above quoted, italicized text matter is no longer relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented.) +Separation of ControlPlane and Infrastructure is expected for the ClusterClass implementation to work correctly. However, after the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes) have been implemented there is the option to supply only the control plane, but you still cannot supply the same resource for both. The responsibilities between the CAPI control plane and infrastructure are blurred with a managed Kubernetes service like AKS or EKS. For example, when you create a EKS control plane in AWS it also creates infrastructure that CAPI would traditionally view as the responsibility of the cluster “infrastructure provider”. @@ -113,9 +118,8 @@ A good example here is the API server load balancer: - Enforce the Managed Kubernetes recommendations as a requirement for Cluster API providers when they implement Managed Kubernetes. - If providers that have already implemented Managed Kubernetes and would like guidance on if/how they could move to be aligned with the recommendations of this proposal then discussions should be facilitated. - Provide advice in this proposal on how to refactor the existing implementations of managed Kubernetes in CAPA & CAPZ. -- Propose a new architecture or API changes to CAPI for managed Kubernetes +- Propose a new architecture or API changes to CAPI for managed Kubernetes. This has been covered by the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes). - Be a concrete design for the GKE implementation in Cluster API Provider GCP (CAPG). - - A separate CAPG proposal will be created for GKE implementation based on the recommendations of this proposal. - Recommend how Managed Kubernetes services would leverage CAPI internally to run their offer. ## Proposal @@ -205,7 +209,7 @@ So that I can eliminate the responsibility of owning and SREing the Control Plan #### AKS in CAPZ - [Docs](https://capz.sigs.k8s.io/topics/managedcluster.html) -- Feature Status: Experimental +- Feature Status: GA - CRDs - AzureManagedControlPlane, AzureManagedCluster - provision AKS cluster - AzureManagedMachinePool - corresponds to AKS node pool @@ -214,22 +218,41 @@ So that I can eliminate the responsibility of owning and SREing the Control Plan #### OKE in CAPOCI -- [Issue](https://github.com/oracle/cluster-api-provider-oci/issues/110) -- Design discussion starting +- [Docs](https://oracle.github.io/cluster-api-provider-oci/managed/managedcluster.html) +- Feature Status: Experimental +- CRDs + - OCIManagedControlPlane, OCIManagedCluster - provision OKE cluster + - OCIManagedMachinePool, OCIVirtualMachinePool - machine pool implementations +- Supported Flavors: + - OCIManagedControlPlane + OCIManagedCluster with OCIManagedMachinePool + - OCIManagedControlPlane + OCIManagedCluster with OCIVirtualMachinePool + +#### GKE in CAPG + +- [Docs](https://github.com/kubernetes-sigs/cluster-api-provider-gcp/blob/main/docs/book/src/topics/gke/index.md) +- Feature Status: Experimental +- CRDs + - GCPManagedControlPlane, GCPManagedCluster - provision GKE cluster + - GCPManagedMachinePool - corresponds to managed node pool +- Support falvor + - GCPManagedControlPlane + GCPManagedCluster with GCPManagedMachinePool ### Managed Kubernetes API Design Approaches -When discussing the different approaches to represent a managed Kubernetes service in CAPI, we will be using the implementation of GKE support in CAPG as an example, as this isn’t currently implemented. +When discussing the different approaches to represent a managed Kubernetes service in CAPI, we will be using the implementation of GKE support in CAPG as an example. -> NOTE: “naming things is hard” so the names of the kinds/structs/fields used in the CAPG examples below are illustrative only and are not the focus of this proposal. There is debate, for example, as to whether `GCPManagedCluster` or `GKECluster` should be used. This type of discussion will be within the CAPG proposal. +> NOTE: “naming things is hard” so the names of the kinds/structs/fields used in the CAPG examples below are illustrative only and are not the focus of this proposal. There is debate, for example, as to whether `GCPManagedCluster` or `GKECluster` should be used. The following section discusses different API implementation options along with pros and cons of each. -#### Option 1: Single kind for Control Plane and Infrastructure +#### Option 1: Two kinds with a ControlPlane and a pass-through InfraCluster -This option introduces a new single resource kind: +**This option will be no longer needed when the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes) have been implemented as option 2 can be used for a simpler solution** + +This option introduces 2 new resource kinds: - **GCPManagedControlPlane**: this represents both a control-plane (i.e. GKE) and infrastructure required for the cluster. It contains properties for both the general cloud infrastructure (that would traditionally be represented by an infrastructure cluster) and the managed Kubernetes control plane (that would traditionally be represented by a control plane provider). +- **GCPManagedCluster**: contains the minimum properties in its spec and status to satisfy the [CAPI contract for an infrastructure cluster](../book/src/developer/providers/cluster-infrastructure.md) (i.e. ControlPlaneEndpoint, Ready condition). Its controller watches GCPManagedControlPlane and copies the ControlPlaneEndpoint field to GCPManagedCluster to report back to CAPI. This is used as a pass-through layer only. ```go type GCPManagedControlPlaneSpec struct { @@ -259,47 +282,39 @@ type GCPManagedControlPlaneSpec struct { } ``` -**This is the design pattern used by EKS in CAPA.** - -##### Background: Why did EKS in CAPA choose this option? - -CAPA decided to represent an EKS cluster as a CAPI control-plane. This meant that control-plane is responsible for creating the API server load balancer. - -Initially CAPA had an infrastructure cluster kind that reported back the control plane endpoint. This required less than ideal code in its controller to watch the control plane and use its value of the control plane endpoint. - -As the infrastructure cluster kind only acted as a passthrough (to satisfy the contract with CAPI) it was decided that it would be removed and the control-plane kind (AWSManagedControlPlane) could be used to satisfy both the “infrastructure” and “control-plane” contracts. _This worked well until ClusterClass arrived with its expectation that the “infrastructure” and “control-plane” are 2 different resource kinds._ - -(Note: the above italicized text matter is no longer relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 merges is implemented.) +```go +type GCPManagedClusterSpec struct { + // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. + // +optional + ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` +} +``` -Note that CAPZ had a similar discussion and an [issue](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/1396) to remove AzureManagedCluster: AzureManagedCluster is useless; let's remove it (and keep AzureManagedControlPlane) +**This is the design pattern currently used by CAPZ and CAPA**. [An example of how ManagedCluster watches ControlPlane in CAPZ.](https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/5c69b44ed847365525504b242da83b5e5da75e4f/controllers/azuremanagedcluster_controller.go#L71) **Pros** -- A simple design with a single resource kind and controller. +- Better aligned with CAPI’s traditional infra provider model +- Works with ClusterClass **Cons** -- Doesn’t work with the current implementation of ClusterClass, which expects a separation of ControlPlane and Infrastructure. - - when CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented this con will no longer be true -- Doesn’t provide separation of responsibilities between creating the general cloud infrastructure for the cluster and the actual cluster control plane. -- Managed Kubernetes look different from unmanaged Kubernetes where two separate kinds are used for a control plane and infrastructure. This would impact products building on top of CAPI. +- Need to maintain Infra cluster kind, which is a pass-through layer and has no other functions. In addition to the CRD, controllers, webhooks and conversions webhooks need to be maintained. +- Infra provider doesn’t provision infrastructure and whilst it may meet the CAPI contract, it doesn’t actually create infrastructure as this is done via the control plane. -#### Option 2: Two kinds with a ControlPlane and a pass-through InfraCluster +#### Option 2: Just a ControlPlane kind and no InfraCluster -This option introduces 2 new resource kinds: +**This option is enabled when the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes) have been implemented.** -- **GCPManagedControlPlane**: same as in option 1 -- **GCPManagedCluster**: contains the minimum properties in its spec and status to satisfy the [CAPI contract for an infrastructure cluster](../book/src/developer/providers/cluster-infrastructure.md) (i.e. ControlPlaneEndpoint, Ready condition). Its controller watches GCPManagedControlPlane and copies the ControlPlaneEndpoint field to GCPManagedCluster to report back to CAPI. This is used as a pass-through layer only. +This option introduces 1 new resource kind: + +- **GCPManagedControlPlane**: this represents a control-plane (i.e. GKE) required for the cluster. It contains properties for the managed Kubernetes control plane (that would traditionally be represented by a control plane provider). ```go type GCPManagedControlPlaneSpec struct { // Project is the name of the project to deploy the cluster to. Project string `json:"project"` - // NetworkSpec encapsulates all things related to the GCP network. - // +optional - Network NetworkSpec `json:"network"` - // AddonsConfig defines the addons to enable with the GKE cluster. // +optional AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"` @@ -319,27 +334,15 @@ type GCPManagedControlPlaneSpec struct { } ``` -```go -type GCPManagedClusterSpec struct { - // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. - // +optional - ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` -} -``` - -**This is the design pattern used by AKS in CAPZ**. [An example of how ManagedCluster watches ControlPlane in CAPZ.](https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/5c69b44ed847365525504b242da83b5e5da75e4f/controllers/azuremanagedcluster_controller.go#L71) - **Pros** -- Better aligned with CAPI’s traditional infra provider model +- Simpler implementation + - No need for a pass-through infra cluster as control plane endpoint can be reported back via the control plane - Works with ClusterClass **Cons** -- Need to maintain Infra cluster kind, which is a pass-through layer and has no other functions. In addition to the CRD, controllers, webhooks and conversions webhooks need to be maintained. -- Infra provider doesn’t provision infrastructure and whilst it may meet the CAPI contract, it doesn’t actually create infrastructure as this is done via the control plane. - -Note: when CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented this option will no longer be relevant, as we can simply drop the InfraCluster altogether. +- If the configuration/functionality related to the base infrastructure are included then we have mixed concerns of the APPI type. #### Option 3: Two kinds with a Managed Control Plane and Managed Infra Cluster with Better Separation of Responsibilities @@ -350,10 +353,6 @@ This option more closely follows the original separation of concerns with the di ```go type GCPManagedControlPlaneSpec struct { - // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. - // +optional - ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint,omitempty"` - // AddonsConfig defines the addons to enable with the GKE cluster. // +optional AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"` @@ -401,6 +400,8 @@ type GCPManagedClusterSpec struct { } ``` +When the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes) have been implemented there is the option to return the control plane endpoint directly from the ControlPlane instead of passing it via the Infracluster. + **Pros** - Clearer separation between the lifecycle management of the general cloud infrastructure required for the cluster and the actual managed control plane (i.e. GKE in this example) @@ -412,32 +413,9 @@ type GCPManagedClusterSpec struct { - Duplication of API definitions between GCPCluster and GCPManagedCluster and reconciliation for the infrastructure cluster -#### Option 4: Two kinds with a Managed Control Plane and Shared Infra Cluster with Better Separation of Responsibilities - -This option is a variation of option 3 and as such it more closely follows the original separation of concerns with the different CAPI provider types. The difference with this option compared to option 3 is that only 1 new resource kind is introduced: - -- **GCPManagedControlPlane**: this presents the actual GKE control plane in GCP. Its spec would only contain properties that are specific to provisioning & management of GKE. It would not contain any general properties related to the general GCP operating infrastructure, like the networking or project. - -The general cluster infrastructure will be declared via the existing **GCPCluster** kind and reconciled via the existing controller. - -However, this approach will require changes to the controller for **GCPCluster**. The steps to create the required infrastructure may be different between an unmanaged cluster and a GKE based cluster. For example, for an unmanaged cluster a load balancer will need to be created but with a GKE based cluster this won’t be needed and instead we’d need to use the endpoint created as part of **GCPManagedControlPlane** reconciliation. - -So the **GCPCluster** controller will need to know if its creating infrastructure for an unmanaged or managed cluster (probably by looking at the parent's (i.e. `Cluster`) **controlPlaneRef**) and do different steps. - -**Pros** - -- Single infra cluster kind irrespective of if you are creating an unmanaged or GKE based cluster. It doesn’t require the user to pick the right one. -- Clear separation between cluster infrastructure and the actual managed (i.e. GKE) control plane -- Works with cluster class - -**Cons** - -- Additional complexity and logic in the infra cluster controller -- API definition could be messy if only certain fields are required for one type of cluster - ## Recommendations -It is proposed that option 3 (two kinds with a managed control plane and managed infra cluster with better separation of responsibilities) is the best way to proceed for **new implementations** of managed Kubernetes in a provider. +It is proposed that option 3 (two kinds with a managed control plane and managed infra cluster with better separation of responsibilities) is the best way to proceed for **new implementations** of managed Kubernetes in a provider where there is additional infrastructure required (e.g. VPC, resource groups). The reasons for this recommendation are as follows: @@ -445,15 +423,25 @@ The reasons for this recommendation are as follows: - The infra cluster provisions and manages the general infrastructure required for the cluster but not the control plane. - By having a separate infra cluster API definition, it allows differences in the API between managed and unmanaged clusters. -Providers like CAPZ and CAPA have already implemented managed Kubernetes support and there should be no requirement on them to move to Option 3. Both Options 4 also works well with ClusterClass and so could be used if required. Option 2 will no longer be relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented. +> This is the model currently adopted by the managed Kubernetes part of CAPG & CAPOCI and all non-managed K8s implementations. + +### Vanilla Managed Kubernetes (i.e. without any additional infrastructure) -Option 1 will be available when CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented. Until then it is not recommended. +If the managed Kubernetes services does not require any base infrastructure to be setup before creating the instance of the service then option 2 (Just a ControlPlane kind (and no InfraCluster) is the recommendation. + +This recommendation assumes that the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes) have been implemented. Until that point option 1 (Two kinds with a ControlPlane and a pass-through InfraCluster) will have to be used. + +### Existing Managed Kubernetes Implementations + +Providers like CAPZ and CAPA have already implemented managed Kubernetes support and there should be no requirement on them to move to Option 3 (if there is additional infrastructure) or option 2 (if there isn't any have additional infrastructure). + +There is a desire to have consistency across all managed Kubernetes implementations and across all cluster types (i.e. managed and unmanaged) but the choice remains with the providers of existing implementations. ### Additional notes on option 3 There are a number of cons listed for option 3. With having 2 API kinds for the infra cluster (and associated controllers), there is a risk of code duplication. To reduce this the 2 controllers can utilize shared reconciliation code from the different controllers so as to reduce this duplication. -The user will need to be aware of when to use which specific infra cluster kind. In our example this means that a user will need to know when to use `GCPCluster` vs `GCPManagedCluster`. To give clear guidance to users, we will provide templates (including ClusterClasses) and documentation for both the unmanaged and managed varieties of clusters. If we used the same infra cluster kind across both unmanaged & managed (i.e. option 4) then we run the risk of complicating the API for the infra cluster & controller if the required properties diverge. +The user will need to be aware of when to use which specific infra cluster kind. In our example this means that a user will need to know when to use `GCPCluster` vs `GCPManagedCluster`. To give clear guidance to users, we will provide templates (including ClusterClasses) and documentation for both the unmanaged and managed varieties of clusters. If we used the same infra cluster kind across both unmanaged & managed (i.e. alternative 2) then we run the risk of complicating the API for the infra cluster & controller if the required properties diverge. ### Managed Node Groups for Worker Nodes @@ -492,10 +480,11 @@ Its recommended that changes are made to the [Provider Implementers documentatio Some of the areas of change (this is not an exhaustive list): -- A new "implementing managed kubernetes" guide that contains details about how to represent a managed Kubernetes service in CAPI. The content will be based on option 3 from this proposal along with other considerations such as managed node and addon management. +- A new "implementing managed kubernetes" guide that contains details about how to represent a managed Kubernetes service in CAPI. The content will be based on the recommendations from this proposal along with other considerations such as managed node and addon management. - Update the [Provider contracts documentation](../book/src/developer/providers/contracts.md) to state that the same kind should not be used to satisfy 2 different provider contracts. - Update the [Cluster Infrastructure documentation](../book/src/developer/providers/cluster-infrastructure.md) to provide guidance on how to populate the `controlPlaneEndpoint` in the scenario where the control plane creates the api server load balancer. We should include sample code. - Update the [Control Plane Controller](../book/src/developer/architecture/controllers/control-plane.md) diagram for managed k8s services case. The Control Plane reconcile needs to start when `InfrastructureReady` is true. +- Updates based on the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes). ## Other Considerations for CAPI @@ -519,6 +508,91 @@ Some of the areas of change (this is not an exhaustive list): - [CAPZ](https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2095) - Managed Kubernetes implementations should be able to opt-in/opt-out of what will be provided by [CAPI’s add-ons orchestration solution](https://github.com/kubernetes-sigs/cluster-api/issues/5491) +## Alternatives + +A number of different representations where also considered but discounted. + +### Alternative 1: Single kind for Control Plane and Infrastructure + +This option introduces a new single resource kind: + +- **GCPManagedControlPlane**: this represents both a control-plane (i.e. GKE) and infrastructure required for the cluster. It contains properties for both the general cloud infrastructure (that would traditionally be represented by an infrastructure cluster) and the managed Kubernetes control plane (that would traditionally be represented by a control plane provider). + +```go +type GCPManagedControlPlaneSpec struct { + // Project is the name of the project to deploy the cluster to. + Project string `json:"project"` + + // NetworkSpec encapsulates all things related to the GCP network. + // +optional + Network NetworkSpec `json:"network"` + + // AddonsConfig defines the addons to enable with the GKE cluster. + // +optional + AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"` + + // Logging contains the logging configuration for the GKE cluster. + // +optional + Logging *ControlPlaneLoggingSpec `json:"logging,omitempty"` + + // EnableKubernetesAlpha will indicate the kubernetes alpha features are enabled + // +optional + EnableKubernetesAlpha bool + + // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. + // +optional + ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` + .... +} +``` + +**This was the design pattern originally used for the EKS implementation in CAPA.** + +#### Background: Why did EKS in CAPA choose this option? + +CAPA decided to represent an EKS cluster as a CAPI control-plane. This meant that control-plane is responsible for creating the API server load balancer. + +Initially CAPA had an infrastructure cluster kind that reported back the control plane endpoint. This required less than ideal code in its controller to watch the control plane and use its value of the control plane endpoint. + +As the infrastructure cluster kind only acted as a passthrough (to satisfy the contract with CAPI) it was decided that it would be removed and the control-plane kind (AWSManagedControlPlane) could be used to satisfy both the “infrastructure” and “control-plane” contracts. _This worked well until ClusterClass arrived with its expectation that the “infrastructure” and “control-plane” are 2 different resource kinds._ + +(Note: the above italicized text matter is no longer relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 merges is implemented.) + +Note that CAPZ had a similar discussion and an [issue](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/1396) to remove AzureManagedCluster: AzureManagedCluster is useless; let's remove it (and keep AzureManagedControlPlane) + +**Pros** + +- A simple design with a single resource kind and controller. + +**Cons** + +- Doesn’t work with the current implementation of ClusterClass, which expects a separation of ControlPlane and Infrastructure. +- Doesn’t provide separation of responsibilities between creating the general cloud infrastructure for the cluster and the actual cluster control plane. +- Managed Kubernetes look different from unmanaged Kubernetes where two separate kinds are used for a control plane and infrastructure. This would impact products building on top of CAPI. + +### Alternative 2: Two kinds with a Managed Control Plane and Shared Infra Cluster with Better Separation of Responsibilities + +This option is a variation of option 3 and as such it more closely follows the original separation of concerns with the different CAPI provider types. The difference with this option compared to option 3 is that only 1 new resource kind is introduced: + +- **GCPManagedControlPlane**: this presents the actual GKE control plane in GCP. Its spec would only contain properties that are specific to provisioning & management of GKE. It would not contain any general properties related to the general GCP operating infrastructure, like the networking or project. + +The general cluster infrastructure will be declared via the existing **GCPCluster** kind and reconciled via the existing controller. + +However, this approach will require changes to the controller for **GCPCluster**. The steps to create the required infrastructure may be different between an unmanaged cluster and a GKE based cluster. For example, for an unmanaged cluster a load balancer will need to be created but with a GKE based cluster this won’t be needed and instead we’d need to use the endpoint created as part of **GCPManagedControlPlane** reconciliation. + +So the **GCPCluster** controller will need to know if its creating infrastructure for an unmanaged or managed cluster (probably by looking at the parent's (i.e. `Cluster`) **controlPlaneRef**) and do different steps. + +**Pros** + +- Single infra cluster kind irrespective of if you are creating an unmanaged or GKE based cluster. It doesn’t require the user to pick the right one. +- Clear separation between cluster infrastructure and the actual managed (i.e. GKE) control plane +- Works with cluster class + +**Cons** + +- Additional complexity and logic in the infra cluster controller +- API definition could be messy if only certain fields are required for one type of cluster + ## Upgrade Strategy As mentioned in the goals section, it is up to providers with existing implementations, CAPA and CAPZ, to decide how they want to proceed. @@ -532,3 +606,4 @@ As mentioned in the goals section, it is up to providers with existing implement - [x] 03/17/2022: Compile a Google Doc following the CAEP template ([link](https://docs.google.com/document/d/1dMN4-KppBkA51sxXPSQhYpqETp2AG_kHzByXTmznxFA/edit?usp=sharing)) - [x] 04/20/2022: Present proposal at a community meeting - [x] 07/27/2022: Move the proposal to a PR in CAPI repo +- [x] 06/15/2023: Updates as a result of the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes.md) and also updates as a result of the current state of managed k8s in CAPI. diff --git a/docs/proposals/20230407-managed-k8s-capi-contract-changes.md b/docs/proposals/20230407-managed-k8s-capi-contract-changes.md index 8c0687850eb4..fcb2b9cc11ab 100644 --- a/docs/proposals/20230407-managed-k8s-capi-contract-changes.md +++ b/docs/proposals/20230407-managed-k8s-capi-contract-changes.md @@ -1,3 +1,38 @@ + + +**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* + +- [Contract Changes to Support Managed Kubernetes](#contract-changes-to-support-managed-kubernetes) + - [Table of Contents](#table-of-contents) + - [Glossary](#glossary) + - [Summary](#summary) + - [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Future Work](#future-work) + - [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Story 3](#story-3) + - [Story 4](#story-4) + - [Story 5](#story-5) + - [Story 6](#story-6) + - [Story 7](#story-7) + - [Current State of Managed Kubernetes in CAPI](#current-state-of-managed-kubernetes-in-capi) + - [EKS in CAPA](#eks-in-capa) + - [AKS in CAPZ](#aks-in-capz) + - [GKE in CAPG](#gke-in-capg) + - [Learnings from original Proposal: Two kinds with a Managed Control Plane & Managed Infra Cluster adhering to the current CAPI contracts](#learnings-from-original-proposal-two-kinds-with-a-managed-control-plane--managed-infra-cluster-adhering-to-the-current-capi-contracts) + - [Two New Flows](#two-new-flows) + - [Flow 1: `Cluster` and `ControlPlane`, with `ControlPlaneEndpoint` reported via `ControlPlane`](#flow-1-infracluster-and-infracontrolplane-with-controlplaneendpoint-reported-via-infracontrolplane) + - [Flow 2: Change CAPI to make `Cluster` optional](#flow-2-change-capi-to-make-infracluster-optional) + - [Alternative Option: Introduce a new Managed Kubernetes provider type (with contract)](#alternative-option-introduce-a-new-managed-kubernetes-provider-type-with-contract) + - [Recommendations](#recommendations) + - [Implementation History](#implementation-history) + + + --- title: Contract Changes to Support Managed Kubernetes authors: From 6bb820a56c64810c0d5bdb724e58db062930bbf9 Mon Sep 17 00:00:00 2001 From: Jack Francis Date: Fri, 16 Jun 2023 09:39:15 -0700 Subject: [PATCH 3/5] enxebre feedback --- docs/proposals/20220725-managed-kubernetes.md | 4 ++-- .../20230407-managed-k8s-capi-contract-changes.md | 10 +++++++++- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/docs/proposals/20220725-managed-kubernetes.md b/docs/proposals/20220725-managed-kubernetes.md index a9127c9dc30d..6ae49931b2cf 100644 --- a/docs/proposals/20220725-managed-kubernetes.md +++ b/docs/proposals/20220725-managed-kubernetes.md @@ -234,7 +234,7 @@ So that I can eliminate the responsibility of owning and SREing the Control Plan - CRDs - GCPManagedControlPlane, GCPManagedCluster - provision GKE cluster - GCPManagedMachinePool - corresponds to managed node pool -- Support falvor +- Support flavor - GCPManagedControlPlane + GCPManagedCluster with GCPManagedMachinePool ### Managed Kubernetes API Design Approaches @@ -342,7 +342,7 @@ type GCPManagedControlPlaneSpec struct { **Cons** -- If the configuration/functionality related to the base infrastructure are included then we have mixed concerns of the APPI type. +- If the configuration/functionality related to the base infrastructure are included then we have mixed concerns of the API type. #### Option 3: Two kinds with a Managed Control Plane and Managed Infra Cluster with Better Separation of Responsibilities diff --git a/docs/proposals/20230407-managed-k8s-capi-contract-changes.md b/docs/proposals/20230407-managed-k8s-capi-contract-changes.md index fcb2b9cc11ab..6b69f69ea382 100644 --- a/docs/proposals/20230407-managed-k8s-capi-contract-changes.md +++ b/docs/proposals/20230407-managed-k8s-capi-contract-changes.md @@ -111,7 +111,15 @@ The following terms will be used in this document. ## Summary -We propose to make provider `Cluster` resources optional in order to better represent Managed Kubernetes scenarios where all _Kubernetes Cluster Infrastructure_ is managed by the service provider, and not by Cluster API. In order to support that, we propose that the API Server endpoint reference can also originate from the `ControlPlane` resource, and not the `Cluster` resource. These changes will introduce two new possible implementation options for providers implementing Managed Kubernetes in Cluster API: +We propose to relax the `Cluster` resource Cluster API contract so that the `ControlPlane` resource may authoritatively express the control plane endpoint in order to better represent real workflows and reduce the complexity for provider implementers. + +By relaxing the `Cluster` contract with respect to the control plane endpoint we can also now provide the opportunity to make the `Cluster` resource fully optional. This additional flexibility will allow Cluster API providers to better represent various Managed Kubernetes service offerings: + +- Cluster Infra is entirely abstracted away from the Managed Kubernetes user +- Cluster Infra is exposed to the Managed Kubernetes user, but managed by the Managed Kubernetes service +- Cluster Infra is provided by the user (BYO) to support the Managed Kubernetes service + +In order to support the above, we propose that the API Server endpoint reference can also originate from the `ControlPlane` resource, and not the `Cluster` resource. These changes will introduce two new possible implementation options for providers implementing Managed Kubernetes in Cluster API: 1. A Managed Kubernetes cluster solution whose configuration surface area is expressed exclusively in a `ControlPlane` resource (no `Cluster` resource). 2. A Managed Kubernetes cluster solution whose configuration surface area comprises both a `Cluster` and a `ControlPlane` resource, with `ControlPlane` being solely responsible for configuring the API Server endpoint (instead of the API Server endpoint being configured via the `Cluster`). From 5c8d5a3f15058433538c51dca3670c980e209455 Mon Sep 17 00:00:00 2001 From: Jack Francis Date: Fri, 6 Oct 2023 16:48:30 -0700 Subject: [PATCH 4/5] initial draft of modified proposal including new CRD Signed-off-by: Jack Francis --- docs/book/src/reference/glossary.md | 22 +- docs/proposals/20220725-managed-kubernetes.md | 18 +- ...0230407-flexible-managed-k8s-endpoints.md} | 230 +++++++++++++----- 3 files changed, 191 insertions(+), 79 deletions(-) rename docs/proposals/{20230407-managed-k8s-capi-contract-changes.md => 20230407-flexible-managed-k8s-endpoints.md} (56%) diff --git a/docs/book/src/reference/glossary.md b/docs/book/src/reference/glossary.md index 7c21c9211b6c..6b58ead3d96d 100644 --- a/docs/book/src/reference/glossary.md +++ b/docs/book/src/reference/glossary.md @@ -26,9 +26,9 @@ A temporary cluster that is used to provision a Target Management cluster. ### Bootstrap provider Refers to a [provider](#provider) that implements a solution for the [bootstrap](#bootstrap) process. -Bootstrap provider's interaction with Cluster API is based on what is defined in the [Cluster API contract](#contract). +Bootstrap provider's interaction with Cluster API is based on what is defined in the [Cluster API contract](#contract). -See [CABPK](#cabpk). +See [CABPK](#cabpk). # C --- @@ -132,6 +132,12 @@ See [core provider](#core-provider) The Cluster API execution model, a set of controllers cooperating in managing the Kubernetes cluster lifecycle. +### Cluster Infrastructure + +or __Kubernetes Cluster Infrastructure__ + +Defines the **infrastructure that supports a Kubernetes cluster**, like e.g. VPC, security groups, load balancers, etc. Please note that in the context of managed Kubernetes some of those components are going to be provided by the corresponding abstraction for a specific Cloud provider (EKS, OKE, AKS etc), and thus Cluster API should not take care of managing a subset or all those components. + ### Contract Or __Cluster API contract__ @@ -155,7 +161,7 @@ See [KCP](#kcp). ### Core provider -Refers to a [provider](#provider) that implements Cluster API core controllers; if you +Refers to a [provider](#provider) that implements Cluster API core controllers; if you consider that the first project that must be deployed in a management Cluster is Cluster API itself, it should be clear why the Cluster API project is also referred to as the core provider. @@ -196,7 +202,7 @@ see [Server](#server) ### Infrastructure provider -Refers to a [provider](#provider) that implements provisioning of infrastructure/computational resources required by +Refers to a [provider](#provider) that implements provisioning of infrastructure/computational resources required by the Cluster or by Machines (e.g. VMs, networking, etc.). Infrastructure provider's interaction with Cluster API is based on what is defined in the [Cluster API contract](#contract). @@ -205,7 +211,7 @@ When there is more than one way to obtain resources from the same infrastructure For a complete list of providers see [Provider Implementations](providers.md). -### Inline patch +### Inline patch A [patch](#patch) defined inline in a [ClusterClass](#clusterclass). An alternative to an [external patch](#external-patch). @@ -269,6 +275,10 @@ See also: [Server](#server) Perform create, scale, upgrade, or destroy operations on the cluster. +### Managed Kubernetes + +Managed Kubernetes refers to any Kubernetes cluster provisioning and maintenance abstraction, usually exposed as an API, that is natively available in a Cloud provider. For example: [EKS](https://aws.amazon.com/eks/), [OKE](https://www.oracle.com/cloud/cloud-native/container-engine-kubernetes/), [AKS](https://azure.microsoft.com/en-us/products/kubernetes-service), [GKE](https://cloud.google.com/kubernetes-engine), [IBM Cloud Kubernetes Service](https://www.ibm.com/cloud/kubernetes-service), [DOKS](https://www.digitalocean.com/products/kubernetes), and many more throughout the Kubernetes Cloud Native ecosystem. + ### Managed Topology See [Topology](#topology) @@ -306,7 +316,7 @@ A generically understood combination of a kernel and system-level userspace inte # P --- -### Patch +### Patch A set of instructions describing modifications to a Kubernetes object. Examples include JSON Patch and JSON Merge Patch. diff --git a/docs/proposals/20220725-managed-kubernetes.md b/docs/proposals/20220725-managed-kubernetes.md index 6ae49931b2cf..5191e3e22a0b 100644 --- a/docs/proposals/20220725-managed-kubernetes.md +++ b/docs/proposals/20220725-managed-kubernetes.md @@ -16,7 +16,7 @@ reviewers: creation-date: 2022-07-25 last-updated: 2023-06-15 status: implementable -see-also: ./20230407-managed-k8s-capi-contract-changes.md +see-also: ./20230407-flexible-managed-k8s-endpoints.md replaces: superseded-by: --- @@ -97,7 +97,7 @@ Some Cluster API Providers (i.e. Azure with AKS first and then AWS with EKS) hav While working on supporting ClusterClass for EKS in Cluster API Provider AWS (CAPA), it was discovered that the current implementation of EKS within CAPA, where a single resource kind (AWSManagedControlPlane) is used for both ControlPlane and Infrastructure, is incompatible with other parts of CAPI assuming the two objects are different (Reference [issue here](https://github.com/kubernetes-sigs/cluster-api/issues/6126)). -Separation of ControlPlane and Infrastructure is expected for the ClusterClass implementation to work correctly. However, after the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes) have been implemented there is the option to supply only the control plane, but you still cannot supply the same resource for both. +Separation of ControlPlane and Infrastructure is expected for the ClusterClass implementation to work correctly. However, after the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented there is the option to supply only the control plane, but you still cannot supply the same resource for both. The responsibilities between the CAPI control plane and infrastructure are blurred with a managed Kubernetes service like AKS or EKS. For example, when you create a EKS control plane in AWS it also creates infrastructure that CAPI would traditionally view as the responsibility of the cluster “infrastructure provider”. @@ -118,7 +118,7 @@ A good example here is the API server load balancer: - Enforce the Managed Kubernetes recommendations as a requirement for Cluster API providers when they implement Managed Kubernetes. - If providers that have already implemented Managed Kubernetes and would like guidance on if/how they could move to be aligned with the recommendations of this proposal then discussions should be facilitated. - Provide advice in this proposal on how to refactor the existing implementations of managed Kubernetes in CAPA & CAPZ. -- Propose a new architecture or API changes to CAPI for managed Kubernetes. This has been covered by the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes). +- Propose a new architecture or API changes to CAPI for managed Kubernetes. This has been covered by the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md). - Be a concrete design for the GKE implementation in Cluster API Provider GCP (CAPG). - Recommend how Managed Kubernetes services would leverage CAPI internally to run their offer. @@ -247,7 +247,7 @@ The following section discusses different API implementation options along with #### Option 1: Two kinds with a ControlPlane and a pass-through InfraCluster -**This option will be no longer needed when the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes) have been implemented as option 2 can be used for a simpler solution** +**This option will be no longer needed when the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented as option 2 can be used for a simpler solution** This option introduces 2 new resource kinds: @@ -304,7 +304,7 @@ type GCPManagedClusterSpec struct { #### Option 2: Just a ControlPlane kind and no InfraCluster -**This option is enabled when the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes) have been implemented.** +**This option is enabled when the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented.** This option introduces 1 new resource kind: @@ -400,7 +400,7 @@ type GCPManagedClusterSpec struct { } ``` -When the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes) have been implemented there is the option to return the control plane endpoint directly from the ControlPlane instead of passing it via the Infracluster. +When the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented there is the option to return the control plane endpoint directly from the ControlPlane instead of passing it via the Infracluster. **Pros** @@ -429,7 +429,7 @@ The reasons for this recommendation are as follows: If the managed Kubernetes services does not require any base infrastructure to be setup before creating the instance of the service then option 2 (Just a ControlPlane kind (and no InfraCluster) is the recommendation. -This recommendation assumes that the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes) have been implemented. Until that point option 1 (Two kinds with a ControlPlane and a pass-through InfraCluster) will have to be used. +This recommendation assumes that the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) have been implemented. Until that point option 1 (Two kinds with a ControlPlane and a pass-through InfraCluster) will have to be used. ### Existing Managed Kubernetes Implementations @@ -484,7 +484,7 @@ Some of the areas of change (this is not an exhaustive list): - Update the [Provider contracts documentation](../book/src/developer/providers/contracts.md) to state that the same kind should not be used to satisfy 2 different provider contracts. - Update the [Cluster Infrastructure documentation](../book/src/developer/providers/cluster-infrastructure.md) to provide guidance on how to populate the `controlPlaneEndpoint` in the scenario where the control plane creates the api server load balancer. We should include sample code. - Update the [Control Plane Controller](../book/src/developer/architecture/controllers/control-plane.md) diagram for managed k8s services case. The Control Plane reconcile needs to start when `InfrastructureReady` is true. -- Updates based on the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes). +- Updates based on the changes documented in the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md). ## Other Considerations for CAPI @@ -606,4 +606,4 @@ As mentioned in the goals section, it is up to providers with existing implement - [x] 03/17/2022: Compile a Google Doc following the CAEP template ([link](https://docs.google.com/document/d/1dMN4-KppBkA51sxXPSQhYpqETp2AG_kHzByXTmznxFA/edit?usp=sharing)) - [x] 04/20/2022: Present proposal at a community meeting - [x] 07/27/2022: Move the proposal to a PR in CAPI repo -- [x] 06/15/2023: Updates as a result of the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-managed-k8s-capi-contract-changes.md) and also updates as a result of the current state of managed k8s in CAPI. +- [x] 06/15/2023: Updates as a result of the [Contract Changes to Support Managed Kubernetes CAEP](./20230407-flexible-managed-k8s-endpoints.md) and also updates as a result of the current state of managed k8s in CAPI. diff --git a/docs/proposals/20230407-managed-k8s-capi-contract-changes.md b/docs/proposals/20230407-flexible-managed-k8s-endpoints.md similarity index 56% rename from docs/proposals/20230407-managed-k8s-capi-contract-changes.md rename to docs/proposals/20230407-flexible-managed-k8s-endpoints.md index 6b69f69ea382..ffa6c790cbb3 100644 --- a/docs/proposals/20230407-managed-k8s-capi-contract-changes.md +++ b/docs/proposals/20230407-flexible-managed-k8s-endpoints.md @@ -2,8 +2,7 @@ **Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* -- [Contract Changes to Support Managed Kubernetes](#contract-changes-to-support-managed-kubernetes) - - [Table of Contents](#table-of-contents) +- [Flexible Managed Kubernetes Endpoints](#flexible-managed-kubernetes-endpoints) - [Glossary](#glossary) - [Summary](#summary) - [Motivation](#motivation) @@ -19,8 +18,14 @@ - [Story 5](#story-5) - [Story 6](#story-6) - [Story 7](#story-7) - - [Current State of Managed Kubernetes in CAPI](#current-state-of-managed-kubernetes-in-capi) - - [EKS in CAPA](#eks-in-capa) + - [Design](#design) + - [Core Cluster API changes](#core-cluster-api-changes) + - [Infra Providers API changes](#infra-providers-api-changes) + - [Core Cluster API Controllers changes](#core-cluster-api-controllers-changes) + - [Provider controller changes](#provider-controller-changes) + - [Guidelines for infra providers implementation](#guidelines-for-infra-providers-implementation) + - [Background work](#background-work) + - [EKS in CAPA](#eks-in-capa) - [AKS in CAPZ](#aks-in-capz) - [GKE in CAPG](#gke-in-capg) - [Learnings from original Proposal: Two kinds with a Managed Control Plane & Managed Infra Cluster adhering to the current CAPI contracts](#learnings-from-original-proposal-two-kinds-with-a-managed-control-plane--managed-infra-cluster-adhering-to-the-current-capi-contracts) @@ -34,7 +39,7 @@ --- -title: Contract Changes to Support Managed Kubernetes +title: Flexible Managed Kubernetes Endpoints authors: - "@jackfrancis" reviewers: @@ -54,43 +59,7 @@ see-also: - "/docs/proposals/20220725-managed-kubernetes.md" --- -# Contract Changes to Support Managed Kubernetes - -## Table of Contents - -A table of contents is helpful for quickly jumping to sections of a proposal and for highlighting -any additional information provided beyond the standard proposal template. -[Tools for generating](https://github.com/ekalinin/github-markdown-toc) a table of contents from markdown are available. - -- [Contract Changes to Support Managed Kubernetes](#contract-changes-to-support-managed-kubernetes) - - [Table of Contents](#table-of-contents) - - [Glossary](#glossary) - - [Summary](#summary) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Future work](#future-work) - - [Proposal](#proposal) - - [User Stories](#user-stories) - - [Story 1](#story-1) - - [Story 2](#story-2) - - [Requirements (Optional)](#requirements-optional) - - [Functional Requirements](#functional-requirements) - - [FR1](#fr1) - - [FR2](#fr2) - - [Non-Functional Requirements](#non-functional-requirements) - - [NFR1](#nfr1) - - [NFR2](#nfr2) - - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - - [Security Model](#security-model) - - [Risks and Mitigations](#risks-and-mitigations) - - [Alternatives](#alternatives) - - [Upgrade Strategy](#upgrade-strategy) - - [Additional Details](#additional-details) - - [Test Plan [optional]](#test-plan-optional) - - [Graduation Criteria [optional]](#graduation-criteria-optional) - - [Version Skew Strategy [optional]](#version-skew-strategy-optional) - - [Implementation History](#implementation-history) +# Flexible Managed Kubernetes Endpoints ## Glossary @@ -98,31 +67,27 @@ Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/referen The following terms will be used in this document. -- `Cluster` - - When we say `Cluster` we refer to any provider's infra-specific implementation of the Cluster API `Cluster` resource spec. When you see ``, interpret that as a placeholder for any provider implementation. Some concrete examples of provider infra cluster implementations are Azure's CAPZ provider (e.g., `AzureCluster` and `AzureManagedCluster`), AWS's CAPA provider (e.g., `AWSCluster` and `AWSManagedCluster`), and Google Cloud's CAPG provider (e.g., `GCPCluster` and `GCPManagedCluster`). Rather than referencing any one of the preceding actual implementations of infra cluster resources, we prefer to generalize to `Cluster` so that we don't suggest any provider-specific bias informing our conclusions. -- `ControlPlane` - - When we say `ControlPlane` we refer to any provider's infra-specific implementation of the a Kubernetes cluster's control plane. When you see ``, interpret that as a placeholder for any provider implementation. Some concrete examples of provider infra control plane implementations are Azure's CAPZ provider (e.g., `AzureManagedControlPlane`), AWS's CAPA provider (e.g., `AWSManagedControlPlane`), and Google Cloud's CAPG provider (e.g., `GCPManagedControlPlane`). - Managed Kubernetes - - Managed Kubernetes refers to any Kubernetes Cluster provisioning and maintenance platform that is exposed by a service API. For example: [EKS](https://aws.amazon.com/eks/), [OKE](https://www.oracle.com/cloud/cloud-native/container-engine-kubernetes/), [AKS](https://azure.microsoft.com/en-us/products/kubernetes-service), [GKE](https://cloud.google.com/kubernetes-engine), [IBM Cloud Kubernetes Service](https://www.ibm.com/cloud/kubernetes-service), [DOKS](https://www.digitalocean.com/products/kubernetes), and many more throughout the Kubernetes Cloud Native ecosystem. + - Managed Kubernetes refers to any Kubernetes Cluster provisioning and maintenance abstraction, usually exposed as an API, that is natively available in a Cloud provider. For example: [EKS](https://aws.amazon.com/eks/), [OKE](https://www.oracle.com/cloud/cloud-native/container-engine-kubernetes/), [AKS](https://azure.microsoft.com/en-us/products/kubernetes-service), [GKE](https://cloud.google.com/kubernetes-engine), [IBM Cloud Kubernetes Service](https://www.ibm.com/cloud/kubernetes-service), [DOKS](https://www.digitalocean.com/products/kubernetes), and many more throughout the Kubernetes Cloud Native ecosystem. +- `ControlPlane Provider` + - When we say `ControlPlane Provider` we refer to a solution that implements a solution for the management of a Kubernetes [control plane](https://kubernetes.io/docs/concepts/#kubernetes-control-plane) according to the Cluster API contract. Please note that in the context of managed Kubernetes, the `ControlPlane Provider` usually wraps the corresponding abstraction for a specific Cloud provider. Concrete example for Microsoft Azure is the `AzureManagedControlPlane`, for AWS the `AWSManagedControlPlane`, for Google the `GCPManagedControlPlane` etc. - _Kubernetes Cluster Infrastructure_ - - When we refer to _Kubernetes Cluster Infrastructure_ we aim to distinguish required environmental infrastructure (e.g., cloud virtual networks) in which a Kubernetes cluster resides as a "set of child resources" from the Kubernetes cluster resources themselves (e.g., virtual machines that underlie nodes, managed by Cluster API). Sometimes this is referred to as "BYO Infrastructure"; essentially, we are talking about **infrastructure that supports a Kubernetes cluster, but is not actively managed by Cluster API**. As we will see, this boundary is different when discussing Managed Kubernetes: more infrastructure resources are not managed by Cluster API when running Managed Kubernetes. + - When we refer to _Kubernetes Cluster Infrastructure_ (abbr. _Cluster Infrastructure_) we refer to the **infrastructure that supports a Kubernetes cluster**, like e.g. VPC, security groups, load balancers etc. Please note that in the context of Managed Kubernetes some of those components are going to be provided by the corresponding abstraction for a specific Cloud provider (EKS, OKE, AKS etc), and thus Cluster API should not take care of managing a subset or all those components. +- `Cluster` + - When we say `Cluster` we refer to any provider that provides Kubernetes Cluster Infrastructure for a specific Cloud provider. Concrete example for Microsoft Azure is the `AzureCluster` and the `AzureManagedCluster`, for AWS the `AWSCluster` and the `AWSManagedCluster`, for Google Cloud the `GCPCluster` and the `GCPManagedCluster`). - e.g. - This just means "For example:"! ## Summary -We propose to relax the `Cluster` resource Cluster API contract so that the `ControlPlane` resource may authoritatively express the control plane endpoint in order to better represent real workflows and reduce the complexity for provider implementers. +This proposal aims to address the lesson learned by running Managed Kubernetes solution on top of Cluster API, and make this use case simpler and more straight forward both for Cluster API users and for the maintainers of the Cluster API providers. -By relaxing the `Cluster` contract with respect to the control plane endpoint we can also now provide the opportunity to make the `Cluster` resource fully optional. This additional flexibility will allow Cluster API providers to better represent various Managed Kubernetes service offerings: +More specifically we would like to introduce first class support for two scenarios: -- Cluster Infra is entirely abstracted away from the Managed Kubernetes user -- Cluster Infra is exposed to the Managed Kubernetes user, but managed by the Managed Kubernetes service -- Cluster Infra is provided by the user (BYO) to support the Managed Kubernetes service +- Permit omitting the `Cluster` entirely, thus making it simpler to use with Cluster API all the Managed Kubernetes implementations which do not require any additional Kubernetes Cluster Infrastructure (network settings, security groups, etc) on top of what is provided out of the box by the managed Kubernetes primitive offered by a Cloud provider. +- Allow the `ControlPlane Provider` component to take ownership of the responsibility of creating the control plane endpoint, thus making it simpler to use with Cluster API all the Managed Kubernetes implementations which are taking care out of the box of this piece of Cluster Infrastructure. -In order to support the above, we propose that the API Server endpoint reference can also originate from the `ControlPlane` resource, and not the `Cluster` resource. These changes will introduce two new possible implementation options for providers implementing Managed Kubernetes in Cluster API: - -1. A Managed Kubernetes cluster solution whose configuration surface area is expressed exclusively in a `ControlPlane` resource (no `Cluster` resource). -2. A Managed Kubernetes cluster solution whose configuration surface area comprises both a `Cluster` and a `ControlPlane` resource, with `ControlPlane` being solely responsible for configuring the API Server endpoint (instead of the API Server endpoint being configured via the `Cluster`). +The above capabilities can be used alone or in combination depending on the requirements of a specific Managed Kubernetes or on the specific architecture/set of Cloud components being implemented. ## Motivation @@ -130,18 +95,19 @@ The implementation of Managed Kubernetes scenarios by Cluster API providers occu One particular part of the existing Cluster API surface area that is inconsistent with most Managed Kubernetes user experiences is the accounting of the [Kubernetes API server](https://kubernetes.io/docs/concepts/overview/components/#kube-apiserver). In the canonical "self-managed" user story that Cluster API addresses, it is the provider implementation of Cluster API (e.g., CAPA) that is responsible for scaffolding the necessary _Kubernetes Cluster Infrastructure_ that is required in order to create the Kubernetes API server (e.g., a Load Balancer and a public IP address). This provider responsibility is declared in the `Cluster` resource, and carried out via its controllers; and then finally this reconciliation is synchronized with the parent `Cluster` Cluster API resource. -Because there exist Managed Kubernetes scenarios that handle all _Kubernetes Cluster Infrastructure_ responsibilities themselves, Cluster API's requirement of a `Cluster` resource leads to weird implementation decisions, because in these scenarios there is no actual work for a Cluster API provider to do to scaffold _Kubernetes Cluster Infrastructure_. +Because there exist Managed Kubernetes scenarios that handle a subset or all _Kubernetes Cluster Infrastructure_ responsibilities themselves, Cluster API's requirement of a `Cluster` resource leads to undesirable implementation decisions, because in these scenarios there is no actual work for a Cluster API provider to do to scaffold _Kubernetes Cluster Infrastructure_. + +Finally, for Managed Kubernetes scenarios that _do_ include additional, user-exposed infra (e.g., GKE and EKS as of this writing), we want to make it easier to account for the representation of the Managed Kubernetes API server endpoint, which is not always best owned by a `Cluster` resource. ### Goals - Build upon [the existing Cluster API Managed Kubernetes proposal](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220725-managed-kubernetes.md). Any net new recommendations and/or proposals will be a continuation of the existing proposal, and consistent with its original conclusions. -- Make `Cluster` resources optional. -- Enable API Server endpoint reporting from a provider's Control Plane resource rather than from its `Cluster` resource. +- Identify and document API changes and controllers changes required to omit the `Cluster` entirely, where this is applicable. +- Identify and document API changes and controllers changes required to allow the `ControlPlane Provider` component to take ownership of the responsibility of creating the control plane endpoint. - Ensure any changes to the current behavioral contract are backwards-compatible. ### Non-Goals -- Changes to existing Cluster API CRDs. - Introduce new "Managed Kubernetes" data types in Cluster API. - Invalidate [the existing Cluster API Managed Kubernetes proposal and concluding recommendations](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220725-managed-kubernetes.md). @@ -163,7 +129,7 @@ As a cluster operator, I want to be able to provision both "unmanaged" and "mana #### Story 3 -As a Cluster API provider implementor, I want to be able to return the control plane endpoint via the ControlPlane custom resource, so that it fits naturally with how I create an instance of the service provider's Managed Kubernetes which creates the endpoint, and so i don't have to pass through the value via another custom resource. +As a Cluster API provider implementor, I want to be able to return the control plane endpoint created by the `ControlPlane Provider`, so that it fits naturally with how most of the native Managed Kubernetes implementations works. #### Story 4 @@ -181,9 +147,145 @@ As a cluster operator, I want to use Cluster API to provision and manage the lif As a service provider I want to be able to offer Managed Kubernetes clusters by using CAPI referencing my own managed control plane implementation that satisfies Cluster API contracts. -### Current State of Managed Kubernetes in CAPI +### Design + +Below we are documenting API changes and controllers changes required to omit the `Cluster` entirely and to allow the `ControlPlane Provider` component to take ownership of the responsibility of creating the control plane endpoint. + +#### Core Cluster API changes + +This proposal does not introduce any breaking changes for the existing "core" API. More specifically: + +The existing Cluster API types are already able to omit the `Cluster`: + +- The `infrastructureRef` field on the Cluster object is already a pointer and thus it could be set to nil, and in fact we are already creating Clusters without `infrastructureRef` when we use a cluster class). +- The `infrastructure.Ref` field on the ClusterClass objects already a pointer and thus it could be set to nil, but in this case it is required to change the validation webhook to allow the user to not specify it; on top of that, when validating inline patches, we should reject patches targeting the infrastructure template objects if not specified. + +In order to allow the `ControlPlane Provider` component to take ownership of the responsibility of creating the control plane endpoint we are going to introduce a new `ClusterEndpoint` CRD, below some example: + +```yaml +apiVersion: cluster.x-k8s.io/v1beta1 +kind: ClusterEndpoint +spec: + host: "name-1234567890.region.elb.amazonaws.com" + port: 1234 + type: ExternalControlPlaneEndpoint +``` + +```yaml +apiVersion: cluster.x-k8s.io/v1beta1 +kind: ClusterEndpoint +spec: + host: "10.40.85.102" + port: 1234 + type: ExternalControlPlaneEndpoint +``` + +This is how the type specification would look: + +```go +// ClusterEndpoint represents a reachable Kubernetes API endpoint serving a particular cluster function. +type ClusterEndpoint struct { + // The Host is the DNS record or the IP address that the endpoint is reachable on. + Host string `json:"host"` + + // The port on which the endpoint is serving. + Port int32 `json:"port"` +} +``` + +The `Cluster` object which is currently using the `spec.controlPlaneEndpoint` for the same scope will continue to work because "core" Cluster API controllers will continue to recognize when this field is set and take care of generating the `ClusterEndpoint` automatically; however this mechanism should be considered as a temporary machinery to migrate to the new CRD, and it will be removed in future versions of Cluster API. In addition, once the legacy behavior is removed, we will deprecate and eventually remove the `spec.controlPlaneEndpoint` field from the `Cluster` CustomResourceDefinition, and recommend that providers do the same for their `Cluster` CustomResourceDefinitions as well. + +Future Notes: + +- A future `type` field can be introduced to enable CAPI to extend the usage of this CRD to address https://github.com/kubernetes-sigs/cluster-api/issues/5295 in a future iteration +- The current implementation originates from the `Cluster.spec.ControlPlaneEndpoint` field, which defines the info we need for this proposal; but in future iterations we might consider to support more addressed or more ports for each ClusterEndpoint, similarly what is implemented in the core v1 Endpoint type. + +#### Infra Providers API changes + +This proposal does not introduce any breaking changes for the provider's API. + +However, Infra providers will be made aware that `spec.controlPlaneEndpoint` will be scheduled for deprecation in `Cluster` resources in a future CAPI API version, with corresponding warning messages in controller logs. We will recommend that they remove it in a future API version of their provider. + +#### Core Cluster API Controllers changes + +- All the controllers working with ClusterClass objects must take into account that the `infrastructure.Ref` field could be omitted; most notably: + - The ClusterClass controller must ignore nil `infrastructure.Ref` fields while adding owner references to all the objects referenced by a ClusterClass. + - The Topology controller must skip the generation of the `Cluster` objects when the `infrastructure.Ref` field in a ClusterClass is empty. + +- All the controllers working with Cluster objects must take into account that the `infrastructureRef` field could be omitted; most notably: + - The Cluster controller must use skip reconciling this external reference when the `infrastructureRef` is missing; also, the `status.InfrastructureReady` field must be automatically set to true in this case. + +- The Cluster controller must reconcile the new `ClusterEndpoint` CR. Please note that: + - The value from the `ClusterEndpoint` CRD must surface on the `spec.ControlPlaneEndpoint` field on the `Cluster` object. + - If both are present, the value from the `ClusterEndpoint` CRD must take precedence on the value from `Cluster` objects still using the `spec.controlPlaneEndpoint`. + +- The Cluster controller must implement the temporary machinery to migrate to the new CRD existing Clusters and to deal with `Cluster` objects still using the `spec.controlPlaneEndpoint` field as a way to communicate the ClusterAddress to "core" Cluster API controllers: + - If there is the `spec.ControlPlaneEndpoint` on the `Cluster` object but not a corresponding `ClusterEndpoint` CR, the CR must be created. + +#### Provider controller changes + +- All the `Cluster` controller who are responsible to create a control plane endpoint + - As soon as the `spec.controlPlaneEndpoint` field in the `Cluster` object will removed, the `Cluster` controller must instead create a `ClusterEndpoint` CR to communicate the control plane endpoint to the Cluster API core controllers + - NOTE: technically it is possible to start creating the `ClusterEndpoint` CR *before* the removal of the `spec.controlPlaneEndpoint` field, because the new CR will take precedence on the value read from the field, but this is up to the infra provider maintainers. + - The `ClusterEndpoint` CR must have an owner reference to the `Cluster` object from which it is originated. + +- All the `ControlPlane Provider` controller who are responsible to create a control plane endpoint + - Must stop to wait for the `spec.ControlPlaneEndpoint` field on the `Cluster` object to be set before starting to provision the control plane. + - As soon as the Managed Kubernetes Service-provided control plane endpoint is available, the controller must create a `ClusterEndpoint` CR to communicate this to the control plane endpoint to the Cluster API core controllers + - The `ClusterEndpoint` CR must have an owner reference to the `ControlPlane` object from which is originated. + +### Guidelines for infra providers implementation + +Let's consider following scenarios for an hypothetical `cluster-api-provider-foo` infra provider: + +_Scenario 1._ + +If the `Foo` cloud provider has a `FKS` managed Kubernetes offering that is taking care of _the entire Kubernetes Cluster infrastructure_, the maintainers of the `cluster-api-provider-foo` provider: +- Must not implement a `FKSCluster` CRD and the corresponding `FKSClusterTemplate` CRD (nor the related controllers) +- Must implement a `FKRControlControlplane provider`, a `FKRControlControlplane` CRD, the corresponding `FKRControlControlplane` and related controllers +- The `FKRControlControlplane` controller: + - Must not wait for `spec.ControlPlaneEndpoint` field on the `Cluster` object to be set before starting to provision the `FKS` managed Kubernetes instance. + - As soon as the control plane endpoint is available, Must create a `ClusterEndpoint` CR to communicate the control plane endpoint to the Cluster API core controllers; the `ClusterEndpoint` CR must have an owner reference to the `FKRControlControlplane` object from which is originated. + - Must set the `status.Ready` field on the `FKRControlControlplane` object when the provisioning is complete + +_Scenario 2._ + +If the `Foo` cloud provider has a `FKS` managed Kubernetes offering that is taking care of _only of a subset of the Kubernetes Cluster infrastructure_, or it is required to provision some additional pieces of infrastructure on top of what provisioned out of the box, e.g. a SSH bastion host, the maintainers of the `cluster-api-provider-foo` provider: +- Must implement a `FKSCluster` CRD and the corresponding `FKSClusterTemplate` CRD and the related controllers + - The `FKSCluster` controller + - Must create only the additional piece of the _Kubernetes Cluster infrastructure_ not provisioned by the `FKS` managed Kubernetes instance (in this example a SSH bastion host) + - Must not create a `ClusterEndpoint` CR (nor set the `spec.controlPlaneEndpoint` field in the `FKSCluster` object), because provisioning the control plane endpoint is not responsibility of this controller. + - Must set the `status.Ready` field on the `FKSCluster` object when the provisioning is complete +- Must implement a `FKRControlControlplane provider`, a `FKRControlControlplane` CRD, the corresponding `FKRControlControlplane` and related controllers + - The `FKRControlControlplane` controller: + - Must wait for `status.InfrastructureReady` field on the `Cluster` object to be set to true before starting to provision the control plane. + - Must not wait for `spec.ControlPlaneEndpoint` field on the `Cluster` object to be set before starting to provision the control plane. + - As soon as the control plane endpoint is available, Must create a `ClusterEndpoint` CR to communicate the control plane endpoint to the Cluster API core controllers; the `ClusterEndpoint` CR must have an owner reference to the `FKRControlControlplane` object from which is originated. + - Must set the `status.Ready` field on the `FKRControlControlplane` object when the provisioning is complete + +_Scenario 3._ + +If the `Foo` cloud provider has a `FKS` managed Kubernetes offering that is not taking care of the control plane endpoint e.g. because it requires an existing `FooElasticIP`, a `FooElacticLoadBalancer` to be provisioned before creating the `FKS` managed Kubernetes cluster, the maintainers of the `cluster-api-provider-foo` provider: +- Must implement a `FKSCluster` CRD and the corresponding `FKSClusterTemplate` CRD and the related controllers; those controllers must create a `ClusterEndpoint` CR as soon as the control plane endpoint is available + - The `FKSCluster` controller + - Must create only the additional piece of the _Kubernetes Cluster infrastructure_ not provisioned by the `FKS` managed Kubernetes instance (in this example `FooElasticIP`, a `FooElacticLoadBalancer`) + - As soon as the control plane endpoint is available, Must create a `ClusterEndpoint` CR; the `ClusterEndpoint` CR must have an owner reference to the `FKSCluster` object from which is originated. + - Must set the `status.Ready` field on the `FKSCluster` object when the provisioning is complete +- Must implement a `FKRControlControlplane provider`, a `FKRControlControlplane` CRD, the corresponding `FKRControlControlplane` and related controllers + - The `FKRControlControlplane` controller: + - Must wait for `status.InfrastructureReady` field on the `Cluster` object to be set to true before starting to provision the `FKS` managed Kubernetes instance. + - Must wait for `spec.ControlPlaneEndpoint` field on the `Cluster` object to be set before starting to provision the `FKS` managed Kubernetes instance. + - Must set the `status.Ready` field on the `FKRControlControlplane` object when the provisioning is complete + +Please note that this scenario is equivalent to what is implemented for a non managed Kubernetes `FooCluster`, backed by Cluster API managed `FooMachines`, with the only difference that in this case it possible to rely on `KCP` as `ControlControlplane provider`, and thus point 2 of the above list do not apply. + +## Background work + +This proposal builds on top of the awesome research work of the Managed Kubernetes working group, and it is a result of a huge work of a team of passionate Cluster API contributors. + +Below a summary of the main evidences / alternative considered during this this work. -#### EKS in CAPA +### EKS in CAPA - [Docs](https://cluster-api-aws.sigs.k8s.io/topics/eks/index.html) - Feature Status: GA From 9547470ca69ba03a3ec6d4ef7c3285180b1d3a49 Mon Sep 17 00:00:00 2001 From: Jack Francis Date: Wed, 1 Nov 2023 08:33:03 -0700 Subject: [PATCH 5/5] remove background history Signed-off-by: Jack Francis --- ...20230407-flexible-managed-k8s-endpoints.md | 281 ++---------------- 1 file changed, 31 insertions(+), 250 deletions(-) diff --git a/docs/proposals/20230407-flexible-managed-k8s-endpoints.md b/docs/proposals/20230407-flexible-managed-k8s-endpoints.md index ffa6c790cbb3..9b3e1b70672c 100644 --- a/docs/proposals/20230407-flexible-managed-k8s-endpoints.md +++ b/docs/proposals/20230407-flexible-managed-k8s-endpoints.md @@ -165,8 +165,12 @@ In order to allow the `ControlPlane Provider` component to take ownership of the ```yaml apiVersion: cluster.x-k8s.io/v1beta1 kind: ClusterEndpoint +metadata: + labels: + cluster.x-k8s.io/cluster-name: my-cluster spec: - host: "name-1234567890.region.elb.amazonaws.com" + cluster: my-cluster + host: "my-cluster-1234567890.region.elb.amazonaws.com" port: 1234 type: ExternalControlPlaneEndpoint ``` @@ -174,7 +178,11 @@ spec: ```yaml apiVersion: cluster.x-k8s.io/v1beta1 kind: ClusterEndpoint +metadata: + labels: + cluster.x-k8s.io/cluster-name: my-cluster-2 spec: + cluster: my-cluster-2 host: "10.40.85.102" port: 1234 type: ExternalControlPlaneEndpoint @@ -183,13 +191,31 @@ spec: This is how the type specification would look: ```go +// ClusterEndpointType describes the type of cluster endpoint. +type ClusterEndpointType string + // ClusterEndpoint represents a reachable Kubernetes API endpoint serving a particular cluster function. type ClusterEndpoint struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec ClusterEndpointSpec `json:"spec,omitempty"` +} + +// ClusterEndpointSpec defines the desired state of the Cluster endpoint. +type ClusterEndpointSpec struct { // The Host is the DNS record or the IP address that the endpoint is reachable on. Host string `json:"host"` // The port on which the endpoint is serving. Port int32 `json:"port"` + + // Cluster is a reference to the cluster name that this endpoint is reachable on. + Cluster string `json:"cluster"` + + // Type describes the function that this cluster endpoint serves. + // +kubebuilder:validation:Enum=apiserver + Type ClusterEndpointType `json:"type"` } ``` @@ -215,7 +241,7 @@ However, Infra providers will be made aware that `spec.controlPlaneEndpoint` wil - All the controllers working with Cluster objects must take into account that the `infrastructureRef` field could be omitted; most notably: - The Cluster controller must use skip reconciling this external reference when the `infrastructureRef` is missing; also, the `status.InfrastructureReady` field must be automatically set to true in this case. -- The Cluster controller must reconcile the new `ClusterEndpoint` CR. Please note that: +- A controller (details TBD) will reconcile the new `ClusterEndpoint` CR. Please note that: - The value from the `ClusterEndpoint` CRD must surface on the `spec.ControlPlaneEndpoint` field on the `Cluster` object. - If both are present, the value from the `ClusterEndpoint` CRD must take precedence on the value from `Cluster` objects still using the `spec.controlPlaneEndpoint`. @@ -224,13 +250,13 @@ However, Infra providers will be made aware that `spec.controlPlaneEndpoint` wil #### Provider controller changes -- All the `Cluster` controller who are responsible to create a control plane endpoint +- All the `Cluster` controllers who are responsible for creating a control plane endpoint - As soon as the `spec.controlPlaneEndpoint` field in the `Cluster` object will removed, the `Cluster` controller must instead create a `ClusterEndpoint` CR to communicate the control plane endpoint to the Cluster API core controllers - NOTE: technically it is possible to start creating the `ClusterEndpoint` CR *before* the removal of the `spec.controlPlaneEndpoint` field, because the new CR will take precedence on the value read from the field, but this is up to the infra provider maintainers. - The `ClusterEndpoint` CR must have an owner reference to the `Cluster` object from which it is originated. -- All the `ControlPlane Provider` controller who are responsible to create a control plane endpoint - - Must stop to wait for the `spec.ControlPlaneEndpoint` field on the `Cluster` object to be set before starting to provision the control plane. +- All the `ControlPlane Provider` controllers who are responsible for creating a control plane endpoint + - Must no longer wait for the `spec.ControlPlaneEndpoint` field on the `Cluster` object to be set before starting to provision the control plane. - As soon as the Managed Kubernetes Service-provided control plane endpoint is available, the controller must create a `ClusterEndpoint` CR to communicate this to the control plane endpoint to the Cluster API core controllers - The `ClusterEndpoint` CR must have an owner reference to the `ControlPlane` object from which is originated. @@ -279,251 +305,6 @@ If the `Foo` cloud provider has a `FKS` managed Kubernetes offering that is not Please note that this scenario is equivalent to what is implemented for a non managed Kubernetes `FooCluster`, backed by Cluster API managed `FooMachines`, with the only difference that in this case it possible to rely on `KCP` as `ControlControlplane provider`, and thus point 2 of the above list do not apply. -## Background work - -This proposal builds on top of the awesome research work of the Managed Kubernetes working group, and it is a result of a huge work of a team of passionate Cluster API contributors. - -Below a summary of the main evidences / alternative considered during this this work. - -### EKS in CAPA - -- [Docs](https://cluster-api-aws.sigs.k8s.io/topics/eks/index.html) -- Feature Status: GA -- CRDs - - AWSManagedCluster - passthrough kind to fullfill the capi contract - - AWSManagedControlPlane - provision EKS cluster - - AWSManagedMachinePool - corresponds to EKS managed node pool -- Supported Flavors - - AWSManagedControlPlane with MachineDeployment / AWSMachine - - AWSManagedControlPlane with MachinePool / AWSMachinePool - - AWSManagedControlPlane with MachinePool / AWSManagedMachinePool -- Bootstrap Provider - - Cluster API bootstrap provider EKS (CABPE) -- Features - - Provisioning/managing an Amazon EKS Cluster - - Upgrading the Kubernetes version of the EKS Cluster - - Attaching self-managed machines as nodes to the EKS cluster - - Creating a machine pool and attaching it to the EKS cluster (experimental) - - Creating a managed machine pool and attaching it to the EKS cluster - - Managing "EKS Addons" - - Creating an EKS Fargate profile (experimental) - - Managing aws-iam-authenticator configuration - -#### AKS in CAPZ - -- [Docs](https://capz.sigs.k8s.io/topics/managedcluster.html) -- Feature Status: GA -- CRDs - - AzureManagedControlPlane, AzureManagedCluster - provision AKS cluster - - AzureManagedMachinePool - corresponds to AKS node pool -- Supported Flavor - - AzureManagedControlPlane + AzureManagedCluster with AzureManagedMachinePool - -#### GKE in CAPG - -- [Docs](https://github.com/kubernetes-sigs/cluster-api-provider-gcp/blob/v1.3.0/docs/book/src/topics/gke/index.md) -- Feature Status: Experimental () -- CRDs - - GCPManagedControlPlane, GCPManagedCluster - provision GKE cluster - - GCPManagedMachinePool - corresponds to the managed node pool for the cluster -- Supported Flavor - - GCPManagedControlPlane + GCPManagedCluster with GCPManagedMachinePool - - -#### Learnings from original Proposal: Two kinds with a Managed Control Plane & Managed Infra Cluster adhering to the current CAPI contracts - -The original Managed Kubernetes proposal recommends managing two separate resources for cluster and control plane configuration, what we're referring to as a `Cluster` and a `ControlPlane`. That recommendation is outlined as [Option 3 in the proposal, here][managedKubernetesRecommendation]. This recommendation has been followed by CAPOCI and CAPG as of this writing. - -This propsal was able to be implemented with no upstream changes in CAPI. It makes the following assumptions about representing Managed Kubernetes: - -- **`Cluster`** - Provides any base infrastructure that is required as a prerequisite for the target environment required for running machines and creating a Managed Kubernetes service. -- **`ControlPlane`** - Represents an instance of the actual Managed Kubernetes service in the target environment (i.e. cloud/service provider). It’s based on the assumption that a Managed Kubernetes service supplies the Kubernetes control plane. - -These broadly follow the existing separation within CAPI. - -However, for many Managed Kubernetes services this will require less than ideal code in the controllers to retrieve the control plane endpoint from the `ControlPlane` kind and report it back via the ControlPlaneEndpoint property on the `Cluster` to satisfy CAPI contracts. - -To give an idea what this means: -- `Cluster` watches the control plane and vice versa -- `Cluster` controller create base infra and sets Ready = true -- `ControlPlane` waits for `Cluster` to be Ready -- `ControlPlane` creates an instance of the managed k8s service -- `ControlPlane` gets the API server endpoint from the managed k8s service and stores it in the CRD instance -- `Cluster` is watching for changes to `ControlPlane` and if the "api server endpoint" on the `ControlPlane` CRD instance is not empty then: - - Map `ControlPlane` to `Cluster` and queue event - - `Cluster` reconciler loop gets the `ControlPlane` CRD instance and takes the value for "api server endpoint" and populates `ControlPlaneEndpoint` on the `Cluster` CRD instance. - - (which will then cause the reconciler for `ControlPlane` to run... again) - -The implementation of the controllers for Managed Kubernetes would be simplified if there was an option to report the ControlPlaneEndpoint via `ControlPlane` instead. Below we will outline two new flows that reduce much of the complexity of the above, while allowing Managed Kubernetes providers to represent their services intuitively. - -### Two New Flows - -#### Flow 1: `Cluster` and `ControlPlane`, with `ControlPlaneEndpoint` reported via `ControlPlane` - -We will describe a CRD composition that adheres to the original separation of concerns of the different provider types as documented in the Cluster API documentation, with a different API Server endpoint reporting flow. - -As described above, at present the control plane endpoint must be returned via the `ControlPlaneEndpoint` field on the spec of the `Cluster` [reference here](https://cluster-api.sigs.k8s.io/developer/providers/cluster-infrastructure.html). This is OK for self-managed clusters, as a load balancer is usually created as part of the reconciliation. But with Managed Kubernetes services the API Server endpoint usually comes from the service directly, which means that the `Cluster` has to get the `ControlPlaneEndpoint` from the managed service so that it can be reported back to CAPI. In practice, this results in `Cluster` watching the `ControlPlane` and the `ControlPlane` watching the `Cluster`, and without care this can cause event storms in the CAPI management cluster. - -This flow would require making changes to CAPI controllers so that there is an option to report the `ControlPlaneEndpoint` via the `ControlPlane` as an alternative to coming from the `Cluster`. - -Using CAPG as an example: - -```go -type GCPManagedControlPlaneSpec struct { - // AddonsConfig defines the addons to enable with the GKE cluster. - // +optional - AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"` - - // Logging contains the logging configuration for the GKE cluster. - // +optional - Logging *ControlPlaneLoggingSpec `json:"logging,omitempty"` - - // EnableKubernetesAlpha will indicate the kubernetes alpha features are enabled - // +optional - EnableKubernetesAlpha bool - - // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. - // +optional - ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` - - ... -} - - -type GCPManagedClusterSpec struct { - // Project is the name of the project to deploy the cluster to. - Project string `json:"project"` - - // The GCP Region the cluster lives in. - Region string `json:"region"` - - // NetworkSpec encapsulates all things related to the GCP network. - // +optional - Network NetworkSpec `json:"network"` - - // FailureDomains is an optional field which is used to assign selected availability zones to a cluster - // FailureDomains if empty, defaults to all the zones in the selected region and if specified would override - // the default zones. - // +optional - FailureDomains []string `json:"failureDomains,omitempty"` - - ... -} -``` - -**Pros** - -- Simplifies provider implementation when reporting `ControlPlaneEndpoint` -- Clearer separation between the lifecycle management of the general cloud infrastructure required for the cluster and the actual managed control plane (GKE in this example) -- Follows the original intentions of an "infrastructure" and "control-plane" provider -- Enables removal/addition of properties for a Managed Kubernetes cluster that may be different from a self-managed Kubernetes cluster -- Works with ClusterClass - -**Cons** - -- Requires changes upstream to CAPI controllers to support the change of reporting `ControlPlaneEndpoint` -- Duplication of API definitions between self-managed and managed `Cluster` definitions and related controllers -- Users need to be aware of when to use the unmanaged or managed `Cluster` definitions. - -#### Flow 2: Change CAPI to make `Cluster` optional - -This option follows along from the first flow above (`ControlPlaneEndpoint` reported by `ControlPlane` resource rather than `Cluster` resource), but takes it further and makes the `Cluster` resource optional. - -This option would allow providers to implement only a `ControlPlane` resource. Using CAPG as an example, rather than: - -- `Cluster` ←→ `GCPManagedCluster` + `GCPManagedControlPlane` - -We would enable: - -- `Cluster` ←→ `GCPManagedControlPlane` - -This would have the advantage of imposing a separation of configuration between each provider’s `Cluster` and `ControlPlane`’s resources. Because our observations have been that various Managed Kubernetes service providers do things a little bit differently, this separation is hard to define and enforce across all providers in a way that is agreeable to each provider. - -In practice this will help Managed Kubernetes provider implementations that do not provide infrastructure resources as part of the service contract, and as of now are required to implement a `Cluster` resource (e.g., `AzureManagedCluster` ) as a sort of proxy resource that exists solely to fulfill the CAPI requirement for an `Cluster` partner of its corresponding Cluster resource even though there is no infrastructure to describe: - -```golang -type ClusterSpec struct { - ... - // InfrastructureRef is a reference to a provider-specific resource that holds the details - // for provisioning infrastructure for a cluster in said provider. - // +optional - InfrastructureRef *corev1.ObjectReference `json:"infrastructureRef,omitempty"` - ... -} -``` - -The above API specification snippet for `ClusterSpec` emphasizes (in the type comment) that in fact the `InfrastructureRef` child property is an optional property of the data model. We are able to take advantage of this data specification to accommodate these non-infrastructure-providing Managed Cluster infrastructure scenarios, and are entirely able to be represented as a “managed control plane” abstraction. Work will need to be done in the CAPI controllers to support this new workflow, which was originally implemented prior to Managed Kubernetes scenarios being considered. - -**Pros** - -- Does not require any change to existing Cluster API CRDs -- Flexible: enables more expressive API semantics for the various scenarios of Managed Kubernetes -- Is a natural evolution of the prior effort to standardize Managed Kubernetes on CAPI, doesn’t require users following this effort to entirely rethink how they can invest in CAPI + Managed Kubernetes - -**Cons** - -- Would require an update to the existing Cluster API contract to accommodate new workflows - -#### Alternative Option: Introduce a new Managed Kubernetes provider type (with contract) - -This option would introduce a new native Managed Kubernetes type definition into Cluster API, which would have the result of standardizing what Managed Kubernetes looks like for all providers under a common interface. We can use the CAPI type definition of “Cluster”, and the various provider implementations of that (e.g., `GCPCluster`) as a model to copy when we design a native Managed Kubernetes specification. - -Defining a new CAPI Managed Kubernetes type would require us to discover and standardize the set of "common" (relevant across all providers) specification data into a new set of CAPI types, e.g.: - -```golang -type ManagedCluster struct { - metav1.TypeMeta `json:",inline"` - metav1.ObjectMeta `json:"metadata,omitempty"` - - Spec ManagedClusterSpec `json:"spec,omitempty"` - Status ManagedClusterStatus `json:"status,omitempty"` -} - -type ManagedClusterSpec struct { - // Cluster network configuration. - // +optional - ClusterNetwork *ClusterNetwork `json:"clusterNetwork,omitempty"` - - // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. - // +optional - ControlPlaneEndpoint APIEndpoint `json:"controlPlaneEndpoint,omitempty"` - - // InfrastructureRef is a reference to a provider-specific resource that holds the details - // for provisioning infrastructure for a cluster in said provider. - // +optional - InfrastructureRef *corev1.ObjectReference `json:"infrastructureRef,omitempty"` -} -``` - -Each provider would then implement its own corresponding type definition: - -```golang -type GCPManagedCluster struct { - .... -} -``` - -Our job is to balance the beneficial outcomes of standardization and consistency by strictly defining certain "common" properties that each provider will fulfill, while enabling enough flexibility to allow providers to meaningfully represent their particular environments. - -**Pros** - -- Standardizing the spec at the foundational, Cluster API layer optimizes for consistency across providers - -**Cons** - -- Would require a new set of resource specifications to the existing Cluster API spec -- Differentiates "self-managed clusters" from "managed clusters" at the foundational API layer: - - Self-managed clusters would use the `Cluster` API resource as the top-level primitive object - - Managed clusters would use the `ManagedCluster` API resource as the top-level primitive object - - For example, to see all clusters under management at present, you can issue a `kubectl get clusters --all-namespaces` command (or the API equivalent); going forward, you would issue `kubectl get clusters,managedclusters --all-namespaces` -- There are no existing provider implementations. All existing provider implementations (e.g., CAPA, CAPZ, CAPOCI, CAPG) would need to be replaced or augmented in order to use a new spec. - -## Recommendations - -Because Managed Kubernetes was not yet in scope for Cluster API when it first appeared and gained rapid adoption, we are incentivized for paths forward that use the existing, mature, widely used API specification. The option to create a new `ManagedCluster` API type to best enforce provider consistency thusly has a high bar to clear in order to justify itself as the best option for the next phase of Managed Kubernetes in Cluster API. - -We conclude that enabling the the CAPI controllers to source authoritative `ControlPlaneEndpoint` data from the `ControlPlane` resource is non-invasive to existing API contracts, and offers non-trivial flexibility for CAPI Managed Kubernetes providers at a small additional cost to CAPI maintenance going forward. Existing implementations that leverage a "proxy" `Cluster` resource merely to satisfy CAPI contracts can be simplified by dropping the `Cluster` resource altogether at little-to-no cost to their existing user communities. New Managed Kubernetes provider implementations will now have a little more flexibility to use a implementation that uses only a `ControlPlane` resource, if that is appropriate, or for implementations that define both a `Cluster` + `ControlPlane` with the appropriate configuration distribution [following our recommendation][managedKubernetesRecommendation], those implementations can be non-trivially simplified with their `ControlPlaneEndpoint` data being observed and straightforwardly returned via `ControlPlane`, the most common source of truth for a Managed Kubernetes service. - ## Implementation History - [x] 01/11/2023: Compile a Google Doc to organize thoughts prior to CAEP [link here](https://docs.google.com/document/d/1rqzZfsO6k_RmOHUxx47cALSr_6SeTG89e9C44-oHHdQ/)