Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple Infrastructure resource from infra providers machine controller #4095

Closed
1 of 2 tasks
enxebre opened this issue Jan 20, 2021 · 19 comments · Fixed by #4135
Closed
1 of 2 tasks

Decouple Infrastructure resource from infra providers machine controller #4095

enxebre opened this issue Jan 20, 2021 · 19 comments · Fixed by #4135
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/proposal Issues or PRs related to proposals.
Milestone

Comments

@enxebre
Copy link
Member

enxebre commented Jan 20, 2021

User Story

As a CAPI consumer I'd like to plug-in my own ControlPlane and Infrastructure resources [1] while still still reusing the existing machine controller implementation for infra providers.

Today this is not possible in some providers because the machine controllers are tightly coupled with the regular AWS/AzureCluster kind [2].

A scenario where this is handy is one where there's a common vision for a controlPlane across providers e.g cluster-api-provider-nested and the infrastructure management can differ arbitrarily from the core implementations, e.g BYO.

[1] https://github.com/kubernetes-sigs/cluster-api/blob/master/api/v1alpha4/cluster_types.go#L50-L58
[2] https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/controllers/azuremachine_controller.go#L186-L204

Detailed Description

[A clear and concise description of what you want to happen.]

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 20, 2021
@fabriziopandini
Copy link
Member

I'm not sure I fully understand all the nuances here.
If I got it right, the ask is to have a sort of generic infrastructure machine controller, and then plugin the infrastructure specific bits, right? is there a description/comparative analysis between the different infrastructure providers so I can better understand what are the parts we want to re-use across infra providers.
Expanding a little bit, does this effort requires a design doc/a CAEP?

@enxebre
Copy link
Member Author

enxebre commented Jan 20, 2021

@fabriziopandini This is meant to be an umbrella ticket to track a better decoupling between the providers cluster infrastructure CRs and the providers machine controllers.
This decoupling enables the providers machine controllers to work with external cluster infrastructure CRs.
This capability enables scenarios where cluster infra requires special treatment different from the reference implementation, e.g bring you own infra for the cluster.

See AWS implementation details
kubernetes-sigs/cluster-api-provider-aws#2124
kubernetes-sigs/cluster-api-provider-aws#2125

Azure
kubernetes-sigs/cluster-api-provider-azure#1129

@vincepri
Copy link
Member

vincepri commented Jan 20, 2021

Ref: #4063 (comment)

This is a set of solutions that should enable this use case as well. A MachineShim (or similar) would effectively decouple machines from their infrastructure part by bringing in a ProviderID (Kubernetes Node). Would that suffice?

@vincepri
Copy link
Member

/kind proposal
/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Jan 20, 2021
@k8s-ci-robot k8s-ci-robot added the kind/proposal Issues or PRs related to proposals. label Jan 20, 2021
@JoelSpeed
Copy link
Contributor

A MachineShim (or similar) would effectively decouple machines from their infrastructure part

I read through the comment that was linked, and as I understand it, this is more about creating fake machine representations that allow some Machine like things (eg lifecycle hooks) to be leveraged in things like MachinePools. So there's no Machine Infrastructure in this case right?

My understanding of the use case in this issue is more that we want to be able to use the Machine Controller and Infrastructure Machine controller without using the Cluster and Cluster Infrastrcture resources/controllers. I believe this is a different problem unless I've missed some nuance of the MachineShim concept

@vincepri
Copy link
Member

cc @hasheddan

@randomvariable
Copy link
Member

randomvariable commented Jan 22, 2021

Me, @yastij and @detiber we're discussing #1250 and came to the realisation that we do also need a similar sort of contract for infrastructure provider load balancers vs. clusters, as in the load balancer may be provided by one infrastructure provider, e.g. AWS, but attached to a cluster using a vSphereCluster. I think then that the contract would be the same as what RH needs and we could define a more generic one that describes how InfraComponents should have a way to be instanceable without an InfraCluster.

I don't think a MachineShim is the right thing here. Concretely, the issue is that you want to provision something, a Machine or a Load Balancer, and today we know where to provision that thing, e.g. the VPC, security groups, subnets etc... based on information in the InfraCluster object. If that InfraCluster object is mismatched, then today the InfraMachine or future InfraLoadBalancer controller won't work.

@enxebre
Copy link
Member Author

enxebre commented Jan 22, 2021

@randomvariable that's right, I wonder if instead of having an additional extenalInfraCluster CR as originally attempted in AWS
we could come up with a very clear definition of what infrastructure and "unmanaged" mean across providers and try to leverage "unmanaged" for this use case

Currently in AWS this is not possible as an "unmanaged" AWSCluster CR will still try to reconcile default security groups and the API server load balancer.

@enxebre
Copy link
Member Author

enxebre commented Jan 25, 2021

@randomvariable @JoelSpeed @yastij @detiber @vincepri would you be to ok to proceed as described above and let the "unmanaged" flavour to be fully unmanaged by stop reconciling SG and APIServer load balancer?
I described the use case here https://docs.google.com/document/d/1uqzpQjEQ9s0gfHppDcRa4zZeQTXGtca4v19bSYPIDgM/edit#

@vincepri
Copy link
Member

@enxebre What if instead of having a yet another reference added to the objects, we'd generically allow to retrieve a ClusterInfrastructure-compatible object from a configmap or secret, or whatnot?

In other words, is there a way to have the current AWSCluster, AzureCluster, etc to function as interfaces? In that way, we have a clear, well defined contract already in place that we could use to create a bridge from other resources.

An alternative to a configmap or secret is to have a generic field or annotation that lets us create these resources, but stops the controller from reconciling them.

@alexander-demicev
Copy link
Contributor

Can we also consider allowing machines to be 'clusterless'? For use cases where infrastructure is provided by the user, machines can stop relying on clusters and read values like subnets, SG, regions, etc. from machine template?

@vincepri
Copy link
Member

Possibly, I'd like to hear from the infrastructure provider maintainers on that bit though, because it increases the support scope

@CecileRobertMichon @devigned @randomvariable @sedefsavas @yastij @srm09

As a side benefit, having a generic data interface would let us welcome other infrastructure based managers like Terraform or Crossplane

@JoelSpeed
Copy link
Contributor

As a side benefit, having a generic data interface would let us welcome other infrastructure based managers like Terraform or Crossplane

I'm intrigued by this idea but at the moment, not sure of the benefit over trying to re-use existing infrastructure cluster resources with the generic extra field that prevents reconciliation as you've suggested 🤔

My understanding of the ideas in the thread so far is that, in theory, I could use terraform or something to create my infrastructure, manually populate the spec of an AWSCluster resource with the details from my terraform environment, and then add that to the cluster. At this stage, this isn't usable because the controller needs to mark the status ready, so in this scenario I want to mark the AWSCluster as a no-op/unmanaged/name TBD, and in this case, the controller looks at the pre-populated resource and says yep, ok, it has the minimum values I expect, cluster is ready (or something like this), and then does nothing else, no reconciliation of any AWS resources ever for the lifetime of this CR.

For this use case, the existing resources should have all of the fields I'd expect as an AWS user for example, to fill in already there, and they are already understood by the controllers, so it seems like it would be easy to have a no-op mode added to each of the providers.

With the generic data interface, I assume I would be using a configmap or equivalent and still be putting in the same field names to enable the interface to pick up the required values? Is the main advantage that no controller would be watching these so we don't need to define the term "unmanaged" unilaterally across all providers?

@randomvariable
Copy link
Member

randomvariable commented Jan 26, 2021

I don't think a "generic extra field" is going to work. The networking constructs are not compatible across cloud providers, and we've already hit this before through over-abstracting failure domains. I think an explicit "unmanaged" toggle for resources would work here (which also has a side benefit of aiding users who currently have to "guess" what to provision to get an unmanaged AWSCluster. Also would be useful for dealing with rate limits for edge cases (not Edge) with very large clusters / lots of clusters in an account where the rate limits just keep being hit. I still think there's a question about how to deal with this with the load balancer proposal too. I think the path forward here, as with the single controller multitenancy, is to take the Cluster API AWS v1alpha4 proposal as an instance of a infrastructure provider contract.

@yastij
Copy link
Member

yastij commented Jan 26, 2021

catching up on this, so today for the vsphere provider you should be able to create machines without needing cluster objects. Also in general for BYO Infra, there are two cases:

  • provision the infra however you want, and have an unmanaged field at the infra/cluster level
  • have a para-virtual model, where the controller reads things like tags from infra resources to create the missing cluster objects

I think the first one is cheaper to implement and have an acceptable UX though

@hasheddan
Copy link
Contributor

Hey folks! Wanted to weigh in here to give some context around how Crossplane works and how it could potentially fit into this use-case. Crossplane is similar to CAPI in that it has providers for the different clouds. However, it aims to support provisioning all managed services on every cloud provider, which is a superset of those required for CAPI. Each provider installs CRDs for all of its managed services and they map 1:1 with the cloud provider API (for instance, here are the currently supported services for provider-aws: https://doc.crds.dev/github.com/crossplane/provider-aws).

On top of these primitive resources, Crossplane provides a composition layer. This allows users to define abstract resource types (CompositeResourceDefinitions a.k.a XRDs) that map to 1 or more of the primitive types. For instance, a good example is creating a Cluster XRD that maps to all the resources required to create an EKS cluster. You can also have multiple compositions for a single XRD. For instance, you may be able to satisfy the same Cluster XRD with the resources required for a GKE cluster. This allows you have powerful abstractions, which can also be nested (i.e. I could have a ClusterGroup XRD that composed multiple EKS and GKE cluster XRDs, etc.).

You can see the similarities to the mapping of generic resources in CAPI to their concrete implementations in CAPI providers. Supporting a common interface for the concrete implementations would allow users to author XRDs that could be backed by many different compositions that provided infrastructure for a k8s cluster. A good example of this would be that you could transparently swap out a cluster made up of EC2 instances and related services for an EKS cluster.

We see Crossplane as already doing the work on the gritty details of managing the life-cycle of the granular infrastructure resources and are working directly with many of the cloud providers to make sure those implementations are reliable and production-ready. We also see Crossplane as a strong solution to defining the abstractions that CAPI knows how to interact with. The advantage of bringing the projects closer together would include:

  • CAPI would no longer need to maintain providers for each cloud provider. Crossplane would become the engine that provided the resources that CAPI is capable of orchestrating.
  • CAPI could focus on what it is best at: provisioning and managing Kubernetes clusters. If suddenly a new resource was required to support a specific cluster configuration, you would write a few lines of YAML to include it in the composition, rather than figuring out how to add it to provider-specific CRDs and understanding the life-cycle and APIs of the resource (i.e. let's not do the same work twice!).
  • Crossplane has a strong distribution story that fits well with CAPI. Both CAPI and Crossplane can be installed into a cluster easily (using clusterctl / helm / etc.). Both the Crossplane providers and the abstractions that would be used by CAPI can be distributed as Crossplane packages, which are just OCI images. For instance, the Cluster / GKE / EKS abstractions described above can all be installed and ready to use with kubectl crossplane install configuration <oci-image-registry/repo:tag> . (more on Crossplane packages here)
  • Because the abstractions are easy to define, if a specific user did not like the way CAPI provisioned clusters on AWS, they could just write their own Composition and drop it in as a replacement to the "officially packaged" one. This would take minutes, and would be an alternative to forking and rewriting a CAPI provider.

The integration would require significant collaboration, but I am confident that it could be done with relatively small changes to the general models used by both projects. Furthermore, I, and many other members of the Crossplane community, would be willing to invest significant effort in making this possible. I am happy to answer any questions, and / or do a more formal presentation / discussion / Q&A at CAPI community meetings.

Lastly, as many of the folks involved in this thread are key members of the CAPI and general upstream k8s community, I want to thank you for your time and effort. The impact CAPI has had and will have on k8s adoption and cluster management cannot be overstated, and the Crossplane community would love to enable that as best we can.

@randomvariable
Copy link
Member

If we were to leverage Crossplane, then we need to come to some agreement on the multi-tenancy model. CAPA and CAPZ already have implementations near completion based on a particular RBAC model, see https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/docs/proposal/20200506-single-controller-multitenancy.md and https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/docs/proposals/20200720-single-controller-multitenancy.md

@devigned
Copy link
Contributor

devigned commented Jan 26, 2021

On top of these primitive resources, Crossplane provides a composition layer. This allows users to define abstract resource types (CompositeResourceDefinitions a.k.a XRDs) that map to 1 or more of the primitive types. For instance, a good example is creating a Cluster XRD that maps to all the resources required to create an EKS cluster. You can also have multiple compositions for a single XRD.

The XRDs and compositions do look to be quite powerful, but also quite complex. The 1 to 1 correspondence with a cloud provider API might be a benefit is some eyes, but to others, it's excessive complexity. Those APIs are not fun to deal with and the resources they describe are low-level. I think there is a lot of value in how the providers have a semi-opinionated approach to how a cluster is built on the given cloud provider. It simplifies the language a user must know to build a best practices cluster on the given cloud provider.

I think it would help folks to understand this more to see a proposal with VM based representation on a few providers, not one using a managed control plane, and how it would all tie together.

@hasheddan
Copy link
Contributor

If we were to leverage Crossplane, then we need to come to some agreement on the multi-tenancy model.

@randomvariable I think this necessitates a longer conversation than we can have on this issue thread, but at quick glance, Crossplane does support the various methods of providing credentials that are supported by the AWS SDK for Go. In Crossplane, credentials are specified in a ProviderConfig. Each object then has a reference to a ProviderConfig, which specifies the credentials that will be used to operate on that specific instance of the resource. When creating higher level abstractions, the abstraction author can force the usage of certain credentials, or allow them to be specified at the XRD level and flow through.

I think there is a lot of value in how the providers have a semi-opinionated approach to how a cluster is built on the given cloud provider.

@devigned I completely agree. What I am proposing is that the work that is currently done to create these abstractions in the providers could actually be handled by crafting compositions of the granular resources (which is a complex exercise that should be handled by folks that have a strong understanding of the APIs) then those abstractions would be published, so users would interact with higher-level / opinionated objects, similar to those provided by the CAPI providers today.

I think it would help folks to understand this more to see a proposal with VM based representation of AWS and Azure, not one using a managed control plane, and how it would all tie together.

Absolutely. This is something we haven't shown off quite as much as the managed solutions and is obviously a critical component for how many folks are using CAPI today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/proposal Issues or PRs related to proposals.
Projects
None yet
10 participants