Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: CAPI failure domain & control plane support #1647

Closed
vincepri opened this issue Oct 24, 2019 · 6 comments · Fixed by #1871
Closed

RFE: CAPI failure domain & control plane support #1647

vincepri opened this issue Oct 24, 2019 · 6 comments · Fixed by #1871
Assignees
Labels
kind/proposal Issues or PRs related to proposals. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@vincepri
Copy link
Member

User Story

As a Cluster API operator I would like to have the control plane resource create Machines in different failure domains.

Detailed Description

The ControlPlane proposal offers a streamlined way to have control plane management in Cluster API. However, it does not support failure domains; this issue is an addendum to add basic support for failure domains to ControlPlane and Machines.

Contract changes

  • Infrastructure providers [optional]
    • During reconciliation, an infrastructure provider populates Status.FailureDomains in its cluster resource (e.g. AcmeCluster).
  • Cluster controller
    • During reconciliation, the infrastructure provider object linked in Cluster.Spec.InfrastructureRef (if any) is parsed using unstructured object utilities. The reconciler parses and copies Status.FailureDomains from the InfrastructureCluster to Cluster.
  • Control Plane controller
    • During reconciliation, the control plane controller takes Cluster.Status.FailureDomains slice, filters the ones that are applicable for control plane use, and picks one by some deterministic method.
    • The controller must take into account which machines have been created in which failure domain and evenly spread between all the available domains.
    • If there are more replicas than available failure domains, the controller just cycles between the available one without erroring.

Data model changes

FailureDomain

type FailureDomain struct {
  // Id is the unique identifier of this failure domain.
  Id string `json:"id"`

  // ControlPlane determines if this failure domain is suitable for use by control plane machines.
  // +optional
  ControlPlane bool `json:"controlPlane,omitempty"`

  // Attributes is a free form map of attributes an infrastructure provider might use.
  // +optional
  Attributes map[string]string `json:"attributes,omitempty"`
}

ClusterStatus

type ClusterStatus struct
  • To add
    • FailureDomains [optional]
      • Type: []FailureDomain
      • Description: FailureDomains describes the available failure domains for the cluster to use.

MachineSpec

type MachineSpec struct
  • To add
    • FailureDomain [optional]
      • Type: *FailureDomain
      • Description: FailureDomain is the failure domain to be used for this Machine.

Goals

  1. To allow a generic mechanism for the Control Plane controller to generate control plane machines that span multiple failure domains.
  2. To avoid requiring user intervention at the Control Plane level to define the Control Plane's failure domains.
  3. To allow infrastructure providers to indicate which failure domains are available for a given Cluster.

Non-Goals / Future Work

  1. To support failure domains for MachineSet or MachineDeployment.

/kind proposal
/milestone v0.3.0
/priority important-soon

@k8s-ci-robot k8s-ci-robot added this to the v0.3.0 milestone Oct 24, 2019
@k8s-ci-robot k8s-ci-robot added kind/proposal Issues or PRs related to proposals. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 24, 2019
@vincepri vincepri self-assigned this Oct 24, 2019
@alexbrand
Copy link
Contributor

Is my understanding correct in that the goals here are:

  1. The infrastructure providers "advertise" available failure domains via the Cluster resource
  2. The ControlPlane controller reads the available failure domains from the Cluster resource
  3. The ControlPlane controller creates ControlPlane machines in each failure domain

@detiber
Copy link
Member

detiber commented Oct 31, 2019

@alexbrand I would probably reword your 3rd goal to:

  • The ControlPlane controller creates ControlPlane Machines, spreading them across the failure domains as evenly as possible

Depending on the scale of the ControlPlane or the number of failure domains it may not be possible to have a Machine in each failure domain, or it may require multiple machines in some failure domains.

It may also be good to add an additional goal:

  • The ControlPlane controller ensures ControlPlane Machines continue to be spread as evenly as possibly across failure domains during the process of an upgrade

@vincepri
Copy link
Member Author

vincepri commented Dec 10, 2019

/lifecycle active

@vincepri vincepri added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Dec 10, 2019
@detiber
Copy link
Member

detiber commented Dec 13, 2019

@ncdc @vincepri how should we handle getting the details of the proposal documented in repo? It seems like data that we'd want to make sure is highlighted... Maybe it's just a matter of updating the provider implementer docs for now, but I feel like we also need to capture the architectural details somewhere as well.

@ncdc
Copy link
Contributor

ncdc commented Dec 16, 2019

@detiber I'd probably add content to the cluster infra provider spec doc describing failure domains, and probably talk about them in the control plane provider spec doc too. Would that suffice?

@vincepri
Copy link
Member Author

Opened #1901

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/proposal Issues or PRs related to proposals. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants