Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kubeadm control plane] Scale up / Scale down #2241

Closed
chuckha opened this issue Jan 31, 2020 · 4 comments · Fixed by #2335
Closed

[kubeadm control plane] Scale up / Scale down #2241

chuckha opened this issue Jan 31, 2020 · 4 comments · Fixed by #2335
Assignees
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@chuckha
Copy link
Contributor

chuckha commented Jan 31, 2020

User Story
KubeadmControlPlane should be able to scale a control plane up and down.

Detailed Description
Scale up and scale down means modifying the number of replicas on a KubeadmControlPlane object. This will cause the controller to remove or add members. In the case of remove it will remove the oldest members first. This was selected as the best option as it synergizes with upgrades that can use the age of nodes to schedule updates.

Scale up and scale down will require all control plane nodes and etcd members to be healthy. However, we can make no guarantees. This will be a best effort. By the time the health check is done, the nodes could have crashed and we would still think they are healthy.

etcd health is defined as:

  1. All members are online
  2. All members report the same member list (ensures we didn't accidentally create a kubernetes cluster with two etcd clusters which has happened in the past)

control plane health as defined as:

  1. All apiserver pods are healthy in terms of pod healthz/healthcheck
  2. All controller-manager pods are healthy in terms of pod healthz/healthcheck
  3. Kubelet is alive

This can be achieved by:

  1. Getting a list of etcd members from any random apiserver.
  2. This allows us to work with a load balancer in front of the apiservers
  3. The etcd client proxies through kubelet which requires kubelet to be alive
  4. For each member, get a list of members.
  5. This is a read on every member which is itself a health check of each member
  6. Compare all members.
  7. This gets us around the problem of having accidentally created two etcd clusters

This will be sufficient for the initial draft. Following after the initial draft we will add the additional checks to the control plane.

For the future, we can consider a smart scale down operation that targets an unhealthy node if one exists. But for the sake of initial implementation we are doing the easiest thing which is to make sure everything is healthy and then pick the oldest node.

/kind feature
/area control-plane
/assign @dlipovetsky
/milestone v0.3.0
/lifecycle active
/priority important-soon

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jan 31, 2020
@k8s-ci-robot k8s-ci-robot added this to the v0.3.0 milestone Jan 31, 2020
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/control-plane Issues or PRs related to control-plane lifecycle management priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 31, 2020
@vincepri
Copy link
Member

@dlipovetsky Do you think you'll be able to open a PR (with at least scale up) for v0.3.0-rc.0 (due Friday)?

@dlipovetsky
Copy link
Contributor

@vincepri Yes, PR coming in an couple of hours

@dlipovetsky
Copy link
Contributor

Working through some capd-e2e issues 😅

@vincepri
Copy link
Member

@dlipovetsky FYI, there is a flake, so I'd ignore capd for now, if integration is passing we should be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants