Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ KubeadmControlPlane scale up serially #2193

Conversation

dlipovetsky
Copy link
Contributor

@dlipovetsky dlipovetsky commented Jan 29, 2020

What this PR does / why we need it:
The control plane needs to be scaled one replica at a time. Also, before a replica is added with, all etcd endpoint should be healthy, and all existing members should be started.

This PR is a collaborative effort. I started the etcd health check work in the scale up function itself. @chuckha factored that out. More importantly, he added abstractions for the management and target cluster, and wired up the etcd client (merged in #2237).

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2016, #2243

/hold
/assign @randomvariable @detiber

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 29, 2020
@ncdc ncdc added this to the v0.3.0 milestone Jan 29, 2020
Copy link
Contributor

@chuckha chuckha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet, this looks good, i wanted to get rid of that for loop for a while 😄

@detiber
Copy link
Member

detiber commented Jan 30, 2020

Would you agree with this:

The controller should create a new replica only if:
(a) for every existing replica (i.e. ownedMachine), there is a healthy etcd member (which implies that the member is listed and started), and
(b) the number of etcd members (including unstarted) is equal to the number of ownedMachines

yes

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 1, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dlipovetsky
To complete the pull request process, please assign chuckha
You can assign the PR to them by writing /assign @chuckha in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dlipovetsky
Copy link
Contributor Author

(The delete-related unit tests are failing, because they relied on the earlier, permissive, scale-up functionality to create a control plane for testing delete. Will fix)

@dlipovetsky dlipovetsky changed the title ✨ KubeadmControlPlane scale up serially [WIP / Do not review] ✨ KubeadmControlPlane scale up serially Feb 3, 2020
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2020
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 5, 2020
@dlipovetsky dlipovetsky changed the title [WIP / Do not review] ✨ KubeadmControlPlane scale up serially ✨ KubeadmControlPlane scale up serially Feb 5, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2020
@dlipovetsky
Copy link
Contributor Author

@chuckha All unit tests pass, but that's because they inject a fake etcd client getter into KubeadmControlPlaneReconcile. Because the "get an etcd client for the etcd Pod of this Machine" isn't implemented, this will break actual use.

How do you want to proceed? Do you want me to pull out the etcd healthcheck code into its own PR and merge that, then come back to this once you've merged your "etcd client getter"?

@chuckha
Copy link
Contributor

chuckha commented Feb 6, 2020

I'll PR the etcd client getter then we an revisit this

@chuckha
Copy link
Contributor

chuckha commented Feb 6, 2020

@dlipovetsky actually i'm going to PR against your branch so I can work it in seamlessly

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2020
limitations under the License.
*/

// Modified copy of k8s.io/apimachinery/pkg/util/sets/int64.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What got modified? Curious why we can't use this out of the box

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is it mainly modifying for uint64? If so, can we leverage the set-gen code generator instead of a hand-edited copy here?

Copy link
Contributor Author

@dlipovetsky dlipovetsky Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, upstream is Int64.

If so, can we leverage the set-gen code generator instead of a hand-edited copy here?

Funny you should ask. I wanted to do this for a different type months ago; it didn't work, and I filed an issue back then: kubernetes/code-generator#74. Not sure if that problem was fixed since, though.

dlipovetsky and others added 2 commits February 7, 2020 11:02
Signed-off-by: Daniel Lipovetsky <dlipovetsky@d2iq.com>
Signed-off-by: Chuck Ha <chuckh@vmware.com>
Copy link
Member

@detiber detiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @dlipovetsky and @chuckha, a few comments, otherwise this is looking great.

controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/cluster.go Show resolved Hide resolved
limitations under the License.
*/

// Modified copy of k8s.io/apimachinery/pkg/util/sets/int64.go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is it mainly modifying for uint64? If so, can we leverage the set-gen code generator instead of a hand-edited copy here?

@dlipovetsky dlipovetsky force-pushed the control-plane-scale-serially branch 2 times, most recently from 1f4e55d to 5fc8b1c Compare February 7, 2020 19:27
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2020
- Pass context to cluster client
- Fix formatting
- Fix linter errors
- Also rename FilterMachines to FilterMachine
- Remove unused code
- Remove whitespace

Signed-off-by: Daniel Lipovetsky <dlipovetsky@d2iq.com>
@dlipovetsky
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2020
Signed-off-by: Daniel Lipovetsky <dlipovetsky@d2iq.com>
@k8s-ci-robot
Copy link
Contributor

@dlipovetsky: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-test a1fdb3e link /test pull-cluster-api-test
pull-cluster-api-capd-e2e a1fdb3e link /test pull-cluster-api-capd-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dlipovetsky
Copy link
Contributor Author

This PR organically grew too large. We split this into multiple PRs, and reviewers were nice enough to copy over the unresolved comments. Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Control plane should scale up serially, not in parallel
6 participants