Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] enable concurrent join of both worker and control-plane nodes #685

Closed
wants to merge 2 commits into from

Conversation

neolit123
Copy link
Member

@neolit123 neolit123 commented Jul 4, 2019

(first commit fixes some typos)

concurrent join of CP nodes should be now supported in k/k master for kubeadm.
no test signal in kinder yet though.

this change works for me, but oddly, i don't see much improvements in performance by using time kind create cluster... with a 3CPx3W cluster. it's roughly 10 seconds reduced from 2 minutes 40 seconds without the change, but the numbers vary a lot for me.

i'm noticing that once kind create cluster finishes most pods (e.g 24 of 26) are already running for this 6 node setup, which should not be the case - most of the pods should be still creating once the kind binary exists, unless this is desired?

do we have sync points or "wait for node/pods ready" logic in kind?

cc @BenTheElder @aojea @fabriziopandini

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 4, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: neolit123
To complete the pull request process, please assign munnerz
You can assign the PR to them by writing /assign @munnerz in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@neolit123
Copy link
Member Author

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 4, 2019
} else if len(secondaryControlPlanes) > 0 && len(workers) == 0 {
startMsg += "control-plane nodes 🎮"
} else {
startMsg += "worker nodes 🚜"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if the messages / icons need tweaks.

@aojea
Copy link
Contributor

aojea commented Jul 4, 2019

I´m curious what´s going to be the job results on versions <1.14

do we have sync points or "wait for node/pods ready" logic in kind?

I think that we have the wait flag:

cmd/kind/create/cluster/createcluster.go:       cmd.Flags().DurationVar(&flags.Wait, "wait", time.Duration(0), "Wait for control plane node to be ready

I use for CI environments kubectl wait -n kube-system --timeout=360s --for condition=Ready -l -l k8s-app=kube-dns pods after the cluster is deployed

@neolit123
Copy link
Member Author

neolit123 commented Jul 4, 2019

I´m curious what´s going to be the job results on versions <1.14

we technically have to enable this feature in kind only for kubeadm versions >= 1.15.

I think that we have the wait flag:

yep, but unless i'm missing something it if set to 0 it does not wait.
if set to > 0 it errors out if the cluster actions don't finish in that time.

@neolit123
Copy link
Member Author

@aojea

pull-kind-conformance-parallel-1-14 — Job succeeded.

in my tests if etcd member addition magically aligns, things work fine without the latest changes in k/k.
xref:

@neolit123
Copy link
Member Author

but i find it quite odd that all 12, 13 and 14 passed.
hm, is the join really concurrent even?

@aojea
Copy link
Contributor

aojea commented Jul 4, 2019

hehehe, is too late for me, I just realized that the jobs in the CI only have one control plane 🙃 😫

EDIT

but i find it quite odd that all 12, 13 and 14 passed.
hm, is the join really concurrent even?

@neolit123 we are not testing secondary control planes in the CI, should we?

@BenTheElder
Copy link
Member

[will come back to this, keeping low priority due to WIP..., @ me if you need something sooner.]

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 27, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 2, 2019
@neolit123
Copy link
Member Author

neolit123 commented Sep 2, 2019

rebased the PR.

when doing the concurrent join it now takes <2minutes, for a 3x3 cluster, however i got it to flake 1 time out of 5 (or so), but i couldn't catch the debug output and the real cause.

probably it needs better timeouts/retries for etcd.
will test this more.

defer ctx.Status.End(false)

// TODO(bentheelder): it's too bad we can't do this concurrently
// (this is not safe currently)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only safe on recent versions, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least this change is needed, which was added in 1.15.1
kubernetes/kubernetes@bc74ac3#diff-5e4c5bba67c635568de70e0c424882c6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably ship this as being concurrent only for >= 1.15.1 then ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i guess there has to be a version check and the old way of serial join has to be supported too.

@neolit123
Copy link
Member Author

neolit123 commented Sep 4, 2019

when doing the concurrent join it now takes <2minutes, for a 3x3 cluster, however i got it to flake 1 time out of 5 (or so), but i couldn't catch the debug output and the real cause.

i only got it to flake 1/10 times. still no logs.

@k8s-ci-robot
Copy link
Contributor

@neolit123: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kind-unit 3d7118e link /test pull-kind-unit
pull-kind-conformance-parallel-1-16 3d7118e link /test pull-kind-conformance-parallel-1-16
pull-kind-e2e-kubernetes 3d7118e link /test pull-kind-e2e-kubernetes

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2019
@k8s-ci-robot
Copy link
Contributor

@neolit123: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 31, 2019
@neolit123
Copy link
Member Author

neolit123 commented Jan 2, 2020

let's get back to this once the required changes in etcd land.
also the PR needs branching per kubeadm version.

/close

@k8s-ci-robot
Copy link
Contributor

@neolit123: Closed this PR.

In response to this:

let's get back to this once the required changes in etcd land.
also the PR needs branching per kubeadm version.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants