sig-cl/kubeadm: add provisional KEP for etcd learner mode #3615

neolit123 · 2022-10-12T15:09:58Z

One-line PR description:
use etcd's learner mode for joining etcd members to the etcd cluster managed by kubeadm

Issue link:
sig-cl/kubeadm: Use etcd's learner mode #3614
RFE: use learner mode for joining etcd members kubeadm#1793

Other comments:
this is not targeting a release yet. the KEP is provisional and needs someone to own the work.

k8s-ci-robot · 2022-10-12T15:10:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-cluster-lifecycle/OWNERS~~ [neolit123]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

neolit123 · 2022-10-12T15:11:27Z

/hold
for review
/cc @fabriziopandini @pacoxu

this is the provisional etcd learner mode KEP PR.
as discussed, we'd like to add the KEP and once we have someone who has time / is willing to work on it we can revisit and check the implementable state.

pacoxu · 2022-10-13T08:28:52Z

keps/sig-cluster-lifecycle/kubeadm/3614-etcd-learner-mode/README.md

+
+#### Story 1
+
+As a kubeadm user, I wish that my HA cluster is more resilient to etcd member failures during addition of new members at cluster bring up time due to slow infrastructure.


I have encountered some cases in that the ntp is not configured by joining as a control plane and ends up with errors. (Not quite sure. Can this case be covered by learner mode?)

i have not experimented with learner mode.
by ntp do you mean network time protocol?
haven't played with that much either.

+1 to what @pacoxu mentioned. We had issues where two nodes of the cluster ended up detecting different set of NTP offset (One node was ahead and other was behind the ETCD node that was already running) that lead to some issue with ETCD member join and some issue after in terms of building a consensus over raft.

looks like if learner mode is added to kubeadm at the alpha stage, we have to experiment with ntp offsets.

IMO testing ntp offsets should be out of scope because AFAIK learner mode but it is not meant to be something that improves the resilience of the raft protocol to machine mis-configurations (or at least this is not a goal of the original design).

agreed, and i will add this as out of scope. proper machine configuration is already a pre-requisite.

pacoxu · 2022-10-13T08:50:17Z

As there is a feature gate, we should not add an extra phase for it.

The change would happen in kubeadm join phase control-plane-join etcd for adding a new local etcd member.

I would like to work on this if you are busy with other work.

neolit123 · 2022-10-13T09:42:35Z

As there is a feature gate, we should not add an extra phase for it.

The change would happen in kubeadm join phase control-plane-join etcd for adding a new local etcd member.

yes, it will be like a separate logic path controlled by the FG and not a new phase.

I would like to work on this if you are busy with other work.

sure if you have time. we can try getting this done for 1.27.
experimentation can start sooner of course.

harshanarayana · 2022-10-13T13:59:46Z

keps/sig-cluster-lifecycle/kubeadm/3614-etcd-learner-mode/README.md

+
+#### Story 1
+
+As a kubeadm user, I wish that my HA cluster is more resilient to etcd member failures during addition of new members at cluster bring up time due to slow infrastructure.


+1 to what @pacoxu mentioned. We had issues where two nodes of the cluster ended up detecting different set of NTP offset (One node was ahead and other was behind the ETCD node that was already running) that lead to some issue with ETCD member join and some issue after in terms of building a consensus over raft.

harshanarayana · 2022-10-13T14:12:30Z

keps/sig-cluster-lifecycle/kubeadm/3614-etcd-learner-mode/README.md

+
+Currently most of the logic of stacked etcd member support in kubeadm is centralized around a couple of files in the source code. These files contain the etcd client wrapped logic and the logic for maintaining a static pod manifest for the etcd server instance.
+
+With the introduction of the new feature gate EtcdLearnerMode a new code path must be created. Preferably the number of "if EtcdLearnerMode" branches in the code should be minimized. Kubeadm currently has some sensitive timeouts while adding etcd members the "old way". Waiting for learners to become voting members would require some modifications in kubeadm in terms of how we wait for a member to be added. Some details can be found in the [official etcd documentation](https://etcd.io/docs/v3.3/learning/learner/#features-in-v34).


This might not be the right place to bring up this but leaving it here just in case.

The current way of kubeadm can handle parallel joins pretty well like you have mentioned before, how would that behavior change if the leaner mode is to be put in place ? Do we promote each of them as standalone?

discussing this with @fabriziopandini , we thought that learners will simply stay on hold / wait and get promoted async. what is not clear to me is if etcd only supports one learner maximum. this maximum was stated in the docs before. if that is the case what can happen instead is that we might have to retry the add member call.

There was an option added to configure the max learner in ETCD a while ago. xref: etcd-io/etcd#13377

that's good, we can cap it to an odd number > 3. but if the number is small, it feels like we still have to have the addmember retry.

alternatively we can just have it as a very large number.

Would it be possible to deduce the number based on the parsing of initial-cluster: ${NAMES[0]}=https://${HOSTS[0]}:2380,${NAMES[1]}=https://${HOSTS[1]}:2380,${NAMES[2]}=https://${HOSTS[2]}:2380... configuration if provided ?

i don't think so. initial cluster is incremented per each new member, so it goes like A; A,B; A,B,C

and we don't know the final number of CP nodes the user wants. my guess is that ~95+% of HA users have 3 CP nodes. the rest might be at 5. 7 is too much, but who knows.

It doesn't always have to be that way though does it ? Take https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/setup-ha-etcd-with-kubeadm/ for example. The initial-cluster is already aware of the fact that it will have 3 nodes. But I am not sure if there is any enforcement on this kind of config. So, I get your point 😄 Thanks for the clarification

external etcd is actually out of scope for this proposal, because there the user manages the etcd machines outside of the control plane machines.

Frankly speaking, I do not think we should claim that parallel join of CP nodes is supported until we have proper test coverage for it.

As a consequence, IMO we should optimize UX for sequential join only, eventually expose flags for people willing to explore parallel joining (or let this be done via extra args), but do not invest time in tuning this scenario.
We can also rule it a non-goal for this first iteration.

will add notes that learner number will be kept as default (1).
improvements over the tentatively supported parallel join will be added as out-of-scope. basically we can retry in the logic, similarly to how it is now, but there will be no claims for improvements.

neolit123 · 2022-10-26T12:27:27Z

updated by adding some goals / non-goals as per the discussions thus far.

pacoxu · 2022-10-28T03:23:51Z

keps/sig-cluster-lifecycle/kubeadm/3614-etcd-learner-mode/README.md

+- Add a new code path in kubeadm that can be used to deploy etcd with learner mode enabled.
+- Use a new feature gate EtcdLearnerMode that can be used to toggle the feature until graduation to GA.
+- Deprecate and remove the "old way" of adding members.
+- Keep the number of learners as per the etcd default value (1). Users can customize the value if they wish.


https://github.com/etcd-io/etcd/blob/29c3b0f307107fd95a6eb43ddff20db952eeb2fa/server/etcdserver/api/membership/cluster_opts.go#L17

As this is a default behavior, when we join two control-plan together, one would wait for the other due to the set. Should we continue to retry if the addMemberAsLearner failed?
Or should we exit if we get the too many learner members in cluster error?

i think we should set up an exponential backoff timeout of say 40 seconds / 1 minute max. potentially reuse the current timeout. then if members are joined in parallel the add member call can retry. we shouldn't exit on max members error.

neolit123 · 2022-11-03T08:06:43Z

any further comments on this, or can we get it merged?

pacoxu · 2022-11-03T08:45:34Z

/lgtm

harshanarayana · 2022-11-03T15:37:01Z

/lgtm

Thanks @neolit123 and @pacoxu for the discussion on this.

neolit123 · 2022-11-03T15:44:19Z

/hold cancel

logicalhan · 2022-11-03T15:46:34Z

/assign

logicalhan · 2022-11-03T16:13:26Z

keps/sig-cluster-lifecycle/kubeadm/3614-etcd-learner-mode/README.md

+[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
+-->
+
+Kubeadm currently adds all members in the "old way" that etcd supported, that is to add them as voting members from the beginning. If added as learners instead, such members would not disrupt the cluster quorum if they end up being faulty. The "old way" has proven problematic in cases where kubeadm attempts to add a etcd cluster member from a control plane node running on slower infrastructure. In such cases users have to manually interfere and remove the faulty member, by using tools such as etcdctl.


There is a downside though, if you are maintaining 3/3 quorum and adding a learner, you are increasing the resource consumption of etcd by 33%.

logicalhan · 2022-11-03T16:14:36Z

keps/sig-cluster-lifecycle/kubeadm/3614-etcd-learner-mode/README.md

+
+### Non-Goals
+
+- Support both the "old way" and "learner mode" in kubeadm as a toggle in the kubeadm API. Ideally we should support only a single, stable, community approved code path.


We should probably call these things more descriptively in the KEP for those who do not have the context. I recommend "old way" to be renamed "etcd-without-learner"

neolit123 added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Oct 12, 2022

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 12, 2022

k8s-ci-robot requested a review from fabriziopandini October 12, 2022 15:10

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Oct 12, 2022

k8s-ci-robot requested a review from justinsb October 12, 2022 15:10

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 12, 2022

k8s-ci-robot requested a review from pacoxu October 12, 2022 15:11

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2022

neolit123 mentioned this pull request Oct 12, 2022

RFE: use learner mode for joining etcd members kubernetes/kubeadm#1793

Open

pacoxu reviewed Oct 13, 2022

View reviewed changes

harshanarayana reviewed Oct 13, 2022

View reviewed changes

sig-cl/kubeadm: add provisional KEP for etcd learner mode

36204df

neolit123 force-pushed the 1.26-etcd-learner-mode branch from 38330ee to 36204df Compare October 26, 2022 12:26

pacoxu reviewed Oct 28, 2022

View reviewed changes

k8s-ci-robot assigned pacoxu Nov 3, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 3, 2022

k8s-ci-robot assigned harshanarayana Nov 3, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2022

k8s-ci-robot assigned logicalhan Nov 3, 2022

k8s-ci-robot merged commit e96e05d into kubernetes:master Nov 3, 2022

k8s-ci-robot added this to the v1.26 milestone Nov 3, 2022

pacoxu mentioned this pull request Jan 31, 2023

sig-cl/kubeadm: Use etcd's learner mode #3614

Open

12 tasks

logicalhan reviewed Oct 1, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sig-cl/kubeadm: add provisional KEP for etcd learner mode #3615

sig-cl/kubeadm: add provisional KEP for etcd learner mode #3615

neolit123 commented Oct 12, 2022

k8s-ci-robot commented Oct 12, 2022

neolit123 commented Oct 12, 2022

pacoxu Oct 13, 2022

neolit123 Oct 13, 2022

harshanarayana Oct 13, 2022

neolit123 Oct 13, 2022

fabriziopandini Oct 17, 2022

neolit123 Oct 26, 2022

pacoxu commented Oct 13, 2022 •

edited

Loading

neolit123 commented Oct 13, 2022

harshanarayana Oct 13, 2022

harshanarayana Oct 13, 2022

neolit123 Oct 13, 2022

harshanarayana Oct 13, 2022

neolit123 Oct 13, 2022

harshanarayana Oct 14, 2022

neolit123 Oct 14, 2022

harshanarayana Oct 14, 2022

neolit123 Oct 14, 2022

fabriziopandini Oct 17, 2022

neolit123 Oct 26, 2022

neolit123 commented Oct 26, 2022

pacoxu Oct 28, 2022

neolit123 Oct 28, 2022 •

edited

Loading

neolit123 commented Nov 3, 2022

pacoxu commented Nov 3, 2022

harshanarayana commented Nov 3, 2022

neolit123 commented Nov 3, 2022

logicalhan commented Nov 3, 2022

logicalhan Nov 3, 2022

logicalhan Nov 3, 2022


		#### Story 1

		As a kubeadm user, I wish that my HA cluster is more resilient to etcd member failures during addition of new members at cluster bring up time due to slow infrastructure.


		Currently most of the logic of stacked etcd member support in kubeadm is centralized around a couple of files in the source code. These files contain the etcd client wrapped logic and the logic for maintaining a static pod manifest for the etcd server instance.

		With the introduction of the new feature gate EtcdLearnerMode a new code path must be created. Preferably the number of "if EtcdLearnerMode" branches in the code should be minimized. Kubeadm currently has some sensitive timeouts while adding etcd members the "old way". Waiting for learners to become voting members would require some modifications in kubeadm in terms of how we wait for a member to be added. Some details can be found in the [official etcd documentation](https://etcd.io/docs/v3.3/learning/learner/#features-in-v34).


		### Non-Goals

		- Support both the "old way" and "learner mode" in kubeadm as a toggle in the kubeadm API. Ideally we should support only a single, stable, community approved code path.

sig-cl/kubeadm: add provisional KEP for etcd learner mode #3615

sig-cl/kubeadm: add provisional KEP for etcd learner mode #3615

Conversation

neolit123 commented Oct 12, 2022

k8s-ci-robot commented Oct 12, 2022

neolit123 commented Oct 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pacoxu commented Oct 13, 2022 • edited Loading

neolit123 commented Oct 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neolit123 commented Oct 26, 2022

Choose a reason for hiding this comment

neolit123 Oct 28, 2022 • edited Loading

Choose a reason for hiding this comment

neolit123 commented Nov 3, 2022

pacoxu commented Nov 3, 2022

harshanarayana commented Nov 3, 2022

neolit123 commented Nov 3, 2022

logicalhan commented Nov 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pacoxu commented Oct 13, 2022 •

edited

Loading

neolit123 Oct 28, 2022 •

edited

Loading