Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sig-cl/kubeadm: add provisional KEP for etcd learner mode #3615

Merged
merged 1 commit into from
Nov 3, 2022

Conversation

neolit123
Copy link
Member

  • One-line PR description:
    use etcd's learner mode for joining etcd members to the etcd cluster managed by kubeadm
  • Other comments:
    this is not targeting a release yet. the KEP is provisional and needs someone to own the work.

@neolit123 neolit123 added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Oct 12, 2022
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 12, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Oct 12, 2022
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 12, 2022
@neolit123
Copy link
Member Author

/hold
for review
/cc @fabriziopandini @pacoxu

this is the provisional etcd learner mode KEP PR.
as discussed, we'd like to add the KEP and once we have someone who has time / is willing to work on it we can revisit and check the implementable state.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2022

#### Story 1

As a kubeadm user, I wish that my HA cluster is more resilient to etcd member failures during addition of new members at cluster bring up time due to slow infrastructure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have encountered some cases in that the ntp is not configured by joining as a control plane and ends up with errors. (Not quite sure. Can this case be covered by learner mode?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have not experimented with learner mode.
by ntp do you mean network time protocol?
haven't played with that much either.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @pacoxu mentioned. We had issues where two nodes of the cluster ended up detecting different set of NTP offset (One node was ahead and other was behind the ETCD node that was already running) that lead to some issue with ETCD member join and some issue after in terms of building a consensus over raft.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like if learner mode is added to kubeadm at the alpha stage, we have to experiment with ntp offsets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO testing ntp offsets should be out of scope because AFAIK learner mode but it is not meant to be something that improves the resilience of the raft protocol to machine mis-configurations (or at least this is not a goal of the original design).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, and i will add this as out of scope. proper machine configuration is already a pre-requisite.

@pacoxu
Copy link
Member

pacoxu commented Oct 13, 2022

As there is a feature gate, we should not add an extra phase for it.

The change would happen in kubeadm join phase control-plane-join etcd for adding a new local etcd member.

I would like to work on this if you are busy with other work.

@neolit123
Copy link
Member Author

As there is a feature gate, we should not add an extra phase for it.

The change would happen in kubeadm join phase control-plane-join etcd for adding a new local etcd member.

yes, it will be like a separate logic path controlled by the FG and not a new phase.

I would like to work on this if you are busy with other work.

sure if you have time. we can try getting this done for 1.27.
experimentation can start sooner of course.


#### Story 1

As a kubeadm user, I wish that my HA cluster is more resilient to etcd member failures during addition of new members at cluster bring up time due to slow infrastructure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @pacoxu mentioned. We had issues where two nodes of the cluster ended up detecting different set of NTP offset (One node was ahead and other was behind the ETCD node that was already running) that lead to some issue with ETCD member join and some issue after in terms of building a consensus over raft.


Currently most of the logic of stacked etcd member support in kubeadm is centralized around a couple of files in the source code. These files contain the etcd client wrapped logic and the logic for maintaining a static pod manifest for the etcd server instance.

With the introduction of the new feature gate EtcdLearnerMode a new code path must be created. Preferably the number of "if EtcdLearnerMode" branches in the code should be minimized. Kubeadm currently has some sensitive timeouts while adding etcd members the "old way". Waiting for learners to become voting members would require some modifications in kubeadm in terms of how we wait for a member to be added. Some details can be found in the [official etcd documentation](https://etcd.io/docs/v3.3/learning/learner/#features-in-v34).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be the right place to bring up this but leaving it here just in case.

The current way of kubeadm can handle parallel joins pretty well like you have mentioned before, how would that behavior change if the leaner mode is to be put in place ? Do we promote each of them as standalone?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussing this with @fabriziopandini , we thought that learners will simply stay on hold / wait and get promoted async. what is not clear to me is if etcd only supports one learner maximum. this maximum was stated in the docs before. if that is the case what can happen instead is that we might have to retry the add member call.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an option added to configure the max learner in ETCD a while ago. xref: etcd-io/etcd#13377

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's good, we can cap it to an odd number > 3. but if the number is small, it feels like we still have to have the addmember retry.

alternatively we can just have it as a very large number.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to deduce the number based on the parsing of initial-cluster: ${NAMES[0]}=https://${HOSTS[0]}:2380,${NAMES[1]}=https://${HOSTS[1]}:2380,${NAMES[2]}=https://${HOSTS[2]}:2380... configuration if provided ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think so. initial cluster is incremented per each new member, so it goes like A; A,B; A,B,C

and we don't know the final number of CP nodes the user wants. my guess is that ~95+% of HA users have 3 CP nodes. the rest might be at 5. 7 is too much, but who knows.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't always have to be that way though does it ? Take https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/setup-ha-etcd-with-kubeadm/ for example. The initial-cluster is already aware of the fact that it will have 3 nodes. But I am not sure if there is any enforcement on this kind of config. So, I get your point 😄 Thanks for the clarification

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

external etcd is actually out of scope for this proposal, because there the user manages the etcd machines outside of the control plane machines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly speaking, I do not think we should claim that parallel join of CP nodes is supported until we have proper test coverage for it.

As a consequence, IMO we should optimize UX for sequential join only, eventually expose flags for people willing to explore parallel joining (or let this be done via extra args), but do not invest time in tuning this scenario.
We can also rule it a non-goal for this first iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add notes that learner number will be kept as default (1).
improvements over the tentatively supported parallel join will be added as out-of-scope. basically we can retry in the logic, similarly to how it is now, but there will be no claims for improvements.

@neolit123
Copy link
Member Author

updated by adding some goals / non-goals as per the discussions thus far.

- Add a new code path in kubeadm that can be used to deploy etcd with learner mode enabled.
- Use a new feature gate EtcdLearnerMode that can be used to toggle the feature until graduation to GA.
- Deprecate and remove the "old way" of adding members.
- Keep the number of learners as per the etcd default value (1). Users can customize the value if they wish.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/etcd-io/etcd/blob/29c3b0f307107fd95a6eb43ddff20db952eeb2fa/server/etcdserver/api/membership/cluster_opts.go#L17

As this is a default behavior, when we join two control-plan together, one would wait for the other due to the set. Should we continue to retry if the addMemberAsLearner failed?
Or should we exit if we get the too many learner members in cluster error?

Copy link
Member Author

@neolit123 neolit123 Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should set up an exponential backoff timeout of say 40 seconds / 1 minute max. potentially reuse the current timeout. then if members are joined in parallel the add member call can retry. we shouldn't exit on max members error.

@neolit123
Copy link
Member Author

any further comments on this, or can we get it merged?

@pacoxu
Copy link
Member

pacoxu commented Nov 3, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 3, 2022
@harshanarayana
Copy link

/lgtm

Thanks @neolit123 and @pacoxu for the discussion on this.

@neolit123
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2022
@logicalhan
Copy link
Member

/assign

@k8s-ci-robot k8s-ci-robot merged commit e96e05d into kubernetes:master Nov 3, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Nov 3, 2022
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
-->

Kubeadm currently adds all members in the "old way" that etcd supported, that is to add them as voting members from the beginning. If added as learners instead, such members would not disrupt the cluster quorum if they end up being faulty. The "old way" has proven problematic in cases where kubeadm attempts to add a etcd cluster member from a control plane node running on slower infrastructure. In such cases users have to manually interfere and remove the faulty member, by using tools such as etcdctl.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a downside though, if you are maintaining 3/3 quorum and adding a learner, you are increasing the resource consumption of etcd by 33%.


### Non-Goals

- Support both the "old way" and "learner mode" in kubeadm as a toggle in the kubeadm API. Ideally we should support only a single, stable, community approved code path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably call these things more descriptively in the KEP for those who do not have the context. I recommend "old way" to be renamed "etcd-without-learner"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants