[Feature] Liveness Probe on multi-node etcd #280

ishan16696 · 2021-12-22T12:37:36Z

Feature (What you would like to be added):
Configure the correct livenessProbe in multi-node etcd.

Motivation (Why is this needed?):
Right now I found 2 ways to configure the livenessProbe:

/health endpoint
/health endpoint returns false if one of the following conditions is met (source):
- there is no etcd leader or leader-election is currently going on.
- the latency of a QGET request exceeds 1sec

Disadvantage of Method 1 (/health endpoint).

If there is no Quorum, etcd-follower will get restart by kubelet but etcd's restart isn't a right behavior.
If etcd cluster is overloaded which is bad, but restart will generate even more load

ETCDCTL_API=3 etcdctl endpoint health --endpoints=${ENDPOINTS} --command-timeout=Xs
etcdctl endpoint health command performs a GET on the "health" key(source)
- fails if there is no etcd leader or leader-election is going on as I think GET request will fail if there is no etcd leader.

Disadvantage of Method 2 (etcdctl endpoint health).

If there is no Quorum, etcd-follower will get restart by kubelet but etcd's restart isn't a right behavior.

This issue of liveness is already reported on etcd here and it has been fixed by this PR but this feature will come with etcd v3.6 and we don't know the timeline when it will be released.

What options we have:

Go with the method 2(etcdctl endpoint health) and whenever etcd release v3.6, switch to that version and configure the correct liveness probe.
Other way is fork the etcd v3.4.13 and make the necessary changes and use the custom image.

The text was updated successfully, but these errors were encountered:

ishan16696 · 2021-12-22T12:43:29Z

cc @abdasgupta @aaronfern @timuthy

timuthy · 2022-05-30T07:47:11Z

Reading this I wonder about the option keeping the current liveness probe but setting --consistency="s"? This mimics the implementation of the linked PR etcd-io/etcd#13399. WDYT @ishan16696?

ishan16696 · 2022-06-01T05:39:10Z

Reading this I wonder about the option keeping the current liveness probe but setting --consistency="s"

yes, this make sense to me .... I don’t see any major issue with this.
we can configure the livenessProbe like this making the consistency serializable instead of linearizable

- /bin/sh
      - -ec
      - ETCDCTL_API=3
      - etcdctl
      - --cert= ...
      - --key= ....
      - --cacert= ....
      - --endpoints=https://etcd-main-local:2379
      - get
      - foo
      - --consistency="s"

ishan16696 · 2022-06-01T13:50:49Z

Hi @timuthy ,
Unfortunately this way of using --consistency="s" is not working for following scenario.

Start an etcd cluster with 3 members.
Add some dummy data.
Stop all the members
Start only one member
then try etcdctl get foo --endpoints=http://127.0.0.1:2379 --consistency="s".
Unfortunately --consistency="s" won't help us in this case until quorum will get satisfy.

timuthy · 2022-06-01T15:08:48Z

As discussed, the assumption wasn't that all members are down but rather that if e.g. 1/3 are available, this last running pod is not killed by the liveness probe as well because it doesn't make sense.

ishan16696 · 2022-06-01T16:09:18Z

this last running pod is not killed by the liveness probe as well because it doesn't make sense.

I'm not saying last pod is killed due to liveness probe ... I'm saying last pod can be killed/crash due to some reason, why you are assuming that last pod can’t be killed.

ishan16696 · 2022-06-01T16:14:38Z

The point I'm making is, suppose all pods/member of etcd-cluster crashes/goes down due to some reasons, they will not be able to come up due to this livenessProbe: etcdctl get foo --consistency="s" will get fail even for 1/3 member until quorum will achieve.

timuthy · 2022-06-02T06:22:15Z

they will not be able to come up due to this livenessProbe

I assume that they will come up fast enough, so that the livenessProbes don't fail beyond the expected threshold (initialDelaySeconds + periodSeconds * failureThreshold) and thus the containers aren't restarted before they're up and running.

ishan16696 · 2022-06-02T07:21:00Z

(initialDelaySeconds + periodSeconds * failureThreshold)

yeah exactly but then we have to depend on this configuration and set them accordingly.

unmarshall · 2022-06-08T06:40:08Z

IMO using --consistency=s is non-deterministic as the time it takes to start the etcd process cannot be guaranteed. The correct thing to do would be to create a /livez or /healthz endpoint which is served out of backup-restore container. If this endpoint has come up and responds with a 200 OK then etcd pod should be considered as LIVE which then prevents it from getting restarted. For readiness probe you can probably use etcdctl with --consistency=s which ensures that the traffic can be redirected to this etcd pod. For redirecting traffic to a single etcd instance it is not important to check if the cluster is healthy (i.e quorum is not lost). If the quorum is lost then the request will anyways get rejected if it has to by etcd member.

timuthy · 2022-06-08T07:05:51Z

IMO using --consistency=s is non-deterministic as the time it takes to start the etcd process cannot be guaranteed. The correct thing to do would be to create a /livez or /healthz endpoint which is served out of backup-restore container.

Can you sketch how an implementation of /livez would look like to make it deterministic?

ishan16696 · 2022-06-20T08:57:18Z

Reopening this issue as livenessProbe can be improved to resolve this kind of scenario #280 (comment)

timuthy · 2022-06-20T09:00:38Z

I'd be interested in knowing what is planned here. Do you already have something in mind?

ishan16696 · 2022-06-20T11:09:05Z

I reopened it for future reference as whenever we will upgrade our Etcd version(may be to 3.6) then we can also update the livenessProbe for etcd.

timuthy · 2022-06-20T14:01:47Z

TBH, I don't get the motivation of having an open issue if there is no issue today (as claimed by #280 (comment)). If you want to keep this as a future reference then you can still take the information from this issue, even if it's closed.
At the moment, this issue represents an open item in #107 and gives the impression that work needs to be done on our end.

Either we see a problem with the readiness/liveness probes as they are configured today and discuss/implement a solution or we open a ☂️ issue for upgrading to etcd 3.6 and list all the requirements/changes once we plan to do so.

ishan16696 · 2022-06-22T08:25:18Z

ok, then feel free to close.

ishan16696 · 2022-06-23T03:39:03Z

Found out the upstream issue which is also described in this comment #280 (comment)
Mentioning this for future reference whenever we want to move to etcd v3.6

ishan16696 · 2022-09-22T05:20:42Z

We had introduced the livenessProbe in multi-node etcd with this PR: #396
But we found out that livenessProbe was causing more harm than good. Then, we decided to remove the livenessProbe from etcd cluster with this PR: #423
So, I'm opening this issue just to track the liveness Probe for etcd cluster and whenever we will upgrade our etcd version then we can introduce the livenessProbe. Please refer to this upstream PR: etcd-io/etcd#13525

/on-hold

ishan16696 · 2023-01-08T12:36:08Z

This issue needs to pick after #445 etcd version upgrade.

shreyas-s-rao · 2023-05-03T09:14:20Z

/assign @ishan16696

ishan16696 added the kind/enhancement Enhancement, improvement, extension label Dec 22, 2021

ishan16696 mentioned this issue Jan 20, 2022

Multi-Node/Clustered ETCD #107

Closed

34 tasks

timuthy mentioned this issue Feb 16, 2022

[Feature] Failed backups to not block incoming traffic and trigger high prio alert instead #147

Closed

3 tasks

ishan16696 mentioned this issue Jun 8, 2022

Update the readiness/liveness Probes configuration. #357

Merged

ishan16696 closed this as completed in #357 Jun 14, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 14, 2022

ishan16696 reopened this Jun 20, 2022

gardener-robot removed the status/closed Issue is closed (either delivered or triaged) label Jun 20, 2022

timuthy closed this as completed Jun 22, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 22, 2022

ashwani2k added the release/beta Planned for Beta release of the Feature label Jul 6, 2022

ishan16696 reopened this Sep 22, 2022

gardener-robot removed the status/closed Issue is closed (either delivered or triaged) label Sep 22, 2022

gardener-robot added the status/on-hold Issue on hold (e.g. because work was suspended) label Sep 22, 2022

ashwani2k added release/ga Planned for GA(General Availability) release of the Feature and removed release/beta Planned for Beta release of the Feature labels Oct 11, 2022

abdasgupta added the priority/4 Priority (lower number equals higher priority) label Jan 6, 2023

shreyas-s-rao added priority/1 Priority (lower number equals higher priority) and removed priority/4 Priority (lower number equals higher priority) labels May 3, 2023

gardener-robot assigned ishan16696 May 3, 2023

ashwani2k added priority/2 Priority (lower number equals higher priority) and removed priority/1 Priority (lower number equals higher priority) labels May 26, 2023

aaronfern mentioned this issue May 29, 2023

First version of etcd-wrapper gardener/etcd-wrapper#3

Merged

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Feb 2, 2024

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Liveness Probe on multi-node etcd #280

[Feature] Liveness Probe on multi-node etcd #280

ishan16696 commented Dec 22, 2021 •

edited

Loading

ishan16696 commented Dec 22, 2021

timuthy commented May 30, 2022

ishan16696 commented Jun 1, 2022

ishan16696 commented Jun 1, 2022

timuthy commented Jun 1, 2022

ishan16696 commented Jun 1, 2022

ishan16696 commented Jun 1, 2022 •

edited

Loading

timuthy commented Jun 2, 2022

ishan16696 commented Jun 2, 2022

unmarshall commented Jun 8, 2022 •

edited

Loading

timuthy commented Jun 8, 2022

ishan16696 commented Jun 20, 2022

timuthy commented Jun 20, 2022 •

edited

Loading

ishan16696 commented Jun 20, 2022

timuthy commented Jun 20, 2022

ishan16696 commented Jun 22, 2022

ishan16696 commented Jun 23, 2022

ishan16696 commented Sep 22, 2022

ishan16696 commented Jan 8, 2023

shreyas-s-rao commented May 3, 2023

[Feature] Liveness Probe on multi-node etcd #280

[Feature] Liveness Probe on multi-node etcd #280

Comments

ishan16696 commented Dec 22, 2021 • edited Loading

ishan16696 commented Dec 22, 2021

timuthy commented May 30, 2022

ishan16696 commented Jun 1, 2022

ishan16696 commented Jun 1, 2022

timuthy commented Jun 1, 2022

ishan16696 commented Jun 1, 2022

ishan16696 commented Jun 1, 2022 • edited Loading

timuthy commented Jun 2, 2022

ishan16696 commented Jun 2, 2022

unmarshall commented Jun 8, 2022 • edited Loading

timuthy commented Jun 8, 2022

ishan16696 commented Jun 20, 2022

timuthy commented Jun 20, 2022 • edited Loading

ishan16696 commented Jun 20, 2022

timuthy commented Jun 20, 2022

ishan16696 commented Jun 22, 2022

ishan16696 commented Jun 23, 2022

ishan16696 commented Sep 22, 2022

ishan16696 commented Jan 8, 2023

shreyas-s-rao commented May 3, 2023

ishan16696 commented Dec 22, 2021 •

edited

Loading

ishan16696 commented Jun 1, 2022 •

edited

Loading

unmarshall commented Jun 8, 2022 •

edited

Loading

timuthy commented Jun 20, 2022 •

edited

Loading