Track stabilization --experimental-wait-cluster-ready-timeout #13785

serathius · 2022-03-10T16:38:12Z

This is tracking issue that tracks graduation of this feature. Let's work towards stabilization of the feature by discussing what steps should be taken for graduation.
Context #13775

ahrtr · 2022-03-11T00:27:11Z

This is a new flag being added in 3.6 recently. Normally we should graduate a flag at least in the next release (>=3.7). But for this flag, since it's low risk, I am OK to graduate it in 3.6 if there is no any objections. cc @serathius @spzala @ptabor

serathius · 2022-03-14T09:58:00Z

I have bulk created issues to graduate all experimental flag. For this flag that was added v3.6 it should be reasonable to wait for v3.7.

niconorsk · 2022-08-24T22:40:09Z

Hi, not sure if this is the correct place to ask, but noticed one slight issue with this new flag. We don't notify to systemd that we are ready to go which means that etcd ends up in a restart loop in this scenario.

I've fixed this in a local install by changing startEtcd in etcdmain\etcd.go to also timeout according to the config flag.

I'm ok with putting up an MR for this, but wanted to check that this was a desired change in the first place, and also suggestions to where to add tests for something like that

ahrtr · 2022-08-24T23:31:03Z

We don't notify to systemd that we are ready to go which means that etcd ends up in a restart loop in this scenario.

I am not sure I get your point. Do you mean systemd restarted etcd because etcd blocked on serve.go#L105?

The PR isn't cherry picked to 3.5. Did you build etcd on main branch yourself or see this issue in 3.6 alpha?

niconorsk · 2022-08-25T03:27:01Z

I tested this with a patch on 3.3.11 because that is the old version I am using but the code in question does not to look to have changed.

Specifically, etcd blocks on etcd.go#L208

Which means that it nevers sends the message back to systemd saying it should be considered started and so systemd eventually times it out and restarts it

ahrtr · 2022-08-25T04:59:47Z

Got it. It is a real issue to me. If other members get started too late or slow for whatever reason, then the running member may be restarted by systemd; accordingly it makes the situation even worse.

Thanks for raising this. We should add code something like below. Please feel free to deliver a PR for this. But please add an e2e test case.

diff --git a/server/etcdmain/etcd.go b/server/etcdmain/etcd.go
index f35ebde6b..7f3ad1cd6 100644
--- a/server/etcdmain/etcd.go
+++ b/server/etcdmain/etcd.go
@@ -19,6 +19,7 @@ import (
        "os"
        "runtime"
        "strings"
+       "time"
 
        "go.etcd.io/etcd/client/pkg/v3/fileutil"
        "go.etcd.io/etcd/client/pkg/v3/logutil"
@@ -207,6 +208,8 @@ func startEtcd(cfg *embed.Config) (<-chan struct{}, <-chan error, error) {
        select {
        case <-e.Server.ReadyNotify(): // wait for e.Server to join the cluster
        case <-e.Server.StopNotify(): // publish aborted from 'ErrStopped'
+       case <- time.After(cfg.ExperimentalWaitClusterReadyTimeout):
+               e.GetLogger().Warn("startEtcd: timed out waiting for the ready notification")
        }
        return e.Server.StopNotify(), e.Err(), nil
 }

When we can't reach quorum, we were waiting forever and never sending the systemd notify message. As a result, systemd would eventually time out and restart the etcd process which likely would make the unhealthy cluster in an even worse state Improves etcd-io#13785

When we can't reach quorum, we were waiting forever and never sending the systemd notify message. As a result, systemd would eventually time out and restart the etcd process which likely would make the unhealthy cluster in an even worse state Improves etcd-io#13785 Signed-off-by: Nicolai Moore <niconorsk@gmail.com>

niconorsk · 2022-08-26T00:29:13Z

I think that MR is good to go but needs someone to allow CI to run

When we can't reach quorum, we were waiting forever and never sending the systemd notify message. As a result, systemd would eventually time out and restart the etcd process which likely would make the unhealthy cluster in an even worse state Improves etcd-io#13785 Signed-off-by: Nicolai Moore <niconorsk@gmail.com>

serathius added help wanted stage/tracked labels Mar 10, 2022

serathius added this to the etcd-v3.6 milestone Mar 10, 2022

ahrtr self-assigned this Mar 11, 2022

serathius modified the milestones: etcd-v3.6, etcd-v3.7 Mar 14, 2022

serathius changed the title ~~Stabilize --experimental-wait-cluster-ready-timeout~~ Track stabilization --experimental-wait-cluster-ready-timeout Mar 14, 2022

stale bot added the stale label Jun 12, 2022

ahrtr removed the stale label Jun 12, 2022

etcd-io deleted a comment from stale bot Jun 13, 2022

niconorsk mentioned this issue Aug 26, 2022

etcdmain: Honour ExperimentalWaitClusterReadyTimeout in startEtcd #14388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track stabilization --experimental-wait-cluster-ready-timeout #13785

Track stabilization --experimental-wait-cluster-ready-timeout #13785

serathius commented Mar 10, 2022 •

edited

Loading

ahrtr commented Mar 11, 2022 •

edited

Loading

serathius commented Mar 14, 2022

niconorsk commented Aug 24, 2022

ahrtr commented Aug 24, 2022 •

edited

Loading

niconorsk commented Aug 25, 2022

ahrtr commented Aug 25, 2022

niconorsk commented Aug 26, 2022

Track stabilization --experimental-wait-cluster-ready-timeout #13785

Track stabilization --experimental-wait-cluster-ready-timeout #13785

Comments

serathius commented Mar 10, 2022 • edited Loading

ahrtr commented Mar 11, 2022 • edited Loading

serathius commented Mar 14, 2022

niconorsk commented Aug 24, 2022

ahrtr commented Aug 24, 2022 • edited Loading

niconorsk commented Aug 25, 2022

ahrtr commented Aug 25, 2022

niconorsk commented Aug 26, 2022

serathius commented Mar 10, 2022 •

edited

Loading

ahrtr commented Mar 11, 2022 •

edited

Loading

ahrtr commented Aug 24, 2022 •

edited

Loading