Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track stabilization --experimental-wait-cluster-ready-timeout #13785

Open
serathius opened this issue Mar 10, 2022 · 7 comments
Open

Track stabilization --experimental-wait-cluster-ready-timeout #13785

serathius opened this issue Mar 10, 2022 · 7 comments

Comments

@serathius
Copy link
Member

serathius commented Mar 10, 2022

This is tracking issue that tracks graduation of this feature. Let's work towards stabilization of the feature by discussing what steps should be taken for graduation.
Context #13775

@ahrtr
Copy link
Member

ahrtr commented Mar 11, 2022

This is a new flag being added in 3.6 recently. Normally we should graduate a flag at least in the next release (>=3.7). But for this flag, since it's low risk, I am OK to graduate it in 3.6 if there is no any objections. cc @serathius @spzala @ptabor

@ahrtr ahrtr self-assigned this Mar 11, 2022
@serathius
Copy link
Member Author

I have bulk created issues to graduate all experimental flag. For this flag that was added v3.6 it should be reasonable to wait for v3.7.

@serathius serathius modified the milestones: etcd-v3.6, etcd-v3.7 Mar 14, 2022
@serathius serathius changed the title Stabilize --experimental-wait-cluster-ready-timeout Track stabilization --experimental-wait-cluster-ready-timeout Mar 14, 2022
@stale stale bot added the stale label Jun 12, 2022
@ahrtr ahrtr removed the stale label Jun 12, 2022
@etcd-io etcd-io deleted a comment from stale bot Jun 13, 2022
@niconorsk
Copy link

Hi, not sure if this is the correct place to ask, but noticed one slight issue with this new flag. We don't notify to systemd that we are ready to go which means that etcd ends up in a restart loop in this scenario.

I've fixed this in a local install by changing startEtcd in etcdmain\etcd.go to also timeout according to the config flag.

I'm ok with putting up an MR for this, but wanted to check that this was a desired change in the first place, and also suggestions to where to add tests for something like that

@ahrtr
Copy link
Member

ahrtr commented Aug 24, 2022

We don't notify to systemd that we are ready to go which means that etcd ends up in a restart loop in this scenario.

I am not sure I get your point. Do you mean systemd restarted etcd because etcd blocked on serve.go#L105?

The PR isn't cherry picked to 3.5. Did you build etcd on main branch yourself or see this issue in 3.6 alpha?

@niconorsk
Copy link

I tested this with a patch on 3.3.11 because that is the old version I am using but the code in question does not to look to have changed.

Specifically, etcd blocks on etcd.go#L208

Which means that it nevers sends the message back to systemd saying it should be considered started and so systemd eventually times it out and restarts it

@ahrtr
Copy link
Member

ahrtr commented Aug 25, 2022

Got it. It is a real issue to me. If other members get started too late or slow for whatever reason, then the running member may be restarted by systemd; accordingly it makes the situation even worse.

Thanks for raising this. We should add code something like below. Please feel free to deliver a PR for this. But please add an e2e test case.

diff --git a/server/etcdmain/etcd.go b/server/etcdmain/etcd.go
index f35ebde6b..7f3ad1cd6 100644
--- a/server/etcdmain/etcd.go
+++ b/server/etcdmain/etcd.go
@@ -19,6 +19,7 @@ import (
        "os"
        "runtime"
        "strings"
+       "time"
 
        "go.etcd.io/etcd/client/pkg/v3/fileutil"
        "go.etcd.io/etcd/client/pkg/v3/logutil"
@@ -207,6 +208,8 @@ func startEtcd(cfg *embed.Config) (<-chan struct{}, <-chan error, error) {
        select {
        case <-e.Server.ReadyNotify(): // wait for e.Server to join the cluster
        case <-e.Server.StopNotify(): // publish aborted from 'ErrStopped'
+       case <- time.After(cfg.ExperimentalWaitClusterReadyTimeout):
+               e.GetLogger().Warn("startEtcd: timed out waiting for the ready notification")
        }
        return e.Server.StopNotify(), e.Err(), nil
 }

niconorsk pushed a commit to niconorsk/etcd that referenced this issue Aug 26, 2022
When we can't reach quorum, we were waiting forever and never sending
the systemd notify message. As a result, systemd would eventually time out
and restart the etcd process which likely would make the unhealthy cluster
in an even worse state

Improves etcd-io#13785
niconorsk added a commit to niconorsk/etcd that referenced this issue Aug 26, 2022
When we can't reach quorum, we were waiting forever and never sending
the systemd notify message. As a result, systemd would eventually time out
and restart the etcd process which likely would make the unhealthy cluster
in an even worse state

Improves etcd-io#13785

Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
@niconorsk
Copy link

I think that MR is good to go but needs someone to allow CI to run

niconorsk added a commit to niconorsk/etcd that referenced this issue Aug 26, 2022
When we can't reach quorum, we were waiting forever and never sending
the systemd notify message. As a result, systemd would eventually time out
and restart the etcd process which likely would make the unhealthy cluster
in an even worse state

Improves etcd-io#13785

Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
niconorsk added a commit to niconorsk/etcd that referenced this issue Aug 26, 2022
When we can't reach quorum, we were waiting forever and never sending
the systemd notify message. As a result, systemd would eventually time out
and restart the etcd process which likely would make the unhealthy cluster
in an even worse state

Improves etcd-io#13785

Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
niconorsk added a commit to niconorsk/etcd that referenced this issue Aug 26, 2022
When we can't reach quorum, we were waiting forever and never sending
the systemd notify message. As a result, systemd would eventually time out
and restart the etcd process which likely would make the unhealthy cluster
in an even worse state

Improves etcd-io#13785

Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
niconorsk added a commit to niconorsk/etcd that referenced this issue Aug 26, 2022
When we can't reach quorum, we were waiting forever and never sending
the systemd notify message. As a result, systemd would eventually time out
and restart the etcd process which likely would make the unhealthy cluster
in an even worse state

Improves etcd-io#13785

Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants