[concept] add livez/readyz for etcd #16008

logicalhan · 2023-06-05T17:10:09Z

This is a prototype for adding livez/readyz support to etcd. Currently I've configured the NO_SPACE alarm to only count towards \readyz since it means etcd is degraded. Whether quorum should be included in liveness is an open question.

Change-Id: Ia440a82b2bf3d275b7cd7d88b5a6e86fe9fe1c28 Signed-off-by: Han Kang <hankang@google.com> Change-Id: Ief9475a92429be58eb7b1f96246bbdb00e996e75

Change-Id: Ie699ae11d0ecc315b91365f85f0ac0b2d339c28d Signed-off-by: Han Kang <hankang@google.com> Change-Id: Iee6f469f63cb1fbcc22a4d633a621b7915a1a799

server/etcdserver/api/etcdhttp/health.go

Signed-off-by: Han Kang <hankang@google.com> Change-Id: I7e95be58cff6b7bc47fa3114249074a9f69a1620

Co-authored-by: Benjamin Wang <wachao@vmware.com> Signed-off-by: Han Kang <hankang@google.com> Change-Id: I45fda0a8ee7d80638af96fee4efb3bfdf2aebaf8

Change-Id: Ie5f02bba1a63f7592c6f3500db9070e6f1022df0 Signed-off-by: Han Kang <hankang@google.com>

serathius

Don't want to rush into adding livez/readyz probe. Main problem with existing health probe we just added it to have it without proper consideration.

I want livez to properly reflect fact that etcd needs restart, for example etcd is stuck on stalled storage https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit?usp=sharing.

Readyz should properly reflect fact that etcd is ready to serve traffic. Don't think alarms matter here. It's a degradation, however it doesn't mean we shouldn't serve reads.

TLDR; I would like to have a design written that will do a proper analysis etcd failure modes and propose matching probes to detect them. Example kubernetes-sigs/metrics-server#542

ahrtr · 2023-06-06T07:10:23Z

server/etcdserver/api/etcdhttp/health.go

@@ -141,7 +169,7 @@ func getSerializableFlag(r *http.Request) bool {

 // TODO: etcdserver.ErrNoLeader in health API

-func checkAlarms(lg *zap.Logger, srv ServerHealth, excludedAlarms AlarmSet) Health {
+func checkAlarms(lg *zap.Logger, srv ServerHealth, excludedAlarms AlarmSet, healthType string) Health {


The healthType string isn't used at all, can we remove it?

Suggested change

func checkAlarms(lg *zap.Logger, srv ServerHealth, excludedAlarms AlarmSet, healthType string) Health {

func checkAlarms(lg *zap.Logger, srv ServerHealth, excludedAlarms AlarmSet) Health {

stale · 2023-09-17T01:16:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

jmhbnz · 2024-01-04T19:08:49Z

Discussed during sig-etcd triage meeting. This original concept has now been superceeded by the work from @siyuanfoundation 🎉

siyuanfoundation · 2024-01-04T19:09:38Z

the work is tracked in #16007

add livez/readyz for etcd

22f940c

Change-Id: Ia440a82b2bf3d275b7cd7d88b5a6e86fe9fe1c28 Signed-off-by: Han Kang <hankang@google.com> Change-Id: Ief9475a92429be58eb7b1f96246bbdb00e996e75

logicalhan force-pushed the livez-readyz branch from 8b6f669 to 22f940c Compare June 5, 2023 17:14

logicalhan mentioned this pull request Jun 5, 2023

Livez/Readyz #16007

Open

logicalhan force-pushed the livez-readyz branch from b5b5372 to e7b59ef Compare June 5, 2023 17:29

refactor tests to basically use the same logic

73db456

Change-Id: Ie699ae11d0ecc315b91365f85f0ac0b2d339c28d Signed-off-by: Han Kang <hankang@google.com> Change-Id: Iee6f469f63cb1fbcc22a4d633a621b7915a1a799

logicalhan force-pushed the livez-readyz branch from e7b59ef to 73db456 Compare June 5, 2023 17:38

ahrtr reviewed Jun 5, 2023

View reviewed changes

server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved

server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved

fix broken grpc handler

28b8b9f

Signed-off-by: Han Kang <hankang@google.com> Change-Id: I7e95be58cff6b7bc47fa3114249074a9f69a1620

logicalhan force-pushed the livez-readyz branch from 8a7c57f to 28b8b9f Compare June 5, 2023 23:40

Apply suggestions from code review

4c3e52c

Co-authored-by: Benjamin Wang <wachao@vmware.com> Signed-off-by: Han Kang <hankang@google.com> Change-Id: I45fda0a8ee7d80638af96fee4efb3bfdf2aebaf8

logicalhan force-pushed the livez-readyz branch from 5a50e8e to 4c3e52c Compare June 5, 2023 23:48

gofmt file

7f9aeb3

Change-Id: Ie5f02bba1a63f7592c6f3500db9070e6f1022df0 Signed-off-by: Han Kang <hankang@google.com>

logicalhan force-pushed the livez-readyz branch from 7476e4a to 7f9aeb3 Compare June 5, 2023 23:58

serathius requested changes Jun 6, 2023

View reviewed changes

ahrtr reviewed Jun 6, 2023

View reviewed changes

stale bot added the stale label Sep 17, 2023

jmhbnz closed this Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[concept] add livez/readyz for etcd #16008

[concept] add livez/readyz for etcd #16008

logicalhan commented Jun 5, 2023

serathius left a comment •

edited

Loading

ahrtr Jun 6, 2023

stale bot commented Sep 17, 2023

jmhbnz commented Jan 4, 2024

siyuanfoundation commented Jan 4, 2024

	func checkAlarms(lg *zap.Logger, srv ServerHealth, excludedAlarms AlarmSet, healthType string) Health {
	func checkAlarms(lg *zap.Logger, srv ServerHealth, excludedAlarms AlarmSet) Health {

[concept] add livez/readyz for etcd #16008

[concept] add livez/readyz for etcd #16008

Conversation

logicalhan commented Jun 5, 2023

serathius left a comment • edited Loading

Choose a reason for hiding this comment

ahrtr Jun 6, 2023

Choose a reason for hiding this comment

stale bot commented Sep 17, 2023

jmhbnz commented Jan 4, 2024

siyuanfoundation commented Jan 4, 2024

serathius left a comment •

edited

Loading