Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix liveness and readiness probes #396

Merged
merged 5 commits into from
Aug 8, 2022
Merged

Conversation

timuthy
Copy link
Member

@timuthy timuthy commented Aug 5, 2022

How to categorize this PR?

/area quality
/kind bug

What this PR does / why we need it:
This PR fixes several issues for the currently used liveness and readiness checks and also adds a startup probe for single- and multi-node etcds.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
Issue with current liveness probe:

  • When passing multiple arguments to /bin/sh -ec then only the first argument is considered, i.e. /bin/sh -ec ETCDCTL_API=3 is always successful.
            - /bin/sh
            - -ec
            - ETCDCTL_API=3
            - etcdctl
            - --cert=/var/etcd/ssl/client/client/tls.crt
            - --key=/var/etcd/ssl/client/client/tls.key
            - --cacert=/var/etcd/ssl/client/ca/ca.crt
            - --endpoints=https://etcd-aws-local:2379/
            - get
            - foo
            - --consistency=s

Issue with current readiness probe:

  • The exec command did not evaluate the return code of the HTTP response and thus the container was considered ready even though the /health(z) endpoint returned != 200.
            - /usr/bin/curl
            - --cert
            - /var/etcd/ssl/client/client/tls.crt
            - --key
            - /var/etcd/ssl/client/client/tls.key
            - --cacert
            - /var/etcd/ssl/client/ca/ca.crt
            - https://etcd-aws-local:8080/healthz

For single-node it's possible to switch to an http probe to solve the explained issue.

For multi-node it's necessary to use an exec probe because the /health endpoint of etcd is protected by mutual TLS (also see etcd-io/etcd#12370) and providing a client cert is not supported via Kubernetes http probes.

The new liveness probe is now accurate, but still fails after few seconds due to the backup sidecar requiring a long time to promote its etcd member from learner. This leads to the etcd container being restarted, and the backup sidecar's initialization fails, and begins another initialization when the etcd container comes back up. This cycle continues, and the etcd pods never become ready. This problem is solved by using a startup probe of 2 minutes to allow the initialization to complete without interruptions due to etcd container restarts.

Release note:

An issue has been fixed that caused the `liveness` and `readiness` probes of `etcd` to always succeed even though an error was reported. This prevented defective etcd pods from being restarted automatically and caused unready candidates being considered as ready to serve traffic via the `etcd service`.
A `startup` probe has been added to `etcd` to allow 2 minutes of initialization time before checking for etcd liveness.
Add support for running envtest on M1 Macbooks.

@timuthy timuthy requested a review from a team as a code owner August 5, 2022 10:56
@gardener-robot gardener-robot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Aug 5, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 5, 2022
@timuthy timuthy added this to the v0.12.0 milestone Aug 5, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 5, 2022
Copy link
Contributor

@shreyas-s-rao shreyas-s-rao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timuthy Thanks for the quick fix! Overall LGTM except for a small nit.
/lgtm

pkg/component/etcd/statefulset/values_helper.go Outdated Show resolved Hide resolved
@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/review Needs review labels Aug 5, 2022
Copy link
Contributor

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @timuthy!

One comment from me

pkg/component/etcd/statefulset/values_helper.go Outdated Show resolved Hide resolved
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 5, 2022
@timuthy
Copy link
Member Author

timuthy commented Aug 5, 2022

Thanks for the reviews @shreyas-s-rao and @aaronfern 🚀 I fixed the type, PTAL.

@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 5, 2022
@shreyas-s-rao
Copy link
Contributor

@timuthy the unit tests for sts component are failing because the probes in validateEtcd and validateEtcdWithDefaults functions in the test need to be adapted. Can you please make that change? Thanks!

@shreyas-s-rao
Copy link
Contributor

/hold

@gardener-robot gardener-robot added the reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies label Aug 5, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 5, 2022
Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>
@gardener-robot gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else and removed size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) reviewed/lgtm Has approval for merging labels Aug 6, 2022
Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 6, 2022
Copy link
Contributor

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/second-opinion Needs second review by someone else labels Aug 6, 2022
@shreyas-s-rao shreyas-s-rao merged commit 0ed5c73 into gardener:master Aug 8, 2022
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Aug 8, 2022
@timuthy timuthy deleted the fix.probes branch August 15, 2022 09:18
timuthy added a commit to timuthy/etcd-druid that referenced this pull request Sep 1, 2022
The enablement of startup/liveness probes through gardener#396 showed that they cause more harm than good:
- The startup time of etcds can vary depending on the state and amount of data
- If startup does not happen in the expected time, the failing probes kill the container which does not help to solve the issue at all but will end in a endless loop of restarts
- Liveness probes had been disabled for a long time before which never caused issues in our experience.
- Other communities have come to a similar conclusion, see https://github.com/improbable-eng/etcd-cluster-operator/blob/master/docs/operations.md#why-arent-there-liveness-probes-for-the-etcd-pods
aaronfern pushed a commit to aaronfern/etcd-druid that referenced this pull request Sep 1, 2022
The enablement of startup/liveness probes through gardener#396 showed that they cause more harm than good:
- The startup time of etcds can vary depending on the state and amount of data
- If startup does not happen in the expected time, the failing probes kill the container which does not help to solve the issue at all but will end in a endless loop of restarts
- Liveness probes had been disabled for a long time before which never caused issues in our experience.
- Other communities have come to a similar conclusion, see https://github.com/improbable-eng/etcd-cluster-operator/blob/master/docs/operations.md#why-arent-there-liveness-probes-for-the-etcd-pods
aaronfern added a commit that referenced this pull request Sep 1, 2022
The enablement of startup/liveness probes through #396 showed that they cause more harm than good:
- The startup time of etcds can vary depending on the state and amount of data
- If startup does not happen in the expected time, the failing probes kill the container which does not help to solve the issue at all but will end in a endless loop of restarts
- Liveness probes had been disabled for a long time before which never caused issues in our experience.
- Other communities have come to a similar conclusion, see https://github.com/improbable-eng/etcd-cluster-operator/blob/master/docs/operations.md#why-arent-there-liveness-probes-for-the-etcd-pods

Co-authored-by: Tim Usner <tim.usner@sap.com>
seshachalam-yv pushed a commit to seshachalam-yv/etcd-druid that referenced this pull request Sep 9, 2022
The enablement of startup/liveness probes through gardener#396 showed that they cause more harm than good:
- The startup time of etcds can vary depending on the state and amount of data
- If startup does not happen in the expected time, the failing probes kill the container which does not help to solve the issue at all but will end in a endless loop of restarts
- Liveness probes had been disabled for a long time before which never caused issues in our experience.
- Other communities have come to a similar conclusion, see https://github.com/improbable-eng/etcd-cluster-operator/blob/master/docs/operations.md#why-arent-there-liveness-probes-for-the-etcd-pods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies reviewed/lgtm Has approval for merging size/l Size of pull request is large (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants