Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness probes for Etcd #83

Closed
wants to merge 2 commits into from

Conversation

wallrj
Copy link
Contributor

@wallrj wallrj commented Nov 14, 2019

Part of #82

Based on the liveness probe used by kubeadm for the kubernetes Etcd processes: kubernetes/kubernetes#81385

The E2E test relies on details of a Kind cluster, so it is skipped in non-kind environments.

@wallrj wallrj force-pushed the 82-liveness-probes branch 2 times, most recently from 8187d21 to 3e9ca4e Compare November 14, 2019 16:34
@wallrj wallrj changed the title WIP: liveness probes Liveness probes Nov 14, 2019
@wallrj wallrj changed the title Liveness probes Liveness probes for Etcd Nov 14, 2019
@wallrj
Copy link
Contributor Author

wallrj commented Nov 14, 2019

E2E tests failed with what looks like data loss:

            kubectl.go:38: Running kubectl --namespace teste2e-parallel-probes run --quiet --restart=Never --rm --image=quay.io/coreos/etcd:v3.2.27 --attach etcdctl -- etcdctl --insecure-discovery --discovery-srv=cluster1 get --quorum -- foo

            kubectl.go:38: Running kubectl --namespace teste2e-parallel-probes run --quiet --restart=Never --rm --image=quay.io/coreos/etcd:v3.2.27 --attach etcdctl -- etcdctl --insecure-discovery --discovery-srv=cluster1 get --quorum -- foo

            kubectl.go:38: Running kubectl --namespace teste2e-parallel-probes run --quiet --restart=Never --rm --image=quay.io/coreos/etcd:v3.2.27 --attach etcdctl -- etcdctl --insecure-discovery --discovery-srv=cluster1 get --quorum -- foo

            kubectl.go:38: Running kubectl --namespace teste2e-parallel-probes run --quiet --restart=Never --rm --image=quay.io/coreos/etcd:v3.2.27 --attach etcdctl -- etcdctl --insecure-discovery --discovery-srv=cluster1 get --quorum -- foo

            kubectl.go:38: Running kubectl --namespace teste2e-parallel-probes run --quiet --restart=Never --rm --image=quay.io/coreos/etcd:v3.2.27 --attach etcdctl -- etcdctl --insecure-discovery --discovery-srv=cluster1 get --quorum -- foo

            e2e_test.go:574: 

                	Error Trace:	e2e_test.go:574

                	            				e2e_test.go:301

                	Error:      	Not equal: 

                	            	expected: "foobarbaz\n"

                	            	actual  : ""

                	            	

                	            	Diff:

                	            	--- Expected

                	            	+++ Actual

                	            	@@ -1,2 +1 @@

                	            	-foobarbaz

                	            	 

                	Test:       	TestE2E/Parallel/Probes

https://app.circleci.com/jobs/github/improbable-eng/etcd-cluster-operator/357

This seems to happen intermittently.
I don't know if there's some bug in etcd or etcdctl where it responds with empty values until the etcd process has started up properly.
Because eventually the expected value can be retrieved.

@wallrj wallrj requested a review from JamesLaverack November 15, 2019 12:22
@adamhosier adamhosier mentioned this pull request Nov 15, 2019
13 tasks
@wallrj wallrj force-pushed the 82-liveness-probes branch from 3e9ca4e to 2d5c835 Compare November 15, 2019 17:47
@wallrj
Copy link
Contributor Author

wallrj commented Nov 15, 2019

Resolved the conflicts. Please review.

)
require.NoError(t, err, out)
assert.Equal(t, expectedValue+"\n", out)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we expect this to behave on a multi-node cluster? if, for example, 1 of 3 nodes are killed, will the cluster continue to operate while the pod is recovering? should we test this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given 1 / 3 nodes fail, I would expect it to continue to respond to queries, yes.
But perhaps we need to add a separate set of e2e tests for quorum and non-quorum failure situations.
For now, I've updated this E2E test to use a 3-node cluster and to STOP two of the nodes.

@wallrj wallrj closed this in #88 Nov 19, 2019
@wallrj
Copy link
Contributor Author

wallrj commented Nov 19, 2019

Oops. Didn't mean to close this.

@wallrj wallrj reopened this Nov 19, 2019
@wallrj wallrj force-pushed the 82-liveness-probes branch 3 times, most recently from 2a2b5c0 to d607eae Compare November 19, 2019 12:38
@wallrj wallrj force-pushed the 82-liveness-probes branch from d607eae to 4db98f5 Compare November 19, 2019 14:38
@wallrj
Copy link
Contributor Author

wallrj commented Nov 19, 2019

@adamhosier You were right to be suspicious. I've updated the test to STOP peers 0 and 1 and although the E2E tests pass, if I leave the cluster running, all the peers have ended up in a crash loop backoff.

richard   82-liveness-probes  ~  projects  improbable-eng  etcd-cluster-operator  130  kubectl -n teste2e-parallel-probes get pods  
NAME                   READY   STATUS             RESTARTS   AGE
get-etcd-value-m5l8b   0/1     Completed          4          18m
my-cluster-0-rwqsq     0/1     CrashLoopBackOff   7          19m
my-cluster-1-vdbx2     0/1     CrashLoopBackOff   7          19m
my-cluster-2-rjnk6     0/1     CrashLoopBackOff   7          19m
set-etcd-value-s8r5x   0/1     Completed          3          19m

I need to spend some time digging through logs to find out why.

@adamhosier adamhosier removed this from the v0.2.0 milestone Nov 27, 2019
@wallrj
Copy link
Contributor Author

wallrj commented Dec 3, 2019

We discussed this last week and the conclusions were:

  • Liveness probes may cause more problems than they solve. In this branch I GET the /health endpoint for liveness, but if the cluster is unhealthy, this will fail, and the remaining nodes will be restarted unnecessarily....making the situation worse. We can add a TCP connect liveness probe if and when we encounter deadlocks in etcd.
  • Readiness probes do not make sense either, because we assume that most clients will be able to connect to multiple etcd nodes and detect whether each of them is healthy.
    If we did add a Readiness probe defined as GET /health, for example, and a corresponding loadbalancer service, the client would have to deal with DNS A record lookups for that service name not having any results when the cluster is unhealthy....and this doesn't seem right either.

Further reading:

What I might do is document this reasoning in an FAQ document.

@wallrj wallrj closed this Dec 3, 2019
@wallrj wallrj deleted the 82-liveness-probes branch February 11, 2020 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants