-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Liveness probes for Etcd #83
Conversation
8187d21
to
3e9ca4e
Compare
E2E tests failed with what looks like data loss:
https://app.circleci.com/jobs/github/improbable-eng/etcd-cluster-operator/357 This seems to happen intermittently. |
3e9ca4e
to
2d5c835
Compare
Resolved the conflicts. Please review. |
) | ||
require.NoError(t, err, out) | ||
assert.Equal(t, expectedValue+"\n", out) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we expect this to behave on a multi-node cluster? if, for example, 1 of 3 nodes are killed, will the cluster continue to operate while the pod is recovering? should we test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given 1 / 3 nodes fail, I would expect it to continue to respond to queries, yes.
But perhaps we need to add a separate set of e2e tests for quorum and non-quorum failure situations.
For now, I've updated this E2E test to use a 3-node cluster and to STOP two of the nodes.
Oops. Didn't mean to close this. |
* Add a cluster name flag
2a2b5c0
to
d607eae
Compare
d607eae
to
4db98f5
Compare
@adamhosier You were right to be suspicious. I've updated the test to STOP peers 0 and 1 and although the E2E tests pass, if I leave the cluster running, all the peers have ended up in a crash loop backoff.
I need to spend some time digging through logs to find out why. |
We discussed this last week and the conclusions were:
Further reading: What I might do is document this reasoning in an FAQ document. |
Part of #82
Based on the liveness probe used by kubeadm for the kubernetes Etcd processes: kubernetes/kubernetes#81385
The E2E test relies on details of a Kind cluster, so it is skipped in non-kind environments.