Skip to content
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.

ETCD Snapshot is not taken when 2 client is reachable #2070

Open
kannanvr opened this issue Apr 2, 2019 · 3 comments
Open

ETCD Snapshot is not taken when 2 client is reachable #2070

kannanvr opened this issue Apr 2, 2019 · 3 comments

Comments

@kannanvr
Copy link

kannanvr commented Apr 2, 2019

Hi,
In our cluster we have 3 ETCD endpoints available.
One of the ETCD endpoint is not reachable due to firewall issue.
out of 3, 2 ETCD endpoint is reachable. But etcd operator is not taking the snapshot. It is expecting all the mentioned endpoint to be available.

Following is the error message

time="2019-04-02T08:59:13Z" level=info msg="getMaxRev: endpoint https://10.209.198.42:2379 revision (11959285)"
time="2019-04-02T08:59:13Z" level=info msg="getMaxRev: endpoint https://10.209.198.43:2379 revision (11959285)"
time="2019-04-02T08:59:13Z" level=error msg="error syncing etcd backup (tcl-cluster/example-etcd-backup-vj2c5): failed to save snapshot (create etcd client failed: failed to get etcd client with maximum kv store revision: failed to create etcd client for endpoint (https://10.209.198.41:2379): dial tcp 10.209.198.41:2379: connect: no route to host\n)" pkg=controller

Is it Possible to take the snapshot atleat any one of the etcd endpoint is available rather than expecting all the etcd endpoint to be available and reachable?

Request you to provide your valuable feedback.

Thanks,
Kannan V

@hexfusion
Copy link
Member

Is it Possible to take the snapshot atleat any one of the etcd endpoint is available rather than expecting all the etcd endpoint to be available and reachable?

@kannanvr I think this is reasonable as the limitation from etcd is that the cluster is quorate. So you would need 2 of 3 in this case. A single running node means you have lost quorum. In which case, a proper snapshot via API is not possible. Would you like to help us dig into this?

@alaypatel07
Copy link
Collaborator

alaypatel07 commented Apr 4, 2019

@kannanvr I think this is related to #2030. Here is what is happening:

  1. Even though the endpoint is unavailable, it is being published because of the bug I mentioned above.
  2. In an unfortunate event, the svc load balancer picked the unavailable endpoint to restore the backup and it failed.

Because of the dependency on the service load balancer, this bug might be hard to reproduce as well.

You can try the following workaround.

  1. Delete the etcd-operator momentarily.
  2. Edit the etcd-cluster-client service and remove annotation that says service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
  3. Run the backup operator again and see if this works.
  4. Bring back the etcd-operator and it will pick up managing the cluster again.

IMO once #2063 lands, this issue should go away, unless I am missing something.

@kannanvr
Copy link
Author

kannanvr commented Apr 9, 2019

@hexfusion @alaypatel07 , Thanks for your guidance.

I think #2063 solves the problem for me...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants