ETCD Snapshot is not taken when 2 client is reachable #2070

kannanvr · 2019-04-02T11:34:59Z

Hi,
In our cluster we have 3 ETCD endpoints available.
One of the ETCD endpoint is not reachable due to firewall issue.
out of 3, 2 ETCD endpoint is reachable. But etcd operator is not taking the snapshot. It is expecting all the mentioned endpoint to be available.

Following is the error message

time="2019-04-02T08:59:13Z" level=info msg="getMaxRev: endpoint https://10.209.198.42:2379 revision (11959285)"
time="2019-04-02T08:59:13Z" level=info msg="getMaxRev: endpoint https://10.209.198.43:2379 revision (11959285)"
time="2019-04-02T08:59:13Z" level=error msg="error syncing etcd backup (tcl-cluster/example-etcd-backup-vj2c5): failed to save snapshot (create etcd client failed: failed to get etcd client with maximum kv store revision: failed to create etcd client for endpoint (https://10.209.198.41:2379): dial tcp 10.209.198.41:2379: connect: no route to host\n)" pkg=controller

Is it Possible to take the snapshot atleat any one of the etcd endpoint is available rather than expecting all the etcd endpoint to be available and reachable?

Request you to provide your valuable feedback.

Thanks,
Kannan V

The text was updated successfully, but these errors were encountered:

hexfusion · 2019-04-03T16:09:18Z

Is it Possible to take the snapshot atleat any one of the etcd endpoint is available rather than expecting all the etcd endpoint to be available and reachable?

@kannanvr I think this is reasonable as the limitation from etcd is that the cluster is quorate. So you would need 2 of 3 in this case. A single running node means you have lost quorum. In which case, a proper snapshot via API is not possible. Would you like to help us dig into this?

alaypatel07 · 2019-04-04T14:26:59Z

@kannanvr I think this is related to #2030. Here is what is happening:

Even though the endpoint is unavailable, it is being published because of the bug I mentioned above.
In an unfortunate event, the svc load balancer picked the unavailable endpoint to restore the backup and it failed.

Because of the dependency on the service load balancer, this bug might be hard to reproduce as well.

You can try the following workaround.

Delete the etcd-operator momentarily.
Edit the etcd-cluster-client service and remove annotation that says service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
Run the backup operator again and see if this works.
Bring back the etcd-operator and it will pick up managing the cluster again.

IMO once #2063 lands, this issue should go away, unless I am missing something.

kannanvr · 2019-04-09T06:01:57Z

@hexfusion @alaypatel07 , Thanks for your guidance.

I think #2063 solves the problem for me...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETCD Snapshot is not taken when 2 client is reachable #2070

ETCD Snapshot is not taken when 2 client is reachable #2070

kannanvr commented Apr 2, 2019 •

edited

Loading

hexfusion commented Apr 3, 2019

alaypatel07 commented Apr 4, 2019 •

edited

Loading

kannanvr commented Apr 9, 2019

ETCD Snapshot is not taken when 2 client is reachable #2070

ETCD Snapshot is not taken when 2 client is reachable #2070

Comments

kannanvr commented Apr 2, 2019 • edited Loading

hexfusion commented Apr 3, 2019

alaypatel07 commented Apr 4, 2019 • edited Loading

kannanvr commented Apr 9, 2019

kannanvr commented Apr 2, 2019 •

edited

Loading

alaypatel07 commented Apr 4, 2019 •

edited

Loading