-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd cluster losing consensus #6276
Comments
@jhgg What do you mean by losing consensus? Do you observe inconsistent data? Or you mean there are some leader elections? |
The data seems to be consistent, it just stops accepting writes as it cannot elect a leader anymore. It basically gets stuck voting. |
@jhgg How did you solve the stuck issue? By restarting the cluster? Or it recovered itself? |
We have a script that basically reads from one of the reachable nodes in the cluster, retrieving the the contents of all the keys recursively and dumping them to a json file, firewalling the cluster off from the rest of the network (our services compensate by using stale data for discovery when the cluster is un-reachable) - deleting the data directories and starting a new cluster, then writing the old discovery data back to the cluster before making them reachable again. |
Interesting. We will perform an upgrade to the latest 2.3.x release tonight and see if it happens anymore. |
@jhgg I close this issue. Please reopen if it happens again after the upgrades. You can directly upgrades from 2.3.0 -> 2.3.7. Thanks for reporting. |
Thanks for the quick response @xiang90 ❤️ |
We run a 3 node etcd cluster, running since Jan 2016 just fine, with out much changes (aside from updating to etcd 2.3.0). Around the time of the failure, we see these log-lines on the servers:
etcd-prd-1-1
etcd-prd-1-2
etcd-prd-1-3
Can anyone provide any pointers as to where we can look? Cluster has been completely fine for months, but this kinda thing has happened twice in the past 2 weeks. CPU usage/load on the nodes were normal according to our metrics during this time.
The text was updated successfully, but these errors were encountered: