-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd2 cluster ID mismatch #3710
Comments
@haozhenxiao One of the member was bootstrapped via discovery service. You must remove the previous data-dir to clean up the member information. Or the member will ignore the new configuration and start with the old configuration. That is why you see the mismatch. See https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#lifecycle for more details. Thanks. |
@xiang90 If -data-dir is not specified, where is the data directory by default? Now I have deleted all the data directories, but I still get the cluster ID mismatch complain. |
@haozhenxiao The default of -data-dir is documented here: https://github.com/coreos/etcd/blob/master/Documentation/configuration.md#-data-dir |
@haozhenxiao When etcd starts, it will print out the data dir it uses and how it starts (using old configuration or bootstrap). You can check it yourself. |
On a related note for someone's future use, I also had this happen when I added the $WAL_DIR parameter to an existing cluster's startup config, but forgot to move the existing wal directory contents to the new location on a couple of nodes. Which is pretty much the same problem - partial missing data directory. :) |
Looks like same here:
Data dir is empty on start:
Starting the server:
What am I doing wrong? |
@Bregor Did you find any solution? I am having the exact same issue with etcd2 |
@alogoc nope, sorry :( |
I am facing same issue :( |
I had a similar problem. I stopped one node because I want the data directory at another place. After removing the etcd member I tried to add it with the new datadir. I ran the etcdctl command to add the etcd member and afterwards I tried to start the new member with the same result you have, cluster mismatch... After some time i recognized that I have to use -initial-cluster-state existing as it is provided by the etcdctl member add command:
After changing my static docker setup to:
all logs are clear an the cluster health is true. Maybe that will help you. |
I'm clearing all the data in /var/lib/etcd and using --initial-cluster-state existing for all nodes other than the first one but I still get output like below that makes me think not all the data was cleared
How can I fix this? |
Actually..it was stale data..I was backing etcd with gluster and seems it was working too well |
Or not..I cleared all the nodes of the data but still when I run etcd I get
Where is this coming from? |
@jonathan-kosgei did you remove/add the node with etcdctl before restarting it with a wiped data directory? Since raft expects members to acknowledge writes, it also expects the member to keep any writes acknowledged; wiping the data directory drops that data, so the member is effectively lost. To get the node running again, the member has to be removed/added through |
I'm running etcd on kubernetes, once I delete the dirs I simply delete/recreate the pods |
@jonathan-kosgei if it's deleting/recreating pods for a single member instead of the entire cluster, the member for that single member pod needs to be removed with There's also the etcd-operator project which can help simplify this process for managing etcd members under kubernetes. |
Solved this, by adding the member via curl/etcdctl before starting it i.e.
the new node
…On Fri, Mar 24, 2017 at 3:07 AM, Anthony Romano ***@***.***> wrote:
@jonathan-kosgei <https://github.com/jonathan-kosgei> if it's
deleting/recreating pods for a single member instead of the entire cluster,
the member for that single member pod needs to be removed with etcdctl
remove, then added again with etcdctl add so that the cluster knows it's
a fresh member. Deleting/recreating a pod won't communicate that it's a new
member to etcd.
There's also the etcd-operator <https://github.com/coreos/etcd-operator/>
project which can help simplify this process for managing etcd members
under kubernetes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3710 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGZuUgly9Z7qrJEpGp5vJusL_yt65YnUks5rowlMgaJpZM4GRHqe>
.
|
I run into the same error with 3.1.8.
The funny thing is that neither of my three pods actually has either of these two IDs. Here's the list of IDs as mentioned by the log message
edit 1:
Why is the member ID here differernt than in the other log message? And why are they starting different clusters instead of the same? For the setup see #8079 |
i get it if you create a etcd member use command
it will create a new member id but the target host have a old ID when run etcd it save at --data-dir {pathfile} you need to delet the data file and create with --initial-cluster-state existing |
+1
+1 , Starting with |
The links in replies on this issue are all broken. Here are working ones: |
I have two CoreOS machines, their IPs are: 10.10.26.160 and 10.10.24.156, I'm using the static bootstrap, the bootstrap script of the 10.10.26.160 is:
etcd2 -name etcd1 -data-dir data
-advertise-client-urls http://10.10.26.160:2379
-listen-client-urls http://10.10.26.160:2379,http://127.0.0.1:2379
-initial-advertise-peer-urls http://10.10.26.160:2380
-listen-peer-urls http://10.10.26.160:2380
-initial-cluster-token etcd-cluster-2
-initial-cluster etcd0=http://10.10.24.156:2380,etcd1=http://10.10.26.160:2380
-initial-cluster-state new
the bootstrap of 10.10.24.156 is:
etcd2 -name etcd0 -data-dir data
-advertise-client-urls http://10.10.24.156:2379
-listen-client-urls http://10.10.24.156:2379,http://127.0.0.1:2379
-initial-advertise-peer-urls http://10.10.24.156:2380
-listen-peer-urls http://10.10.24.156:2380
-initial-cluster-token etcd-cluster-2
-initial-cluster etcd0=http://10.10.24.156:2380,etcd1=http://10.10.26.160:2380
-initial-cluster-state new
While running the two scripts on the two coreos machines, I got some errors, the error of 10.10.26.160 is:
2015/10/16 07:51:46 raft: 8f87889e2f3130e3 is starting a new election at term 88
2015/10/16 07:51:46 raft: 8f87889e2f3130e3 became candidate at term 89
2015/10/16 07:51:46 raft: 8f87889e2f3130e3 received vote from 8f87889e2f3130e3 at term 89
2015/10/16 07:51:46 raft: 8f87889e2f3130e3 [logterm: 1, index: 3] sent vote request to db30be88917b6839 at term 89
2015/10/16 07:51:46 raft: 8f87889e2f3130e3 [logterm: 1, index: 3] sent vote request to f2d68f8a4e38f628 at term 89
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:46 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:47 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:47 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:47 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:47 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:47 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:47 rafthttp: request sent was ignored (cluster ID mismatch: remote[db30be88917b6839]=ac5f3aa02066b598, local=9b09b40f488fe304)
2015/10/16 07:51:47 rafthttp: failed to dial db30be88917b6839 on stream Message (dial tcp 10.10.24.156:2380: connection refused)
2015/10/16 07:51:47 rafthttp: failed to dial db30be88917b6839 on stream MsgApp v2 (dial tcp 10.10.24.156:2380: connection refused)
2015/10/16 07:51:47 rafthttp: failed to dial f2d68f8a4e38f628 on stream Message (dial tcp 10.10.24.161:2380: no route to host)
the error of 10.10.24.156 is:
2015/10/16 07:51:44 raft: 6ada9347d44a3950 is starting a new election at term 355
2015/10/16 07:51:44 raft: 6ada9347d44a3950 became candidate at term 356
2015/10/16 07:51:44 raft: 6ada9347d44a3950 received vote from 6ada9347d44a3950 at term 356
2015/10/16 07:51:44 raft: 6ada9347d44a3950 [logterm: 1, index: 2] sent vote request to 17c82b75bae3cfdf at term 356
2015/10/16 07:51:44 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:44 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: request received was ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: failed to write 17c82b75bae3cfdf on pipeline (dial tcp 10.10.48.217:2390: i/o timeout)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:45 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:46 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:46 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:46 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
2015/10/16 07:51:46 rafthttp: streaming request ignored (cluster ID mismatch got 9b09b40f488fe304 want ac5f3aa02066b598)
I also checked the status of etcd2 using systemctl status -l etcd2, it seems that the 10.10.24.156 machine sometimes stopped running, sometimes it recovered by yourself, the outputs of the 10.10.24.156 machine varies between:
etcd2.service - etcd2
Loaded: loaded (/usr/lib64/systemd/system/etcd2.service; disabled; vendor preset: disabled)
Drop-In: /run/systemd/system/etcd2.service.d
└─20-cloudinit.conf
Active: activating (auto-restart) since Mon 2015-10-19 06:58:15 UTC; 9s ago
Process: 1109 ExecStart=/usr/bin/etcd2 (code=exited, status=0/SUCCESS)
Main PID: 1109 (code=exited, status=0/SUCCESS)
and:
etcd2.service - etcd2
Loaded: loaded (/usr/lib64/systemd/system/etcd2.service; disabled; vendor preset: disabled)
Drop-In: /run/systemd/system/etcd2.service.d
└─20-cloudinit.conf
Active: active (running) since Mon 2015-10-19 06:58:25 UTC; 784ms ago
Main PID: 1118 (etcd2)
Memory: 4.1M
CPU: 781ms
CGroup: /system.slice/etcd2.service
└─1118 /usr/bin/etcd2
Oct 19 06:58:25 localhost etcd2[1118]: 2015/10/19 06:58:25 etcdserver: recovered store from snapshot at index 160016
Oct 19 06:58:25 localhost etcd2[1118]: 2015/10/19 06:58:25 etcdserver: name = fab7de8892ac4659aa45ab6c640bb05d
Oct 19 06:58:25 localhost etcd2[1118]: 2015/10/19 06:58:25 etcdserver: data dir = /var/lib/etcd2
Oct 19 06:58:25 localhost etcd2[1118]: 2015/10/19 06:58:25 etcdserver: member dir = /var/lib/etcd2/member
Oct 19 06:58:25 localhost etcd2[1118]: 2015/10/19 06:58:25 etcdserver: heartbeat = 100ms
Oct 19 06:58:25 localhost etcd2[1118]: 2015/10/19 06:58:25 etcdserver: election = 1000ms
Oct 19 06:58:25 localhost etcd2[1118]: 2015/10/19 06:58:25 etcdserver: snapshot count = 10000
Oct 19 06:58:25 localhost etcd2[1118]: 2015/10/19 06:58:25 etcdserver: discovery URL= https://discovery.etcd.io/c0eb5523934c5a502f2e314f9326781f
Oct 19 06:58:25 localhost etcd2[1118]: 2015/10/19 06:58:25 etcdserver: advertise client URLs = http://:2379,http://:4001
Oct 19 06:58:25 localhost etcd2[1118]: 2015/10/19 06:58:25 etcdserver: loaded cluster information from store:
The output of etcdctl member list of 10.10.26.160 is : client: no endpoints available, and for the 10.10.24.156 machine, the output is client: etcd cluster is unavailable or misconfigured.
What am I missing while bootstrapping etcd2 here? Thank you in advance.
The text was updated successfully, but these errors were encountered: