vote endpoint returns 404 on new cluster #815

jefferai · 2014-05-30T14:18:57Z

I created a fresh cluster of three machines yesterday, using the 0.4.1 tag. I added one key to the cluster (/global/serfkey) and then left it alone overnight.

It's a very simple cluster:

etcdctl -debug -peers http://others:4001 ls --recursive /_etcd/machines
Cluster-Peers: http://10.8.8.108:4001 http://10.8.8.106:4001
Curl-Example: curl -X GET http://10.8.8.108:4001/v2/keys/_etcd/machines?consistent=true&recursive=true&sorted=false
/_etcd/machines/others
/_etcd/machines/hurley

etcdctl -debug -peers http://others:4001 ls --recursive /
Cluster-Peers: http://10.8.8.108:4001 http://10.8.8.106:4001
Curl-Example: curl -X GET http://10.8.8.108:4001/v2/keys/?consistent=true&recursive=true&sorted=false
/global
/global/serfkey

Then I tried to simply add a directory:

etcdctl -debug -peers http://others:4001 mkdir /roles/eventcenter
Cluster-Peers: http://10.8.8.108:4001 http://10.8.8.106:4001
Curl-Example: curl -X PUT http://10.8.8.108:4001/v2/keys/roles/eventcenter?dir=true&prevExist=false

Curl-Example: curl -X PUT http://10.8.8.108:4001/v2/keys/roles/eventcenter?dir=true&prevExist=false
Curl-Example: curl -X PUT http://10.8.8.108:4001/v2/keys/roles/eventcenter?dir=true&prevExist=false
Curl-Example: curl -X PUT http://10.8.8.108:4001/v2/keys/roles/eventcenter?dir=true&prevExist=false
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

One of my three machines (hurley at 10.8.8.108) is now spitting out:

[etcd] May 30 14:17:51.322 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:51.579 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:51.579 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:51.900 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:51.900 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.197 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.197 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.537 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.537 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.882 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.882 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

Restarting etcd doesn't help. The other two machines can't connect to it, fail to generate a quorum between them, and the whole cluster is unusable (and I'm hoping not doomed).

The text was updated successfully, but these errors were encountered:

jefferai · 2014-05-30T14:49:49Z

Here is output with debugging set up high:

From the machines not outputting the above output:

[etcd] May 30 14:48:26.124 DEBUG     | Error parsing response to etcd error: json: cannot unmarshal number into Go value of type error.Error
[etcd] May 30 14:48:26.124 DEBUG     | fail getting machine messages from http://10.8.8.110:7001
[etcd] May 30 14:48:26.124 DEBUG     | fail getting machine messages from http://10.8.8.108:7001
[etcd] May 30 14:48:26.125 DEBUG     | Error parsing response to etcd error: invalid character '<' looking for beginning of value
[etcd] May 30 14:48:26.125 DEBUG     | fail getting machine messages from http://10.8.8.106:7001
[etcd] May 30 14:48:26.126 WARNING   | fail syncing cluster([http://10.8.8.110:7001 http://10.8.8.108:7001 http://10.8.8.106:7001]): unreachable cluster
[etcd] May 30 14:48:31.128 DEBUG     | Error parsing response to etcd error: json: cannot unmarshal number into Go value of type error.Error
[etcd] May 30 14:48:31.128 DEBUG     | fail getting machine messages from http://10.8.8.110:7001
[etcd] May 30 14:48:31.128 DEBUG     | fail getting machine messages from http://10.8.8.108:7001
[etcd] May 30 14:48:31.129 DEBUG     | Error parsing response to etcd error: invalid character '<' looking for beginning of value
[etcd] May 30 14:48:31.129 DEBUG     | fail getting machine messages from http://10.8.8.106:7001
[etcd] May 30 14:48:31.129 WARNING   | fail syncing cluster([http://10.8.8.110:7001 http://10.8.8.108:7001 http://10.8.8.106:7001]): unreachable cluster

And from the machine outputting the above output:

[etcd] May 30 14:49:16.987 DEBUG     | Send Vote from hurley to http://10.8.8.106:7001
[raft]14:49:16.987438 peer.vote:  hurley -> desmond
[etcd] May 30 14:49:16.987 DEBUG     | Send Vote from hurley to http://10.8.8.110:7001
[etcd] May 30 14:49:16.988 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]14:49:16.988917 peer.vote.failed:  hurley <- others
[etcd] May 30 14:49:16.989 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]14:49:16.989714 peer.vote.failed:  hurley <- desmond
[raft]14:49:17.384085 peer.vote:  hurley -> others
[etcd] May 30 14:49:17.384 DEBUG     | Send Vote from hurley to http://10.8.8.106:7001
[raft]14:49:17.384166 peer.vote:  hurley -> desmond
[etcd] May 30 14:49:17.384 DEBUG     | Send Vote from hurley to http://10.8.8.110:7001
[etcd] May 30 14:49:17.385 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]14:49:17.385406 peer.vote.failed:  hurley <- desmond
[etcd] May 30 14:49:17.385 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]14:49:17.385961 peer.vote.failed:  hurley <- others
[raft]14:49:17.588297 peer.vote:  hurley -> others
[etcd] May 30 14:49:17.588 DEBUG     | Send Vote from hurley to http://10.8.8.106:7001
[raft]14:49:17.588398 peer.vote:  hurley -> desmond
[etcd] May 30 14:49:17.588 DEBUG     | Send Vote from hurley to http://10.8.8.110:7001
[etcd] May 30 14:49:17.589 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]14:49:17.589725 peer.vote.failed:  hurley <- others
[etcd] May 30 14:49:17.590 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

jefferai · 2014-05-30T15:15:18Z

Something is really wrong here; here is HTTP-level output between the instances:

###
T 10.8.8.108:60866 -> 10.8.8.106:7001 [AP]
POST /vote HTTP/1.1.
Host: 10.8.8.106:7001.
User-Agent: Go 1.1 package http.
Content-Length: 17.
Accept-Encoding: gzip.
.
.........".hurley
#
T 10.8.8.108:45244 -> 10.8.8.110:7001 [AP]
POST /vote HTTP/1.1.
Host: 10.8.8.110:7001.
User-Agent: Go 1.1 package http.
Content-Length: 17.
Accept-Encoding: gzip.
.
.........".hurley
#
T 10.8.8.106:7001 -> 10.8.8.108:60866 [AP]
HTTP/1.1 404 Not Found.
Content-Type: text/plain; charset=utf-8.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Length: 19.
.
404 page not found

##
T 10.8.8.110:7001 -> 10.8.8.108:45244 [AP]
HTTP/1.1 404 Not Found.
Content-Type: text/plain; charset=utf-8.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Length: 19.
.
404 page not found

#####
T 10.8.8.106:52793 -> 10.8.8.108:7001 [AP]
GET /v2/admin/machines HTTP/1.1.
Host: 10.8.8.108:7001.
User-Agent: Go 1.1 package http.
Accept-Encoding: gzip.
.

##
T 10.8.8.108:7001 -> 10.8.8.106:52793 [AP]
HTTP/1.1 200 OK.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Type: text/plain; charset=utf-8.
Transfer-Encoding: chunked.
.
14a.
[{"name":"desmond","state":"follower","clientURL":"http://10.8.8.110:4001","peerURL":"http://10.8.8.110:7001"},{"name":"hurley","state":"follower","clientURL":"http://10.8.8.108:4001","peerURL":"http://10.8.8.108:7001"},{"name":"others","state":"follower","clientURL":"http://10.8.8.106:4001","peerURL":"http://10.8.8.106:7001"}]
.

#
T 10.8.8.108:7001 -> 10.8.8.106:52793 [AP]
0.
.

#########
T 10.8.8.106:52794 -> 10.8.8.108:7001 [AP]
GET /v2/admin/config HTTP/1.1.
Host: 10.8.8.108:7001.
User-Agent: Go 1.1 package http.
Accept-Encoding: gzip.
.

##
T 10.8.8.108:7001 -> 10.8.8.106:52794 [AP]
HTTP/1.1 200 OK.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Type: text/plain; charset=utf-8.
Transfer-Encoding: chunked.
.
32.
{"activeSize":9,"removeDelay":5,"syncInterval":5}
.

#
T 10.8.8.108:7001 -> 10.8.8.106:52794 [AP]
0.
.

######
T 10.8.8.108:60866 -> 10.8.8.106:7001 [AP]
POST /vote HTTP/1.1.
Host: 10.8.8.106:7001.
User-Agent: Go 1.1 package http.
Content-Length: 17.
Accept-Encoding: gzip.
.
.........".hurley
#
T 10.8.8.108:45244 -> 10.8.8.110:7001 [AP]
POST /vote HTTP/1.1.
Host: 10.8.8.110:7001.
User-Agent: Go 1.1 package http.
Content-Length: 17.
Accept-Encoding: gzip.
.
.........".hurley
#
T 10.8.8.110:7001 -> 10.8.8.108:45244 [AP]
HTTP/1.1 404 Not Found.
Content-Type: text/plain; charset=utf-8.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Length: 19.
.
404 page not found

##
T 10.8.8.106:7001 -> 10.8.8.108:60866 [AP]
HTTP/1.1 404 Not Found.
Content-Type: text/plain; charset=utf-8.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Length: 19.
.
404 page not found

yichengq · 2014-05-30T16:07:41Z

@jefferai
This is a severe bug.
Could you do me a favor to grep the standby_info, conf and log under all three etcd directories?
The scenario is that two nodes come into standby node, while the third one has no ideas about that at all.

jefferai · 2014-05-30T16:58:42Z

@unihorn
You can find the files from all three here: (removed)

Please let me know when you have them, and I will remove the link.

If you need the snapshots, let me know how to best get them to you privately (I could send a link via GitHub message for instance).

tarnfeld · 2014-07-08T07:26:27Z

Hey guys, looks like this was fixed and release – however i've just run into this exact issue after some unexpected network latency. Took the whole cluster down (all 8 nodes, with an active set of 3) on version 0.4.3.

I managed to get the cluster back up by removing standby_info from every etcd directory on the cluster, and that brought things back.

djannot · 2014-07-20T09:46:21Z

Same issue on 0.4.5. It happened 3 times during the last 2 weeks. Removing standby_info on each node and rebooting them also fixed the issue.

litch · 2014-07-21T21:53:54Z

I'm having what i think may be the same issue. I have a 6-node cluster on Rackspace that's been up for several hours etcd just stopped working completely. I'm on:

NAME=CoreOS
VERSION=353.0.0

Output of journalctl -u etcd on one node (web2)

Jul 21 21:48:10 web2 etcd[1081]: [etcd] Jul 21 21:48:10.785 WARNING   | [vote] Error: nil response
Jul 21 21:48:12 web2 etcd[1081]: [etcd] Jul 21 21:48:12.080 WARNING   | [vote] Error: nil response
Jul 21 21:48:13 web2 etcd[1081]: [etcd] Jul 21 21:48:13.350 WARNING   | [vote] Error: nil response
Jul 21 21:48:15 web2 etcd[1081]: [etcd] Jul 21 21:48:15.523 WARNING   | [vote] Error: nil response
Jul 21 21:48:17 web2 etcd[1081]: [etcd] Jul 21 21:48:17.131 WARNING   | [vote] Error: nil response
Jul 21 21:48:19 web2 etcd[1081]: [etcd] Jul 21 21:48:19.131 WARNING   | [vote] Error: nil response
Jul 21 21:48:21 web2 etcd[1081]: [etcd] Jul 21 21:48:21.129 WARNING   | [vote] Error: nil response
Jul 21 21:48:22 web2 etcd[1081]: [etcd] Jul 21 21:48:22.617 WARNING   | [vote] Error: nil response
Jul 21 21:48:23 web2 etcd[1081]: [etcd] Jul 21 21:48:23.236 WARNING   | [vote] Error: nil response
Jul 21 21:48:24 web2 etcd[1081]: [etcd] Jul 21 21:48:24.520 WARNING   | [vote] Error: nil response
Jul 21 21:48:26 web2 etcd[1081]: [etcd] Jul 21 21:48:26.098 WARNING   | [vote] Error: nil response

And another (web3):

Jul 21 21:42:39 web3 etcd[716]: [etcd] Jul 21 21:42:39.831 INFO      | af5a2d3e5791488b9b38bf1a8edfe53d: term #782 started.
Jul 21 21:42:41 web3 etcd[716]: [etcd] Jul 21 21:42:41.387 INFO      | af5a2d3e5791488b9b38bf1a8edfe53d: state changed from 'follower' to 'candidate'.
Jul 21 21:42:41 web3 etcd[716]: [etcd] Jul 21 21:42:41.389 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jul 21 21:42:41 web3 etcd[716]: [etcd] Jul 21 21:42:41.389 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jul 21 21:42:41 web3 etcd[716]: [etcd] Jul 21 21:42:41.390 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jul 21 21:42:41 web3 etcd[716]: [etcd] Jul 21 21:42:41.390 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jul 21 21:42:43 web3 etcd[716]: [etcd] Jul 21 21:42:43.367 INFO      | af5a2d3e5791488b9b38bf1a8edfe53d: state changed from 'candidate' to 'follower'.
Jul 21 21:42:43 web3 etcd[716]: [etcd] Jul 21 21:42:43.367 INFO      | af5a2d3e5791488b9b38bf1a8edfe53d: term #784 started.
Jul 21 21:42:45 web3 etcd[716]: [etcd] Jul 21 21:42:45.128 INFO      | af5a2d3e5791488b9b38bf1a8edfe53d: term #785 started.

nnnnathann · 2014-08-29T04:34:04Z

+1 on this, happening to me also

payneio · 2014-09-02T19:04:51Z

+1

ndemoor · 2014-09-16T15:49:20Z

+1 (v0.4.6)

omegahm · 2014-09-24T07:20:47Z

+1

zeisss · 2014-09-24T09:05:54Z

+1, but removing standby_info didn't help.

johnae · 2014-09-24T09:56:53Z

+1 same problem seemingly here. I can't even find a standby_info anywhere.

memcmp · 2014-09-26T13:55:57Z

+1

nnnnathann · 2014-09-26T15:27:39Z

Is everyone here on Rackspace? I suspect these issues stem from the cloud-init process with Rackspace, but not sure.

zeisss · 2014-09-26T16:20:17Z

No, we are on AWS

tarnfeld · 2014-09-26T16:22:24Z

We're running across AWS and a private cloud, it seems to occur on both.

jefferai · 2014-09-30T13:40:47Z

Running on KVM-based virtual machines on a local machine. No cloud-init.

mzsanford · 2014-10-01T18:48:46Z

Seeing this starting two etcd instances¹ on EC2. Hope fully this reproducible test case will help:

Fetch etcd v0.4.6 from the downloads page
Start etcd on EC2 instance A (etcd -discovery https://discovery.etcd.io/$TOKEN -peer-addr=$(curl -qs http://169.254.169.254/latest/meta-data/local-ipv4):7001)
Start etcd on EC2 instance B (same command)
Stop etcd on EC2 instance A
Start etcd on EC2 instance A (same command)
Observe broken state.
- Instance A reporting:

[etcd] Oct  1 21:24:08.770 DEBUG     | Send Vote from ip-10-0-X-A to http://10.0.X.B:7001
[etcd] Oct  1 21:24:08.771 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]21:24:08.771653 peer.vote.failed:  ip-10-0-X-A <- ip-10-0-X-B
[raft]21:24:08.998215 peer.vote:  ip-10-0-X-A -> ip-10-0-X-B

Instance B reporting:

[etcd] Oct  1 21:24:10.797 DEBUG     | [recv] POST http://10.0.X.B:7001/vote [ip-10-0-X-A]
[etcd] Oct  1 21:24:10.797 WARNING   | [vote] Error: nil response

Note that error does not appear if steps 3 & 4 are performed on EC2 instance B. This is reproducible in this case when the initial leader is restarted.

¹ I am planning a cluster of 3+ machines but I was using 2 to test connectivity when I found this. Two nodes is not a production setup but I can't imagine this failure mode is expected even with only two nodes.

Updated: Added debug flags and added more log information.

ryantanner · 2014-10-01T20:47:13Z

Seeing the same problem on EC2. Having extreme difficulty standing up an etcd cluster today whereas the exact same configuration worked flawlessly a few days ago.

philips · 2014-10-01T22:05:24Z

@mzsanford @ryantanner Can you ensure that the machines are not configured with the same name? "ETCD_NAME" and "-name". We are ensuring this mis-configuration can't happen in etcd 0.5 but it is in the users hand to ensure this doesn't happen in 0.4

mzsanford · 2014-10-01T22:31:49Z

My discovery endpoint was showing the EC2 generated host names (ip-10-0-X-X) in the key so that didn't seem the likely culprit. However, adding -name $(curl -qs http://169.254.169.254/latest/meta-data/instance-id) to my command line (thus using the EC2 instance id as a name) seems to correct the issue. At least after a few dozen restarts using a new discovery token I haven't seen it. @ryantanner can you try the same (or something similar)?

philips · 2014-10-02T00:58:22Z

@mzsanford I don't know how you are running etcd but if you do not specify the name flag then it will default to the hostname. And depending on the OS, network config, etc the hostname may end up being non-unique. On CoreOS we use /etc/machine-id which is a UUID-ish thing generated on the first boot of a host.

zeisss · 2014-10-05T09:52:26Z

I just saw this again on one of our aws clusters. As far as I understand a reboot happened to the leader. Another node took over.
When the previous leader then tries to rejoin the cluster, it began to log errors and

Oct 05 08:41:41 ip-172-31-28-172.<snip> etcd[485]: [etcd] Oct  5 08:41:41.593 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

What I found curious is that he first tried to reconnect with itself:

Oct 05 08:41:39 ip-172-31-28-172.<snip> etcd[485]: [etcd] Oct  5 08:41:39.533 INFO      | 58618282333c43cd96bea999eac1b4a5 attempted to join via 172.31.28.172:7001 failed: fail checking join version: Client Internal Error (Get http://172.31.28.172:7001/version: dial tcp 172.31.28.172:7001: connection refused)

The newly elected leader now began panicing with panic: assertion failed: leader.elected.at.same.term.2. The third node in the cluster began logging [vote] Error: nil response.

You can find the logs of the three nodes here: https://gist.github.com/ZeissS/3fd8cb73dc6d59bf4ae0

zeisss · 2014-10-05T09:57:59Z

PS: I found a way to get the cluster back to running: stop all etcd nodes, start the old leader first, then the one node that didn't took over, wait a few seconds, then the temporary leader. This worked for me and etcdctl ls showed me the keys as expected.

mzsanford · 2014-10-06T16:36:14Z

@philips it is started exactly as stated in my original report. I am just starting a test cluster so this is fetched with wget, extracted with tar, change to that directory and run ./etcd with the parameters shown (where that curl command returns the internal IP address). Based on the debug log messages in step 6 of my original report it would seem that the name was correctly deduced as the EC2 instance hostnames follow the ip-X-X-X-X format and this what is shown. After adding -name I have not seen the issue again but since the logged names are legit I fear this will occur again once I rely on the cluster. I'll keep prodding it and keep an eye out.

@zeisss that matches exactly what I had seen. Are you running etcd with -name/ETCD_NAME or letting it be auto-populated?

zeisss · 2014-10-07T09:47:50Z

@mzsanford Whatever the default was on CoreOS 402.0.0 (yeah a rather old image, I know). We didn't temper with the etcd.service file

chakri-n · 2014-10-11T08:19:39Z

I narrowed this down to permission problem on my etcd cluster on fedora. The data directory used by etcd is not owned by user etcd.

jefferai · 2014-10-20T17:24:00Z

FWIW permission problems (as @chakri-n ) was definitely not a cause of the issue for me.

coozy · 2014-10-22T07:18:44Z

Hi, I ran into the same issue as well.
root@[ip-172-31-2-96 ~]# etcd -version
etcd version 0.4.6

Two machines in the cluster - both of them 'followers', the etcd names are unique *

[root@ip-172-31-2-96 ~]# curl -L -s 127.0.0.1:7001/v2/admin/machines
[{"name":"1237ec5afe8e4f1c8ed92500b1f74a70","state":"follower","clientURL":"http://172.31.4.149:4001","peerURL":"http://172.31.4.149:7001"},{"name":"b5bb9c3de09449b09bd4ebb9c0c14978","state":"follower","clientURL":"http://172.31.2.96:4001","peerURL":"http://172.31.2.96:7001"}]

First machine is flooded with the following mesasge *

Oct 22 07:17:42 ip-172-31-2-96 etcd[19237]: [etcd] Oct 22 07:17:42.372 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Oct 22 07:17:43 ip-172-31-2-96 etcd[19237]: [etcd] Oct 22 07:17:43.928 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

Second machine is flooded with the following mesasge *

Oct 22 07:17:02 ip-172-31-4-149 etcd[8833]: [etcd] Oct 22 07:17:02.445 WARNING   | [vote] Error: nil response
Oct 22 07:17:04 ip-172-31-4-149 etcd[8833]: [etcd] Oct 22 07:17:04.843 WARNING   | [vote] Error: nil response

shuoli84 · 2014-10-26T16:42:45Z

same here
0.4.6 using vagrant on OS X.

Oct 26 16:38:39 core-01 etcd[563]: [etcd] Oct 26 16:38:39.740 WARNING   | [vote] Error: nil response
Oct 26 16:38:40 core-01 etcd[563]: [etcd] Oct 26 16:38:40.083 WARNING   | [vote] Error: nil response

kelseyhightower · 2014-10-27T21:28:56Z

@shuoli84 @coozy @jefferai Can we make sure we are using the -name flag with etcd?

jefferai · 2014-11-05T16:26:29Z

I was, yes.

btrepp · 2014-12-19T06:16:51Z

I'm using coreos stable (494.5.0).

etcd --version
etcd version 0.4.6

Getting similar entries

One machine has

Dec 19 06:07:02 core1 etcd[584]: [etcd] Dec 19 06:07:02.638 WARNING   | transporter.ae.decoding.error:proto: field/encoding mismatch: wrong type for field
Dec 19 06:07:03 core1 etcd[584]: [etcd] Dec 19 06:07:03.151 WARNING   | transporter.ae.decoding.error:proto: field/encoding mismatch: wrong type for field

The other two

Dec 19 06:11:23 core2 etcd[1848]: [etcd] Dec 19 06:11:23.714 WARNING   | [ae] Error: nil response
Dec 19 06:11:24 core2 etcd[1848]: [etcd] Dec 19 06:11:24.229 WARNING   | [ae] Error: nil response

This is one baremetal (well actually VMs inside microsofts hyperV, but following baremetal style installs).

Systemd is coming from cloud config I think, I have these style files. Systemctl shows me it's using this unit

 systemctl cat etcd
# /usr/lib64/systemd/system/etcd.service
[Unit]
Description=etcd

[Service]
User=etcd
PermissionsStartOnly=true
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd
Restart=always
RestartSec=10s

# /run/systemd/system/etcd.service.d/20-cloudinit.conf
[Service]
Environment="ETCD_ADDR=192.168.0.2:4001"
Environment="ETCD_DISCOVERY=https://discovery.etcd.io/718b5f19527426c6a61fae70d27b3221"
Environment="ETCD_NAME=87e89652b7d9483492764d231ab1e515"
Environment="ETCD_PEER_ADDR=192.168.0.2:7001"
Environment="ETCD_PEER_ELECTION_TIMEOUT=2000"
Environment="ETCD_PEER_HEARTBEAT_INTERVAL=500"

With a different ETCD_ADDR, ETCD_NAME, ETCD_PEER_ADDR for each machine.

Seems etcd 0.4 is somewhat broken?. 0.4.6 is the latest version in any coreos release. So I'm a tad stumped on what to do.

EDIT:

So kicking the leader in the teeth (restarting it via systemctl) seems to have worked. Seems to be some sort of conditional bug in how the leader processes messages from other nodes. So when it occurs it can't decode anything a peer is sending, and it still thinks its the leader and everything breaks.

rca · 2015-01-05T03:31:24Z

I see the same errors as @coozy above:

Jan 05 03:27:57 ip-172-30-0-11.ec2.internal etcd[572]: [etcd] Jan  5 03:27:57.339 WARNING   | [vote] Error: nil response

Jan 05 03:28:32 ip-172-30-0-125.ec2.internal etcd[9798]: [etcd] Jan  5 03:28:32.262 INFO      | 55feb942c50942aebc9406fa236d62c6: state changed from 'candidate' to 'follower'.
Jan 05 03:28:32 ip-172-30-0-125.ec2.internal etcd[9798]: [etcd] Jan  5 03:28:32.262 INFO      | 55feb942c50942aebc9406fa236d62c6: term #2308 started.
Jan 05 03:28:33 ip-172-30-0-125.ec2.internal etcd[9798]: [etcd] Jan  5 03:28:33.965 INFO      | 55feb942c50942aebc9406fa236d62c6: state changed from 'follower' to 'candidate'.
Jan 05 03:28:33 ip-172-30-0-125.ec2.internal etcd[9798]: [etcd] Jan  5 03:28:33.967 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 05 03:28:35 ip-172-30-0-125.ec2.internal etcd[9798]: [etcd] Jan  5 03:28:35.380 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

piaohai · 2015-01-11T09:10:39Z

Jan 11 09:10:06 core-03 etcd[456]: [etcd] Jan 11 09:10:06.730 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:06 core-03 kubelet[950]: I0111 09:10:06.828030 950 logs.go:39] etcd DEBUG: get /registry/nodes/172.17.8.103/boundpods [http://127.0.0.1:4001]
Jan 11 09:10:06 core-03 kubelet[950]: I0111 09:10:06.828187 950 logs.go:39] etcd DEBUG: Connecting to etcd: attempt1forkeys/registry/nodes/172.17.8.103/boundpods?consistent=true&recursive=false&sorted=false
Jan 11 09:10:06 core-03 kubelet[950]: I0111 09:10:06.828213 950 logs.go:39] etcd DEBUG: send.request.to http://127.0.0.1:4001/v2/keys/registry/nodes/172.17.8.103/boundpods?consistent=true&recursive=false&sorted=false | method GET
Jan 11 09:10:06 core-03 kubelet[950]: I0111 09:10:06.849250 950 logs.go:39] etcd DEBUG: network error:Get /v2/keys/registry/nodes/172.17.8.103/boundpods?consistent=true&recursive=false&sorted=false: stopped after 10 redirects
Jan 11 09:10:06 core-03 kubelet[950]: E0111 09:10:06.851528 950 etcd_tools_watch.go:171] watch was unable to retrieve the current index for the provided key ("/registry/nodes/172.17.8.103/boundpods"): 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
Jan 11 09:10:06 core-03 kubelet[950]: I0111 09:10:06.851776 950 etcd.go:71] Watch closed (&api.Status{TypeMeta:api.TypeMeta{Kind:"", APIVersion:""}, ListMeta:api.ListMeta{SelfLink:"", ResourceVersion:""}, Status:"Failure", Message:"501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]", Reason:"", Details:(*api.StatusDetails)(nil), Code:0}). Reopening.
Jan 11 09:10:06 core-03 etcd[456]: [etcd] Jan 11 09:10:06.961 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:06 core-03 etcd[456]: [etcd] Jan 11 09:10:06.962 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:07 core-03 etcd[456]: [etcd] Jan 11 09:10:07.303 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:07 core-03 etcd[456]: [etcd] Jan 11 09:10:07.305 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:07 core-03 etcd[456]: [etcd] Jan 11 09:10:07.664 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:07 core-03 etcd[456]: [etcd] Jan 11 09:10:07.668 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

i have met same problem too atfer server crash,

joshuarubin · 2015-01-16T00:57:04Z

+1

xiang90 · 2015-01-16T01:10:17Z

This is a go-raft problem. And go-raft is unmaintained now. We are using new raft now. So this problem should be solved in etcd 2.0. Thanks!

david-gurley · 2015-01-28T23:26:52Z

+1

When will etcd 2.0 get integrated in the alpha channel?

jpmx · 2015-01-29T09:16:09Z

same problem, +1 for etcd 2.0 on alpha

yichengq · 2015-01-29T18:05:11Z

@david-gurley It plans to happen in Feb.
@jpmx If you meet it in 2.0, please open a new issue with details about it.

stianstr · 2015-02-05T20:30:41Z

So it's a go-raft problem, what are the details of that problem? Does it shed any light on a possible workaround for us who want to use etcd, but do not want to run Alpha in production? Because I assume it will take some time before etcd 2.0 makes it way to Stable?

jefferai · 2015-02-06T21:42:16Z

@stianstr For now (and maybe permanently) we've moved to Consul. etcd has great features but it will be a while before I feel good about its raft implementation; not just because of go-raft but because of how new and untested its new implementation is.

brandonkruse · 2015-02-23T16:57:10Z

Consul is really fantastic, however, fleet is tied so closely with etcd that I'd like to stay with etcd. +1 for alpha coreos with etcd 2.x

mosquitorat · 2015-03-03T06:00:34Z

I got the same problem with etcd 0.4.6, but after remove all data in /var/lib/etcd/*, I cloud start the etcd cluster.

wutil · 2015-03-16T05:55:48Z

@mosquitorat 3Q

wptad · 2015-03-23T05:43:40Z

Works for me !! @mosquitorat

Thanks a lot

wangzi19870227 · 2015-04-15T11:06:36Z

Works for me too !! @mosquitorat

Thanks a lot

jefferai · 2015-04-17T18:06:36Z

@mosquitorat Which removes all of the cluster data, correct?

jefferai changed the title ~~Serious error when running mkdir on a new cluster~~ vote endpoint returns 404 on new cluster May 30, 2014

yichengq mentioned this issue May 30, 2014

fix(standby_server): able to join the cluster containing itself #818

Merged

yichengq closed this as completed in #818 May 31, 2014

robszumski reopened this Jul 24, 2014

yichengq added the bug label Aug 28, 2014

zeisss mentioned this issue Oct 5, 2014

Machine rebooted though we configured reboot-strategy: off coreos/bugs#161

Closed

This was referenced Oct 14, 2014

"Use of closed network connection" causing heartbeats to time out on internal network connection #800

Closed

Reverse proxying hits occasional upstream connections refused #801

Closed

xiang90 closed this as completed Jan 16, 2015

vote endpoint returns 404 on new cluster #815

vote endpoint returns 404 on new cluster #815

Comments

jefferai commented May 30, 2014

jefferai commented May 30, 2014

jefferai commented May 30, 2014

yichengq commented May 30, 2014

jefferai commented May 30, 2014

tarnfeld commented Jul 8, 2014

djannot commented Jul 20, 2014

litch commented Jul 21, 2014

nnnnathann commented Aug 29, 2014

payneio commented Sep 2, 2014

ndemoor commented Sep 16, 2014

omegahm commented Sep 24, 2014

zeisss commented Sep 24, 2014

johnae commented Sep 24, 2014

memcmp commented Sep 26, 2014

nnnnathann commented Sep 26, 2014

zeisss commented Sep 26, 2014

tarnfeld commented Sep 26, 2014

jefferai commented Sep 30, 2014

mzsanford commented Oct 1, 2014

ryantanner commented Oct 1, 2014

philips commented Oct 1, 2014

mzsanford commented Oct 1, 2014

philips commented Oct 2, 2014

zeisss commented Oct 5, 2014

zeisss commented Oct 5, 2014

mzsanford commented Oct 6, 2014

zeisss commented Oct 7, 2014

chakri-n commented Oct 11, 2014

jefferai commented Oct 20, 2014

coozy commented Oct 22, 2014

shuoli84 commented Oct 26, 2014

kelseyhightower commented Oct 27, 2014

jefferai commented Nov 5, 2014

btrepp commented Dec 19, 2014

rca commented Jan 5, 2015

piaohai commented Jan 11, 2015

joshuarubin commented Jan 16, 2015

xiang90 commented Jan 16, 2015

david-gurley commented Jan 28, 2015

jpmx commented Jan 29, 2015

yichengq commented Jan 29, 2015

stianstr commented Feb 5, 2015

jefferai commented Feb 6, 2015

brandonkruse commented Feb 23, 2015

mosquitorat commented Mar 3, 2015

wutil commented Mar 16, 2015

wptad commented Mar 23, 2015

wangzi19870227 commented Apr 15, 2015

jefferai commented Apr 17, 2015