Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panics when joining new a 3.0.17 node to an existing 2.3 cluster #7429

Closed
euank opened this issue Mar 6, 2017 · 7 comments
Closed

Panics when joining new a 3.0.17 node to an existing 2.3 cluster #7429

euank opened this issue Mar 6, 2017 · 7 comments

Comments

@euank
Copy link

euank commented Mar 6, 2017

Bug report

I have an existing 3-node etcd cluster running 2.3.7 on Container Linux (various versions).

I ran etcdctl member add etcd1 http://192.168.131.3:2380 and configured a new node, running etcd v3.0.17 on a fresh Container Linux machine.

This resulted in the following panic.

+ exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=systemd-dir,target=/run/systemd/system --volume systemd-dir,kind=host,source=/run/systemd/system,readOnly=true --mount volume=notify,target=/run/systemd/notify --volume notify,kind=host,source=/run/systemd/notify --set-env=NOTIFY_SOCKET=/run/systemd/notify --volume data-dir,kind=host,source=/var/lib/etcd,readOnly=false --volume etc-ssl-certs,kind=host,source=/etc/ssl/certs,readOnly=true --volume usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=data-dir,target=/var/lib/etcd --mount volume=etc-ssl-certs,target=/etc/ssl/certs --mount volume=usr-share-certs,target=/usr/share/ca-certificates --mount volume=etc-hosts,target=/etc/hosts --mount volume=etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/etcd:v3.0.17 --user=232 --
image: keys already exist for prefix "quay.io/coreos/etcd", not fetching again
Downloading signature:  0 B/473 B
Downloading signature:  473 B/473 B
Downloading signature:  473 B/473 B
Downloading ACI:  0 B/13.8 MB
Downloading ACI:  16.3 KB/13.8 MB
Downloading ACI:  3.9 MB/13.8 MB
Downloading ACI:  13.8 MB/13.8 MB
image: signature verified:
  Quay.io ACI Converter (ACI conversion signing key) <support@quay.io>
2017-03-06 08:25:37.018085 I | flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://192.168.131.3:2379
2017-03-06 08:25:37.018176 I | flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
2017-03-06 08:25:37.018204 I | flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.131.3:2380
2017-03-06 08:25:37.018223 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=k8s7=http://192.168.131.106:2380,etcd1=http://192.168.131.3:2380,k8s8=http://192.168.131.165:2380,k8s1=http://192.168.131.178:2380
2017-03-06 08:25:37.018235 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
2017-03-06 08:25:37.018250 I | flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://192.168.131.3:2379
2017-03-06 08:25:37.018262 I | flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://192.168.131.3:2380
2017-03-06 08:25:37.018289 I | flags: recognized and used environment variable ETCD_NAME=etcd1
2017-03-06 08:25:37.018328 I | flags: recognized and used environment variable ETCD_STRICT_RECONFIG_CHECK=true
2017-03-06 08:25:37.018364 W | flags: unrecognized environment variable ETCD_USER=etcd
2017-03-06 08:25:37.018373 W | flags: unrecognized environment variable ETCD_IMAGE_TAG=v3.0.17
2017-03-06 08:25:37.018416 I | etcdmain: etcd Version: 3.0.17
2017-03-06 08:25:37.018424 I | etcdmain: Git SHA: cc198e2
2017-03-06 08:25:37.018430 I | etcdmain: Go Version: go1.6.4
2017-03-06 08:25:37.018437 I | etcdmain: Go OS/Arch: linux/amd64
2017-03-06 08:25:37.018444 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2017-03-06 08:25:37.018494 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2017-03-06 08:25:37.018594 I | etcdmain: listening for peers on http://192.168.131.3:2380
2017-03-06 08:25:37.018634 I | etcdmain: listening for client requests on 192.168.131.3:2379
2017-03-06 08:25:37.025377 I | etcdserver: name = etcd1
2017-03-06 08:25:37.025400 I | etcdserver: data dir = /var/lib/etcd
2017-03-06 08:25:37.025410 I | etcdserver: member dir = /var/lib/etcd/member
2017-03-06 08:25:37.025417 I | etcdserver: heartbeat = 100ms
2017-03-06 08:25:37.025424 I | etcdserver: election = 1000ms
2017-03-06 08:25:37.025431 I | etcdserver: snapshot count = 10000
2017-03-06 08:25:37.025443 I | etcdserver: advertise client URLs = http://192.168.131.3:2379
2017-03-06 08:25:37.080261 I | etcdserver: starting member 628fc0cde08f860e in cluster 45cdd59f8a767374
2017-03-06 08:25:37.080399 I | raft: 628fc0cde08f860e became follower at term 0
2017-03-06 08:25:37.080420 I | raft: newRaft 628fc0cde08f860e [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2017-03-06 08:25:37.080428 I | raft: 628fc0cde08f860e became follower at term 1
2017-03-06 08:25:37.185378 I | rafthttp: started HTTP pipelining with peer 214fecd12c05f700
2017-03-06 08:25:37.185452 I | rafthttp: started HTTP pipelining with peer 9b8a45aac4216e22
2017-03-06 08:25:37.185477 I | rafthttp: started HTTP pipelining with peer db59fd94ae7e5f2b
2017-03-06 08:25:37.185500 I | rafthttp: starting peer 214fecd12c05f700...
2017-03-06 08:25:37.185521 I | rafthttp: started HTTP pipelining with peer 214fecd12c05f700
2017-03-06 08:25:37.190038 I | rafthttp: started streaming with peer 214fecd12c05f700 (writer)
2017-03-06 08:25:37.190084 I | rafthttp: started streaming with peer 214fecd12c05f700 (writer)
2017-03-06 08:25:37.192888 I | rafthttp: started peer 214fecd12c05f700
2017-03-06 08:25:37.192953 I | rafthttp: added peer 214fecd12c05f700
2017-03-06 08:25:37.192999 I | rafthttp: starting peer 9b8a45aac4216e22...
2017-03-06 08:25:37.193042 I | rafthttp: started streaming with peer 214fecd12c05f700 (stream Message reader)
2017-03-06 08:25:37.193077 I | rafthttp: started HTTP pipelining with peer 9b8a45aac4216e22
2017-03-06 08:25:37.193356 I | rafthttp: started streaming with peer 214fecd12c05f700 (stream MsgApp v2 reader)
2017-03-06 08:25:37.194815 I | rafthttp: started streaming with peer 9b8a45aac4216e22 (writer)
2017-03-06 08:25:37.195622 I | rafthttp: started streaming with peer 9b8a45aac4216e22 (writer)
2017-03-06 08:25:37.196057 I | rafthttp: peer 214fecd12c05f700 became active
2017-03-06 08:25:37.198392 I | rafthttp: established a TCP streaming connection with peer 214fecd12c05f700 (stream Message reader)
2017-03-06 08:25:37.198419 I | rafthttp: started streaming with peer 9b8a45aac4216e22 (stream MsgApp v2 reader)
2017-03-06 08:25:37.198439 I | rafthttp: started streaming with peer 9b8a45aac4216e22 (stream Message reader)
2017-03-06 08:25:37.198463 I | rafthttp: started peer 9b8a45aac4216e22
2017-03-06 08:25:37.198500 I | rafthttp: added peer 9b8a45aac4216e22
2017-03-06 08:25:37.198526 I | rafthttp: starting peer db59fd94ae7e5f2b...
2017-03-06 08:25:37.198553 I | rafthttp: started HTTP pipelining with peer db59fd94ae7e5f2b
2017-03-06 08:25:37.199935 I | rafthttp: peer 9b8a45aac4216e22 became active
2017-03-06 08:25:37.199976 I | rafthttp: established a TCP streaming connection with peer 9b8a45aac4216e22 (stream MsgApp v2 reader)
2017-03-06 08:25:37.201136 I | rafthttp: started streaming with peer db59fd94ae7e5f2b (writer)
2017-03-06 08:25:37.201357 I | rafthttp: started streaming with peer db59fd94ae7e5f2b (writer)
2017-03-06 08:25:37.201459 I | rafthttp: established a TCP streaming connection with peer 214fecd12c05f700 (stream MsgApp v2 reader)
2017-03-06 08:25:37.202752 I | rafthttp: started peer db59fd94ae7e5f2b
2017-03-06 08:25:37.202809 I | rafthttp: added peer db59fd94ae7e5f2b
2017-03-06 08:25:37.202839 I | etcdserver: starting server... [version: 3.0.17, cluster version: to_be_decided]
2017-03-06 08:25:37.203888 I | rafthttp: established a TCP streaming connection with peer 9b8a45aac4216e22 (stream Message reader)
2017-03-06 08:25:37.203941 I | rafthttp: started streaming with peer db59fd94ae7e5f2b (stream Message reader)
2017-03-06 08:25:37.204046 I | rafthttp: started streaming with peer db59fd94ae7e5f2b (stream MsgApp v2 reader)
2017-03-06 08:25:37.206795 I | rafthttp: established a TCP streaming connection with peer 214fecd12c05f700 (stream Message writer)
2017-03-06 08:25:37.206879 I | rafthttp: established a TCP streaming connection with peer 9b8a45aac4216e22 (stream Message writer)
2017-03-06 08:25:37.208454 I | rafthttp: peer db59fd94ae7e5f2b became active
2017-03-06 08:25:37.208503 I | rafthttp: established a TCP streaming connection with peer db59fd94ae7e5f2b (stream Message reader)
2017-03-06 08:25:37.208681 I | rafthttp: established a TCP streaming connection with peer db59fd94ae7e5f2b (stream MsgApp v2 reader)
2017-03-06 08:25:37.209476 I | rafthttp: established a TCP streaming connection with peer 9b8a45aac4216e22 (stream MsgApp v2 writer)
2017-03-06 08:25:37.209524 I | rafthttp: established a TCP streaming connection with peer 214fecd12c05f700 (stream MsgApp v2 writer)
2017-03-06 08:25:37.209572 I | rafthttp: established a TCP streaming connection with peer db59fd94ae7e5f2b (stream MsgApp v2 writer)
2017-03-06 08:25:37.210833 I | rafthttp: established a TCP streaming connection with peer db59fd94ae7e5f2b (stream Message writer)
2017-03-06 08:25:37.373937 I | raft: 628fc0cde08f860e [term: 1] received a MsgVote message with higher term from db59fd94ae7e5f2b [term: 3498]
2017-03-06 08:25:37.373994 I | raft: 628fc0cde08f860e became follower at term 3498
2017-03-06 08:25:37.374021 I | raft: 628fc0cde08f860e [logterm: 0, index: 0, vote: 0] voted for db59fd94ae7e5f2b [logterm: 3495, index: 74140240] at term 3498
2017-03-06 08:25:37.408554 I | raft: raft.node: 628fc0cde08f860e elected leader db59fd94ae7e5f2b at term 3498
2017-03-06 08:25:37.491798 I | raft: 628fc0cde08f860e [commit: 0, lastindex: 0, lastterm: 0] starts to restore snapshot [index: 74138982, term: 3495]
2017-03-06 08:25:37.491893 I | raft: log [committed=0, applied=0, unstable.offset=1, len(unstable.Entries)=0] starts to restore snapshot [index: 74138982, term: 3495]
2017-03-06 08:25:37.491942 I | raft: 628fc0cde08f860e restored progress of 214fecd12c05f700 [next = 74138983, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2017-03-06 08:25:37.491977 I | raft: 628fc0cde08f860e restored progress of 628fc0cde08f860e [next = 74138983, match = 74138982, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2017-03-06 08:25:37.492020 I | raft: 628fc0cde08f860e restored progress of 9b8a45aac4216e22 [next = 74138983, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2017-03-06 08:25:37.492049 I | raft: 628fc0cde08f860e restored progress of db59fd94ae7e5f2b [next = 74138983, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2017-03-06 08:25:37.492070 I | raft: 628fc0cde08f860e [commit: 74138982] restored snapshot [index: 74138982, term: 3495]
2017-03-06 08:25:37.494728 I | etcdserver: applying snapshot at index 0...
2017-03-06 08:25:37.494877 C | etcdserver: get database snapshot file path error: snap: snapshot file doesn't exist
2017-03-06 08:25:37.494898 I | etcdserver: finished applying incoming snapshot at index 0
panic: get database snapshot file path error: snap: snapshot file doesn't exist
goroutine 210 [running]:
panic(0xd460a0, 0xc82000b450)
	/usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc8201ad7e0, 0x125e0e0, 0x29, 0xc82003f6b8, 0x1, 0x1)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x191
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).applySnapshot(0xc82023e480, 0xc820012440, 0xc8202226e0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:654 +0x5a1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).applyAll(0xc82023e480, 0xc820012440, 0xc8202226e0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:615 +0x60
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).run.func2(0x7f0dd2500690, 0xc8200123c0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:596 +0x32
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule.(*fifo).run(0xc82006f5c0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule/schedule.go:160 +0x323
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule.NewFIFOScheduler
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule/schedule.go:71 +0x27d
etcd-member.service: Main process exited, code=exited, status=2/INVALIDARGUMENT


Afterwords, restarting it (with the same configuration), results in a new panic:

2017-03-06 08:25:51.248581 I | etcdserver: restarting member 628fc0cde08f860e in cluster 45cdd59f8a767374 at commit index 74138982
2017-03-06 08:25:51.248634 C | raft: 628fc0cde08f860e state.commit 74138982 is out of range [0, 0]
panic: 628fc0cde08f860e state.commit 74138982 is out of range [0, 0]
goroutine 1 [running]:
panic(0xd460a0, 0xc820250aa0)
	/usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc8201cca60, 0x1238960, 0x2b, 0xc82014db00, 0x4, 0x4)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x191
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).loadState(0xc820119d40, 0xdaa, 0xdb59fd94ae7e5f2b, 0x46b4566, 0x0, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:942 +0x2a2
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.newRaft(0xc8201138a8, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:225 +0x8ff
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.RestartNode(0xc8201138a8, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:213 +0x45
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.restartNode(0xc82007cf20, 0x0, 0x7f2a8516a028, 0xc820138d90, 0x0, 0x0, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/raft.go:369 +0x7c7
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc82007cf20, 0x0, 0x7f2a8516a028, 0xc820138d90)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:349 +0x43f3
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc82002fc00, 0x0, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:374 +0x245f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:116 +0x2101
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:36 +0x21e
main.main()
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/main.go:28 +0x14
etcd-member.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
@euank euank changed the title Panics when joining 3.0.17 to a 2.3 cluster Panics when joining new a 3.0.17 node to an existing 2.3 cluster Mar 6, 2017
@gyuho
Copy link
Contributor

gyuho commented Mar 6, 2017

2017-03-06 08:25:37.018204 I | flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.131.3:2380
2017-03-06 08:25:37.018223 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=k8s7=http://192.168.131.106:2380,etcd1=http://192.168.131.3:2380,k8s8=http://192.168.131.165:2380,k8s1=http://192.168.131.178:2380

Was etcd1 added to 2.3.7 cluster before 3.0 migration (https://github.com/coreos/etcd/blob/master/Documentation/upgrades/upgrade_3_0.md#upgrade-procedure)? Or added,started as 2.3.7 node and try to upgrade??

the server is already initialized as member before, starting as etcd member...

Where does this data directory come from?

@euank
Copy link
Author

euank commented Mar 6, 2017

@gyuho no, it was not, and I could fix this issue by adding 2.3.7 and then updating it locally to 3.0 (a more normal migration). After doing that things didn't panic and it was all okay.

Where does this data directory come from?

I realized I missed a step. I accidentally started version 3.1.2, and it errored with "2017-03-06 08:23:55.391805 W | etcdserver: the running cluster version(2.3.0) is lower than the minimal cluster version(3.0.0) supported" and exited cleanly.

I guess it created a data directory. I thought it hadn't, good catch.


This might be a documentation problem. The docs make it sound like a cluster of mixed 2.3 and 3.0 can run, but if you can only do that by upgrading existing 2.3 -> 3.0 nodes using the same data directory, that should just be documented more clearly.

@gyuho
Copy link
Contributor

gyuho commented Mar 6, 2017

@xiang90 Any suggestion to make doc clearer?

@euank
Copy link
Author

euank commented Mar 6, 2017

The line I found most confusing was:

https://github.com/coreos/etcd/blob/v3.1.2/Documentation/upgrades/upgrade_3_0.md#mixed-versions

While upgrading, an etcd cluster supports mixed versions of etcd members, and operates with the protocol of the lowest common version

To me, that implied I could add a 3.0 etcd node to a 2.3 cluster, remove a 2.3 member, and repeat until I was done to upgrade.

It sounds like that doesn't work.

@heyitsanthony
Copy link
Contributor

I feel like the remove/add way should work, but I'm not surprised it's broken since it's an unusual upgrade path.

@xiang90
Copy link
Contributor

xiang90 commented Mar 6, 2017

@heyitsanthony I remembered there is a known bug in the unusually path for 2.3. -> 3.0 only. We did not try to fix it since on one really blocks by it (our infra team did hit it... but they then switched to normal upgrade path to work around it) until know.

@heyitsanthony
Copy link
Contributor

@gyuho the upgrade doc probably needs a Known issues section at the bottom warning about this path

gyuho added a commit to gyuho/etcd that referenced this issue Mar 7, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Mar 7, 2017
Fix etcd-io#7429.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
gyuho added a commit to gyuho/etcd that referenced this issue Mar 7, 2017
Fix etcd-io#7429.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
gyuho added a commit to gyuho/etcd that referenced this issue Mar 7, 2017
Fix etcd-io#7429.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
gyuho added a commit to gyuho/etcd that referenced this issue Mar 7, 2017
Fix etcd-io#7429.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants