Upgrading from v2.3 through v3.0 and v3.1 to v3.2 results in panic #9480

jlhawn · 2018-03-22T18:29:21Z

Bug reporting

The docs recommend upgrading from v2.3 to v3.2 by first upgrading to each minor version along the way, however there seems to be an issue if you perform this transition too quickly, specifically if there are no writes to the v3 backend or there are no snapshots produced during v3.0 or v3.1 then this causes v3.2 to panic on startup.

To reproduce this, start with an etcd v2.3 server which does have a snapshot (this bug does not occur if no snapshots have taken place yet). Stop the server and replace it with a v3.0 server. Everything seems fine, next stop the server and replace it with a v3.1 server. Again everything is fine. Finally, stop the server and replace it with a v3.2 server and witness this panic when the server starts up:

2018-03-22 18:14:32.879716 I | etcdserver: recovered store from snapshot at index 52
2018-03-22 18:14:32.882938 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb7ab8c]

goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4201ac5f8, 0xc4201ac3d0)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:284 +0x3c
panic(0xdaf1c0, 0xc42025f950)
	/usr/local/google/home/jpbetz/.gvm/gos/go1.8.7/src/runtime/panic.go:489 +0x2cf
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420170820, 0xf95ff9, 0x2a, 0xc4201ac440, 0x1, 0x1)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc42026c000, 0x0, 0x14b2580, 0xc42025f8e0)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:379 +0x2e4d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc420182a80, 0xc420264000, 0x0, 0x0)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:157 +0x782
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc420182a80, 0x6, 0xf71713, 0x6, 0x1)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:186 +0x58
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:103 +0x15ba
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:39 +0x61
main.main()
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20

The bug seems to have been introduced in this patch from last year.

It this case, the new db backend (which has not yet been used and has been in its initial state since the v3.0 server was deployed) reports an index (0) which is less than the latest snapshot index. The server assumes that this means there is a *.snap.db file which can be renamed to db to catch up to the *.snap file but no such *.snap.db file exists.

The text was updated successfully, but these errors were encountered:

gyuho · 2018-03-22T18:33:10Z

To reproduce this, start with an etcd v2.3 server which does have a snapshot (this bug does not occur if no snapshots have taken place yet). Stop the server and replace it with a v3.0 server. Everything seems fine, next stop the server and replace it with a v3.1 server.

Do you have any v3 data?

jlhawn · 2018-03-22T18:38:43Z

No, we are starting from etcd v2.3.8 and only use the v2 API (the v3 stuff was experimental in that release I thought) so we don't have any v3 data.

gyuho · 2018-03-22T18:46:05Z

So just migrating v2.3.8 server with v2 data to v3 server etcd's can trigger this panic?

jlhawn · 2018-03-22T18:46:59Z

So just migrating v2.3.8 server with v2 data to v3 server etcd's can trigger this panic?

Yes.

gyuho · 2018-03-22T18:55:47Z

@jlhawn Ok, I will try to reproduce. Thanks for report.

jlhawn · 2018-03-22T20:48:46Z

@gyuho I have prepared some repro steps if you think that would help.

gyuho · 2018-03-22T20:49:50Z

@jlhawn Can you share minimal reproducible steps here?

jlhawn · 2018-03-22T21:04:57Z

My minimal repro steps only require using docker:

Create a volume named etcd-data:

docker volume create --name etcd-data

Create a shell var to store some basic arguments for the server:

ARGS='-name etcd0 -data-dir /data
-advertise-client-urls http://127.0.0.1:2379,http://127.0.0.1:4001
-listen-client-urls http://127.0.0.1:2379,http://127.0.0.1:4001
-initial-advertise-peer-urls http://127.0.0.1:2380
-listen-peer-urls http://127.0.0.1:2380
-initial-cluster-token etcd-cluster-1
-initial-cluster etcd0=http://127.0.0.1:2380
-initial-cluster-state new'

Create a 2.3.8 server:

docker run -d -v etcd-data:/data --name etcd quay.io/coreos/etcd:v2.3.8 $ARGS -snapshot-count 25

The very low -snapshot-count flag will force a snapshot soon after the server starts up without the need to write any data (or you could do that if you want, that works too). Use docker logs etcd to wait for that to happen, it should be only a few seconds and looks like this:

2018-03-22 20:52:44.261940 I | etcdserver: start to snapshot (applied: 26, lastsnap: 0)
2018-03-22 20:52:44.266772 I | etcdserver: saved snapshot at index 26

You can also check the contents of the etcd-data volume to see that there is a member/snap/0000000000000002-000000000000001a.snap file which exists now.

At this point, remove the server container with docker rm -f etcd.

Create another server with etcd v3.0 (note we don't need the small snapshot count any longer):

docker run -d -v etcd-data:/data --name etcd quay.io/coreos/etcd:v3.0 etcd $ARGS

You'll see from the logs of that container that it's up and has migrated from 2.3 to 3.0 and enabled v3 features:

2018-03-22 20:54:03.900059 I | etcdserver: updating the cluster version from 2.3 to 3.0
2018-03-22 20:54:03.902694 N | membership: updated the cluster version from 2.3 to 3.0
2018-03-22 20:54:03.902739 I | api: enabled capabilities for version 3.0

Remove this container again to upgrade it to v3.1: docker rm -f etcd

Create another server with etcd v3.1 next:

docker run -d -v etcd-data:/data --name etcd quay.io/coreos/etcd:v3.1 etcd $ARGS

You'll see from the logs of that container that it's up and has migrated from 3.0 to 3.1 and enabled v3.1 features:

2018-03-22 20:54:27.241720 I | etcdserver: updating the cluster version from 3.0 to 3.1
2018-03-22 20:54:27.243598 N | etcdserver/membership: updated the cluster version from 3.0 to 3.1
2018-03-22 20:54:27.243659 I | etcdserver/api: enabled capabilities for version 3.1

Remove this container again to upgrade it to v3.2: docker rm -f etcd

Create another server with etcd v3.2:

docker run -d -v etcd-data:/data --name etcd quay.io/coreos/etcd:v3.2 etcd $ARGS

This container will exit soon after it starts. Use docker logs etcd to see that it had a panic.

gyuho · 2018-03-23T11:33:54Z

This panic prevents accidental db file deletion(overwrite) in v3.
I will add an optional flag to allow this upgrade use case.

raoofm · 2018-03-26T19:44:38Z

@gyuho I see that you added comments in upgrade checklist to not upgrade the server unless you migrate v3 data. The comments were added to v3.0, v3.1, v3.2, v3.3, v3.4

I think you can remove v3.0 and v3.1 from the list as the bug was introduced in v3.2+

I'm running 3.1.x in prod with v2 data and I think @jlhawn also mentioned it works.

gyuho · 2018-03-26T19:49:04Z

@raoofm Thanks for pointing that out.

Actually, I intentionally added it to them all

Do not upgrade to newer v3 versions until v3.0 server contains v3 data.

to remind that no upgrade from 3.0 to 3.x without v3 data.

Please let me know if this is still confusing.

raoofm · 2018-03-26T19:51:20Z

@gyuho no worries, was just wondering if it panics existing users who already did :) like me.

You can keep as is, thanks 👍

wsong · 2018-03-26T20:01:38Z

So just to be clear, if you start up a 3.0 server, don't make any writes, then stop it and start up a 3.2 server, it's expected that the 3.2 server will panic? In this case, are you supposed to write out a dummy v3 key or something?

gyuho · 2018-03-26T20:19:10Z

Yes. Only upgrade to 3.2 with no v3 keys will panic (no consistent index has been set). It's not expected though... but we've decided to keep it as it is (sorry, too late to backport a fix to all 3.x branches), because bypassing it requires too much of manual unsafe operations. Safest workaround is write some dummy v3 keys.

gyuho added the type/bug label Mar 22, 2018

jlhawn mentioned this issue Mar 22, 2018

Do not recover db snapshot if index is 0 #9481

Closed

This was referenced Mar 23, 2018

*: "--unsafe-overwrite-db" flag to support v2 migration with no previous v3 data #9484

Closed

Documentation: highlight v3 post migration #9486

Merged

gyuho closed this as completed in #9486 Mar 26, 2018

Cynerva mentioned this issue Apr 6, 2018

Revert "Switch to etcd snap to 3.2" charmed-kubernetes/layer-etcd#118

Merged

hexfusion mentioned this issue Jun 28, 2018

Cannot restore etcd V2 data after backup when data-dir exists snapshot files. #9890

Closed

hexfusion mentioned this issue Aug 23, 2018

etcd crashes due to corrupt data dir (3.2.16, Fedora 28) #10012

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading from v2.3 through v3.0 and v3.1 to v3.2 results in panic #9480

Upgrading from v2.3 through v3.0 and v3.1 to v3.2 results in panic #9480

jlhawn commented Mar 22, 2018

gyuho commented Mar 22, 2018

jlhawn commented Mar 22, 2018

gyuho commented Mar 22, 2018

jlhawn commented Mar 22, 2018

gyuho commented Mar 22, 2018

jlhawn commented Mar 22, 2018

gyuho commented Mar 22, 2018

jlhawn commented Mar 22, 2018 •

edited

Loading

gyuho commented Mar 23, 2018

raoofm commented Mar 26, 2018

gyuho commented Mar 26, 2018

raoofm commented Mar 26, 2018

wsong commented Mar 26, 2018

gyuho commented Mar 26, 2018 •

edited

Loading

Upgrading from v2.3 through v3.0 and v3.1 to v3.2 results in panic #9480

Upgrading from v2.3 through v3.0 and v3.1 to v3.2 results in panic #9480

Comments

jlhawn commented Mar 22, 2018

Bug reporting

gyuho commented Mar 22, 2018

jlhawn commented Mar 22, 2018

gyuho commented Mar 22, 2018

jlhawn commented Mar 22, 2018

gyuho commented Mar 22, 2018

jlhawn commented Mar 22, 2018

gyuho commented Mar 22, 2018

jlhawn commented Mar 22, 2018 • edited Loading

gyuho commented Mar 23, 2018

raoofm commented Mar 26, 2018

gyuho commented Mar 26, 2018

raoofm commented Mar 26, 2018

wsong commented Mar 26, 2018

gyuho commented Mar 26, 2018 • edited Loading

jlhawn commented Mar 22, 2018 •

edited

Loading

gyuho commented Mar 26, 2018 •

edited

Loading