Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd crashes due to corrupt data dir (3.2.16, Fedora 28) #10012

Closed
vorburger opened this issue Aug 15, 2018 · 16 comments
Closed

etcd crashes due to corrupt data dir (3.2.16, Fedora 28) #10012

vorburger opened this issue Aug 15, 2018 · 16 comments
Assignees

Comments

@vorburger
Copy link
Member

I have installed etcd on Fedora 28 via sudo dnf install etcd, and am up-to-date.

If I do something like cd /tmp ; etcd (without parameters) it starts just fine.

But if I systemctl start etcd it crashes, similarly to below.

I realize this is could be some kind of an issue re. how systemd launches etcd, like a wrong parameter, thus possibly more of a mistake in the service configuration file packaging than a core etcd bug, but it probably still should not "crash hard" (with coredump), but print some sort of more useful message for whatever it is not happy about?

Glancing at /usr/lib/systemd/system/etcd.service and /etc/etcd/etcd.conf I have been able to reproduce it without systemd by just launching it with the same parameters from a CLI like so:

$ sudo bash

[root@khany /]# etcd --version
etcd Version: 3.2.16
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.10
Go OS/Arch: linux/amd64

[root@khany /]# cd /var/lib/etcd/
[root@khany etcd]# ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379" GOMAXPROCS=4 /usr/bin/etcd --name="default" --data-dir="/var/lib/etcd/default.etcd" --listen-client-urls="http://localhost:2379"
2018-08-15 15:16:57.540801 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://localhost:2379
2018-08-15 15:16:57.540911 I | etcdmain: etcd Version: 3.2.16
2018-08-15 15:16:57.540918 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2018-08-15 15:16:57.540922 I | etcdmain: Go Version: go1.10
2018-08-15 15:16:57.540925 I | etcdmain: Go OS/Arch: linux/amd64
2018-08-15 15:16:57.540930 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2018-08-15 15:16:57.540936 N | etcdmain: failed to detect default host (could not find default route)
2018-08-15 15:16:57.540990 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-08-15 15:16:57.542741 I | embed: listening for peers on http://localhost:2380
2018-08-15 15:16:57.542985 I | embed: listening for client requests on localhost:2379
2018-08-15 15:16:57.545941 I | etcdserver: recovered store from snapshot at index 30003
2018-08-15 15:16:57.550573 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x557aed944a12]

goroutine 1 [running]:
panic(0x557aedf34800, 0x557aee55dfc0)
	/usr/lib/golang/src/runtime/panic.go:554 +0x3c5 fp=0xc420265f88 sp=0xc420265ee8 pc=0x557aed262ef5
runtime.panicmem()
	/usr/lib/golang/src/runtime/panic.go:63 +0x60 fp=0xc420265fa8 sp=0xc420265f88 pc=0x557aed261d90
runtime.sigpanic()
	/usr/lib/golang/src/runtime/signal_unix.go:388 +0x17e fp=0xc420265ff8 sp=0xc420265fa8 pc=0x557aed278dce
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc420266628, 0xc420266400)
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:284 +0x42 fp=0xc420266020 sp=0xc420265ff8 pc=0x557aed944a12
runtime.call32(0x0, 0x557aee098fe0, 0xc4200be0b0, 0x1000000010)
	/usr/lib/golang/src/runtime/asm_amd64.s:573 +0x3d fp=0xc420266050 sp=0xc420266020 pc=0x557aed29086d
panic(0x557aedeccb40, 0xc4201dc030)
	/usr/lib/golang/src/runtime/panic.go:505 +0x22d fp=0xc4202660f0 sp=0xc420266050 pc=0x557aed262d5d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc42000d800, 0x557aedac1e5d, 0x2a, 0xc4202664a0, 0x1, 0x1)
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x164 fp=0xc420266170 sp=0xc4202660f0 pc=0x557aed689a34
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc42023a480, 0xc42023a480, 0x557aee09faa0, 0xc4202a2020)
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:379 +0x258b fp=0xc420266618 sp=0xc420266170 pc=0x557aed9320cb
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc4201f6700, 0xc420300300, 0x0, 0x0)
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:157 +0x743 fp=0xc420266c78 sp=0xc420266618 pc=0x557aeda3bde3
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc4201f6700, 0x557aedaa247b, 0x6, 0xc420267201, 0x2)
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:186 +0x75 fp=0xc420266d40 sp=0xc420266c78 pc=0x557aeda95095
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:103 +0x136b fp=0xc420267f18 sp=0xc420266d40 pc=0x557aeda94a0b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:39 +0x12d fp=0xc420267f78 sp=0xc420267f18 pc=0x557aeda9a19d
main.main()
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x22 fp=0xc420267f88 sp=0xc420267f78 pc=0x557aeda9f1a2
runtime.main()
	/usr/lib/golang/src/runtime/proc.go:198 +0x21a fp=0xc420267fe0 sp=0xc420267f88 pc=0x557aed264c2a
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420267fe8 sp=0xc420267fe0 pc=0x557aed293071

goroutine 2 [force gc (idle)]:
runtime.gopark(0x557aee09c1d0, 0x557aee5b1770, 0x557aedaa8d3f, 0xf, 0x557aee09c014, 0x1)
	/usr/lib/golang/src/runtime/proc.go:291 +0x120 fp=0xc42005c768 sp=0xc42005c748 pc=0x557aed265080
runtime.goparkunlock(0x557aee5b1770, 0x557aedaa8d3f, 0xf, 0x14, 0x1)
	/usr/lib/golang/src/runtime/proc.go:297 +0x60 fp=0xc42005c7a8 sp=0xc42005c768 pc=0x557aed265140
runtime.forcegchelper()
	/usr/lib/golang/src/runtime/proc.go:248 +0xd0 fp=0xc42005c7e0 sp=0xc42005c7a8 pc=0x557aed264ec0
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc42005c7e8 sp=0xc42005c7e0 pc=0x557aed293071
created by runtime.init.5
	/usr/lib/golang/src/runtime/proc.go:237 +0x37

goroutine 3 [GC sweep wait]:
runtime.gopark(0x557aee09c1d0, 0x557aee5b1920, 0x557aedaa7400, 0xd, 0x557aed28e014, 0x1)
	/usr/lib/golang/src/runtime/proc.go:291 +0x120 fp=0xc42005cf60 sp=0xc42005cf40 pc=0x557aed265080
runtime.goparkunlock(0x557aee5b1920, 0x557aedaa7400, 0xd, 0x14, 0x1)
	/usr/lib/golang/src/runtime/proc.go:297 +0x60 fp=0xc42005cfa0 sp=0xc42005cf60 pc=0x557aed265140
runtime.bgsweep(0xc420040070)
	/usr/lib/golang/src/runtime/mgcsweep.go:71 +0x130 fp=0xc42005cfd8 sp=0xc42005cfa0 pc=0x557aed256b50
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc42005cfe0 sp=0xc42005cfd8 pc=0x557aed293071
created by runtime.gcenable
	/usr/lib/golang/src/runtime/mgc.go:216 +0x5a

goroutine 18 [finalizer wait]:
runtime.gopark(0x557aee09c1d0, 0x557aee5d3d30, 0x557aedaa81ff, 0xe, 0x14, 0x1)
	/usr/lib/golang/src/runtime/proc.go:291 +0x120 fp=0xc420058718 sp=0xc4200586f8 pc=0x557aed265080
runtime.goparkunlock(0x557aee5d3d30, 0x557aedaa81ff, 0xe, 0x14, 0x1)
	/usr/lib/golang/src/runtime/proc.go:297 +0x60 fp=0xc420058758 sp=0xc420058718 pc=0x557aed265140
runtime.runfinq()
	/usr/lib/golang/src/runtime/mfinal.go:175 +0xb1 fp=0xc4200587e0 sp=0xc420058758 pc=0x557aed24dad1
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4200587e8 sp=0xc4200587e0 pc=0x557aed293071
created by runtime.createfing
	/usr/lib/golang/src/runtime/mfinal.go:156 +0x64

goroutine 16 [chan receive]:
runtime.gopark(0x557aee09c1d0, 0xc4200c88f8, 0x557aedaa6cfd, 0xc, 0xc42004d317, 0x3)
	/usr/lib/golang/src/runtime/proc.go:291 +0x120 fp=0xc420058d70 sp=0xc420058d50 pc=0x557aed265080
runtime.goparkunlock(0xc4200c88f8, 0x557aedaa6cfd, 0xc, 0x557aed281a17, 0x3)
	/usr/lib/golang/src/runtime/proc.go:297 +0x60 fp=0xc420058db0 sp=0xc420058d70 pc=0x557aed265140
runtime.chanrecv(0xc4200c88a0, 0xc420058f38, 0x3b9aca01, 0xc4200bc960)
	/usr/lib/golang/src/runtime/chan.go:518 +0x2f6 fp=0xc420058e48 sp=0xc420058db0 pc=0x557aed23c216
runtime.chanrecv2(0xc4200c88a0, 0xc420058f38, 0xc420000180)
	/usr/lib/golang/src/runtime/chan.go:405 +0x2b fp=0xc420058e78 sp=0xc420058e48 pc=0x557aed23bf0b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.(*MergeLogger).outputLoop(0xc42000cb40)
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:174 +0x40f fp=0xc420058fd8 sp=0xc420058e78 pc=0x557aed785b5f
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420058fe0 sp=0xc420058fd8 pc=0x557aed293071
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.NewMergeLogger
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:92 +0x87

goroutine 29 [chan receive]:
runtime.gopark(0x557aee09c1d0, 0xc4201ec2f8, 0x557aedaa6cfd, 0xc, 0xc42004f817, 0x3)
	/usr/lib/golang/src/runtime/proc.go:291 +0x120 fp=0xc42005d570 sp=0xc42005d550 pc=0x557aed265080
runtime.goparkunlock(0xc4201ec2f8, 0x557aedaa6cfd, 0xc, 0x557aed281a17, 0x3)
	/usr/lib/golang/src/runtime/proc.go:297 +0x60 fp=0xc42005d5b0 sp=0xc42005d570 pc=0x557aed265140
runtime.chanrecv(0xc4201ec2a0, 0xc42005d738, 0x3b9aca01, 0xc4201ee320)
	/usr/lib/golang/src/runtime/chan.go:518 +0x2f6 fp=0xc42005d648 sp=0xc42005d5b0 pc=0x557aed23c216
runtime.chanrecv2(0xc4201ec2a0, 0xc42005d738, 0xc420000180)
	/usr/lib/golang/src/runtime/chan.go:405 +0x2b fp=0xc42005d678 sp=0xc42005d648 pc=0x557aed23bf0b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.(*MergeLogger).outputLoop(0xc4201a5420)
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:174 +0x40f fp=0xc42005d7d8 sp=0xc42005d678 pc=0x557aed785b5f
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc42005d7e0 sp=0xc42005d7d8 pc=0x557aed293071
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.NewMergeLogger
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:92 +0x87

goroutine 27 [syscall]:
runtime.notetsleepg(0x557aee5b7960, 0x3b9a6c3b, 0x0)
	/usr/lib/golang/src/runtime/lock_futex.go:227 +0x46 fp=0xc420059760 sp=0xc420059730 pc=0x557aed247696
runtime.timerproc(0x557aee5b7940)
	/usr/lib/golang/src/runtime/time.go:261 +0x2eb fp=0xc4200597d8 sp=0xc420059760 pc=0x557aed2820bb
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4200597e0 sp=0xc4200597d8 pc=0x557aed293071
created by runtime.(*timersBucket).addtimerLocked
	/usr/lib/golang/src/runtime/time.go:160 +0x109

goroutine 30 [chan receive]:
runtime.gopark(0x557aee09c1d0, 0xc42008cb38, 0x557aedaa6cfd, 0xc, 0x17, 0x3)
	/usr/lib/golang/src/runtime/proc.go:291 +0x120 fp=0xc420059d70 sp=0xc420059d50 pc=0x557aed265080
runtime.goparkunlock(0xc42008cb38, 0x557aedaa6cfd, 0xc, 0x557aed281a17, 0x3)
	/usr/lib/golang/src/runtime/proc.go:297 +0x60 fp=0xc420059db0 sp=0xc420059d70 pc=0x557aed265140
runtime.chanrecv(0xc42008cae0, 0xc420059f38, 0x3b9aca01, 0xc420206b90)
	/usr/lib/golang/src/runtime/chan.go:518 +0x2f6 fp=0xc420059e48 sp=0xc420059db0 pc=0x557aed23c216
runtime.chanrecv2(0xc42008cae0, 0xc420059f38, 0x0)
	/usr/lib/golang/src/runtime/chan.go:405 +0x2b fp=0xc420059e78 sp=0xc420059e48 pc=0x557aed23bf0b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.(*MergeLogger).outputLoop(0xc4201a5480)
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:174 +0x40f fp=0xc420059fd8 sp=0xc420059e78 pc=0x557aed785b5f
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420059fe0 sp=0xc420059fd8 pc=0x557aed293071
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.NewMergeLogger
	/builddir/build/BUILD/etcd-3.2.16/_build/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:92 +0x87

goroutine 71 [syscall]:
runtime.notetsleepg(0x557aee5b7a60, 0x5f5d124, 0x557aed533fce)
	/usr/lib/golang/src/runtime/lock_futex.go:227 +0x46 fp=0xc42005a760 sp=0xc42005a730 pc=0x557aed247696
runtime.timerproc(0x557aee5b7a40)
	/usr/lib/golang/src/runtime/time.go:261 +0x2eb fp=0xc42005a7d8 sp=0xc42005a760 pc=0x557aed2820bb
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc42005a7e0 sp=0xc42005a7d8 pc=0x557aed293071
created by runtime.(*timersBucket).addtimerLocked
	/usr/lib/golang/src/runtime/time.go:160 +0x109

goroutine 62 [syscall]:
runtime.notetsleepg(0x557aee5b78e0, 0x3b9ac1f7, 0x0)
	/usr/lib/golang/src/runtime/lock_futex.go:227 +0x46 fp=0xc42005df60 sp=0xc42005df30 pc=0x557aed247696
runtime.timerproc(0x557aee5b78c0)
	/usr/lib/golang/src/runtime/time.go:261 +0x2eb fp=0xc42005dfd8 sp=0xc42005df60 pc=0x557aed2820bb
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc42005dfe0 sp=0xc42005dfd8 pc=0x557aed293071
created by runtime.(*timersBucket).addtimerLocked
	/usr/lib/golang/src/runtime/time.go:160 +0x109

goroutine 40 [syscall]:
runtime.notetsleepg(0x557aee5b79e0, 0x5f5a275, 0x1)
	/usr/lib/golang/src/runtime/lock_futex.go:227 +0x46 fp=0xc420278760 sp=0xc420278730 pc=0x557aed247696
runtime.timerproc(0x557aee5b79c0)
	/usr/lib/golang/src/runtime/time.go:261 +0x2eb fp=0xc4202787d8 sp=0xc420278760 pc=0x557aed2820bb
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4202787e0 sp=0xc4202787d8 pc=0x557aed293071
created by runtime.(*timersBucket).addtimerLocked
	/usr/lib/golang/src/runtime/time.go:160 +0x109

goroutine 87 [syscall]:
runtime.notetsleepg(0x557aee5d4320, 0xffffffffffffffff, 0x0)
	/usr/lib/golang/src/runtime/lock_futex.go:227 +0x46 fp=0xc42005af80 sp=0xc42005af50 pc=0x557aed247696
os/signal.signal_recv(0x0)
	/usr/lib/golang/src/runtime/sigqueue.go:139 +0xa8 fp=0xc42005afa8 sp=0xc42005af80 pc=0x557aed279d78
os/signal.loop()
	/usr/lib/golang/src/os/signal/signal_unix.go:22 +0x24 fp=0xc42005afe0 sp=0xc42005afa8 pc=0x557aeda56564
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc42005afe8 sp=0xc42005afe0 pc=0x557aed293071
created by os/signal.init.0
	/usr/lib/golang/src/os/signal/signal_unix.go:28 +0x43

goroutine 102 [GC worker (idle)]:
runtime.gopark(0x557aee09c058, 0xc4202a4300, 0x557aedaa94e7, 0x10, 0x14, 0x0)
	/usr/lib/golang/src/runtime/proc.go:291 +0x120 fp=0xc420279748 sp=0xc420279728 pc=0x557aed265080
runtime.gcBgMarkWorker(0xc42004a000)
	/usr/lib/golang/src/runtime/mgc.go:1775 +0x136 fp=0xc4202797d8 sp=0xc420279748 pc=0x557aed2516f6
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4202797e0 sp=0xc4202797d8 pc=0x557aed293071
created by runtime.gcBgMarkStartWorkers
	/usr/lib/golang/src/runtime/mgc.go:1723 +0x7b

goroutine 72 [GC worker (idle)]:
runtime.gopark(0x557aee09c058, 0xc4202a4310, 0x557aedaa94e7, 0x10, 0x14, 0x0)
	/usr/lib/golang/src/runtime/proc.go:291 +0x120 fp=0xc420274748 sp=0xc420274728 pc=0x557aed265080
runtime.gcBgMarkWorker(0xc42004c500)
	/usr/lib/golang/src/runtime/mgc.go:1775 +0x136 fp=0xc4202747d8 sp=0xc420274748 pc=0x557aed2516f6
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4202747e0 sp=0xc4202747d8 pc=0x557aed293071
created by runtime.gcBgMarkStartWorkers
	/usr/lib/golang/src/runtime/mgc.go:1723 +0x7b

goroutine 103 [GC worker (idle)]:
runtime.gopark(0x557aee09c058, 0xc4202a4320, 0x557aedaa94e7, 0x10, 0x14, 0x0)
	/usr/lib/golang/src/runtime/proc.go:291 +0x120 fp=0xc420279f48 sp=0xc420279f28 pc=0x557aed265080
runtime.gcBgMarkWorker(0xc42004ea00)
	/usr/lib/golang/src/runtime/mgc.go:1775 +0x136 fp=0xc420279fd8 sp=0xc420279f48 pc=0x557aed2516f6
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420279fe0 sp=0xc420279fd8 pc=0x557aed293071
created by runtime.gcBgMarkStartWorkers
	/usr/lib/golang/src/runtime/mgc.go:1723 +0x7b

goroutine 73 [GC worker (idle)]:
runtime.gopark(0x557aee09c058, 0xc4202a4330, 0x557aedaa94e7, 0x10, 0x14, 0x0)
	/usr/lib/golang/src/runtime/proc.go:291 +0x120 fp=0xc420274f48 sp=0xc420274f28 pc=0x557aed265080
runtime.gcBgMarkWorker(0xc420050f00)
	/usr/lib/golang/src/runtime/mgc.go:1775 +0x136 fp=0xc420274fd8 sp=0xc420274f48 pc=0x557aed2516f6
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420274fe0 sp=0xc420274fd8 pc=0x557aed293071
created by runtime.gcBgMarkStartWorkers
	/usr/lib/golang/src/runtime/mgc.go:1723 +0x7b
Aborted (core dumped)
@vorburger
Copy link
Member Author

Oh! Wiping data rm -rf /var/lib/etcd/default.etcd fixed this... 😄

No worries for a dev workstation. Less cool if this were production... normal?

@vorburger vorburger changed the title etcd 3.2.16 from Fedora 28 package crashes on 'systemctl start etcd.service' etcd crashes due to corrupt data dir (3.2.16, Fedora 28) Aug 15, 2018
@hexfusion
Copy link
Contributor

@vorburger can you tell us about the data that is recovered from snapshot? I believe this panic can be caused by restoring from snapshot that has /v2 data only and no /v3.

ref: #9890

@vorburger
Copy link
Member Author

I'm new to etcd and have not, or at least not intentionally, used v2 at all. But I actually didn't rm but mv, so I would still have that data - would it be useful if I ZIP and shared it for reproduction? I don't remember how I got to that state though, sorry. Unless I've had the etcd RPM package installed for longer than I thought, and it used to be v2, and then during a Fedora upgrade it became v3... is that possible? FYI this is NOT blocking me - I just thought I would file it if it adds value to the project. If you are sure that this is "just" #9890 and no need to repro with a ZIP of my data, then just close it? Although, in an ideal world, I guess it still never should just crash like this? 😄

@hexfusion
Copy link
Contributor

hexfusion commented Aug 16, 2018 via email

@vorburger
Copy link
Member Author

Here is my /var/lib/etcd/default.etcd/ : default.etcd.crash-issue-10012.tar.gz

@hexfusion hexfusion self-assigned this Aug 16, 2018
@hexfusion
Copy link
Contributor

@vorburger just checking in, I will have some time this week to dig into this.

@hexfusion
Copy link
Contributor

@vorburger I took a closer look and found data in the existing snapshots pointing to an existing v2.2 member.

{"action":"set","node":{"key":"/0/version","value":"2.2.0"..

If I tried to start the cluster with the datadir you provided I receive the same panic noted above.

As noted in the upgrade documents upgrade-checklists to upgrade to 3.0 the etcd cluster needs to be at least 2.3. So for this to work we would of needed to upgrade from 2.2 to 2.3 and then move towards 3.2.16. The panic is originaly pinned to #9480.

So while a panic is not what we would want to see I feel this is fairly well documented.

But since we were here I figured lets try to migrate and see if it would work. So I did a backup and restore of the directory you provided with a 2.3.8 binary. Then started the node we see update succeeds.

2018-08-22 20:41:20.418324 N | etcdserver: updated the cluster version from 2.2 to 2.3

Next with v3.0.17

2018-08-22 20:51:54.737017 N | membership: updated the cluster version from 2.3 to 3.0

Next we will add some v3 data to avoid panic as if I just start this with v3.1.19 we also get panic because we have no v3 data.

ETCDCTL_VERSION=3 etcdctl put foo bar

Upgrade works as expected.

2018-08-22 21:04:02.948208 I | etcdserver: updating the cluster version from 3.0 to 3.1

Now we are in the clear and can just upgrade v3 binary and panic is no longer an issue. I hope you find this useful information.

@vorburger
Copy link
Member Author

Cool. I don't mind if you just close this issue with this. What I do find a little bit curious is that I have no memory of ever having explicitly done anything to "point to an existing v2.2 member" (barely understand what that even means TBH). So just wondering if I've had the etcd RPM package installed when it was only v2 before v3, and played with v2, and then during a Fedora upgrade it became v3, and that caused this. That, in theory, in an ideal world, would mean the "upgrade path" is broken. Or may be it's just something I did explicitly, like 2-3 years ago, and forgot about. Unles you have plans to better handle this crash, just close this issue - but I wanted to at least record this thought here, just in case this crash ever comes up anywhere again.

@hexfusion
Copy link
Contributor

@vorburger you bet and I appreciate the issue please don't think twice about that, I can leave the issue open.

So just wondering if I've had the etcd RPM package installed when it was only v2 before v3, and played with v2, and then during a Fedora upgrade it became v3, and that caused this. That, in theory, in an ideal world, would mean the "upgrade path" is broken.

The data was clearly from v2.2 which is not supported for upgrade to v3.x. If etcd were upgraded per the documentation the panic issue still would have happened moving from 3.0.x to 3.1.x. The reason for this panic to prevent accidental v3 data loss (e.g. db file might have been moved). @gyuho did attempt to create a workaround but was decided against (#9484). So from our end I feel handling the error is probably a reasonable goal and I will try to take a look at it soonish. That is unless you are interested in taking a stab :)

@vorburger
Copy link
Member Author

look at it soonish. That is unless you are interested in taking a stab :)

Tempting (and thanks for the offer, and faith in me) but I'm already spread too thin (FYI I'm helping out over on jetcd!) and most importantly would first have do some some catching up re. Go... 😈

@hexfusion
Copy link
Contributor

Tempting (and thanks for the offer, and faith in me) but I'm already spread too thin (FYI I'm helping out over on jetcd!)

Yes I am watching this and thank you! Interesting I am trying to help out at jetcd but need catching up with Java :)

@wenjiaswe
Copy link
Contributor

cc @jpbetz

@jpbetz
Copy link
Contributor

jpbetz commented Aug 29, 2018

This is resolved right? Okay to close?

@vorburger
Copy link
Member Author

@jpbetz re. "This is resolved right? Okay to close?" note @hexfusion saying "I feel handling the error is probably a reasonable goal and I will try to take a look at it soonish."

@hexfusion
Copy link
Contributor

To elaborate etcd conditionally calls panic so it is intentional in this case as a safeguard. I think a user seeing this panic will eventually find these GitHub issues, but a message to the fact that the v3 store must have data could be helpful. Again I need to review this but I did want to add a little clarity to my statement.

etcd/etcdserver/server.go

Lines 441 to 447 in 34fcaba

if be, err = recoverSnapshotBackend(cfg, be, *snapshot); err != nil {
if cfg.Logger != nil {
cfg.Logger.Panic("failed to recover v3 backend from snapshot", zap.Error(err))
} else {
plog.Panicf("recovering backend from snapshot error: %v", err)
}
}

@hexfusion
Copy link
Contributor

I will create a new issue to track this change, closing this as the question was answered and I do plan on reviewing further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants