Etcd start failed after power off and restart #11949

xieyanker · 2020-05-27T03:54:21Z

In such that, I hava an etcd cluster for kubernetes which has 3 members. And all machines were powered off yesterday, however, all of the etcd members start failed after power is restored.

My etcd version is 3.4.2.

The etcd member01's error is failed to find database snapshot file (snap: snapshot file doesn't exist), and the errors of member02 and member03 are same: freepages: failed to get all reachable pages.

The full logs are as follows:

member01:

May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680835 I | etcdmain: etcd Version: 3.4.2
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680840 I | etcdmain: Git SHA: a7cf1ca
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680845 I | etcdmain: Go Version: go1.12.5
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680849 I | etcdmain: Go OS/Arch: linux/amd64
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680855 I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680926 W | etcdmain: found invalid file/dir .bash_logout under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680937 W | etcdmain: found invalid file/dir .bashrc under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680942 W | etcdmain: found invalid file/dir .profile under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680952 N | etcdmain: the server is already initialized as member before, starting as etcd member...
May 26 22:00:14 mgt01 etcd[6323]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680997 I | embed: peerTLS: cert = /etc/ssl/etcd/ssl/member-mgt01.pem, key = /etc/ssl/etcd/ssl/member-mgt01-key.pem, trusted-ca = /etc/ssl/etcd/ssl/ca.pem, client-cert-auth = true, crl-file =
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681694 I | embed: name = etcd-mgt01
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681708 I | embed: data dir = /var/lib/etcd
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681713 I | embed: member dir = /var/lib/etcd/member
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681719 I | embed: heartbeat = 100ms
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681724 I | embed: election = 1000ms
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681728 I | embed: snapshot count = 10000
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681737 I | embed: advertise client URLs = https://10.61.109.41:2379
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681746 I | embed: initial advertise peer URLs = https://10.61.109.41:2380
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681753 I | embed: initial cluster =
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.689611 I | etcdserver: recovered store from snapshot at index 21183302
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.693186 C | etcdserver: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
May 26 22:00:14 mgt01 etcd[6323]: panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
May 26 22:00:14 mgt01 etcd[6323]:         panic: runtime error: invalid memory address or nil pointer dereference
May 26 22:00:14 mgt01 etcd[6323]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc2c6be]
May 26 22:00:14 mgt01 etcd[6323]: goroutine 1 [running]:
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/etcdserver.NewServer.func1(0xc0002eef50, 0xc0002ecf48)
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/etcdserver/server.go:335 +0x3e
May 26 22:00:14 mgt01 etcd[6323]: panic(0xed6540, 0xc00062a080)
May 26 22:00:14 mgt01 etcd[6323]:         /usr/local/go/src/runtime/panic.go:522 +0x1b5
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc0001fe7c0, 0x10af294, 0x2a, 0xc0002ed018, 0x1, 0x1)
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x135
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/etcdserver.NewServer(0xc00004208a, 0xa, 0x0, 0x0, 0x0, 0x0, 0xc000198d00, 0x1, 0x1, 0xc000198e80, ...)
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/etcdserver/server.go:456 +0x42f7
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/embed.StartEtcd(0xc0001bc580, 0xc0001bcb00, 0x0, 0x0)
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/embed/etcd.go:211 +0x9d0
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/etcdmain.startEtcd(0xc0001bc580, 0x10849de, 0x6, 0x1, 0xc00020f1d0)
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/etcdmain/etcd.go:302 +0x40
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/etcdmain/etcd.go:144 +0x2f71
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/etcdmain.Main()
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/etcdmain/main.go:46 +0x38
May 26 22:00:14 mgt01 etcd[6323]: main.main()
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/main.go:28 +0x20

member02:

May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749838 I | etcdmain: etcd Version: 3.4.2
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749845 I | etcdmain: Git SHA: a7cf1ca
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749850 I | etcdmain: Go Version: go1.12.5
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749855 I | etcdmain: Go OS/Arch: linux/amd64
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749861 I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749973 W | etcdmain: found invalid file/dir .bash_logout under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749984 W | etcdmain: found invalid file/dir .bashrc under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749992 W | etcdmain: found invalid file/dir .profile under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.750001 N | etcdmain: the server is already initialized as member before, starting as etcd member...
May 26 22:00:13 mgt02 etcd[3482]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.750037 I | embed: peerTLS: cert = /etc/ssl/etcd/ssl/member-mgt02.pem, key = /etc/ssl/etcd/ssl/member-mgt02-key.pem, trusted-ca = /etc/ssl/etcd/ssl/ca.pem, client-cert-auth = true, crl-file =
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751369 I | embed: name = etcd-mgt02
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751385 I | embed: data dir = /var/lib/etcd
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751391 I | embed: member dir = /var/lib/etcd/member
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751396 I | embed: heartbeat = 100ms
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751401 I | embed: election = 1000ms
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751406 I | embed: snapshot count = 10000
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751415 I | embed: advertise client URLs = https://10.61.109.42:2379
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751421 I | embed: initial advertise peer URLs = https://10.61.109.42:2380
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751428 I | embed: initial cluster =
May 26 22:00:13 mgt02 etcd[3482]: panic: freepages: failed to get all reachable pages (page 6148948888203174444: out of bounds: 14622)
May 26 22:00:13 mgt02 etcd[3482]: goroutine 145 [running]:
May 26 22:00:13 mgt02 etcd[3482]: go.etcd.io/etcd/vendor/go.etcd.io/bbolt.(*DB).freepages.func2(0xc00023c120)
May 26 22:00:13 mgt02 etcd[3482]:         /opt/go/src/go.etcd.io/etcd/vendor/go.etcd.io/bbolt/db.go:1003 +0xe5
May 26 22:00:13 mgt02 etcd[3482]: created by go.etcd.io/etcd/vendor/go.etcd.io/bbolt.(*DB).freepages
May 26 22:00:13 mgt02 etcd[3482]:         /opt/go/src/go.etcd.io/etcd/vendor/go.etcd.io/bbolt/db.go:1001 +0x1b5

Finally, we recovered most of the data by our backup file. However, we wonder to know why this happend. Is there any idea to avoid this issue? Thanks!!!

The text was updated successfully, but these errors were encountered:

tangcong · 2020-05-27T23:38:37Z

so terrible. it seems three member db files are broken. can you use bolt tool(for example,see issue #10010) to check your db file and provider complete etcd logs and before and after failure ？Is there a problem with the disk?
/cc @gyuho @jingyih @xiang90 @jpbetz

vekergu · 2020-06-26T15:36:11Z

My etcd version is 3.4.9

# cd /var/lib/etcd/
# tree
.
└── member
    ├── snap
    │   ├── 0000000000000003-0000000000033466.snap
    │   ├── 0000000000000003-0000000000035b77.snap
    │   ├── 0000000000000003-0000000000038288.snap
    │   ├── 0000000000000003-000000000003a999.snap
    │   ├── 0000000000000003-000000000003d0aa.snap
    │   └── db
    └── wal
        ├── 0000000000000000-0000000000000000.wal
        ├── 0000000000000001-0000000000014a00.wal
        ├── 0000000000000002-000000000002a59b.wal
        └── 0.tmp

etcd start log is:

{"level":"info","ts":"2020-06-26T13:11:02.536+0800","caller":"etcdmain/etcd.go:134","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{"level":"info","ts":"2020-06-26T13:11:02.536+0800","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.145.10:2380"]}
{"level":"info","ts":"2020-06-26T13:11:02.536+0800","caller":"embed/etcd.go:465","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2020-06-26T13:11:02.537+0800","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.145.10:2379"]}
{"level":"info","ts":"2020-06-26T13:11:02.537+0800","caller":"embed/etcd.go:299","msg":"starting an etcd server","etcd-version":"3.4.9","git-sha":"54ba95891","go-version":"go1.12.17","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"192.168.145.10","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":true,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.145.10:2380"],"listen-peer-urls":["https://192.168.145.10:2380"],"advertise-client-urls":["https://192.168.145.10:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.145.10:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2020-06-26T13:11:02.538+0800","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"571.607µs"}
{"level":"info","ts":"2020-06-26T13:11:02.913+0800","caller":"etcdserver/server.go:451","msg":"recovered v2 store from snapshot","snapshot-index":250026,"snapshot-size":"7.5 kB"}
{"level":"warn","ts":"2020-06-26T13:11:02.915+0800","caller":"snap/db.go:92","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":250026,"snapshot-file-path":"/var/lib/etcd/member/snap/000000000003d0aa.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2020-06-26T13:11:02.915+0800","caller":"etcdserver/server.go:462","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/etcdserver.NewServer\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:462\ngo.etcd.io/etcd/embed.StartEtcd\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/embed/etcd.go:211\ngo.etcd.io/etcd/etcdmain.startEtcd\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:302\ngo.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:144\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}
panic: failed to recover v3 backend from snapshot
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc0494e]

goroutine 1 [running]:
go.etcd.io/etcd/etcdserver.NewServer.func1(0xc0003a0f50, 0xc00039ef48)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:334 +0x3e
panic(0xee36c0, 0xc000046200)
        /usr/local/go/src/runtime/panic.go:522 +0x1b5
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000164420, 0xc0001527c0, 0x1, 0x1)
        /home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/zap@v1.10.0/zapcore/entry.go:229 +0x546
go.uber.org/zap.(*Logger).Panic(0xc0002fc180, 0x10bd441, 0x2a, 0xc0001527c0, 0x1, 0x1)
        /home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/zap@v1.10.0/logger.go:225 +0x7f
go.etcd.io/etcd/etcdserver.NewServer(0x7ffdcc63c732, 0xe, 0x0, 0x0, 0x0, 0x0, 0xc0001bd000, 0x1, 0x1, 0xc0001bd180, ...)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:462 +0x3d67
go.etcd.io/etcd/embed.StartEtcd(0xc00016c580, 0xc00016cb00, 0x0, 0x0)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/embed/etcd.go:211 +0x9d0
go.etcd.io/etcd/etcdmain.startEtcd(0xc00016c580, 0x1092f33, 0x6, 0xc0001bd801, 0x2)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:144 +0x2f71
go.etcd.io/etcd/etcdmain.Main()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46 +0x38
main.main()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28 +0x20

why the snap file is "0000000000000003-000000000003d0aa.snap",but recoverd file need "/var/lib/etcd/member/snap/000000000003d0aa.snap.db"

lynk-coder · 2020-06-30T09:44:10Z

I also encountered the same problem, I ran Kubernetes 1.18.3 in the virtual machine, accidentally several power cuts, or unexpected shutdown, resulting in the loss of "***.snap.db" file, so my entire Kubernates cluster could not start, I hope to solve this problem as soon as possible, and I hope to inform the Kubernetes team as soon as possible.
This problem prevented me from putting my new Kubernates into production.
Thanks a million!

lbogdan · 2020-08-06T20:11:16Z

Related kubernetes issue: kubernetes/kubernetes#88574 .

jpbetz · 2020-08-06T20:55:49Z

@vekergu snap.db files are different than .snap files. When a member attempts a "snapshot recovery", the leader sends it a full copy of the database, which the member stores as a snap.db file. If a member is restarted during a "snashot recovery" process the member will notice that it needs it and attempt to load it. I don't think etcd is confusing the files. I think the .snap.db is legitimately missing.

We fixed one issue related to the order of operations that happen on startup in in 3.4.8 (xref: #11888). But this appears to be a separate problem (note that reports of it occurring on 3.4.2 and 3.4.9).

Any ideas @jingyih, @ptabor?

lbogdan · 2020-08-06T21:05:27Z

In my case it happens in a single control plane node kubernetes / single node etcd installation, so the thing about the leader sending the snap.db to the member doesn't even apply, does it?

jpbetz · 2020-08-07T16:58:40Z

In my case it happens in a single control plane node kubernetes / single node etcd installation, so the thing about the leader sending the snap.db to the member doesn't even apply, does it?

Did the log output your case directly say that the snap.db file is missing? I didn't see that.

lbogdan · 2020-08-07T22:24:14Z

Did the log output your case directly say that the snap.db file is missing? I didn't see that.

Sorry, I forgot sharing my logs, but I get exactly the same error:

{"level":"info","ts":"2020-08-07T22:21:50.941Z","caller":"embed/etcd.go:299","msg":"starting an etcd server","etcd-version":"3.4.9","git-sha":"54ba95891","go-version":"go1.12.17","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"master-0","data-dir":"/var/lib/etcd.bck","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd.bck/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.1.100:2398"],"listen-peer-urls":["https://192.168.1.100:2398"],"advertise-client-urls":["https://192.168.1.100:2397"],"listen-client-urls":["https://127.0.0.1:2397","https://192.168.1.100:2397"],"listen-metrics-urls":["http://127.0.0.1:2399"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2020-08-07T22:21:50.944Z","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"/var/lib/etcd.bck/member/snap/db","took":"2.694476ms"}
{"level":"info","ts":"2020-08-07T22:21:51.431Z","caller":"etcdserver/server.go:451","msg":"recovered v2 store from snapshot","snapshot-index":2030213,"snapshot-size":"7.5 kB"}
{"level":"warn","ts":"2020-08-07T22:21:51.433Z","caller":"snap/db.go:92","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":2030213,"snapshot-file-path":"/var/lib/etcd.bck/member/snap/00000000001efa85.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2020-08-07T22:21:51.433Z","caller":"etcdserver/server.go:462","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/etcdserver.NewServer\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:462\ngo.etcd.io/etcd/embed.StartEtcd\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/embed/etcd.go:211\ngo.etcd.io/etcd/etcdmain.startEtcd\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:302\ngo.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:144\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}
panic: failed to recover v3 backend from snapshot
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc0494e]

goroutine 1 [running]:
go.etcd.io/etcd/etcdserver.NewServer.func1(0xc00045ef50, 0xc00045cf48)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:334 +0x3e
panic(0xee36c0, 0xc0001720e0)
        /usr/local/go/src/runtime/panic.go:522 +0x1b5
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0000ce420, 0xc0001d8d40, 0x1, 0x1)
        /home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/zap@v1.10.0/zapcore/entry.go:229 +0x546
go.uber.org/zap.(*Logger).Panic(0xc000242960, 0x10bd441, 0x2a, 0xc0001d8d40, 0x1, 0x1)
        /home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/zap@v1.10.0/logger.go:225 +0x7f
go.etcd.io/etcd/etcdserver.NewServer(0x7ffde0354294, 0x8, 0x0, 0x0, 0x0, 0x0, 0xc000254d80, 0x1, 0x1, 0xc000254f00, ...)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:462 +0x3d67
go.etcd.io/etcd/embed.StartEtcd(0xc0002c6000, 0xc0002c6580, 0x0, 0x0)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/embed/etcd.go:211 +0x9d0
go.etcd.io/etcd/etcdmain.startEtcd(0xc0002c6000, 0x1092f33, 0x6, 0xc000255601, 0x2)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:144 +0x2f71
go.etcd.io/etcd/etcdmain.Main()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46 +0x38
main.main()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28 +0x20

stale · 2020-11-06T00:07:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

gsakun · 2020-11-06T06:36:53Z

same problem

2020-11-06 06:32:51.257031 I | etcdmain: etcd Version: 3.3.10
2020-11-06 06:32:51.257119 I | etcdmain: Git SHA: 27fc7e2
2020-11-06 06:32:51.257125 I | etcdmain: Go Version: go1.10.4
.......
2020-11-06 06:32:51.476967 I | etcdserver: recovered store from snapshot at index 164056893
2020-11-06 06:32:51.520022 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xb8cb90]

goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4202abca0, 0xc4202ab758)
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:291 +0x40
panic(0xde0ce0, 0xc4201f9090)
	/usr/local/go/src/runtime/panic.go:502 +0x229
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc42018d440, 0xfe8789, 0x2a, 0xc4202ab7f8, 0x1, 0x1)
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x162
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0x7ffdf037b4bd, 0xf, 0x0, 0x0, 0x0, 0x0, 0xc4202f2b00, 0x1, 0x1, 0xc4202f2c00, ...)
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:386 +0x26bb
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc420284900, 0xc420284d80, 0x0, 0x0)
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:179 +0x811
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc420284900, 0xfc62b7, 0x6, 0xc4202acd01, 0x2)
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:181 +0x40
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:102 +0x1369
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:46 +0x3f
main.main()
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20

mamiapatrick · 2020-11-23T18:43:08Z

Same problem!!! Do someone fix this issues. there is many threads on this, but no real solutions

tangcong · 2020-11-25T16:58:11Z

@vekergu snap.db files are different than .snap files. When a member attempts a "snapshot recovery", the leader sends it a full copy of the database, which the member stores as a snap.db file. If a member is restarted during a "snashot recovery" process the member will notice that it needs it and attempt to load it. I don't think etcd is confusing the files. I think the .snap.db is legitimately missing.

We fixed one issue related to the order of operations that happen on startup in in 3.4.8 (xref: #11888). But this appears to be a separate problem (note that reports of it occurring on 3.4.2 and 3.4.9).

Any ideas @jingyih, @ptabor?

The same error occurred in one node cluster. It seems that the invariant( snapshot.Metadata.Index <= db.consistentIndex) was violated after the etcd node power off. it is so weird, etcd will commit metadata(consistentindex) before creating a snapshot. I am trying to add some logs to reproduce it, but it has not been successful. The safety of fsync depends on whether the file system forces the disk cache to be flushed to the hardware. @mamiapatrick what file system are you using?

	clone := s.v2store.Clone()
	// commit kv to write metadata (for example: consistent index) to disk.
	// KV().commit() updates the consistent index in backend.
	// All operations that update consistent index must be called sequentially
	// from applyAll function.
	// So KV().Commit() cannot run in parallel with apply. It has to be called outside
	// the go routine created below.
	s.KV().Commit()

mamiapatrick · 2020-11-25T20:20:32Z

@tangcong thanks for your help, really struggle on that and despite different research and several threads it seems that issues is not yet resolved by anyone.

For your question:
1- My cluster is on one unique node for etcd-server so there's no many member, only that one
2- The issue is on my rancher node (Rancher OS) where i setup RKE and the filesystem is ext4

hope to read you soon!

tangcong · 2020-11-26T10:04:11Z

@mamiapatrick thanks. I am trying to inject pod-failure and container-kill error into etcd cluster, but it seems very hard to reproduce it, do you have any advice to reproduce it?

mamiapatrick · 2020-12-01T16:22:46Z

@tangcong
I don't really know how to reproduce it. It seems that error appears most of time after a brutal reboot like power failure. In my case it is the etcd of the rancher container (i install kubernetes through rancher and rke and to install rancher you have to deploy a container rancher/rancher) that do not start anymore and not the ectd of the kubernetes cluster's node himself.

deeco · 2021-01-06T14:55:48Z

2 broken files remain in wal and snap folders , removing does nothing as automatically skipped , possible to restore or remove snaps or wal files to get to a previous state, issue seems to be the index number references a particular file, snaps folder has 4 files , wal folder has 5, is the 56 in file name linked between snap and wal files ?

appears missing a snap with f067 reference ?

wal folder content

drwx------ 4 root root 4.0K Jan 6 13:57 ..
-rw------- 1 root root 62M Jan 6 13:57 0000000000000053-0000000000619780.wal
-rw------- 1 root root 62M Jan 6 13:57 0000000000000054-000000000062ce9c.wal
-rw------- 1 root root 62M Jan 6 13:57 0000000000000056-00000000006540e6.wal
-rw------- 1 root root 62M Jan 6 13:57 0.tmp
-rw------- 1 root root 62M Jan 6 13:57 0000000000000052-0000000000605f89.wal
-rw-r--r-- 1 root root 62M Jan 6 13:57 0000000000000012-000000000017c929.wal.broken
drwx------ 2 root root 4.0K Jan 6 13:57 .
-rw------- 1 root root 62M Jan 6 13:57 0000000000000055-000000000064088b.wal

snap folder contents

-rw-r--r-- 1 root root 10K Jan 6 13:57 000000000000000a-000000000065c956.snap
-rw-r--r-- 1 root root 10K Jan 6 13:57 000000000000000a-0000000000657b34.snap
-rw------- 1 root root 33M Jan 6 13:57 db
-rw-r--r-- 1 root root 10K Jan 6 13:57 000000000000000a-000000000065f067.snap.broken
-rw-r--r-- 1 root root 10K Jan 6 13:57 000000000000000a-000000000065a245.snap
drwx------ 2 root root 4.0K Jan 6 13:57 .
-rw-r--r-- 1 root root 10K Jan 6 13:57 000000000000000a-0000000000655423.snap

log from etcd container

2021-01-06 14:57:59.838738 I | etcdserver: recovered store from snapshot at index 6670678 2021-01-06 14:57:59.853464 C | etcdserver: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist) panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist) panic: runtime error: invalid memory address or nil pointer dereference

stale · 2021-04-07T00:47:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

arthurzenika · 2021-05-20T09:06:30Z

Up on this issue as it doesn't seem solved or that a workaround is documented. Impacted by this too.

ptabor · 2021-05-20T10:55:16Z

I fill a lot more convenient about updating consistent_index after: #12855 was submitted.
Before it, there used to be administrative operations (e.g. add members) that could not be reflected in cindex, but could lead to higher value being written into the Snapshot.

arthurzenika · 2021-05-20T12:52:58Z

@ptabor is that feature likely to facilitate recovery of etcd data from previous versions of etcd ? I guess that MR is available in v3.5.0-beta.3 ?

ptabor · 2021-05-20T12:58:04Z

The PR is in v3.5.0-beta.3. There is no code that would 'ignore' the mismatch between WAL & db.

For recovery from such situation I would consider taking 'db' file alone from the member/snap directory,
and trying to run etcdctl snapshot restore command on top of it.
It should produce initial WAL logs compatible with the state of the 'db' file.

arthurzenika · 2021-05-20T13:22:48Z

Thanks for pointing me in the direction of the etcdctl snapshot restore from the db file. I didn't know that was possible.

Trying that I get

/tmp/etcd-download-test/etcdctl --endpoints=localhost:2379 snapshot restore /tmp/member/snap/db
{"level":"info","ts":1621516681.2776036,"caller":"snapshot/v3_snapshot.go:287","msg":"restoring snapshot","path":"/tmp/member/snap/db","wal-dir":"default.etcd/member/wal","data-dir":"default.etcd","snap-dir":"default.etcd/member/snap"}
Error: snapshot missing hash but --skip-hash-check=false

produces a 200Mo db file. Starting etcd works, but db looks empty (testing with /tmp/etcd-download-test/etcdctl --endpoints=localhost:2379 get / --prefix --keys-only

I also try with --skip-hash-check=true and get :

/tmp/etcd-download-test/etcdctl --endpoints=localhost:2379 snapshot restore /tmp/member/snap/db --skip-hash-check=true
{"level":"info","ts":1621516738.8462574,"caller":"snapshot/v3_snapshot.go:287","msg":"restoring snapshot","path":"/tmp/member/snap/db","wal-dir":"default.etcd/member/wal","data-dir":"default.etcd","snap-dir":"default.etcd/member/snap"}
{"level":"info","ts":1621516739.914219,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1621516740.501717,"caller":"snapshot/v3_snapshot.go:300","msg":"restored snapshot","path":"/tmp/member/snap/db","wal-dir":"default.etcd/member/wal","data-dir":"default.etcd","snap-dir":"default.etcd/member/snap"}

Again 200Mo db file, but no keys... (maybe my test command is wrong?)

ptabor · 2021-05-20T16:27:08Z

Are you using the etcdctl snapshot restore --with-v3 variant ?

arthurzenika · 2021-05-21T07:57:51Z

I get Error: unknown flag: --with-v3. Couldn't find the option in the documentation. Tried out ETCDCTL_API=3 from https://etcd.io/docs/v3.3/op-guide/recovery/ which didn't provide any different results (empty key list).

ptabor · 2021-05-21T09:36:14Z

My mistake. --with-v3 is in: etcdctl/ctlv2/command/backup_command.go

ptabor · 2021-05-21T09:38:25Z

How about:

./bin/etcdctl --endpoints=localhost:2379 get "" --prefix --keys-only

arthurzenika · 2021-05-21T10:30:15Z

@ptabor with get "" no keys either. Thanks for following up !

veerendra2 · 2021-06-30T13:22:16Z

Somehow I was able to recover. I commented here in case someone interested

tomkcpr · 2022-01-10T06:08:12Z

I've the same issue. However, 2/3 nodes are showing the above message and the entire cluster is down. (One node doesn't show the etcd message below so I'm thinking that service is recoverable?) Services needed for cert renewal on the kubernetes cluster cannot start due to etcd having corrupt DB files. I do not have an explicitly taken etcd backup. Messages are the same:

stderr F panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
stderr F    panic: runtime error: invalid memory address or nil pointer dereference

However, I'm wondering if it's possible to take the db file and copy that over to the faulty nodes and what steps are needed to do so successfully:

/var/lib/etcd/member/snap/db

I didn't get clarity how this can be done from the above commands provided thus far. Most howto's assume an explicit backup was taken using etcdctl snapshot restore. From the comments that @arthurlogilab posted, it didn't appear to work for him. This issue is occurring on an OpenShift + Kubernetes cluster.

gaosong030431207 · 2022-01-27T04:02:50Z

no real solutions！ I also encountered the same problem. my etcd version is 3.4.13-0

jacob-faber · 2022-02-03T11:38:35Z

I've the same issue. However, 2/3 nodes are showing the above message and the entire cluster is down. (One node doesn't show the etcd message below so I'm thinking that service is recoverable?) Services needed for cert renewal on the kubernetes cluster cannot start due to etcd having corrupt DB files. I do not have an explicitly taken etcd backup. Messages are the same:
stderr F panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
stderr F    panic: runtime error: invalid memory address or nil pointer dereference
However, I'm wondering if it's possible to take the db file and copy that over to the faulty nodes and what steps are needed to do so successfully:

/var/lib/etcd/member/snap/db

I didn't get clarity how this can be done from the above commands provided thus far. Most howto's assume an explicit backup was taken using etcdctl snapshot restore. From the comments that @arthurlogilab posted, it didn't appear to work for him. This issue is occurring on an OpenShift + Kubernetes cluster.

In case of OpenShift, the restore procedure worked every time for me. Just don't forget to take backups every day and regularly verify them.

It doesn't solve this problem but backups of that critical components are essential.

MrMYHuang · 2022-02-04T11:33:48Z

I build a k8s cluster with kubeadm containing one control plane node and one worker node. The cluster contains one etcd 3.5.1 server in the control plane node. Unfortunately, I also meet several times of etcd error "failed to recover v3 backend from snapshot".

I studied etcd source files and found the condition to trigger this kind of error:

etcd/server/storage/backend.go

Lines 99 to 109 in 986a2b5

    
           func RecoverSnapshotBackend(cfg config.ServerConfig, oldbe backend.Backend, snapshot raftpb.Snapshot, beExist bool, hooks *BackendHooks) (backend.Backend, error) { 
        
           	consistentIndex := uint64(0) 
        
           	if beExist { 
        
           		consistentIndex, _ = schema.ReadConsistentIndex(oldbe.BatchTx()) 
        
           	} 
        
           	if snapshot.Metadata.Index <= consistentIndex { 
        
           		return oldbe, nil 
        
           	} 
        
           	oldbe.Close() 
        
           	return OpenSnapshotBackend(cfg, snap.New(cfg.Logger, cfg.SnapDir()), snapshot, hooks) 
        
           }

I.e. the condition snapshot.Metadata.Index <= db.consistentIndex is violated. It means the consistent index in snap/db file is older than the one in one of snap/*.snap files.
But in general cases, it's impossible(?), because the following program ensures(?) the consistent index is written to snap/db file before creating a .snap file:

etcd/server/etcdserver/server.go

Line 2019 in 986a2b5

func (s *EtcdServer) snapshot(snapi uint64, confState raftpb.ConfState) {

Thus, the steps to reproduce violation of the condition snapshot.Metadata.Index <= db.consistentIndex are unknown and difficult to find...

Nevertheless, I propose a workaround to make etcd back to work:
(Notice, this workaround doesn't guarantee no data loss. Please make a backup and try at your risk!)

sudo rm /var/lib/etcd/member/wal/*.wal

or

sudo rm /var/lib/etcd/member/snap/*.snap

Then, your etcd might start without crash.

cruzanstx · 2022-02-19T23:46:16Z

I build a k8s cluster with kubeadm containing one control plane node and one worker node. The cluster contains one etcd 3.5.1 server in the control plane node. Unfortunately, I also meet several times of etcd error "failed to recover v3 backend from snapshot".

I studied etcd source files and found the condition to trigger this kind of error:

etcd/server/storage/backend.go

Lines 99 to 109 in 986a2b5

func RecoverSnapshotBackend(cfg config.ServerConfig, oldbe backend.Backend, snapshot raftpb.Snapshot, beExist bool, hooks *BackendHooks) (backend.Backend, error) {

consistentIndex := uint64(0)

if beExist {

consistentIndex, _ = schema.ReadConsistentIndex(oldbe.BatchTx())

}

if snapshot.Metadata.Index <= consistentIndex {

return oldbe, nil

}

oldbe.Close()

return OpenSnapshotBackend(cfg, snap.New(cfg.Logger, cfg.SnapDir()), snapshot, hooks)

}

I.e. the condition snapshot.Metadata.Index <= db.consistentIndex is violated. It means the consistent index in snap/db file is older than the one in one of snap/*.snap files.
But in general cases, it's impossible(?), because the following program ensures(?) the consistent index is written to snap/db file before creating a .snap file:

etcd/server/etcdserver/server.go

Line 2019 in 986a2b5

func (s *EtcdServer) snapshot(snapi uint64, confState raftpb.ConfState) {

Thus, the steps to reproduce violation of the condition snapshot.Metadata.Index <= db.consistentIndex are unknown and difficult to find...
Nevertheless, I propose a workaround to make etcd back to work: (Notice, this workaround doesn't guarantee no data loss. Please make a backup and try at your risk!)
sudo rm /var/lib/etcd/member/wal/*.wal
or
sudo rm /var/lib/etcd/member/snap/*.snap
Then, your etcd might start without crash.

removing *.wal and *.snap worked for me thanks!

wang-xiaowu · 2022-12-19T10:40:21Z

I build a k8s cluster with kubeadm containing one control plane node and one worker node. The cluster contains one etcd 3.5.1 server in the control plane node. Unfortunately, I also meet several times of etcd error "failed to recover v3 backend from snapshot".

I studied etcd source files and found the condition to trigger this kind of error:

etcd/server/storage/backend.go

Lines 99 to 109 in 986a2b5

func RecoverSnapshotBackend(cfg config.ServerConfig, oldbe backend.Backend, snapshot raftpb.Snapshot, beExist bool, hooks *BackendHooks) (backend.Backend, error) {

consistentIndex := uint64(0)

if beExist {

consistentIndex, _ = schema.ReadConsistentIndex(oldbe.BatchTx())

}

if snapshot.Metadata.Index <= consistentIndex {

return oldbe, nil

}

oldbe.Close()

return OpenSnapshotBackend(cfg, snap.New(cfg.Logger, cfg.SnapDir()), snapshot, hooks)

}

I.e. the condition snapshot.Metadata.Index <= db.consistentIndex is violated. It means the consistent index in snap/db file is older than the one in one of snap/*.snap files.
But in general cases, it's impossible(?), because the following program ensures(?) the consistent index is written to snap/db file before creating a .snap file:

etcd/server/etcdserver/server.go

Line 2019 in 986a2b5

func (s *EtcdServer) snapshot(snapi uint64, confState raftpb.ConfState) {

Thus, the steps to reproduce violation of the condition snapshot.Metadata.Index <= db.consistentIndex are unknown and difficult to find...
Nevertheless, I propose a workaround to make etcd back to work: (Notice, this workaround doesn't guarantee no data loss. Please make a backup and try at your risk!)
sudo rm /var/lib/etcd/member/wal/*.wal
or
sudo rm /var/lib/etcd/member/snap/*.snap
Then, your etcd might start without crash.

it’s working，and my location is /var/lib/rancher/k3s/server/db/etcd/member/

vivisidea · 2023-03-16T07:57:11Z

None of the above solutions worked for me, here is my solution
just remove it from the cluster, and re-add it as a new member

https://etcd.io/docs/v3.5/tutorials/how-to-deal-with-membership/

xgenvn · 2023-10-02T09:48:39Z

Using workaround like remove *.snap and *.wal worked for me for some time. Recently it doesn't work anymore, and stucking at connecting to server along with maintenance alarm issue.
However, I was able to forward the etcd client port and dump from there. Then later create a new PVC and restore back.

ttron · 2024-03-22T20:47:47Z

Removing *.snap and *.wal can lead to start etcd pods successfully. However, you will lost all data this way. I accidently found another way to fix this by copying db to most recent snap.db and restart kubelet and containerd (CRI-O, docker).
use following for example:

# ls -lah /var/lib/etcd/member/snap/
total 49M
drwx------ 2 root root 295 Mar 22 14:58 .
drwx------ 4 root root  29 Mar 21 12:54 ..
-rw-r--r-- 1 root root 10K Mar 22 11:19 000000000000000a-000000000418625a.snap
-rw-r--r-- 1 root root 10K Mar 22 12:14 000000000000000a-000000000418896b.snap
-rw-r--r-- 1 root root 10K Mar 22 13:08 000000000000000a-000000000418b07c.snap
-rw-r--r-- 1 root root 10K Mar 22 14:03 000000000000000a-000000000418d78d.snap
-rw-r--r-- 1 root root 10K Mar 22 14:57 000000000000000a-000000000418fe9e.snap
-rw------- 1 root root 30M Mar 22 15:44 db

so the most recent snap will be 000000000000000a-000000000418fe9e.snap

#cp db 000000000000000a-000000000418fe9e.snap.db
#systemctl restart kubelet
#systemctl restart containerd

etcd pod and k8s control panel pods should be start normally.

stale bot added the stale label Nov 6, 2020

stale bot removed the stale label Nov 6, 2020

liwh1111 mentioned this issue Nov 25, 2020

v3.4.9 Due to power failure, etcd cannot be started. Is there any solution #12487

Closed

deeco mentioned this issue Jan 6, 2021

etcd and kube-apiserver does not start after incorrect machine shutdown kubernetes/kubernetes#88574

Closed

stale bot added the stale label Apr 7, 2021

stale bot closed this as completed Apr 29, 2021

arthurzenika mentioned this issue May 20, 2021

Restore of rancher/rancher:v2.5.7 docker/docker-compose install with /var/lib/rancher rancher/rancher#32791

Closed

tobgu mentioned this issue Aug 25, 2022

ETCD fails to start after performing alarm list operation and then power off/on #14382

Closed

Etcd start failed after power off and restart #11949

Etcd start failed after power off and restart #11949

Comments

xieyanker commented May 27, 2020

tangcong commented May 27, 2020

vekergu commented Jun 26, 2020 • edited Loading

lynk-coder commented Jun 30, 2020

lbogdan commented Aug 6, 2020

jpbetz commented Aug 6, 2020 • edited Loading

lbogdan commented Aug 6, 2020

jpbetz commented Aug 7, 2020

lbogdan commented Aug 7, 2020

stale bot commented Nov 6, 2020

gsakun commented Nov 6, 2020

mamiapatrick commented Nov 23, 2020

tangcong commented Nov 25, 2020 • edited Loading

mamiapatrick commented Nov 25, 2020

tangcong commented Nov 26, 2020 • edited Loading

mamiapatrick commented Dec 1, 2020

deeco commented Jan 6, 2021 • edited Loading

stale bot commented Apr 7, 2021

arthurzenika commented May 20, 2021

ptabor commented May 20, 2021

arthurzenika commented May 20, 2021

ptabor commented May 20, 2021

arthurzenika commented May 20, 2021

ptabor commented May 20, 2021 • edited Loading

arthurzenika commented May 21, 2021

ptabor commented May 21, 2021

ptabor commented May 21, 2021

arthurzenika commented May 21, 2021

veerendra2 commented Jun 30, 2021

tomkcpr commented Jan 10, 2022

gaosong030431207 commented Jan 27, 2022 • edited Loading

jacob-faber commented Feb 3, 2022

MrMYHuang commented Feb 4, 2022 • edited Loading

cruzanstx commented Feb 19, 2022

wang-xiaowu commented Dec 19, 2022 • edited Loading

vivisidea commented Mar 16, 2023

xgenvn commented Oct 2, 2023

ttron commented Mar 22, 2024

vekergu commented Jun 26, 2020 •

edited

Loading

jpbetz commented Aug 6, 2020 •

edited

Loading

tangcong commented Nov 25, 2020 •

edited

Loading

tangcong commented Nov 26, 2020 •

edited

Loading

deeco commented Jan 6, 2021 •

edited

Loading

ptabor commented May 20, 2021 •

edited

Loading

gaosong030431207 commented Jan 27, 2022 •

edited

Loading

MrMYHuang commented Feb 4, 2022 •

edited

Loading

wang-xiaowu commented Dec 19, 2022 •

edited

Loading