Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd start failed after power off and restart #11949

Closed
xieyanker opened this issue May 27, 2020 · 37 comments
Closed

Etcd start failed after power off and restart #11949

xieyanker opened this issue May 27, 2020 · 37 comments
Labels

Comments

@xieyanker
Copy link

In such that, I hava an etcd cluster for kubernetes which has 3 members. And all machines were powered off yesterday, however, all of the etcd members start failed after power is restored.

My etcd version is 3.4.2.

The etcd member01's error is failed to find database snapshot file (snap: snapshot file doesn't exist), and the errors of member02 and member03 are same: freepages: failed to get all reachable pages.

The full logs are as follows:

member01:

May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680835 I | etcdmain: etcd Version: 3.4.2
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680840 I | etcdmain: Git SHA: a7cf1ca
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680845 I | etcdmain: Go Version: go1.12.5
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680849 I | etcdmain: Go OS/Arch: linux/amd64
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680855 I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680926 W | etcdmain: found invalid file/dir .bash_logout under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680937 W | etcdmain: found invalid file/dir .bashrc under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680942 W | etcdmain: found invalid file/dir .profile under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680952 N | etcdmain: the server is already initialized as member before, starting as etcd member...
May 26 22:00:14 mgt01 etcd[6323]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.680997 I | embed: peerTLS: cert = /etc/ssl/etcd/ssl/member-mgt01.pem, key = /etc/ssl/etcd/ssl/member-mgt01-key.pem, trusted-ca = /etc/ssl/etcd/ssl/ca.pem, client-cert-auth = true, crl-file =
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681694 I | embed: name = etcd-mgt01
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681708 I | embed: data dir = /var/lib/etcd
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681713 I | embed: member dir = /var/lib/etcd/member
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681719 I | embed: heartbeat = 100ms
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681724 I | embed: election = 1000ms
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681728 I | embed: snapshot count = 10000
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681737 I | embed: advertise client URLs = https://10.61.109.41:2379
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681746 I | embed: initial advertise peer URLs = https://10.61.109.41:2380
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.681753 I | embed: initial cluster =
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.689611 I | etcdserver: recovered store from snapshot at index 21183302
May 26 22:00:14 mgt01 etcd[6323]: 2020-05-26 14:00:14.693186 C | etcdserver: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
May 26 22:00:14 mgt01 etcd[6323]: panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
May 26 22:00:14 mgt01 etcd[6323]:         panic: runtime error: invalid memory address or nil pointer dereference
May 26 22:00:14 mgt01 etcd[6323]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc2c6be]
May 26 22:00:14 mgt01 etcd[6323]: goroutine 1 [running]:
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/etcdserver.NewServer.func1(0xc0002eef50, 0xc0002ecf48)
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/etcdserver/server.go:335 +0x3e
May 26 22:00:14 mgt01 etcd[6323]: panic(0xed6540, 0xc00062a080)
May 26 22:00:14 mgt01 etcd[6323]:         /usr/local/go/src/runtime/panic.go:522 +0x1b5
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc0001fe7c0, 0x10af294, 0x2a, 0xc0002ed018, 0x1, 0x1)
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x135
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/etcdserver.NewServer(0xc00004208a, 0xa, 0x0, 0x0, 0x0, 0x0, 0xc000198d00, 0x1, 0x1, 0xc000198e80, ...)
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/etcdserver/server.go:456 +0x42f7
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/embed.StartEtcd(0xc0001bc580, 0xc0001bcb00, 0x0, 0x0)
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/embed/etcd.go:211 +0x9d0
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/etcdmain.startEtcd(0xc0001bc580, 0x10849de, 0x6, 0x1, 0xc00020f1d0)
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/etcdmain/etcd.go:302 +0x40
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/etcdmain/etcd.go:144 +0x2f71
May 26 22:00:14 mgt01 etcd[6323]: go.etcd.io/etcd/etcdmain.Main()
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/etcdmain/main.go:46 +0x38
May 26 22:00:14 mgt01 etcd[6323]: main.main()
May 26 22:00:14 mgt01 etcd[6323]:         /opt/go/src/go.etcd.io/etcd/main.go:28 +0x20

member02:

May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749838 I | etcdmain: etcd Version: 3.4.2
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749845 I | etcdmain: Git SHA: a7cf1ca
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749850 I | etcdmain: Go Version: go1.12.5
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749855 I | etcdmain: Go OS/Arch: linux/amd64
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749861 I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749973 W | etcdmain: found invalid file/dir .bash_logout under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749984 W | etcdmain: found invalid file/dir .bashrc under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.749992 W | etcdmain: found invalid file/dir .profile under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.750001 N | etcdmain: the server is already initialized as member before, starting as etcd member...
May 26 22:00:13 mgt02 etcd[3482]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.750037 I | embed: peerTLS: cert = /etc/ssl/etcd/ssl/member-mgt02.pem, key = /etc/ssl/etcd/ssl/member-mgt02-key.pem, trusted-ca = /etc/ssl/etcd/ssl/ca.pem, client-cert-auth = true, crl-file =
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751369 I | embed: name = etcd-mgt02
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751385 I | embed: data dir = /var/lib/etcd
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751391 I | embed: member dir = /var/lib/etcd/member
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751396 I | embed: heartbeat = 100ms
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751401 I | embed: election = 1000ms
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751406 I | embed: snapshot count = 10000
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751415 I | embed: advertise client URLs = https://10.61.109.42:2379
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751421 I | embed: initial advertise peer URLs = https://10.61.109.42:2380
May 26 22:00:13 mgt02 etcd[3482]: 2020-05-26 14:00:13.751428 I | embed: initial cluster =
May 26 22:00:13 mgt02 etcd[3482]: panic: freepages: failed to get all reachable pages (page 6148948888203174444: out of bounds: 14622)
May 26 22:00:13 mgt02 etcd[3482]: goroutine 145 [running]:
May 26 22:00:13 mgt02 etcd[3482]: go.etcd.io/etcd/vendor/go.etcd.io/bbolt.(*DB).freepages.func2(0xc00023c120)
May 26 22:00:13 mgt02 etcd[3482]:         /opt/go/src/go.etcd.io/etcd/vendor/go.etcd.io/bbolt/db.go:1003 +0xe5
May 26 22:00:13 mgt02 etcd[3482]: created by go.etcd.io/etcd/vendor/go.etcd.io/bbolt.(*DB).freepages
May 26 22:00:13 mgt02 etcd[3482]:         /opt/go/src/go.etcd.io/etcd/vendor/go.etcd.io/bbolt/db.go:1001 +0x1b5

Finally, we recovered most of the data by our backup file. However, we wonder to know why this happend. Is there any idea to avoid this issue? Thanks!!!

@tangcong
Copy link
Contributor

so terrible. it seems three member db files are broken. can you use bolt tool(for example,see issue #10010) to check your db file and provider complete etcd logs and before and after failure ?Is there a problem with the disk?
/cc @gyuho @jingyih @xiang90 @jpbetz

@vekergu
Copy link

vekergu commented Jun 26, 2020

My etcd version is 3.4.9

# cd /var/lib/etcd/
# tree
.
└── member
    ├── snap
    │   ├── 0000000000000003-0000000000033466.snap
    │   ├── 0000000000000003-0000000000035b77.snap
    │   ├── 0000000000000003-0000000000038288.snap
    │   ├── 0000000000000003-000000000003a999.snap
    │   ├── 0000000000000003-000000000003d0aa.snap
    │   └── db
    └── wal
        ├── 0000000000000000-0000000000000000.wal
        ├── 0000000000000001-0000000000014a00.wal
        ├── 0000000000000002-000000000002a59b.wal
        └── 0.tmp 

etcd start log is:

{"level":"info","ts":"2020-06-26T13:11:02.536+0800","caller":"etcdmain/etcd.go:134","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{"level":"info","ts":"2020-06-26T13:11:02.536+0800","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.145.10:2380"]}
{"level":"info","ts":"2020-06-26T13:11:02.536+0800","caller":"embed/etcd.go:465","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2020-06-26T13:11:02.537+0800","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.145.10:2379"]}
{"level":"info","ts":"2020-06-26T13:11:02.537+0800","caller":"embed/etcd.go:299","msg":"starting an etcd server","etcd-version":"3.4.9","git-sha":"54ba95891","go-version":"go1.12.17","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"192.168.145.10","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":true,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.145.10:2380"],"listen-peer-urls":["https://192.168.145.10:2380"],"advertise-client-urls":["https://192.168.145.10:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.145.10:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2020-06-26T13:11:02.538+0800","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"571.607µs"}
{"level":"info","ts":"2020-06-26T13:11:02.913+0800","caller":"etcdserver/server.go:451","msg":"recovered v2 store from snapshot","snapshot-index":250026,"snapshot-size":"7.5 kB"}
{"level":"warn","ts":"2020-06-26T13:11:02.915+0800","caller":"snap/db.go:92","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":250026,"snapshot-file-path":"/var/lib/etcd/member/snap/000000000003d0aa.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2020-06-26T13:11:02.915+0800","caller":"etcdserver/server.go:462","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/etcdserver.NewServer\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:462\ngo.etcd.io/etcd/embed.StartEtcd\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/embed/etcd.go:211\ngo.etcd.io/etcd/etcdmain.startEtcd\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:302\ngo.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:144\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}
panic: failed to recover v3 backend from snapshot
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc0494e]

goroutine 1 [running]:
go.etcd.io/etcd/etcdserver.NewServer.func1(0xc0003a0f50, 0xc00039ef48)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:334 +0x3e
panic(0xee36c0, 0xc000046200)
        /usr/local/go/src/runtime/panic.go:522 +0x1b5
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000164420, 0xc0001527c0, 0x1, 0x1)
        /home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/zap@v1.10.0/zapcore/entry.go:229 +0x546
go.uber.org/zap.(*Logger).Panic(0xc0002fc180, 0x10bd441, 0x2a, 0xc0001527c0, 0x1, 0x1)
        /home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/zap@v1.10.0/logger.go:225 +0x7f
go.etcd.io/etcd/etcdserver.NewServer(0x7ffdcc63c732, 0xe, 0x0, 0x0, 0x0, 0x0, 0xc0001bd000, 0x1, 0x1, 0xc0001bd180, ...)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:462 +0x3d67
go.etcd.io/etcd/embed.StartEtcd(0xc00016c580, 0xc00016cb00, 0x0, 0x0)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/embed/etcd.go:211 +0x9d0
go.etcd.io/etcd/etcdmain.startEtcd(0xc00016c580, 0x1092f33, 0x6, 0xc0001bd801, 0x2)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:144 +0x2f71
go.etcd.io/etcd/etcdmain.Main()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46 +0x38
main.main()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28 +0x20

why the snap file is "0000000000000003-000000000003d0aa.snap",but recoverd file need "/var/lib/etcd/member/snap/000000000003d0aa.snap.db"

@lynk-coder
Copy link

I also encountered the same problem, I ran Kubernetes 1.18.3 in the virtual machine, accidentally several power cuts, or unexpected shutdown, resulting in the loss of "***.snap.db" file, so my entire Kubernates cluster could not start, I hope to solve this problem as soon as possible, and I hope to inform the Kubernetes team as soon as possible.
This problem prevented me from putting my new Kubernates into production.
Thanks a million!

@lbogdan
Copy link

lbogdan commented Aug 6, 2020

Related kubernetes issue: kubernetes/kubernetes#88574 .

@jpbetz
Copy link
Contributor

jpbetz commented Aug 6, 2020

@vekergu snap.db files are different than .snap files. When a member attempts a "snapshot recovery", the leader sends it a full copy of the database, which the member stores as a snap.db file. If a member is restarted during a "snashot recovery" process the member will notice that it needs it and attempt to load it. I don't think etcd is confusing the files. I think the .snap.db is legitimately missing.

We fixed one issue related to the order of operations that happen on startup in in 3.4.8 (xref: #11888). But this appears to be a separate problem (note that reports of it occurring on 3.4.2 and 3.4.9).

Any ideas @jingyih, @ptabor?

@lbogdan
Copy link

lbogdan commented Aug 6, 2020

In my case it happens in a single control plane node kubernetes / single node etcd installation, so the thing about the leader sending the snap.db to the member doesn't even apply, does it?

@jpbetz
Copy link
Contributor

jpbetz commented Aug 7, 2020

In my case it happens in a single control plane node kubernetes / single node etcd installation, so the thing about the leader sending the snap.db to the member doesn't even apply, does it?

Did the log output your case directly say that the snap.db file is missing? I didn't see that.

@lbogdan
Copy link

lbogdan commented Aug 7, 2020

Did the log output your case directly say that the snap.db file is missing? I didn't see that.

Sorry, I forgot sharing my logs, but I get exactly the same error:

{"level":"info","ts":"2020-08-07T22:21:50.941Z","caller":"embed/etcd.go:299","msg":"starting an etcd server","etcd-version":"3.4.9","git-sha":"54ba95891","go-version":"go1.12.17","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"master-0","data-dir":"/var/lib/etcd.bck","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd.bck/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.1.100:2398"],"listen-peer-urls":["https://192.168.1.100:2398"],"advertise-client-urls":["https://192.168.1.100:2397"],"listen-client-urls":["https://127.0.0.1:2397","https://192.168.1.100:2397"],"listen-metrics-urls":["http://127.0.0.1:2399"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2020-08-07T22:21:50.944Z","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"/var/lib/etcd.bck/member/snap/db","took":"2.694476ms"}
{"level":"info","ts":"2020-08-07T22:21:51.431Z","caller":"etcdserver/server.go:451","msg":"recovered v2 store from snapshot","snapshot-index":2030213,"snapshot-size":"7.5 kB"}
{"level":"warn","ts":"2020-08-07T22:21:51.433Z","caller":"snap/db.go:92","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":2030213,"snapshot-file-path":"/var/lib/etcd.bck/member/snap/00000000001efa85.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2020-08-07T22:21:51.433Z","caller":"etcdserver/server.go:462","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/etcdserver.NewServer\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:462\ngo.etcd.io/etcd/embed.StartEtcd\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/embed/etcd.go:211\ngo.etcd.io/etcd/etcdmain.startEtcd\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:302\ngo.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:144\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}
panic: failed to recover v3 backend from snapshot
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc0494e]

goroutine 1 [running]:
go.etcd.io/etcd/etcdserver.NewServer.func1(0xc00045ef50, 0xc00045cf48)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:334 +0x3e
panic(0xee36c0, 0xc0001720e0)
        /usr/local/go/src/runtime/panic.go:522 +0x1b5
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0000ce420, 0xc0001d8d40, 0x1, 0x1)
        /home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/zap@v1.10.0/zapcore/entry.go:229 +0x546
go.uber.org/zap.(*Logger).Panic(0xc000242960, 0x10bd441, 0x2a, 0xc0001d8d40, 0x1, 0x1)
        /home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/zap@v1.10.0/logger.go:225 +0x7f
go.etcd.io/etcd/etcdserver.NewServer(0x7ffde0354294, 0x8, 0x0, 0x0, 0x0, 0x0, 0xc000254d80, 0x1, 0x1, 0xc000254f00, ...)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:462 +0x3d67
go.etcd.io/etcd/embed.StartEtcd(0xc0002c6000, 0xc0002c6580, 0x0, 0x0)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/embed/etcd.go:211 +0x9d0
go.etcd.io/etcd/etcdmain.startEtcd(0xc0002c6000, 0x1092f33, 0x6, 0xc000255601, 0x2)
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:144 +0x2f71
go.etcd.io/etcd/etcdmain.Main()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46 +0x38
main.main()
        /tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28 +0x20

@stale
Copy link

stale bot commented Nov 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 6, 2020
@gsakun
Copy link

gsakun commented Nov 6, 2020

same problem

2020-11-06 06:32:51.257031 I | etcdmain: etcd Version: 3.3.10
2020-11-06 06:32:51.257119 I | etcdmain: Git SHA: 27fc7e2
2020-11-06 06:32:51.257125 I | etcdmain: Go Version: go1.10.4
.......
2020-11-06 06:32:51.476967 I | etcdserver: recovered store from snapshot at index 164056893
2020-11-06 06:32:51.520022 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xb8cb90]

goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4202abca0, 0xc4202ab758)
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:291 +0x40
panic(0xde0ce0, 0xc4201f9090)
	/usr/local/go/src/runtime/panic.go:502 +0x229
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc42018d440, 0xfe8789, 0x2a, 0xc4202ab7f8, 0x1, 0x1)
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x162
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0x7ffdf037b4bd, 0xf, 0x0, 0x0, 0x0, 0x0, 0xc4202f2b00, 0x1, 0x1, 0xc4202f2c00, ...)
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:386 +0x26bb
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc420284900, 0xc420284d80, 0x0, 0x0)
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:179 +0x811
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc420284900, 0xfc62b7, 0x6, 0xc4202acd01, 0x2)
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:181 +0x40
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:102 +0x1369
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:46 +0x3f
main.main()
	/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20

@stale stale bot removed the stale label Nov 6, 2020
@mamiapatrick
Copy link

Same problem!!! Do someone fix this issues. there is many threads on this, but no real solutions

@tangcong
Copy link
Contributor

tangcong commented Nov 25, 2020

@vekergu snap.db files are different than .snap files. When a member attempts a "snapshot recovery", the leader sends it a full copy of the database, which the member stores as a snap.db file. If a member is restarted during a "snashot recovery" process the member will notice that it needs it and attempt to load it. I don't think etcd is confusing the files. I think the .snap.db is legitimately missing.

We fixed one issue related to the order of operations that happen on startup in in 3.4.8 (xref: #11888). But this appears to be a separate problem (note that reports of it occurring on 3.4.2 and 3.4.9).

Any ideas @jingyih, @ptabor?

The same error occurred in one node cluster. It seems that the invariant( snapshot.Metadata.Index <= db.consistentIndex) was violated after the etcd node power off. it is so weird, etcd will commit metadata(consistentindex) before creating a snapshot. I am trying to add some logs to reproduce it, but it has not been successful. The safety of fsync depends on whether the file system forces the disk cache to be flushed to the hardware. @mamiapatrick what file system are you using?

	clone := s.v2store.Clone()
	// commit kv to write metadata (for example: consistent index) to disk.
	// KV().commit() updates the consistent index in backend.
	// All operations that update consistent index must be called sequentially
	// from applyAll function.
	// So KV().Commit() cannot run in parallel with apply. It has to be called outside
	// the go routine created below.
	s.KV().Commit()

@mamiapatrick
Copy link

@tangcong thanks for your help, really struggle on that and despite different research and several threads it seems that issues is not yet resolved by anyone.

For your question:
1- My cluster is on one unique node for etcd-server so there's no many member, only that one
2- The issue is on my rancher node (Rancher OS) where i setup RKE and the filesystem is ext4

hope to read you soon!

@tangcong
Copy link
Contributor

tangcong commented Nov 26, 2020

@mamiapatrick thanks. I am trying to inject pod-failure and container-kill error into etcd cluster, but it seems very hard to reproduce it, do you have any advice to reproduce it?

@mamiapatrick
Copy link

@tangcong
I don't really know how to reproduce it. It seems that error appears most of time after a brutal reboot like power failure. In my case it is the etcd of the rancher container (i install kubernetes through rancher and rke and to install rancher you have to deploy a container rancher/rancher) that do not start anymore and not the ectd of the kubernetes cluster's node himself.

@deeco
Copy link

deeco commented Jan 6, 2021

2 broken files remain in wal and snap folders , removing does nothing as automatically skipped , possible to restore or remove snaps or wal files to get to a previous state, issue seems to be the index number references a particular file, snaps folder has 4 files , wal folder has 5, is the 56 in file name linked between snap and wal files ?

appears missing a snap with f067 reference ?

wal folder content

drwx------ 4 root root 4.0K Jan 6 13:57 ..
-rw------- 1 root root 62M Jan 6 13:57 0000000000000053-0000000000619780.wal
-rw------- 1 root root 62M Jan 6 13:57 0000000000000054-000000000062ce9c.wal
-rw------- 1 root root 62M Jan 6 13:57 0000000000000056-00000000006540e6.wal
-rw------- 1 root root 62M Jan 6 13:57 0.tmp
-rw------- 1 root root 62M Jan 6 13:57 0000000000000052-0000000000605f89.wal
-rw-r--r-- 1 root root 62M Jan 6 13:57 0000000000000012-000000000017c929.wal.broken
drwx------ 2 root root 4.0K Jan 6 13:57 .
-rw------- 1 root root 62M Jan 6 13:57 0000000000000055-000000000064088b.wal

snap folder contents

-rw-r--r-- 1 root root 10K Jan 6 13:57 000000000000000a-000000000065c956.snap
-rw-r--r-- 1 root root 10K Jan 6 13:57 000000000000000a-0000000000657b34.snap
-rw------- 1 root root 33M Jan 6 13:57 db
-rw-r--r-- 1 root root 10K Jan 6 13:57 000000000000000a-000000000065f067.snap.broken
-rw-r--r-- 1 root root 10K Jan 6 13:57 000000000000000a-000000000065a245.snap
drwx------ 2 root root 4.0K Jan 6 13:57 .
-rw-r--r-- 1 root root 10K Jan 6 13:57 000000000000000a-0000000000655423.snap

log from etcd container

2021-01-06 14:57:59.838738 I | etcdserver: recovered store from snapshot at index 6670678 2021-01-06 14:57:59.853464 C | etcdserver: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist) panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist) panic: runtime error: invalid memory address or nil pointer dereference

@stale
Copy link

stale bot commented Apr 7, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2021
@stale stale bot closed this as completed Apr 29, 2021
@arthurzenika
Copy link

Up on this issue as it doesn't seem solved or that a workaround is documented. Impacted by this too.

@ptabor
Copy link
Contributor

ptabor commented May 20, 2021

I fill a lot more convenient about updating consistent_index after: #12855 was submitted.
Before it, there used to be administrative operations (e.g. add members) that could not be reflected in cindex, but could lead to higher value being written into the Snapshot.

@arthurzenika
Copy link

@ptabor is that feature likely to facilitate recovery of etcd data from previous versions of etcd ? I guess that MR is available in v3.5.0-beta.3 ?

@ptabor
Copy link
Contributor

ptabor commented May 20, 2021

The PR is in v3.5.0-beta.3. There is no code that would 'ignore' the mismatch between WAL & db.

For recovery from such situation I would consider taking 'db' file alone from the member/snap directory,
and trying to run etcdctl snapshot restore command on top of it.
It should produce initial WAL logs compatible with the state of the 'db' file.

@arthurzenika
Copy link

Thanks for pointing me in the direction of the etcdctl snapshot restore from the db file. I didn't know that was possible.

Trying that I get

/tmp/etcd-download-test/etcdctl --endpoints=localhost:2379 snapshot restore /tmp/member/snap/db
{"level":"info","ts":1621516681.2776036,"caller":"snapshot/v3_snapshot.go:287","msg":"restoring snapshot","path":"/tmp/member/snap/db","wal-dir":"default.etcd/member/wal","data-dir":"default.etcd","snap-dir":"default.etcd/member/snap"}
Error: snapshot missing hash but --skip-hash-check=false

produces a 200Mo db file. Starting etcd works, but db looks empty (testing with /tmp/etcd-download-test/etcdctl --endpoints=localhost:2379 get / --prefix --keys-only

I also try with --skip-hash-check=true and get :

/tmp/etcd-download-test/etcdctl --endpoints=localhost:2379 snapshot restore /tmp/member/snap/db --skip-hash-check=true
{"level":"info","ts":1621516738.8462574,"caller":"snapshot/v3_snapshot.go:287","msg":"restoring snapshot","path":"/tmp/member/snap/db","wal-dir":"default.etcd/member/wal","data-dir":"default.etcd","snap-dir":"default.etcd/member/snap"}
{"level":"info","ts":1621516739.914219,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1621516740.501717,"caller":"snapshot/v3_snapshot.go:300","msg":"restored snapshot","path":"/tmp/member/snap/db","wal-dir":"default.etcd/member/wal","data-dir":"default.etcd","snap-dir":"default.etcd/member/snap"}

Again 200Mo db file, but no keys... (maybe my test command is wrong?)

@ptabor
Copy link
Contributor

ptabor commented May 20, 2021

Are you using the etcdctl snapshot restore --with-v3 variant ?

@arthurzenika
Copy link

I get Error: unknown flag: --with-v3. Couldn't find the option in the documentation. Tried out ETCDCTL_API=3 from https://etcd.io/docs/v3.3/op-guide/recovery/ which didn't provide any different results (empty key list).

@ptabor
Copy link
Contributor

ptabor commented May 21, 2021

My mistake. --with-v3 is in: etcdctl/ctlv2/command/backup_command.go

@ptabor
Copy link
Contributor

ptabor commented May 21, 2021

How about:

./bin/etcdctl --endpoints=localhost:2379 get "" --prefix --keys-only

@arthurzenika
Copy link

@ptabor with get "" no keys either. Thanks for following up !

@veerendra2
Copy link

Somehow I was able to recover. I commented here in case someone interested

@tomkcpr
Copy link

tomkcpr commented Jan 10, 2022

I've the same issue. However, 2/3 nodes are showing the above message and the entire cluster is down. (One node doesn't show the etcd message below so I'm thinking that service is recoverable?) Services needed for cert renewal on the kubernetes cluster cannot start due to etcd having corrupt DB files. I do not have an explicitly taken etcd backup. Messages are the same:

stderr F panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
stderr F    panic: runtime error: invalid memory address or nil pointer dereference

However, I'm wondering if it's possible to take the db file and copy that over to the faulty nodes and what steps are needed to do so successfully:

/var/lib/etcd/member/snap/db

I didn't get clarity how this can be done from the above commands provided thus far. Most howto's assume an explicit backup was taken using etcdctl snapshot restore. From the comments that @arthurlogilab posted, it didn't appear to work for him. This issue is occurring on an OpenShift + Kubernetes cluster.

@gaosong030431207
Copy link

gaosong030431207 commented Jan 27, 2022

no real solutions! I also encountered the same problem. my etcd version is 3.4.13-0

@jacob-faber
Copy link

I've the same issue. However, 2/3 nodes are showing the above message and the entire cluster is down. (One node doesn't show the etcd message below so I'm thinking that service is recoverable?) Services needed for cert renewal on the kubernetes cluster cannot start due to etcd having corrupt DB files. I do not have an explicitly taken etcd backup. Messages are the same:

stderr F panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
stderr F    panic: runtime error: invalid memory address or nil pointer dereference

However, I'm wondering if it's possible to take the db file and copy that over to the faulty nodes and what steps are needed to do so successfully:

/var/lib/etcd/member/snap/db

I didn't get clarity how this can be done from the above commands provided thus far. Most howto's assume an explicit backup was taken using etcdctl snapshot restore. From the comments that @arthurlogilab posted, it didn't appear to work for him. This issue is occurring on an OpenShift + Kubernetes cluster.

In case of OpenShift, the restore procedure worked every time for me. Just don't forget to take backups every day and regularly verify them.

It doesn't solve this problem but backups of that critical components are essential.

@MrMYHuang
Copy link

MrMYHuang commented Feb 4, 2022

I build a k8s cluster with kubeadm containing one control plane node and one worker node. The cluster contains one etcd 3.5.1 server in the control plane node. Unfortunately, I also meet several times of etcd error "failed to recover v3 backend from snapshot".

I studied etcd source files and found the condition to trigger this kind of error:

func RecoverSnapshotBackend(cfg config.ServerConfig, oldbe backend.Backend, snapshot raftpb.Snapshot, beExist bool, hooks *BackendHooks) (backend.Backend, error) {
consistentIndex := uint64(0)
if beExist {
consistentIndex, _ = schema.ReadConsistentIndex(oldbe.BatchTx())
}
if snapshot.Metadata.Index <= consistentIndex {
return oldbe, nil
}
oldbe.Close()
return OpenSnapshotBackend(cfg, snap.New(cfg.Logger, cfg.SnapDir()), snapshot, hooks)
}

I.e. the condition snapshot.Metadata.Index <= db.consistentIndex is violated. It means the consistent index in snap/db file is older than the one in one of snap/*.snap files.
But in general cases, it's impossible(?), because the following program ensures(?) the consistent index is written to snap/db file before creating a .snap file:
func (s *EtcdServer) snapshot(snapi uint64, confState raftpb.ConfState) {

Thus, the steps to reproduce violation of the condition snapshot.Metadata.Index <= db.consistentIndex are unknown and difficult to find...

Nevertheless, I propose a workaround to make etcd back to work:
(Notice, this workaround doesn't guarantee no data loss. Please make a backup and try at your risk!)

sudo rm /var/lib/etcd/member/wal/*.wal

or

sudo rm /var/lib/etcd/member/snap/*.snap

Then, your etcd might start without crash.

@cruzanstx
Copy link

I build a k8s cluster with kubeadm containing one control plane node and one worker node. The cluster contains one etcd 3.5.1 server in the control plane node. Unfortunately, I also meet several times of etcd error "failed to recover v3 backend from snapshot".

I studied etcd source files and found the condition to trigger this kind of error:

func RecoverSnapshotBackend(cfg config.ServerConfig, oldbe backend.Backend, snapshot raftpb.Snapshot, beExist bool, hooks *BackendHooks) (backend.Backend, error) {
consistentIndex := uint64(0)
if beExist {
consistentIndex, _ = schema.ReadConsistentIndex(oldbe.BatchTx())
}
if snapshot.Metadata.Index <= consistentIndex {
return oldbe, nil
}
oldbe.Close()
return OpenSnapshotBackend(cfg, snap.New(cfg.Logger, cfg.SnapDir()), snapshot, hooks)
}

I.e. the condition snapshot.Metadata.Index <= db.consistentIndex is violated. It means the consistent index in snap/db file is older than the one in one of snap/*.snap files.
But in general cases, it's impossible(?), because the following program ensures(?) the consistent index is written to snap/db file before creating a .snap file:

func (s *EtcdServer) snapshot(snapi uint64, confState raftpb.ConfState) {

Thus, the steps to reproduce violation of the condition snapshot.Metadata.Index <= db.consistentIndex are unknown and difficult to find...
Nevertheless, I propose a workaround to make etcd back to work: (Notice, this workaround doesn't guarantee no data loss. Please make a backup and try at your risk!)

sudo rm /var/lib/etcd/member/wal/*.wal

or

sudo rm /var/lib/etcd/member/snap/*.snap

Then, your etcd might start without crash.

removing *.wal and *.snap worked for me thanks!

@wang-xiaowu
Copy link

wang-xiaowu commented Dec 19, 2022

I build a k8s cluster with kubeadm containing one control plane node and one worker node. The cluster contains one etcd 3.5.1 server in the control plane node. Unfortunately, I also meet several times of etcd error "failed to recover v3 backend from snapshot".

I studied etcd source files and found the condition to trigger this kind of error:

func RecoverSnapshotBackend(cfg config.ServerConfig, oldbe backend.Backend, snapshot raftpb.Snapshot, beExist bool, hooks *BackendHooks) (backend.Backend, error) {
consistentIndex := uint64(0)
if beExist {
consistentIndex, _ = schema.ReadConsistentIndex(oldbe.BatchTx())
}
if snapshot.Metadata.Index <= consistentIndex {
return oldbe, nil
}
oldbe.Close()
return OpenSnapshotBackend(cfg, snap.New(cfg.Logger, cfg.SnapDir()), snapshot, hooks)
}

I.e. the condition snapshot.Metadata.Index <= db.consistentIndex is violated. It means the consistent index in snap/db file is older than the one in one of snap/*.snap files.
But in general cases, it's impossible(?), because the following program ensures(?) the consistent index is written to snap/db file before creating a .snap file:

func (s *EtcdServer) snapshot(snapi uint64, confState raftpb.ConfState) {

Thus, the steps to reproduce violation of the condition snapshot.Metadata.Index <= db.consistentIndex are unknown and difficult to find...
Nevertheless, I propose a workaround to make etcd back to work: (Notice, this workaround doesn't guarantee no data loss. Please make a backup and try at your risk!)

sudo rm /var/lib/etcd/member/wal/*.wal

or

sudo rm /var/lib/etcd/member/snap/*.snap

Then, your etcd might start without crash.

it’s working,and my location is /var/lib/rancher/k3s/server/db/etcd/member/

@vivisidea
Copy link

None of the above solutions worked for me, here is my solution
just remove it from the cluster, and re-add it as a new member

https://etcd.io/docs/v3.5/tutorials/how-to-deal-with-membership/

@xgenvn
Copy link

xgenvn commented Oct 2, 2023

Using workaround like remove *.snap and *.wal worked for me for some time. Recently it doesn't work anymore, and stucking at connecting to server along with maintenance alarm issue.
However, I was able to forward the etcd client port and dump from there. Then later create a new PVC and restore back.

@ttron
Copy link

ttron commented Mar 22, 2024

Removing *.snap and *.wal can lead to start etcd pods successfully. However, you will lost all data this way. I accidently found another way to fix this by copying db to most recent snap.db and restart kubelet and containerd (CRI-O, docker).
use following for example:

# ls -lah /var/lib/etcd/member/snap/
total 49M
drwx------ 2 root root 295 Mar 22 14:58 .
drwx------ 4 root root  29 Mar 21 12:54 ..
-rw-r--r-- 1 root root 10K Mar 22 11:19 000000000000000a-000000000418625a.snap
-rw-r--r-- 1 root root 10K Mar 22 12:14 000000000000000a-000000000418896b.snap
-rw-r--r-- 1 root root 10K Mar 22 13:08 000000000000000a-000000000418b07c.snap
-rw-r--r-- 1 root root 10K Mar 22 14:03 000000000000000a-000000000418d78d.snap
-rw-r--r-- 1 root root 10K Mar 22 14:57 000000000000000a-000000000418fe9e.snap
-rw------- 1 root root 30M Mar 22 15:44 db

so the most recent snap will be 000000000000000a-000000000418fe9e.snap

#cp db 000000000000000a-000000000418fe9e.snap.db
#systemctl restart kubelet
#systemctl restart containerd

etcd pod and k8s control panel pods should be start normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests