Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd crash and cannot be recovered after "mvcc: range cannot find rev" #10010

Closed
yingnanzhang666 opened this issue Aug 15, 2018 · 19 comments
Closed
Labels

Comments

@yingnanzhang666
Copy link
Contributor

yingnanzhang666 commented Aug 15, 2018

etcd version: 3.1.9
kubernetes version: 1.9

etcd panic after got err msg "mvcc: range cannot find rev (224,0)" as below.

2018-08-13 06:08:45.746323 I | mvcc: finished scheduled compaction at 1715575 (took 5.864407ms)
2018-08-13 06:11:26.283927 I | etcdserver: start to snapshot (applied: 1720172, lastsnap: 1710171)
2018-08-13 06:11:26.286216 I | etcdserver: saved snapshot at index 1720172
2018-08-13 06:11:26.287100 I | etcdserver: compacted raft log at 1715172
2018-08-13 06:11:41.885194 I | pkg/fileutil: purged file /var/etcd/data/member/snap/0000000000000002-0000000000197c17.snap successfully
2018-08-13 06:13:45.744077 I | mvcc: store.index: compact 1716563
2018-08-13 06:13:45.750931 I | mvcc: finished scheduled compaction at 1716563 (took 5.397279ms)
2018-08-13 06:17:14.221766 C | mvcc: range cannot find rev (224,0)

When it restart, and restore compact to a certain revision, keep to panic, got "panic: runtime error: slice bounds out of range"

2018-08-13 06:17:18.831239 I | etcdserver: restarting member eff4154ae7e4c0fb in cluster c4936eefaafec041 at commit index 1722187
2018-08-13 06:17:18.831643 I | raft: eff4154ae7e4c0fb became follower at term 2
2018-08-13 06:17:18.831694 I | raft: newRaft eff4154ae7e4c0fb [peers: [eff4154ae7e4c0fb], term: 2, commit: 1722187, applied: 1720172, lastindex: 1722187, lastterm: 2]
2018-08-13 06:17:18.832004 I | etcdserver/api: enabled capabilities for version 3.1
2018-08-13 06:17:18.832034 I | etcdserver/membership: added member eff4154ae7e4c0fb [https://***] to cluster c4936eefaafec041 from store
2018-08-13 06:17:18.832073 I | etcdserver/membership: set the cluster version to 3.1 from store
2018-08-13 06:17:18.840345 I | mvcc: restore compact to 1716563
panic: runtime error: slice bounds out of range

goroutine 1 [running]:
panic(0xd168c0, 0xc4200120a0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).restore(0xc4201fc270, 0x12d205c, 0x4)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:405 +0x12f0
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.NewStore(0x12e4320, 0xc420256720, 0x12e5860, 0xc4200a2e40, 0x12d4a00, 0xc420109b98, 0x87fde2)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:121 +0x30f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.newWatchableStore(0x12e4320, 0xc420256720, 0x12e5860, 0xc4200a2e40, 0x12d4a00, 0xc420109b98, 0xc4200a2e40)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/watchable_store.go:73 +0x76
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.New(0x12e4320, 0xc420256720, 0x12e5860, 0xc4200a2e40, 0x12d4a00, 0xc420109b98, 0x0, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/watchable_store.go:68 +0x5d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc4200c9e40, 0x0, 0x0, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:449 +0x12d1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc42014fc00, 0xc4201098c0, 0x0, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:123 +0x67d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc42014fc00, 0x6, 0xe412fc, 0x6, 0x1)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:187 +0x58
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:104 +0x14e0
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:39 +0x61
main.main()
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x14

When restore etcd from some previous backups, got other two kinds of panic, "panic: invalid page type: 1938: 10" and "panic: page 1938 already freed" as below.

2018-08-13 12:38:13.839410 N | etcdserver/membership: set the initial cluster version to 3.1
2018-08-13 12:38:13.839496 I | embed: ready to serve client requests
2018-08-13 12:38:13.839541 I | etcdserver: published {Name:kubernetes-master ClientURLs:[https://***]} to cluster c4936eefaafec041
2018-08-13 12:38:13.839588 I | etcdserver/api: enabled capabilities for version 3.1
2018-08-13 12:38:13.841230 I | embed: serving client requests on [::]:4001
panic: invalid page type: 1938: 10

goroutine 81392 [running]:
panic(0xcbb000, 0xc420be6360)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Cursor).search(0xc4227451e8, 0x12d2032, 0x3, 0x3, 0x792)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/cursor.go:256 +0x429
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Cursor).seek(0xc4227451e8, 0x12d2032, 0x3, 0x3, 0x0, 0x0, 0xc4216ae640, 0xc4215e5c00, 0x0, 0x2, ...)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/cursor.go:159 +0xb1
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Bucket).Bucket(0xc4201649b8, 0x12d2032, 0x3, 0x3, 0x100000000000000)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/bucket.go:112 +0x108
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Tx).Bucket(0xc4201649a0, 0x12d2032, 0x3, 0x3, 0x7f04c4e684b0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/tx.go:101 +0x4f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTx).UnsafeRange(0xc4202835e0, 0x12d2032, 0x3, 0x3, 0xc42073b180, 0x11, 0x12, 0xc42073b1a0, 0x11, 0x12, ...)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:88 +0x79
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).rangeKeys(0xc4202872b0, 0xc4216ae440, 0x36, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:524 +0x216
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).Range(0xc4202872b0, 0xc4216ae440, 0x36, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4201f3b00, ...)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:155 +0xd2
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).Range(0xc420178030, 0xffffffffffffffff, 0xc4205d96c0, 0x36, 0x40, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/apply.go:284 +0xb7d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Range.func2()
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/v3_server.go:110 +0x61
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).doSerialize(0xc420203340, 0x7f0244808118, 0xc4205dc660, 0xc42073b100, 0xc422745878, 0x0, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/v3_server.go:635 +0x10d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Range(0xc420203340, 0x7f0244808118, 0xc4205dc660, 0xc4205d96c0, 0x12e4380, 0xc420853b80, 0x12e1f80)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/v3_server.go:111 +0x234
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*kvServer).Range(0xc420a28500, 0x7f0244808118, 0xc4205dc660, 0xc4205d96c0, 0xc422f369c0, 0x62019a, 0xc420853b80)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/key.go:50 +0x88
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*quotaKVServer).Range(0xc420a28540, 0x7f0244808118, 0xc4205dc660, 0xc4205d96c0, 0x5, 0xe51026, 0xf)
        <autogenerated>:23 +0x6b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb._KV_Range_Handler.func1(0x7f0244808118, 0xc4205dc660, 0xda0020, 0xc4205d96c0, 0xc4208939f0, 0xc422f36a38, 0x9bcaf6, 0xc4205dc660)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:2287 +0xa2
github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.UnaryServerInterceptor(0x7f0244808118, 0xc4205dc660, 0xda0020, 0xc4205d96c0, 0xc42073afc0, 0xc42073afe0, 0x40f9e8, 0x20, 0xd48960, 0x1)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server.go:29 +0xa4
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x7f0244808118, 0xc4205dc660, 0xda0020, 0xc4205d96c0, 0xc42073afc0, 0xc42073afe0, 0x38, 0x2, 0x0, 0xc422f36b28)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/interceptor.go:57 +0xc3
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x7f0244808118, 0xc4205dc660, 0xda0020, 0xc4205d96c0, 0xc42073afc0, 0xc42073afe0, 0x38, 0x2, 0x0, 0xc422f36b28)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/interceptor.go:57 +0xc3
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb._KV_Range_Handler(0xd60d80, 0xc420a28540, 0x7f0244808118, 0xc4205dc660, 0xc4208939a0, 0xc420b0c2e0, 0x0, 0x0, 0xc420c26d90, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:2289 +0x166
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc4201fe0c0, 0x12e4560, 0xc4211a96e0, 0xc42060fe00, 0xc4202348a0, 0x1302820, 0x0, 0x0, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/server.go:638 +0xc14
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*Server).handleStream(0xc4201fe0c0, 0x12e4560, 0xc4211a96e0, 0xc42060fe00, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/server.go:796 +0x6b0
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc420c27460, 0xc4201fe0c0, 0x12e4560, 0xc4211a96e0, 0xc42060fe00)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/server.go:449 +0xab
created by github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/server.go:450 +0xa3
2018-08-13 14:53:45.309967 I | etcdserver/api: enabled capabilities for version 3.1
2018-08-13 14:53:45.311354 I | embed: serving client requests on [::]:4001
panic: page 1938 already freed

goroutine 78 [running]:
panic(0xcbb000, 0xc42073ce10)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*freelist).free(0xc42027af90, 0x746f6, 0x7f6280e06000)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/freelist.go:117 +0x2c0
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*node).spill(0xc4214b5730, 0xc421cd5260, 0xc4226479f0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/node.go:363 +0x1dc
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Bucket).spill(0xc4218636c0, 0xc421cd4b00, 0xc422647c60)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/bucket.go:541 +0x43b
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Bucket).spill(0xc42282e398, 0x2fc4521f, 0x130c500)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/bucket.go:508 +0x93b
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Tx).Commit(0xc42282e380, 0x2fc43e6f, 0x130c500)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/tx.go:163 +0x125
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTx).commit(0xc4202a5ea0, 0xebca00)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:183 +0x98
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTx).Commit(0xc4202a5ea0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:140 +0x66
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*backend).run(0xc4202d0600)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:199 +0x13b
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.newBackend
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:119 +0x1c4

Check snapshot. Only use the backups before that one passed bolt check, it can recover to a healthy etcd cluster. The bad one shows "page 1938: already freed"

root@b449f79a82ac:/# bolt check /var/etcd/1533614511486539137.snapshot.db
OK
root@b449f79a82ac:/# bolt check /var/etcd/1533781911485869699.snapshot.db
page 1938: already freed
page 1977: already freed
page 1940: unreachable unfreed
page 1979: unreachable unfreed
4 errors found
invalid value

Check the db in etcd data-dir, it shows "stack overflowpage"

# bolt check db
page 1977: already freed
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
......
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
page 1938: multiple references
runtime: goroutine stack exceeds 1000000000-byte limit
fatal error: stack overflowpage 1938: multiple references


runtime stack:
runtime.throw(0x55c53e, 0xe)
    /usr/local/go/src/runtime/panic.go:596 +0x95
runtime.newstack(0x0)
    /usr/local/go/src/runtime/stack.go:1089 +0x3f2
runtime.morestack()
    /usr/local/go/src/runtime/asm_amd64.s:398 +0x86

goroutine 19 [running]:
fmt.(*pp).printArg(0xc427b1a000, 0x526520, 0xc428d9f620, 0x64)
    /usr/local/go/src/fmt/print.go:604 +0xa37 fp=0xc440300360 sp=0xc440300358
fmt.(*pp).doPrintf(0xc427b1a000, 0x55f935, 0x1c, 0xc4403005d8, 0x1, 0x1)
    /usr/local/go/src/fmt/print.go:998 +0x11e7 fp=0xc440300440 sp=0xc440300360
fmt.Sprintf(0x55f935, 0x1c, 0xc4403005d8, 0x1, 0x1, 0x40cccf, 0x526520)
    /usr/local/go/src/fmt/print.go:196 +0x6a fp=0xc440300498 sp=0xc440300440
fmt.Errorf(0x55f935, 0x1c, 0xc4403005d8, 0x1, 0x1, 0x40f508, 0x40)
    /usr/local/go/src/fmt/print.go:205 +0x5a fp=0xc4403004f8 sp=0xc440300498
github.com/boltdb/bolt.(*Tx).checkBucket.func1(0x7ffbd0fa7000, 0x0)
    /go/src/github.com/boltdb/bolt/tx.go:434 +0x46d fp=0xc440300648 sp=0xc4403004f8
@yingnanzhang666 yingnanzhang666 changed the title etcd crash and cannot been recover after "mvcc: range cannot find rev" etcd crash and cannot be recovered after "mvcc: range cannot find rev" Aug 15, 2018
@gyuho
Copy link
Contributor

gyuho commented Aug 15, 2018

@yingnanzhang666 Is there anyway that you can share the db files?

@yingnanzhang666
Copy link
Contributor Author

@gyuho It's our product env. so I cannot share. Or which information do you need? I can collect for you.
And our backup interval is half an hour. But we lost several day's data because until the first crash occurred, go back to previous backups and restore, found it still panic by using the latest several backups. Then we want to figure out how to prevent this situation.

@gyuho
Copy link
Contributor

gyuho commented Aug 15, 2018

go back to previous backups and restore, found it still panic by using the latest several backups.

Can you provide more details? So, no matter how far you go back with previous snapshots, etcd now panics right away?

@gyuho
Copy link
Contributor

gyuho commented Aug 15, 2018

I will provide some redact tooling #7620, so that the db files can be shared.

@yingnanzhang666
Copy link
Contributor Author

yingnanzhang666 commented Aug 15, 2018

After restore, it can running for several minutes, log is as below

2018-08-13 07:09:45.932523 I | mvcc: restore compact to 1713602
2018-08-13 07:09:45.960089 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2018-08-13 07:09:45.962025 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2018-08-13 07:09:45.962092 I | etcdserver: starting server... [version: 3.1.9, cluster version: to_be_decided]
2018-08-13 07:09:45.962189 I | embed: ClientTLS: cert = /etc/ssl/kubernetes/etcd.crt, key = /etc/ssl/kubernetes/etcd.key, ca = , trusted-ca = /etc/ssl/kubernetes/etcdca.crt, client-cert-auth = true
2018-08-13 07:09:46.724006 I | raft: eff4154ae7e4c0fb is starting a new election at term 1
2018-08-13 07:09:46.724199 I | raft: eff4154ae7e4c0fb became candidate at term 2
2018-08-13 07:09:46.724242 I | raft: eff4154ae7e4c0fb received MsgVoteResp from eff4154ae7e4c0fb at term 2
2018-08-13 07:09:46.724275 I | raft: eff4154ae7e4c0fb became leader at term 2
2018-08-13 07:09:46.724295 I | raft: raft.node: eff4154ae7e4c0fb elected leader eff4154ae7e4c0fb at term 2
2018-08-13 07:09:46.724790 I | etcdserver: setting up the initial cluster version to 3.1
2018-08-13 07:09:46.725729 N | etcdserver/membership: set the initial cluster version to 3.1
2018-08-13 07:09:46.725812 I | embed: ready to serve client requests
2018-08-13 07:09:46.725914 I | etcdserver/api: enabled capabilities for version 3.1
2018-08-13 07:09:46.725953 I | etcdserver: published {Name:kubernetes-master-42-1 ClientURLs:[https://***]} to cluster c4936eefaafec041
2018-08-13 07:09:46.727277 I | embed: serving client requests on [::]:4001
2018-08-13 07:15:55.167102 C | mvcc: range cannot find rev (1716401,0)

and the last log is "mvcc: range cannot find rev (1716401,0)", then keep to restart. but panic immediately.

2018-08-13 07:15:59.139476 I | etcdserver: advertise client URLs = https://***
2018-08-13 07:15:59.162166 I | etcdserver: restarting member eff4154ae7e4c0fb in cluster c4936eefaafec041 at commit index 1302
2018-08-13 07:15:59.162463 I | raft: eff4154ae7e4c0fb became follower at term 2
2018-08-13 07:15:59.162514 I | raft: newRaft eff4154ae7e4c0fb [peers: [eff4154ae7e4c0fb], term: 2, commit: 1302, applied: 1, lastindex: 1302, lastterm: 2]
2018-08-13 07:15:59.162778 I | etcdserver/membership: added member eff4154ae7e4c0fb [https://kubernetes-master-42-1.42.tess.io:2380] to cluster c4936eefaafec041 from store
2018-08-13 07:15:59.171548 I | mvcc: restore compact to 1713602
panic: runtime error: slice bounds out of range

goroutine 1 [running]:
panic(0xd168c0, 0xc4200120a0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).restore(0xc42022e270, 0x12d205c, 0x4)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:405 +0x12f0
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.NewStore(0x12e4320, 0xc4202922a0, 0x12e5860, 0xc4201f2ba0, 0x12d4a00, 0xc420203358, 0x87fde2)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:121 +0x30f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.newWatchableStore(0x12e4320, 0xc4202922a0, 0x12e5860, 0xc4201f2ba0, 0x12d4a00, 0xc420203358, 0xc4201f2ba0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/watchable_store.go:73 +0x76
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.New(0x12e4320, 0xc4202922a0, 0x12e5860, 0xc4201f2ba0, 0x12d4a00, 0xc420203358, 0xc420236000, 0xc4202811e0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/watchable_store.go:68 +0x5d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc4201fcb00, 0x0, 0x0, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:449 +0x12d1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc4201e9500, 0xc420202840, 0x0, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:123 +0x67d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc4201e9500, 0x6, 0xe412fc, 0x6, 0x1)
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:187 +0x58
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:104 +0x14e0
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:39 +0x61
main.main()
        /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x14

@yingnanzhang666
Copy link
Contributor Author

I restore the backups from the last one and go backward.

The phenomenon to restore the latest several one is as my last comment.

Then go back to restore other backups, got two other kinds of panic, "panic: invalid page type: 1938: 10" and "panic: page 1938 already freed", mentioned in issue description. And it will panic immediately after being bring up.

Until go back to the backup which shows as below, etcd was recovered, no panic.

root@b449f79a82ac:/# bolt check /var/etcd/1533614511486539137.snapshot.db
OK

redact tooling is pretty cool. When can I use it?

@gyuho
Copy link
Contributor

gyuho commented Aug 17, 2018

@yingnanzhang666 No timeline yet, but will add that feature asap. Could you keep the "first" db file that you can panic from after "snapshot restore"? So that, it can be reproducible on our side once we have redact tool?

@wenjiaswe
Copy link
Contributor

cc @jpbetz

@jpbetz
Copy link
Contributor

jpbetz commented Aug 29, 2018

@yingnanzhang666 You've encountered #8813 which exists on 3.1.9. The fix was backported via #8902 to 3.1.11. "page 1938 already freed" type errors are the most common symptom of this corruption bug. It can happen in both HA clusters when they recover from peers and when recovering from a backup snapshot file.

@yingnanzhang666
Copy link
Contributor Author

Before restore from snapshot, the cluster is running with single etcd normally, but got error "mvcc: range cannot find rev (224,0)", after etcd panic and restart with data, it cannot start up with "panic: runtime error: slice bounds out of range" when restore.
add debug log, print the key and value ranged from https://github.com/etcd-io/etcd/blob/v3.1.9/mvcc/kvstore.go#L398

2018-09-03 09:45:30.075839 I | mvcc/backend: [debug] UnsafeRange: endKey is [127 255 255 255 255 255 255 255 95 127 255 255 255 255 255 255 255]
2018-09-03 09:45:30.075851 I | mvcc/backend: [debug] UnsafeRange: key is [0 0 0 0 0 0 0 1 95 0 0 0 0 0 0 0 0]
2018-09-03 09:45:30.075863 I | mvcc/backend: [debug] ck: [97 108 97 114 109], cv []
2018-09-03 09:45:30.075878 I | mvcc/backend: [debug] ck: [97 117 116 104], cv []
2018-09-03 09:45:30.075892 I | mvcc/backend: [debug] ck: [97 117 116 104 82 111 108 101 115], cv []
2018-09-03 09:45:30.075906 I | mvcc/backend: [debug] ck: [97 117 116 104 85 115 101 114 115], cv []
2018-09-03 09:45:30.075923 I | mvcc/backend: [debug] ck: [99 108 117 115 116 101 114], cv []
2018-09-03 09:45:30.075942 I | mvcc/backend: [debug] ck: [107 101 121], cv []
2018-09-03 09:45:30.075954 I | mvcc/backend: [debug] ck: [108 101 97 115 101], cv []
2018-09-03 09:45:30.075971 I | mvcc/backend: [debug] ck: [109 101 109 98 101 114 115], cv []
2018-09-03 09:45:30.076011 I | mvcc/backend: [debug] ck: [109 101 109 98 101 114 115 95 114 101 109 111 118 101 100], cv []
2018-09-03 09:45:30.076031 I | mvcc/backend: [debug] ck: [109 101 116 97], cv []

It shows the keys are not correct. the len of key is not larger than 8+1+8.
@jpbetz Do you have any idea which cause this situation?

@yingnanzhang666
Copy link
Contributor Author

@jpbetz For the issue fixed by this PR etcd-io/bbolt#67, whether it existed in v3.1.9? The code changed isn't existed in v3.1.9. And also I add the unit test case "func TestDB_Concurrent_WriteTo(t *testing.T)" into v3.1.9, but it can pass.

ok  	github.com/coreos/bbolt/bolt	5.149s

And test v3.1.11 without etcd-io/bbolt#67, the unit test shows

consistency check failed (2 errors)
page 5: reachable freed
page 4: unreachable unfreed

db saved to:
/var/folders/59/7kl8hd612tj4nj8qdgtfwlqm395_t9/T/bolt-737926952


FAIL	github.com/coreos/bbolt/bolt	5.033s

@yingnanzhang666
Copy link
Contributor Author

@jpbetz And also, from the comments #8813 (comment), it only impact the snapshot instead of the db, right? So suppose if it's a single etcd, and never recovered from backup, this freelist corruption issue won't occur. But actually, when it got the first error "mvcc: range cannot find rev (224,0)" and went to crash, it's a single etcd, never restart or recover since it start from scratch.

meta page

# bolt page db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=1938>
Freelist:   <pgid=1939>
HWM:        <pgid=2021>
Txn ID:     1403214
Checksum:   d928c4f09eb3fc0a

root page

# bolt page db 1938
Page ID:    1938
Page Type:  leaf
Total Size: 4096 bytes
Item Count: 10

"alarm": <pgid=0,seq=0>
"auth": <pgid=0,seq=0>
"authRoles": <pgid=0,seq=0>
"authUsers": <pgid=0,seq=0>
"cluster": <pgid=0,seq=0>
"key": <pgid=1938,seq=0>
"lease": <pgid=0,seq=0>
"members": <pgid=0,seq=0>
"members_removed": <pgid=0,seq=0>
"meta": <pgid=0,seq=0>

And the key page is the same as root page. multiple references to this page 1938.

freelist page

# bolt page db 1923
Page ID:    1923
Page Type:  freelist
Total Size: 4096 bytes
Item Count: 78

1898
1928
1939
1941
1942
1943
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1977
1978
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020

See there're two 1977 in this freelist.

I think perhaps this issue is not the same with etcd-io/bbolt#67

@jpbetz
Copy link
Contributor

jpbetz commented Sep 6, 2018

@yingnanzhang666 my understanding is that the issue exists on etcd versions prior to 3.1.11. The safest bet would be to bump up your etcd version.

I’m not 100% certain why the test passed, but it’s not written to be deterministic, so maybe that’s at play?

@jpbetz
Copy link
Contributor

jpbetz commented Sep 6, 2018

There are two ways we know it can happen: (1) when an etcd server becomes sufficiently out of date that it is send a new copy of the db from the leader. This happens automatically. (2) when creating a snapshot file explicitly. E.g via etcdctl

So maybe you’ve encountered it in a way we haven’t seen before? Everything about the details you’ve provided strongly suggest that this is the same free list corruption issue.

@yingnanzhang666
Copy link
Contributor Author

@jpbetz There's a etcdctl snapshot save every 30 minutes in the cluster, but in my understanding, it only impact the backup db created by etcdctl snapshot save, am I right? But in my issue, the original db has been impacted, too.

@yuchengwu
Copy link

I am afraid the issue persists in 3.1.18,

2019-07-24 21:42:17.650935 I | etcdmain: etcd Version: 3.1.18                                                                                                                                              
2019-07-24 21:42:17.650996 I | etcdmain: Git SHA: b7ff47f9d                                                                                                                                                
2019-07-24 21:42:17.651001 I | etcdmain: Go Version: go1.8.7                                                                                                                                               
2019-07-24 21:42:17.651005 I | etcdmain: Go OS/Arch: linux/amd64                                                                                                                                           
2019-07-24 21:42:17.651010 I | etcdmain: setting maximum number of CPUs to 32, total number of available CPUs is 32                                                                                        
2019-07-24 21:42:17.651047 N | etcdmain: the server is already initialized as member before, starting as etcd member...                                                                                    
2019-07-24 21:42:17.651080 I | embed: peerTLS: cert = /etc/ssl/etcd.pem, key = /etc/ssl/etcd-key.pem, ca = , trusted-ca = /etc/ssl/etcd-ca.pem, client-cert-auth = true                                    
2019-07-24 21:42:17.651952 I | embed: listening for peers on https://0.0.0.0:2380                                                                                                                          
2019-07-24 21:42:17.651992 I | embed: listening for client requests on 0.0.0.0:2379                                                                                                                        
2019-07-24 21:42:17.691258 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated                                                                                 
2019-07-24 21:42:17.692194 I | etcdserver: recovered store from snapshot at index 24520255                                                                                                                 
2019-07-24 21:42:17.697494 I | mvcc: restore compact to 20437889                                                                                                                                           
panic: runtime error: slice bounds out of range                                                                                                                                                            
                                                                                                                                                                                                           
goroutine 1 [running]:                                                                                                                                                                                     
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).restore(0xc4201e1ee0, 0x1355080, 0x4)                                                                                               
        /tmp/etcd-release-3.1.18/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:405 +0x10ae                                                         
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.NewStore(0x1367fe0, 0xc4203ab980, 0x1369760, 0x13c2048, 0x1357c40, 0xc4203b5190, 0xc42001a000)                                               
        /tmp/etcd-release-3.1.18/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:121 +0x305                                                          
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.newWatchableStore(0x1367fe0, 0xc4203ab980, 0x1369760, 0x13c2048, 0x1357c40, 0xc4203b5190, 0x13c2048)                                         
        /tmp/etcd-release-3.1.18/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/watchable_store.go:73 +0x81                                                    
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.New(0x1367fe0, 0xc4203ab980, 0x1369760, 0x13c2048, 0x1357c40, 0xc4203b5190, 0x1, 0xc4201e8208)                                               
        /tmp/etcd-release-3.1.18/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/watchable_store.go:68 +0x5d                                                    
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.recoverSnapshotBackend(0xc42000f340, 0x1367fe0, 0xc4203ab980, 0xc4203c2a00, 0x323c, 0x3500, 0xc4203b3680, 0x3, 0x4, 0x0, ...)          
        /tmp/etcd-release-3.1.18/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:76 +0xe0                                                      
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc42000f340, 0x0, 0x0, 0x0)                                                                                                 
        /tmp/etcd-release-3.1.18/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:382 +0x30e0                                                    
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc420091500, 0xc4200f4dc0, 0x0, 0x0)                                                                                             
        /tmp/etcd-release-3.1.18/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:124 +0x70f                                                            
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc420091500, 0x6, 0xe9ede4, 0x6, 0x1)                                                                                         
        /tmp/etcd-release-3.1.18/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:187 +0x58                                                          
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
        /tmp/etcd-release-3.1.18/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:104 +0x15ba                                                        
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
        /tmp/etcd-release-3.1.18/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:39 +0x61                                                           
main.main()
        /tmp/etcd-release-3.1.18/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20                            

using snapshot restore with the broken db file got similar result

panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).restore(0xc4201c12b0, 0x117e07c, 0x4)
        /home/wyc/go/src/go.etcd.io/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:410 +0x117d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.NewStore(0x118e4c0, 0xc4201f6960, 0x118f960, 0xc4201f6a20, 0x1180840, 0xc420152398, 0x0)
        /home/wyc/go/src/go.etcd.io/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:121 +0x2fd
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3/command.makeDB(0xc420206440, 0x18, 0x7ffcca1cbe2b, 0x2, 0x1)
        /home/wyc/go/src/go.etcd.io/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3/command/snapshot_command.go:379 +0x405
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3/command.snapshotRestoreCommandFunc(0xc42026f400, 0xc420150f20, 0x1, 0x2)
        /home/wyc/go/src/go.etcd.io/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3/command/snapshot_command.go:196 +0x34d
github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra.(*Command).execute(0xc42026f400, 0xc420150e00, 0x2, 0x2, 0xc42026f400, 0xc420150e00)
        /home/wyc/go/src/go.etcd.io/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra/command.go:572 +0x20d
github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x11c4aa0, 0xc42010e638, 0xc420177e01, 0xc420177ee0)
        /home/wyc/go/src/go.etcd.io/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra/command.go:662 +0x357
github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra.(*Command).Execute(0x11c4aa0, 0x0, 0x0)
        /home/wyc/go/src/go.etcd.io/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra/command.go:618 +0x2b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3.Start()
        /home/wyc/go/src/go.etcd.io/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3/ctl.go:96 +0x8e
main.main()
        /home/wyc/go/src/go.etcd.io/etcd/gopath/src/github.com/coreos/etcd/cmd/etcdctl/main.go:40 +0x184

to see what happened, I added several log lines before and after bytesToRev (https://github.com/etcd-io/etcd/blob/v3.1.18/mvcc/kvstore.go#L404-L406) :

the log lines I added

    405                 plog.Println(">>", key)
    406                 if len(key) < 8+1+8 {
    407                         plog.Println(">>skip")
    408                         continue
    409                 }
    410                 rev := bytesToRev(key[:revBytesLen])
    411                 plog.Println(">>>", rev)

and reexecute snapshot restore, the output is

ETCDCTL_API=3 ~/go/src/go.etcd.io/etcd/bin/etcdctl snapshot restore  db --skip-hash-check

... ...
2019-07-24 21:50:17.651245 I | mvcc: >>> {20439393 0}
2019-07-24 21:50:17.651264 I | mvcc: >> [0 0 0 0 1 55 225 98 95 0 0 0 0 0 0 0 0]
2019-07-24 21:50:17.651276 I | mvcc: >>> {20439394 0}
2019-07-24 21:50:17.651288 I | mvcc: >> [97 108 97 114 109]
2019-07-24 21:50:17.651299 I | mvcc: >>skip
2019-07-24 21:50:17.651311 I | mvcc: >> [97 117 116 104]
2019-07-24 21:50:17.651321 I | mvcc: >>skip
2019-07-24 21:50:17.651333 I | mvcc: >> [97 117 116 104 82 111 108 101 115]
2019-07-24 21:50:17.651343 I | mvcc: >>skip
2019-07-24 21:50:17.651355 I | mvcc: >> [97 117 116 104 85 115 101 114 115]
2019-07-24 21:50:17.651413 I | mvcc: >>skip
2019-07-24 21:50:17.651424 I | mvcc: >> [99 108 117 115 116 101 114]
2019-07-24 21:50:17.651429 I | mvcc: >>skip
2019-07-24 21:50:17.651434 I | mvcc: >> [107 101 121]
2019-07-24 21:50:17.651440 I | mvcc: >>skip
2019-07-24 21:50:17.651445 I | mvcc: >> [108 101 97 115 101]
2019-07-24 21:50:17.651450 I | mvcc: >>skip
2019-07-24 21:50:17.651456 I | mvcc: >> [109 101 109 98 101 114 115]
2019-07-24 21:50:17.651461 I | mvcc: >>skip
2019-07-24 21:50:17.651468 I | mvcc: >> [109 101 109 98 101 114 115 95 114 101 109 111 118 101 100]
2019-07-24 21:50:17.651473 I | mvcc: >>skip
2019-07-24 21:50:17.651481 I | mvcc: >> [109 101 116 97]
2019-07-24 21:50:17.651486 I | mvcc: >>skip
2019-07-24 21:50:17.651494 I | mvcc: >> [0 0 0 0 1 55 225 108 95 0 0 0 0 0 0 0 0]
2019-07-24 21:50:17.651501 I | mvcc: >>> {20439404 0}
... ...

as @yingnanzhang666 stated in #10010 (comment)

It shows the keys are not correct. the len of key is not larger than 8+1+8.

I have to member remove this member then member add it back to resolve this issue.

@jingyih
Copy link
Contributor

jingyih commented Jul 24, 2019

@yuchengwu, do you still have the broken db file? Could you run 'bolt check db' and verify this is the same bug as originally reported in this issue? 'bolt' is a command line utility: https://github.com/boltdb/bolt#installing

@yuchengwu
Copy link

@jingyih yes, sure

[root@node03 ~]# ./bolt check /opt/kubernetes/data/etcd/data.bak/member/snap/db 
page 496: reachable freed
page 574: unreachable unfreed
2 errors found
invalid value
[root@node03 ~]# ./bolt info /opt/kubernetes/data/etcd/data.bak/member/snap/db 
Page Size: 4096

this is 'bolt_pages.txt', this is 'bolt_stats.txt'

@stale
Copy link

stale bot commented Sep 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 21, 2022
@stale stale bot closed this as completed Oct 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

7 participants