Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boltdb panic when defragmenting in functional-tester #7526

Closed
gyuho opened this issue Mar 17, 2017 · 2 comments
Closed

boltdb panic when defragmenting in functional-tester #7526

gyuho opened this issue Mar 17, 2017 · 2 comments

Comments

@gyuho
Copy link
Contributor

gyuho commented Mar 17, 2017

  1. boltdb panic-ed when etcd-tester was NOT injecting failures
  2. etcd-tester does not detect this kind of server failure and kept retrying for hours (e.g. etcd-tester: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 10.240.0.2:2379: getsockopt: connection refused"; Reconnecting to {10.240.0.2:2379 <nil>})

screen shot 2017-03-17 at 5 48 35 am

Server panic

2017-03-16 16:40:17.457029 I | etcdserver: saved snapshot at index 163022073
2017-03-16 16:40:17.489487 I | etcdserver: compacted raft log at 163017073
2017-03-16 16:40:19.620655 I | etcdserver: start to snapshot (applied: 163032076, lastsnap: 163022073)
2017-03-16 16:40:19.634507 I | etcdserver: saved snapshot at index 163032076
2017-03-16 16:40:19.634692 I | etcdserver: compacted raft log at 163027076
2017-03-16 16:40:30.478248 I | etcdserver: start to snapshot (applied: 163046268, lastsnap: 163032076)
2017-03-16 16:40:30.492110 I | etcdserver: saved snapshot at index 163046268
2017-03-16 16:40:30.492287 I | etcdserver: compacted raft log at 163041268
2017-03-16 16:40:31.427332 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b5d6a8.snap successfully
2017-03-16 16:40:31.427443 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b5fdc9.snap successfully
2017-03-16 16:40:31.427547 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b624dc.snap successfully
2017-03-16 16:40:31.427635 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b64bf9.snap successfully
2017-03-16 16:40:31.427722 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b6730c.snap successfully
2017-03-16 16:40:31.427807 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b69a1d.snap successfully
2017-03-16 16:40:31.427903 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b6c14c.snap successfully
2017-03-16 16:40:31.427989 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b6e860.snap successfully
2017-03-16 16:40:31.428078 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b70f73.snap successfully
2017-03-16 16:40:31.432185 I | pkg/fileutil: purged file agent.etcd/member/wal/0000000000000235-0000000009a1d1a2.wal successfully
2017-03-16 16:40:45.406695 I | etcdserver: start to snapshot (applied: 163056287, lastsnap: 163046268)
2017-03-16 16:40:45.417688 I | etcdserver: saved snapshot at index 163056287
2017-03-16 16:40:45.417960 I | etcdserver: compacted raft log at 163051287
2017-03-16 16:40:49.559151 I | etcdserver: start to snapshot (applied: 163066298, lastsnap: 163056287)
2017-03-16 16:40:49.575017 I | etcdserver: saved snapshot at index 163066298
2017-03-16 16:40:49.575194 I | etcdserver: compacted raft log at 163061298
2017-03-16 16:40:52.214025 I | etcdserver: start to snapshot (applied: 163076302, lastsnap: 163066298)
2017-03-16 16:40:52.229370 I | etcdserver: saved snapshot at index 163076302
2017-03-16 16:40:52.229563 I | etcdserver: compacted raft log at 163071302
2017-03-16 16:40:55.825572 I | etcdserver/api/v3rpc: transport: http2Server.HandleStreams failed to read frame: read tcp 10.240.0.2:2379->10.240.0.10:54220: read: connection reset by peer
2017-03-16 16:40:55.825666 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2017-03-16 16:40:55.825696 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825708 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport: write tcp 10.240.0.2:2379->10.240.0.10:54220: use of closed network connection"
2017-03-16 16:40:55.825736 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825749 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2017-03-16 16:40:55.825756 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2017-03-16 16:40:55.825770 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825782 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825792 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2017-03-16 16:40:55.825807 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825814 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825828 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2017-03-16 16:40:55.825836 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825844 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825853 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825860 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825870 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2017-03-16 16:40:55.825877 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825887 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2017-03-16 16:40:55.825895 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2017-03-16 16:40:55.825912 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.825921 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2017-03-16 16:40:55.825928 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2017-03-16 16:40:55.832454 I | mvcc: store.index: compact 144479929
2017-03-16 16:40:56.286247 W | etcdserver: apply entries took too long [455.80384ms for 1 entries]
2017-03-16 16:40:56.286276 W | etcdserver: avoid queries with large range/delete range!
2017-03-16 16:41:01.428416 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b7369a.snap successfully
2017-03-16 16:41:01.428543 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b75dd9.snap successfully
2017-03-16 16:41:01.428633 I | pkg/fileutil: purged file agent.etcd/member/snap/000000000000154b-0000000009b784f9.snap successfully
2017-03-16 16:41:09.880146 I | mvcc: finished scheduled compaction at 144479929 (took 13.593830004s)
2017-03-16 16:41:10.087061 N | etcdserver/api/v3rpc: starting to defragment the storage backend...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa3d8e2]

goroutine 45907 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Tx).Bucket(0x0, 0x138803e, 0x3, 0x3, 0xc430aacd40)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/tx.go:101 +0x22
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.unsafeRange(0x0, 0x138803e, 0x3, 0x3, 0xc42a008da0, 0x11, 0x12, 0xc42a008e00, 0x11, 0x12, ...)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:92 +0x6a
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*readTx).UnsafeRange(0xc420104ff0, 0x138803e, 0x3, 0x3, 0xc42a008da0, 0x11, 0x12, 0xc42a008e00, 0x11, 0x12, ...)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/read_tx.go:68 +0x216
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc43d38a030, 0xc423cc1240, 0x13, 0x20, 0xc423cc12a0, 0x13, 0x20, 0x89cbdd8, 0x0, 0x0, ...)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore_txn.go:135 +0x2ab
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*storeTxnRead).Range(0xc43d38a030, 0xc423cc1240, 0x13, 0x20, 0xc423cc12a0, 0x13, 0x20, 0x0, 0x0, 0x0, ...)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore_txn.go:45 +0xaf
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*txnReadWrite).Range(0xc43f22d4f0, 0xc423cc1240, 0x13, 0x20, 0xc423cc12a0, 0x13, 0x20, 0x0, 0x0, 0xb48400, ...)
        <autogenerated>:27 +0xc1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*metricsTxnWrite).Range(0xc43d38a060, 0xc423cc1240, 0x13, 0x20, 0xc423cc12a0, 0x13, 0x20, 0x0, 0x0, 0x13f5a00, ...)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/metrics_txn.go:38 +0xb6
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).Range(0xc4201100b0, 0x1397a20, 0xc43d38a060, 0xc42c91a930, 0x0, 0x0, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/apply.go:253 +0x1c7
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Range.func2()
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/v3_server.go:110 +0x6a
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).doSerialize(0xc420252840, 0x7f8cc9079ed0, 0xc43a82b260, 0xc42a008b60, 0xc4348318f0, 0x0, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/v3_server.go:635 +0xfe
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Range(0xc420252840, 0x7f8cc9079ed0, 0xc43a82b260, 0xc42c91a930, 0x139b3e0, 0xc43b2647c0, 0x1398e60)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/v3_server.go:111 +0x236
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*kvServer).Range(0xc43abf0340, 0x7f8cc9079ed0, 0xc43a82b260, 0xc42c91a930, 0xc43b2647c0, 0xc43cbd29e8, 0x3)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/key.go:50 +0x80
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*quotaKVServer).Range(0xc43abf0380, 0x7f8cc9079ed0, 0xc43a82b260, 0xc42c91a930, 0xc43abf0380, 0xed3372, 0x5)
        <autogenerated>:24 +0x6b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb._KV_Range_Handler.func1(0x7f8cc9079ed0, 0xc43a82b260, 0xe12260, 0xc42c91a930, 0xc43b8a8780, 0xc43cbd2a90, 0x95204d, 0xc43a82b260)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:2294 +0x86
github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.UnaryServerInterceptor(0x7f8cc9079ed0, 0xc43a82b260, 0xe12260, 0xc42c91a930, 0xc423cc1300, 0xc423cc1360, 0x40fef8, 0x20, 0xdb47a0, 0x1)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server.go:29 +0xa4
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x7f8cc9079ed0, 0xc43a82b260, 0xe12260, 0xc42c91a930, 0xc423cc1300, 0xc423cc1360, 0x0, 0xc43cbd2b70, 0x40fef8, 0x50)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/interceptor.go:57 +0xc3
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb._KV_Range_Handler(0xdcfd80, 0xc43abf0380, 0x7f8cc9079ed0, 0xc43a82b260, 0xc43b8a8730, 0xc42f8a91b0, 0x0, 0x0, 0x13f5ad0, 0xc42023c450)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:2296 +0x177
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc43370a000, 0x139b560, 0xc439b64c60, 0xc42aebc690, 0xc433a22330, 0x13ca5e0, 0x0, 0x0, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/server.go:638 +0xb5c
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*Server).handleStream(0xc43370a000, 0x139b560, 0xc439b64c60, 0xc42aebc690, 0x0)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/server.go:796 +0x1261
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc43f1b7880, 0xc43370a000, 0x139b560, 0xc439b64c60, 0xc42aebc690)
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/server.go:449 +0xa9
created by github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
        /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/server.go:450 +0xa1
~

Related #7322

@heyitsanthony
Copy link
Contributor

This is probably from the lock breaking patch; different from #7322 since the backend isn't closed.

@gyuho
Copy link
Contributor Author

gyuho commented Mar 22, 2017

Easy repro:

// TestV3MaintenanceDefragInflightRange ensures inflight range requests
// while defrag is running does not panic the mvcc backend.
func TestV3MaintenanceDefragInflight(t *testing.T) {
	defer testutil.AfterTest(t)
	clus := NewClusterV3(t, &ClusterConfig{Size: 1})
	defer clus.Terminate(t)

	cli := clus.RandClient()
	kvc := toGRPC(cli).KV
	if _, err := kvc.Put(context.Background(), &pb.PutRequest{Key: []byte("foo"), Value: []byte("bar")}, grpc.FailFast(false)); err != nil {
		t.Fatal(err)
	}

	ctx, cancel := context.WithTimeout(context.Background(), time.Second)

	donec := make(chan struct{})
	go func() {
		defer close(donec)
		kvc.Range(ctx, &pb.RangeRequest{Key: []byte("foo")}, grpc.FailFast(false))
	}()

	mvc := toGRPC(cli).Maintenance
	mvc.Defragment(context.Background(), &pb.DefragmentRequest{}, grpc.FailFast(false))
	cancel()

	<-donec
}

gyuho added a commit to gyuho/etcd that referenced this issue Mar 22, 2017
Fix etcd-io#7526.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
@gyuho gyuho closed this as completed in 26abd25 Mar 23, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Mar 23, 2017
etcd panic-ed, so defrag response just blocked for "days"
when the actual 'v3rpc' path never returned.

We should catch this earlier.

ref. etcd-io#7526

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
gyuho added a commit to gyuho/etcd that referenced this issue Mar 23, 2017
etcd panic-ed, so defrag response just blocked for "days"
when the actual 'v3rpc' path never returned.

We should catch this earlier.

ref. etcd-io#7526

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
gyuho added a commit to gyuho/etcd that referenced this issue Mar 23, 2017
etcd panic-ed, so defrag response just blocked for "days"
when the actual 'v3rpc' path never returned.

We should catch this earlier.

ref. etcd-io#7526

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants