Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: TestGossipSystemConfigOnLeaseChange failed under stress #22407

Closed
cockroach-teamcity opened this issue Feb 6, 2018 · 22 comments
Closed
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/4b2c64a7107e392229939b24b5bda303c1950a1f

Parameters:

TAGS=deadlock
GOFLAGS=

Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=508297&tab=buildLog


I180206 10:55:11.423252 1179407 gossip/gossip.go:332  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:32835" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > 
W180206 10:55:11.434778 1179407 gossip/gossip.go:1292  [n2] no incoming or outgoing connections
I180206 10:55:11.435175 1180756 gossip/client.go:129  [n2] started gossip client to 127.0.0.1:32835
I180206 10:55:11.435281 1179407 storage/store.go:1307  [s2] [n2,s2]: failed initial metrics computation: [n2,s2]: system config not yet available
I180206 10:55:11.435417 1179407 gossip/gossip.go:332  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:45485" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > 
W180206 10:55:11.445019 1179407 gossip/gossip.go:1292  [n3] no incoming or outgoing connections
I180206 10:55:11.445410 1179407 storage/store.go:1307  [s3] [n3,s3]: failed initial metrics computation: [n3,s3]: system config not yet available
I180206 10:55:11.445543 1179407 gossip/gossip.go:332  [n3] NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:35211" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > 
I180206 10:55:11.446566 1181586 gossip/client.go:129  [n3] started gossip client to 127.0.0.1:32835
POTENTIAL DEADLOCK: Duplicate locking, saw callers this locks in one goroutine:
current goroutine 1179407 lock &{{{0 0} 0 0 1 0}} 
all callers to this lock in the goroutine
client_test.go:1066 storage_test.(*multiTestContext).replicateRangeNonFatal { m.mu.RLock() } <<<<<
client_test.go:1059 storage_test.(*multiTestContext).replicateRange { if err := m.replicateRangeNonFatal(rangeID, dests...); err != nil { }
client_lease_test.go:249 storage_test.TestGossipSystemConfigOnLeaseChange { mtc.replicateRange(rangeID, 1, 2) }

client_test.go:472 storage_test.(*multiTestContextKVTransport).SendNext { t.mtc.mu.RLock() } <<<<<
../kv/dist_sender.go:1278 kv.(*DistSender).sendToReplicas { transport.SendNext(ctx, done) }
../kv/dist_sender.go:382 kv.(*DistSender).sendRPC { return ds.sendToReplicas(ctx, SendOptions{metrics: &ds.metrics}, rangeID, replicas, ba, ds.rpcContext) }
../kv/dist_sender.go:446 kv.(*DistSender).sendSingleRange { br, err := ds.sendRPC(ctx, desc.RangeID, replicas, ba) }
../kv/dist_sender.go:1056 kv.(*DistSender).sendPartialBatch { reply, pErr = ds.sendSingleRange(ctx, ba, desc) }
../kv/dist_sender.go:724 kv.(*DistSender).divideAndSendBatchToRanges { resp := ds.sendPartialBatch(ctx, ba, rs, ri.Desc(), ri.Token(), batchIdx, false /* needsTruncate */) }
../kv/dist_sender.go:640 kv.(*DistSender).Send { rpl, pErr = ds.divideAndSendBatchToRanges(ctx, ba, rs, 0 /* batchIdx */) }
../kv/txn_coord_sender.go:485 kv.(*TxnCoordSender).Send { if br, pErr = tc.wrapped.Send(ctx, ba); pErr != nil { }
../internal/client/db.go:555 client.(*DB).sendUsingSender { br, pErr := sender.Send(ctx, ba) }
../internal/client/db.go:529 client.(*DB).send { return db.sendUsingSender(ctx, ba, db.GetSender()) }
../internal/client/db.go:482 client.send)-fm { return sendAndFill(ctx, db.send, b) }
../internal/client/db.go:459 client.sendAndFill { b.response, b.pErr = send(ctx, ba) }
../internal/client/db.go:482 client.(*DB).Run { return sendAndFill(ctx, db.send, b) }
../internal/client/db.go:221 client.(*DB).Get { return getOneRow(db.Run(ctx, b), b) }
../internal/client/db.go:229 client.(*DB).GetProto { r, err := db.Get(ctx, key) }
client_test.go:1013 storage_test.(*multiTestContext).changeReplicasLocked { if err := m.dbs[0].GetProto(ctx, keys.RangeDescriptorKey(startKey), &desc); err != nil { }
client_test.go:1074 storage_test.(*multiTestContext).replicateRangeNonFatal { expectedReplicaIDs[i], err = m.changeReplicasLocked(rangeID, dest, roachpb.ADD_REPLICA) }
client_test.go:1059 storage_test.(*multiTestContext).replicateRange { if err := m.replicateRangeNonFatal(rangeID, dests...); err != nil { }
client_lease_test.go:249 storage_test.TestGossipSystemConfigOnLeaseChange { mtc.replicateRange(rangeID, 1, 2) }


Other goroutines holding locks:
goroutine 1138042 lock 0xc434de4388
store.go:4072 storage.(*Store).tryGetOrCreateReplica { repl.raftMu.Lock() } <<<<<
store.go:3980 storage.(*Store).getOrCreateReplica { r, created, err := s.tryGetOrCreateReplica( }
store_test.go:2775 storage.TestRemovedReplicaTombstone.func1 { _, created, err := s.getOrCreateReplica(ctx, rangeID, c.createReplicaID, &creatingReplica) }

goroutine 1142018 lock 0xc434de4988
store.go:4072 storage.(*Store).tryGetOrCreateReplica { repl.raftMu.Lock() } <<<<<
store.go:3980 storage.(*Store).getOrCreateReplica { r, created, err := s.tryGetOrCreateReplica( }
store_test.go:2775 storage.TestRemovedReplicaTombstone.func1 { _, created, err := s.getOrCreateReplica(ctx, rangeID, c.createReplicaID, &creatingReplica) }

goroutine 1143292 lock 0xc429768c88
store.go:4072 storage.(*Store).tryGetOrCreateReplica { repl.raftMu.Lock() } <<<<<
store.go:3980 storage.(*Store).getOrCreateReplica { r, created, err := s.tryGetOrCreateReplica( }
store_test.go:2775 storage.TestRemovedReplicaTombstone.func1 { _, created, err := s.getOrCreateReplica(ctx, rangeID, c.createReplicaID, &creatingReplica) }





ERROR: exit status 2

context canceled
make: *** [stress] Error 1
@cockroach-teamcity cockroach-teamcity added O-robot Originated from a bot. C-test-failure Broken test (automatically or manually discovered). labels Feb 6, 2018
@a-robinson
Copy link
Contributor

This was broken by the update to the deadlock detector in #21477.

@a-robinson
Copy link
Contributor

I'm following up on sasha-s/go-deadlock#7 before modifying this test infrastructure that has been in place for years.

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@petermattis petermattis added this to the 2.0 milestone Feb 21, 2018
@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

3 participants