teamcity: failed test: TestConcurrentRaftSnapshots #39661

cockroach-teamcity · 2019-08-14T14:46:20Z

The following tests appear to have failed on master (testrace): TestConcurrentRaftSnapshots

You may want to check for open issues.

TestConcurrentRaftSnapshots
...to all 4 replicas failed; last error: (err: node unavailable; try another peer) <nil>
W190814 14:36:22.952605 103654 internal/client/txn.go:509  [liveness-hb] failure aborting transaction: node unavailable; try another peer; abort caused by: node unavailable; try another peer
W190814 14:36:22.953295 103654 storage/node_liveness.go:484  [liveness-hb] failed node liveness heartbeat: node unavailable; try another peer
I190814 14:36:22.972668 105719 internal/client/txn.go:619  async rollback failed: node unavailable; try another peer
W190814 14:36:22.975256 103477 storage/store.go:1535  [s1,r1/1:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 14:36:22.975718 103973 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
I190814 14:36:22.977749 105566 internal/client/txn.go:619  async rollback failed: failed to send RPC: sending to all 4 replicas failed; last error: (err: node unavailable; try another peer) <nil>
W190814 14:36:22.989950 103738 storage/store.go:1535  [s3,r1/3:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 14:36:23.035861 103738 storage/store.go:1535  [s3,r1/3:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 14:36:23.040239 103601 storage/store.go:1535  [s2,r1/2:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 14:36:23.052669 103973 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 14:36:23.158153 103973 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
I190814 14:36:23.417779 105685 util/stop/stopper.go:542  quiescing; tasks left:
12     rpc heartbeat
I190814 14:36:23.419183 105685 util/stop/stopper.go:542  quiescing; tasks left:
11     rpc heartbeat
W190814 14:36:23.423125 104049 gossip/gossip.go:1517  [n5] no incoming or outgoing connections
I190814 14:36:23.440370 105685 util/stop/stopper.go:542  quiescing; tasks left:
10     rpc heartbeat
W190814 14:36:23.442608 105064 storage/raft_transport.go:620  while processing outgoing Raft queue to node 4: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 14:36:23.444194 104677 storage/raft_transport.go:620  while processing outgoing Raft queue to node 3: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 14:36:23.444518 105701 storage/raft_transport.go:620  while processing outgoing Raft queue to node 5: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 14:36:23.450638 103653 gossip/gossip.go:1517  [n2] no incoming or outgoing connections
W190814 14:36:23.453525 104389 storage/raft_transport.go:620  while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 14:36:23.457033 103696 gossip/gossip.go:1517  [n3] no incoming or outgoing connections
W190814 14:36:23.457366 104320 storage/raft_transport.go:620  while processing outgoing Raft queue to node 1: rpc error: code = Unavailable desc = transport is closing:
W190814 14:36:23.458003 104008 gossip/gossip.go:1517  [n4] no incoming or outgoing connections
W190814 14:36:23.458338 104471 storage/raft_transport.go:620  while processing outgoing Raft queue to node 1: rpc error: code = Canceled desc = grpc: the client connection is closing:
I190814 14:36:23.461581 105685 util/stop/stopper.go:542  quiescing; tasks left:
6      rpc heartbeat
I190814 14:36:23.462412 105685 util/stop/stopper.go:542  quiescing; tasks left:
5      rpc heartbeat
    client_raft_test.go:1119: change replicas of r1 failed: TransactionStatusError: client already committed or rolled back the transaction. Trying to execute: EndTransaction(commit:true) [/Min] (REASON_UNKNOWN)

Please assign, take a look and update the issue accordingly.

The text was updated successfully, but these errors were encountered:

Fixes cockroachdb#39652. Fixes cockroachdb#39661. Fixes cockroachdb#35144. This commit fixes the referenced issues by eliminating the practice of updating the transaction coordinator's proto directly from its heartbeat loop. This was problematic because of the race described in https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364. The heartbeat loop doesn't know if it's racing with an EndTransaction request and it could incorrectly interpret a missing transaction record if it is. The safest thing to do is to limit the path in which it informs the TxnCoordSender of updates. This limits the responsibility of the heartbeat loop. Its job is now only to: 1. update the transaction record to maintain liveness 2. eagerly clean up a transaction if it is found to be aborted 3. inform the transaction coordinator about an aborted transaction record IF the transaction coordinator is continuing to send requests through the interceptor. Notably, the heartbeat loop no longer blindly updates the transaction coordinator's transaction proto. There wasn't a strong reason for it to be able to do so, especially now that we no longer push transactions or ratchet their priority frequently. Moreover, even if those were still frequent occurrences, updating the proto from the heartbeat loop prevented usual restart handling from being used. For instance, doing so might prevent us from refreshing the transaction. All in all, allowing this didn't seem worth the complexity. This commit also includes some cleanup. For instance, it removes a confusing dependency where the txnHeartbeater called back into the TxnCoordSender. It also addresses a longstanding TODO to actually unit test the txnHeartbeater. Release note: None

39699: kv: don't update transaction proto directly from heartbeat loop r=nvanbenschoten a=nvanbenschoten Fixes #39652. Fixes #39661. Fixes #35144. This commit fixes the referenced issues by eliminating the practice of updating the transaction coordinator's proto directly from its heartbeat loop. This was problematic because of the race described in https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364 The heartbeat loop doesn't know if it's racing with an EndTransaction request and it could incorrectly interpret a missing transaction record if it is. The safest thing to do is to limit the path in which it informs the TxnCoordSender of updates. This limits the responsibility of the heartbeat loop. Its job is now only to: 1. update the transaction record to maintain liveness 2. eagerly clean up a transaction if it is found to be aborted 3. inform the transaction coordinator about an aborted transaction record IF the transaction coordinator is continuing to send requests through the interceptor. Notably, the heartbeat loop no longer blindly updates the transaction coordinator's transaction proto. There wasn't a strong reason for it to be able to do so, especially now that we no longer push transactions or ratchet their priority frequently. Moreover, even if those were still frequent occurrences, updating the proto from the heartbeat loop prevented usual restart handling from being used. For instance, doing so might prevent us from refreshing the transaction. All in all, allowing this didn't seem worth the complexity. This commit also includes some cleanup. For instance, it removes a confusing dependency where the txnHeartbeater called back into the TxnCoordSender. It also addresses a longstanding TODO to actually unit test the txnHeartbeater. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Aug 14, 2019

cockroach-teamcity added this to the 19.2 milestone Aug 14, 2019

nvanbenschoten mentioned this issue Aug 16, 2019

kv: don't update transaction proto directly from heartbeat loop #39699

Merged

craig bot closed this as completed in 1a84480 Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

teamcity: failed test: TestConcurrentRaftSnapshots #39661

teamcity: failed test: TestConcurrentRaftSnapshots #39661

cockroach-teamcity commented Aug 14, 2019

teamcity: failed test: TestConcurrentRaftSnapshots #39661

teamcity: failed test: TestConcurrentRaftSnapshots #39661

Comments

cockroach-teamcity commented Aug 14, 2019