teamcity: failed test: TestConcurrentRaftSnapshots #39652

cockroach-teamcity · 2019-08-14T05:46:21Z

The following tests appear to have failed on master (testrace): TestConcurrentRaftSnapshots

You may want to check for open issues.

TestConcurrentRaftSnapshots
...iled: result is ambiguous (server shutdown)
I190814 05:35:42.953318 100587 internal/client/txn.go:619  async rollback failed: node unavailable; try another peer
W190814 05:35:42.938309 98532 storage/store.go:1535  [s2,r1/2:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
I190814 05:35:42.954053 100529 util/stop/stopper.go:542  quiescing; tasks left:
1      mtc send
I190814 05:35:42.955682 100417 internal/client/txn.go:619  async rollback failed: result is ambiguous (server shutdown)
W190814 05:35:42.957766 98877 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
I190814 05:35:42.977813 100489 internal/client/txn.go:619  async rollback failed: result is ambiguous (server shutdown)
W190814 05:35:43.001811 98685 storage/store.go:1535  [s3,r1/3:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 05:35:43.017634 99383 storage/raft_transport.go:294  unable to accept Raft message from (n4,s4):?: no handler registered for (n1,s1):?
W190814 05:35:43.046216 99309 storage/store.go:3547  [s4] raft error: node 1 claims to not contain store 1 for replica (n1,s1):?: store 1 was not found
W190814 05:35:43.046671 99364 storage/raft_transport.go:620  while processing outgoing Raft queue to node 1: store 1 was not found:
W190814 05:35:43.049448 98877 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 05:35:43.050723 98407 gossip/gossip.go:1522  [n1] first range unavailable; trying remaining resolvers
I190814 05:35:43.070898 100668 gossip/client.go:124  [n1] started gossip client to 127.0.0.1:36661
W190814 05:35:43.128152 98685 storage/store.go:1535  [s3,r1/3:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 05:35:43.187793 99157 gossip/gossip.go:1522  [n5] first range unavailable; trying remaining resolvers
I190814 05:35:43.186614 98623 gossip/client.go:129  [n3] closing client to n1 (127.0.0.1:37147): stopping outgoing client to n1 (127.0.0.1:37147); already have incoming
W190814 05:35:43.204717 98708 gossip/gossip.go:1522  [n3] first range unavailable; trying remaining resolvers
I190814 05:35:43.210250 100654 gossip/client.go:124  [n5] started gossip client to 127.0.0.1:37147
I190814 05:35:43.225293 100714 gossip/client.go:124  [n3] started gossip client to 127.0.0.1:36977
W190814 05:35:43.229673 98877 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
I190814 05:35:43.414295 100528 util/stop/stopper.go:542  quiescing; tasks left:
10     rpc heartbeat
I190814 05:35:43.417655 100528 util/stop/stopper.go:542  quiescing; tasks left:
9      rpc heartbeat
W190814 05:35:43.419049 99582 storage/raft_transport.go:620  while processing outgoing Raft queue to node 3: rpc error: code = Unavailable desc = transport is closing:
I190814 05:35:43.428275 100528 util/stop/stopper.go:542  quiescing; tasks left:
8      rpc heartbeat
W190814 05:35:43.429368 99273 storage/raft_transport.go:620  while processing outgoing Raft queue to node 1: EOF:
W190814 05:35:43.430436 100122 storage/raft_transport.go:620  while processing outgoing Raft queue to node 4: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 05:35:43.432401 98428 gossip/gossip.go:1517  [n2] no incoming or outgoing connections
W190814 05:35:43.432661 99242 storage/raft_transport.go:620  while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 05:35:43.440940 98827 gossip/gossip.go:1517  [n4] no incoming or outgoing connections
    client_raft_test.go:1119: change replicas of r1 failed: TransactionStatusError: client already committed or rolled back the transaction. Trying to execute: EndTransaction(commit:true) [/Min] (REASON_UNKNOWN)

Please assign, take a look and update the issue accordingly.

The text was updated successfully, but these errors were encountered:

tbg · 2019-08-14T07:07:59Z

Repros easily with

make stressrace PKG=./pkg/storage/ TESTS=TestConcurrentRaftSnapshots

I'll take a look

tbg · 2019-08-14T08:25:03Z

The problem seems to be that we're returning here without an error

cockroach/pkg/storage/replica_command.go

Lines 1256 to 1262 in c80412a

    
           b.AddRawRequest(&roachpb.EndTransactionRequest{ 
        
           	Commit: true, 
        
           	InternalCommitTrigger: &roachpb.InternalCommitTrigger{ 
        
           		ChangeReplicasTrigger: crt, 
        
           	}, 
        
           }) 
        
           if err := txn.Run(ctx, b); err != nil {

But the transaction at that point is ABORTED. TxnCoordSender saw the commit, so it has put the txn into finalized state. Since the retriable closure returns without an error though, the txn runner thinks it still needs to commit, does so, and trips the assertion against use of finalized txns.

cockroach/pkg/internal/client/txn.go

Line 696 in 48465c5

err = txn.Commit(ctx)

Now on to finding out how we're losing the error that we should see when trying to commit but don't.

tbg · 2019-08-14T09:02:13Z

EndTransaction itself evaluates fine (i.e. returns a COMMITTED proto). But after the call to txn.Run(batch) the resulting proto is aborted.

At init() time, the txn proto has not been populated yet. Found while investigating cockroachdb#39652. Release note: None

tbg · 2019-08-14T09:38:15Z

Ok, I understand this now. This is a race between the txn's heartbeat loop and the commit. It looks like we're managing to commit but before the response enters the txn coord sender, its heartbeat loop misses a heartbeat and marks the txn as aborted here:

cockroach/pkg/kv/txn_interceptor_heartbeater.go

Line 342 in dc3686f

h.mu.txn.Status = roachpb.ABORTED

However, seems that txcoordsender still returns the (successful) commit response to the client here (except that the COMMITTED was dropped on the floor since the txn proto was already ABORTED).

cockroach/pkg/storage/replica_command.go

Lines 1255 to 1264 in 66acbc0

    
           b.AddRawRequest(&roachpb.EndTransactionRequest{ 
        
           	Commit: true, 
        
           	InternalCommitTrigger: &roachpb.InternalCommitTrigger{ 
        
           		ChangeReplicasTrigger: crt, 
        
           	}, 
        
           }) 
        
           if err := txn.Run(ctx, b); err != nil { 
        
           	log.Event(ctx, err.Error()) 
        
           	return err 
        
           }

and so we hit the return nil in the state in which the txn is aborted, which tricks the txn retry loop into an erroneous autocommit:

cockroach/pkg/internal/client/txn.go

Lines 693 to 695 in 3d26bc5

    
           if err == nil { 
        
           	if txn.status() != roachpb.COMMITTED { 
        
           		err = txn.Commit(ctx)

which then fails because the txn coord sender was already finalized here

cockroach/pkg/kv/txn_coord_sender.go

Line 806 in 3d26bc5

tc.mu.txnState = txnFinalized

The fundamental problem here seems with the heartbeat loop. If the heartbeat loop changes the txn while there is an inflight commit attempt, the client must receive an ambiguous result. In general it seems fraught to have the heartbeats muddle with the state like that; if a heartbeat fails we just want to notify the client. We can do that more easily by setting a flag that is queried only when the client tries to make another request and that, if hit, returns a transaction aborted error.

@nvanbenschoten could I throw this your way? I'm going to send a PR to skip this test (though the problem ultimately isn't specific to the test, it seems to hit it reliably now that learners are in there, for whatever reason).

The test highlights a bug in production code and is too flaky to remain unskipped while we fix said bug. See cockroachdb#39652. Release note: None

39657: storage: skip TestConcurrentRaftSnapshots r=ajwerner a=tbg The test highlights a bug in production code and is too flaky to remain unskipped while we fix said bug. See #39652. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

nvanbenschoten · 2019-08-14T17:39:31Z

So it sounds like what's happening is exactly this:

cockroach/pkg/kv/txn_interceptor_heartbeater.go

Lines 357 to 364 in dc3686f

    
           // TODO(nvanbenschoten): It's possible for a HeartbeatTxn request to observe 
        
           // the result of an EndTransaction request and beat it back to the client. 
        
           // This is an issue when a COMMITTED txn record is GCed and later re-written 
        
           // as ABORTED. The coordinator's local status could flip from PENDING to 
        
           // ABORTED (after heartbeat response) to COMMITTED (after commit response). 
        
           // This appears to be benign, but it's still somewhat disconcerting. If this 
        
           // ever causes any issues, we'll need to be smarter about detecting this race 
        
           // on the client and conditionally ignoring the result of heartbeat responses.

I really should have taken time to address this back when I was originally thinking through it. I guess I didn't want to start pulling on another thread while in the middle of trying to figure out everything with parallel commits.

I think that was broken by this change, which prevents the COMMITTED status from overwriting the ABORTED status. We should log a loud error if anyone tries to update a transaction proto with that transition.

I like your idea about the heartbeat loop setting a flag that is queried only when the client tries to make another request. That way, the transaction is always properly updated by its main goroutine through the normal mechanisms.

39640: storage: unify replica addition and removal paths r=nvanbenschoten a=tbg This continues the reworking of the various replication change APIs with the goal of allowing a) testing of general atomic replication changes b) issuing replica swaps from the replicate queue (in 19.2). For previous steps, see: #39485 #39611 This change is not a pure plumbing PR. Instead, it unifies `(*Replica).addReplica` and `(*Replica).removeReplica` into a method that can do both, `(*Replica).addAndRemoveReplicas`. Given a slice of ReplicationChanges, this method first adds learner replicas corresponding to the desired new voters. After having sent snapshots to all of them, the method issues a configuration change that atomically - upgrades all learners to voters - removes any undesired replicas. Note that no atomic membership changes are *actually* carried out yet. This is because the callers of `addAndRemoveReplicas` pass in only a single change (i.e. an addition or removal), which the method also verifies. Three pieces are missing after this PR: First, we need to be able to instruct raft to carry out atomic configuration changes: https://github.com/cockroachdb/cockroach/blob/2e8db6ca53c59d3d281e64939f79d937195403d4/pkg/storage/replica_proposal_buf.go#L448-L451 which in particular requires being able to store the ConfState corresponding to a joint configuration in the unreplicated local state (under a new key). Second, we must pass the slice of changes handed to `AdminChangeReplicas` through to `addAndRemoveReplicas` without unrolling it first, see: https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica_command.go#L870-L891 and https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica.go#L1314-L1325 Third, we must to teach the replicate queue to issue the "atomic swaps"; this is the reason we're introducing atomic membership changes in the first place. Release note: None 39656: kv: init heartbeat txn log tag later r=nvanbenschoten a=tbg At init() time, the txn proto has not been populated yet. Found while investigating #39652. This change strikes me as clunky, but I don't have the bandwidth to dig deeper right now. Release note: None 39666: testutils/lint/passes: disable under nightly stress r=mjibson a=mjibson Under stress these error with "go build a: failed to cache compiled Go files". Fixes #39616 Fixes #39541 Fixes #39479 Release note: None 39669: rpc: use gRPC enforced minimum keepalive timeout r=knz a=ajwerner Before this commit we'd experience the following annoying log message from gRPC every time we create a new connection telling us that our setting is being ignored. ``` Adjusting keepalive ping interval to minimum period of 10s ``` Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Matt Jibson <matt.jibson@gmail.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

Fixes cockroachdb#39652. Fixes cockroachdb#39661. Fixes cockroachdb#35144. This commit fixes the referenced issues by eliminating the practice of updating the transaction coordinator's proto directly from its heartbeat loop. This was problematic because of the race described in https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364. The heartbeat loop doesn't know if it's racing with an EndTransaction request and it could incorrectly interpret a missing transaction record if it is. The safest thing to do is to limit the path in which it informs the TxnCoordSender of updates. This limits the responsibility of the heartbeat loop. Its job is now only to: 1. update the transaction record to maintain liveness 2. eagerly clean up a transaction if it is found to be aborted 3. inform the transaction coordinator about an aborted transaction record IF the transaction coordinator is continuing to send requests through the interceptor. Notably, the heartbeat loop no longer blindly updates the transaction coordinator's transaction proto. There wasn't a strong reason for it to be able to do so, especially now that we no longer push transactions or ratchet their priority frequently. Moreover, even if those were still frequent occurrences, updating the proto from the heartbeat loop prevented usual restart handling from being used. For instance, doing so might prevent us from refreshing the transaction. All in all, allowing this didn't seem worth the complexity. This commit also includes some cleanup. For instance, it removes a confusing dependency where the txnHeartbeater called back into the TxnCoordSender. It also addresses a longstanding TODO to actually unit test the txnHeartbeater. Release note: None

39699: kv: don't update transaction proto directly from heartbeat loop r=nvanbenschoten a=nvanbenschoten Fixes #39652. Fixes #39661. Fixes #35144. This commit fixes the referenced issues by eliminating the practice of updating the transaction coordinator's proto directly from its heartbeat loop. This was problematic because of the race described in https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364 The heartbeat loop doesn't know if it's racing with an EndTransaction request and it could incorrectly interpret a missing transaction record if it is. The safest thing to do is to limit the path in which it informs the TxnCoordSender of updates. This limits the responsibility of the heartbeat loop. Its job is now only to: 1. update the transaction record to maintain liveness 2. eagerly clean up a transaction if it is found to be aborted 3. inform the transaction coordinator about an aborted transaction record IF the transaction coordinator is continuing to send requests through the interceptor. Notably, the heartbeat loop no longer blindly updates the transaction coordinator's transaction proto. There wasn't a strong reason for it to be able to do so, especially now that we no longer push transactions or ratchet their priority frequently. Moreover, even if those were still frequent occurrences, updating the proto from the heartbeat loop prevented usual restart handling from being used. For instance, doing so might prevent us from refreshing the transaction. All in all, allowing this didn't seem worth the complexity. This commit also includes some cleanup. For instance, it removes a confusing dependency where the txnHeartbeater called back into the TxnCoordSender. It also addresses a longstanding TODO to actually unit test the txnHeartbeater. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Aug 14, 2019

cockroach-teamcity added this to the 19.2 milestone Aug 14, 2019

tbg self-assigned this Aug 14, 2019

tbg added a commit to tbg/cockroach that referenced this issue Aug 14, 2019

kv: init heartbeat txn log tag later

dc3686f

At init() time, the txn proto has not been populated yet. Found while investigating cockroachdb#39652. Release note: None

tbg mentioned this issue Aug 14, 2019

kv: init heartbeat txn log tag later #39656

Merged

tbg added a commit to tbg/cockroach that referenced this issue Aug 14, 2019

storage: skip TestConcurrentRaftSnapshots

e6061bc

The test highlights a bug in production code and is too flaky to remain unskipped while we fix said bug. See cockroachdb#39652. Release note: None

tbg mentioned this issue Aug 14, 2019

storage: skip TestConcurrentRaftSnapshots #39657

Merged

tbg assigned nvanbenschoten and unassigned tbg Aug 14, 2019

nvanbenschoten mentioned this issue Aug 16, 2019

kv: don't update transaction proto directly from heartbeat loop #39699

Merged

craig bot closed this as completed in 1a84480 Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

teamcity: failed test: TestConcurrentRaftSnapshots #39652

teamcity: failed test: TestConcurrentRaftSnapshots #39652

cockroach-teamcity commented Aug 14, 2019

tbg commented Aug 14, 2019

tbg commented Aug 14, 2019

tbg commented Aug 14, 2019

tbg commented Aug 14, 2019

nvanbenschoten commented Aug 14, 2019

teamcity: failed test: TestConcurrentRaftSnapshots #39652

teamcity: failed test: TestConcurrentRaftSnapshots #39652

Comments

cockroach-teamcity commented Aug 14, 2019

tbg commented Aug 14, 2019

tbg commented Aug 14, 2019

tbg commented Aug 14, 2019

tbg commented Aug 14, 2019

nvanbenschoten commented Aug 14, 2019