Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamcity: failed test: TestConcurrentRaftSnapshots #39652

Closed
cockroach-teamcity opened this issue Aug 14, 2019 · 5 comments · Fixed by #39699
Closed

teamcity: failed test: TestConcurrentRaftSnapshots #39652

cockroach-teamcity opened this issue Aug 14, 2019 · 5 comments · Fixed by #39699
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

The following tests appear to have failed on master (testrace): TestConcurrentRaftSnapshots

You may want to check for open issues.

#1437952:

TestConcurrentRaftSnapshots
...iled: result is ambiguous (server shutdown)
I190814 05:35:42.953318 100587 internal/client/txn.go:619  async rollback failed: node unavailable; try another peer
W190814 05:35:42.938309 98532 storage/store.go:1535  [s2,r1/2:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
I190814 05:35:42.954053 100529 util/stop/stopper.go:542  quiescing; tasks left:
1      mtc send
I190814 05:35:42.955682 100417 internal/client/txn.go:619  async rollback failed: result is ambiguous (server shutdown)
W190814 05:35:42.957766 98877 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
I190814 05:35:42.977813 100489 internal/client/txn.go:619  async rollback failed: result is ambiguous (server shutdown)
W190814 05:35:43.001811 98685 storage/store.go:1535  [s3,r1/3:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 05:35:43.017634 99383 storage/raft_transport.go:294  unable to accept Raft message from (n4,s4):?: no handler registered for (n1,s1):?
W190814 05:35:43.046216 99309 storage/store.go:3547  [s4] raft error: node 1 claims to not contain store 1 for replica (n1,s1):?: store 1 was not found
W190814 05:35:43.046671 99364 storage/raft_transport.go:620  while processing outgoing Raft queue to node 1: store 1 was not found:
W190814 05:35:43.049448 98877 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 05:35:43.050723 98407 gossip/gossip.go:1522  [n1] first range unavailable; trying remaining resolvers
I190814 05:35:43.070898 100668 gossip/client.go:124  [n1] started gossip client to 127.0.0.1:36661
W190814 05:35:43.128152 98685 storage/store.go:1535  [s3,r1/3:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 05:35:43.187793 99157 gossip/gossip.go:1522  [n5] first range unavailable; trying remaining resolvers
I190814 05:35:43.186614 98623 gossip/client.go:129  [n3] closing client to n1 (127.0.0.1:37147): stopping outgoing client to n1 (127.0.0.1:37147); already have incoming
W190814 05:35:43.204717 98708 gossip/gossip.go:1522  [n3] first range unavailable; trying remaining resolvers
I190814 05:35:43.210250 100654 gossip/client.go:124  [n5] started gossip client to 127.0.0.1:37147
I190814 05:35:43.225293 100714 gossip/client.go:124  [n3] started gossip client to 127.0.0.1:36977
W190814 05:35:43.229673 98877 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
I190814 05:35:43.414295 100528 util/stop/stopper.go:542  quiescing; tasks left:
10     rpc heartbeat
I190814 05:35:43.417655 100528 util/stop/stopper.go:542  quiescing; tasks left:
9      rpc heartbeat
W190814 05:35:43.419049 99582 storage/raft_transport.go:620  while processing outgoing Raft queue to node 3: rpc error: code = Unavailable desc = transport is closing:
I190814 05:35:43.428275 100528 util/stop/stopper.go:542  quiescing; tasks left:
8      rpc heartbeat
W190814 05:35:43.429368 99273 storage/raft_transport.go:620  while processing outgoing Raft queue to node 1: EOF:
W190814 05:35:43.430436 100122 storage/raft_transport.go:620  while processing outgoing Raft queue to node 4: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 05:35:43.432401 98428 gossip/gossip.go:1517  [n2] no incoming or outgoing connections
W190814 05:35:43.432661 99242 storage/raft_transport.go:620  while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 05:35:43.440940 98827 gossip/gossip.go:1517  [n4] no incoming or outgoing connections
    client_raft_test.go:1119: change replicas of r1 failed: TransactionStatusError: client already committed or rolled back the transaction. Trying to execute: EndTransaction(commit:true) [/Min] (REASON_UNKNOWN)




Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Aug 14, 2019
@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Aug 14, 2019
@tbg
Copy link
Member

tbg commented Aug 14, 2019

Repros easily with

make stressrace PKG=./pkg/storage/ TESTS=TestConcurrentRaftSnapshots

I'll take a look

@tbg tbg self-assigned this Aug 14, 2019
@tbg
Copy link
Member

tbg commented Aug 14, 2019

The problem seems to be that we're returning here without an error

b.AddRawRequest(&roachpb.EndTransactionRequest{
Commit: true,
InternalCommitTrigger: &roachpb.InternalCommitTrigger{
ChangeReplicasTrigger: crt,
},
})
if err := txn.Run(ctx, b); err != nil {

But the transaction at that point is ABORTED. TxnCoordSender saw the commit, so it has put the txn into finalized state. Since the retriable closure returns without an error though, the txn runner thinks it still needs to commit, does so, and trips the assertion against use of finalized txns.

err = txn.Commit(ctx)

Now on to finding out how we're losing the error that we should see when trying to commit but don't.

@tbg
Copy link
Member

tbg commented Aug 14, 2019

EndTransaction itself evaluates fine (i.e. returns a COMMITTED proto). But after the call to txn.Run(batch) the resulting proto is aborted.

tbg added a commit to tbg/cockroach that referenced this issue Aug 14, 2019
At init() time, the txn proto has not been populated yet.
Found while investigating cockroachdb#39652.

Release note: None
@tbg
Copy link
Member

tbg commented Aug 14, 2019

Ok, I understand this now. This is a race between the txn's heartbeat loop and the commit. It looks like we're managing to commit but before the response enters the txn coord sender, its heartbeat loop misses a heartbeat and marks the txn as aborted here:

h.mu.txn.Status = roachpb.ABORTED

However, seems that txcoordsender still returns the (successful) commit response to the client here (except that the COMMITTED was dropped on the floor since the txn proto was already ABORTED).

b.AddRawRequest(&roachpb.EndTransactionRequest{
Commit: true,
InternalCommitTrigger: &roachpb.InternalCommitTrigger{
ChangeReplicasTrigger: crt,
},
})
if err := txn.Run(ctx, b); err != nil {
log.Event(ctx, err.Error())
return err
}

and so we hit the return nil in the state in which the txn is aborted, which tricks the txn retry loop into an erroneous autocommit:

if err == nil {
if txn.status() != roachpb.COMMITTED {
err = txn.Commit(ctx)

which then fails because the txn coord sender was already finalized here

tc.mu.txnState = txnFinalized

The fundamental problem here seems with the heartbeat loop. If the heartbeat loop changes the txn while there is an inflight commit attempt, the client must receive an ambiguous result. In general it seems fraught to have the heartbeats muddle with the state like that; if a heartbeat fails we just want to notify the client. We can do that more easily by setting a flag that is queried only when the client tries to make another request and that, if hit, returns a transaction aborted error.

@nvanbenschoten could I throw this your way? I'm going to send a PR to skip this test (though the problem ultimately isn't specific to the test, it seems to hit it reliably now that learners are in there, for whatever reason).

tbg added a commit to tbg/cockroach that referenced this issue Aug 14, 2019
The test highlights a bug in production code and is too flaky to remain
unskipped while we fix said bug.

See cockroachdb#39652.

Release note: None
craig bot pushed a commit that referenced this issue Aug 14, 2019
39657: storage: skip TestConcurrentRaftSnapshots r=ajwerner a=tbg

The test highlights a bug in production code and is too flaky to remain
unskipped while we fix said bug.

See #39652.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@nvanbenschoten
Copy link
Member

So it sounds like what's happening is exactly this:

// TODO(nvanbenschoten): It's possible for a HeartbeatTxn request to observe
// the result of an EndTransaction request and beat it back to the client.
// This is an issue when a COMMITTED txn record is GCed and later re-written
// as ABORTED. The coordinator's local status could flip from PENDING to
// ABORTED (after heartbeat response) to COMMITTED (after commit response).
// This appears to be benign, but it's still somewhat disconcerting. If this
// ever causes any issues, we'll need to be smarter about detecting this race
// on the client and conditionally ignoring the result of heartbeat responses.

I really should have taken time to address this back when I was originally thinking through it. I guess I didn't want to start pulling on another thread while in the middle of trying to figure out everything with parallel commits.

I think that was broken by this change, which prevents the COMMITTED status from overwriting the ABORTED status. We should log a loud error if anyone tries to update a transaction proto with that transition.

I like your idea about the heartbeat loop setting a flag that is queried only when the client tries to make another request. That way, the transaction is always properly updated by its main goroutine through the normal mechanisms.

craig bot pushed a commit that referenced this issue Aug 14, 2019
39640: storage: unify replica addition and removal paths r=nvanbenschoten a=tbg

This continues the reworking of the various replication change APIs with
the goal of allowing a) testing of general atomic replication changes b)
issuing replica swaps from the replicate queue (in 19.2).

For previous steps, see:

#39485
#39611

This change is not a pure plumbing PR. Instead, it unifies
`(*Replica).addReplica` and `(*Replica).removeReplica` into a method that
can do both, `(*Replica).addAndRemoveReplicas`.

Given a slice of ReplicationChanges, this method first adds learner
replicas corresponding to the desired new voters. After having sent
snapshots to all of them, the method issues a configuration change that
atomically
- upgrades all learners to voters
- removes any undesired replicas.

Note that no atomic membership changes are *actually* carried out yet. This
is because the callers of `addAndRemoveReplicas` pass in only a single
change (i.e. an addition or removal), which the method also verifies.

Three pieces are missing after this PR: First, we need to be able to
instruct raft to carry out atomic configuration changes:

https://github.com/cockroachdb/cockroach/blob/2e8db6ca53c59d3d281e64939f79d937195403d4/pkg/storage/replica_proposal_buf.go#L448-L451

which in particular requires being able to store the ConfState
corresponding to a joint configuration in the unreplicated local state
(under a new key).

Second, we must pass the slice of changes handed to
`AdminChangeReplicas` through to `addAndRemoveReplicas` without unrolling
it first, see:

https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica_command.go#L870-L891

and

https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica.go#L1314-L1325

Third, we must to teach the replicate queue to issue the "atomic swaps";
this is the reason we're introducing atomic membership changes in the first
place.

Release note: None

39656: kv: init heartbeat txn log tag later r=nvanbenschoten a=tbg

At init() time, the txn proto has not been populated yet.
Found while investigating #39652.

This change strikes me as clunky, but I don't have the bandwidth to dig deeper
right now.

Release note: None

39666: testutils/lint/passes: disable under nightly stress r=mjibson a=mjibson

Under stress these error with "go build a: failed to cache compiled Go files".

Fixes #39616
Fixes #39541
Fixes #39479

Release note: None

39669: rpc: use gRPC enforced minimum keepalive timeout r=knz a=ajwerner

Before this commit we'd experience the following annoying log message from gRPC
every time we create a new connection telling us that our setting is being
ignored.

```
Adjusting keepalive ping interval to minimum period of 10s
```

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@tbg tbg assigned nvanbenschoten and unassigned tbg Aug 14, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 16, 2019
Fixes cockroachdb#39652.
Fixes cockroachdb#39661.
Fixes cockroachdb#35144.

This commit fixes the referenced issues by eliminating the practice of updating
the transaction coordinator's proto directly from its heartbeat loop. This was
problematic because of the race described in
https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364.
The heartbeat loop doesn't know if it's racing with an EndTransaction request
and it could incorrectly interpret a missing transaction record if it is. The
safest thing to do is to limit the path in which it informs the TxnCoordSender
of updates.

This limits the responsibility of the heartbeat loop. Its job is now
only to:
1. update the transaction record to maintain liveness
2. eagerly clean up a transaction if it is found to be aborted
3. inform the transaction coordinator about an aborted transaction
   record IF the transaction coordinator is continuing to send requests
   through the interceptor.

Notably, the heartbeat loop no longer blindly updates the transaction
coordinator's transaction proto. There wasn't a strong reason for it to be able
to do so, especially now that we no longer push transactions or ratchet their
priority frequently. Moreover, even if those were still frequent occurrences,
updating the proto from the heartbeat loop prevented usual restart handling from
being used. For instance, doing so might prevent us from refreshing the
transaction. All in all, allowing this didn't seem worth the complexity.

This commit also includes some cleanup. For instance, it removes a confusing
dependency where the txnHeartbeater called back into the TxnCoordSender. It also
addresses a longstanding TODO to actually unit test the txnHeartbeater.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 21, 2019
Fixes cockroachdb#39652.
Fixes cockroachdb#39661.
Fixes cockroachdb#35144.

This commit fixes the referenced issues by eliminating the practice of updating
the transaction coordinator's proto directly from its heartbeat loop. This was
problematic because of the race described in
https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364.
The heartbeat loop doesn't know if it's racing with an EndTransaction request
and it could incorrectly interpret a missing transaction record if it is. The
safest thing to do is to limit the path in which it informs the TxnCoordSender
of updates.

This limits the responsibility of the heartbeat loop. Its job is now
only to:
1. update the transaction record to maintain liveness
2. eagerly clean up a transaction if it is found to be aborted
3. inform the transaction coordinator about an aborted transaction
   record IF the transaction coordinator is continuing to send requests
   through the interceptor.

Notably, the heartbeat loop no longer blindly updates the transaction
coordinator's transaction proto. There wasn't a strong reason for it to be able
to do so, especially now that we no longer push transactions or ratchet their
priority frequently. Moreover, even if those were still frequent occurrences,
updating the proto from the heartbeat loop prevented usual restart handling from
being used. For instance, doing so might prevent us from refreshing the
transaction. All in all, allowing this didn't seem worth the complexity.

This commit also includes some cleanup. For instance, it removes a confusing
dependency where the txnHeartbeater called back into the TxnCoordSender. It also
addresses a longstanding TODO to actually unit test the txnHeartbeater.

Release note: None
craig bot pushed a commit that referenced this issue Aug 21, 2019
39699: kv: don't update transaction proto directly from heartbeat loop r=nvanbenschoten a=nvanbenschoten

Fixes #39652.
Fixes #39661.
Fixes #35144.

This commit fixes the referenced issues by eliminating the practice of updating
the transaction coordinator's proto directly from its heartbeat loop. This was
problematic because of the race described in
https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364

The heartbeat loop doesn't know if it's racing with an EndTransaction request
and it could incorrectly interpret a missing transaction record if it is. The
safest thing to do is to limit the path in which it informs the TxnCoordSender
of updates.

This limits the responsibility of the heartbeat loop. Its job is now
only to:
1. update the transaction record to maintain liveness
2. eagerly clean up a transaction if it is found to be aborted
3. inform the transaction coordinator about an aborted transaction
   record IF the transaction coordinator is continuing to send requests
   through the interceptor.

Notably, the heartbeat loop no longer blindly updates the transaction
coordinator's transaction proto. There wasn't a strong reason for it to be able
to do so, especially now that we no longer push transactions or ratchet their
priority frequently. Moreover, even if those were still frequent occurrences,
updating the proto from the heartbeat loop prevented usual restart handling from
being used. For instance, doing so might prevent us from refreshing the
transaction. All in all, allowing this didn't seem worth the complexity.

This commit also includes some cleanup. For instance, it removes a confusing
dependency where the txnHeartbeater called back into the TxnCoordSender. It also
addresses a longstanding TODO to actually unit test the txnHeartbeater.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in 1a84480 Aug 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants