Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamcity: failed test: TestConcurrentRaftSnapshots #39661

Closed
cockroach-teamcity opened this issue Aug 14, 2019 · 0 comments · Fixed by #39699
Closed

teamcity: failed test: TestConcurrentRaftSnapshots #39661

cockroach-teamcity opened this issue Aug 14, 2019 · 0 comments · Fixed by #39699
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

The following tests appear to have failed on master (testrace): TestConcurrentRaftSnapshots

You may want to check for open issues.

#1438837:

TestConcurrentRaftSnapshots
...to all 4 replicas failed; last error: (err: node unavailable; try another peer) <nil>
W190814 14:36:22.952605 103654 internal/client/txn.go:509  [liveness-hb] failure aborting transaction: node unavailable; try another peer; abort caused by: node unavailable; try another peer
W190814 14:36:22.953295 103654 storage/node_liveness.go:484  [liveness-hb] failed node liveness heartbeat: node unavailable; try another peer
I190814 14:36:22.972668 105719 internal/client/txn.go:619  async rollback failed: node unavailable; try another peer
W190814 14:36:22.975256 103477 storage/store.go:1535  [s1,r1/1:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 14:36:22.975718 103973 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
I190814 14:36:22.977749 105566 internal/client/txn.go:619  async rollback failed: failed to send RPC: sending to all 4 replicas failed; last error: (err: node unavailable; try another peer) <nil>
W190814 14:36:22.989950 103738 storage/store.go:1535  [s3,r1/3:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 14:36:23.035861 103738 storage/store.go:1535  [s3,r1/3:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 14:36:23.040239 103601 storage/store.go:1535  [s2,r1/2:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 14:36:23.052669 103973 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
W190814 14:36:23.158153 103973 storage/store.go:1535  [s4,r1/4:/M{in-ax}] could not gossip first range descriptor: node unavailable; try another peer
I190814 14:36:23.417779 105685 util/stop/stopper.go:542  quiescing; tasks left:
12     rpc heartbeat
I190814 14:36:23.419183 105685 util/stop/stopper.go:542  quiescing; tasks left:
11     rpc heartbeat
W190814 14:36:23.423125 104049 gossip/gossip.go:1517  [n5] no incoming or outgoing connections
I190814 14:36:23.440370 105685 util/stop/stopper.go:542  quiescing; tasks left:
10     rpc heartbeat
W190814 14:36:23.442608 105064 storage/raft_transport.go:620  while processing outgoing Raft queue to node 4: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 14:36:23.444194 104677 storage/raft_transport.go:620  while processing outgoing Raft queue to node 3: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 14:36:23.444518 105701 storage/raft_transport.go:620  while processing outgoing Raft queue to node 5: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 14:36:23.450638 103653 gossip/gossip.go:1517  [n2] no incoming or outgoing connections
W190814 14:36:23.453525 104389 storage/raft_transport.go:620  while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190814 14:36:23.457033 103696 gossip/gossip.go:1517  [n3] no incoming or outgoing connections
W190814 14:36:23.457366 104320 storage/raft_transport.go:620  while processing outgoing Raft queue to node 1: rpc error: code = Unavailable desc = transport is closing:
W190814 14:36:23.458003 104008 gossip/gossip.go:1517  [n4] no incoming or outgoing connections
W190814 14:36:23.458338 104471 storage/raft_transport.go:620  while processing outgoing Raft queue to node 1: rpc error: code = Canceled desc = grpc: the client connection is closing:
I190814 14:36:23.461581 105685 util/stop/stopper.go:542  quiescing; tasks left:
6      rpc heartbeat
I190814 14:36:23.462412 105685 util/stop/stopper.go:542  quiescing; tasks left:
5      rpc heartbeat
    client_raft_test.go:1119: change replicas of r1 failed: TransactionStatusError: client already committed or rolled back the transaction. Trying to execute: EndTransaction(commit:true) [/Min] (REASON_UNKNOWN)




Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Aug 14, 2019
@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Aug 14, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 16, 2019
Fixes cockroachdb#39652.
Fixes cockroachdb#39661.
Fixes cockroachdb#35144.

This commit fixes the referenced issues by eliminating the practice of updating
the transaction coordinator's proto directly from its heartbeat loop. This was
problematic because of the race described in
https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364.
The heartbeat loop doesn't know if it's racing with an EndTransaction request
and it could incorrectly interpret a missing transaction record if it is. The
safest thing to do is to limit the path in which it informs the TxnCoordSender
of updates.

This limits the responsibility of the heartbeat loop. Its job is now
only to:
1. update the transaction record to maintain liveness
2. eagerly clean up a transaction if it is found to be aborted
3. inform the transaction coordinator about an aborted transaction
   record IF the transaction coordinator is continuing to send requests
   through the interceptor.

Notably, the heartbeat loop no longer blindly updates the transaction
coordinator's transaction proto. There wasn't a strong reason for it to be able
to do so, especially now that we no longer push transactions or ratchet their
priority frequently. Moreover, even if those were still frequent occurrences,
updating the proto from the heartbeat loop prevented usual restart handling from
being used. For instance, doing so might prevent us from refreshing the
transaction. All in all, allowing this didn't seem worth the complexity.

This commit also includes some cleanup. For instance, it removes a confusing
dependency where the txnHeartbeater called back into the TxnCoordSender. It also
addresses a longstanding TODO to actually unit test the txnHeartbeater.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 21, 2019
Fixes cockroachdb#39652.
Fixes cockroachdb#39661.
Fixes cockroachdb#35144.

This commit fixes the referenced issues by eliminating the practice of updating
the transaction coordinator's proto directly from its heartbeat loop. This was
problematic because of the race described in
https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364.
The heartbeat loop doesn't know if it's racing with an EndTransaction request
and it could incorrectly interpret a missing transaction record if it is. The
safest thing to do is to limit the path in which it informs the TxnCoordSender
of updates.

This limits the responsibility of the heartbeat loop. Its job is now
only to:
1. update the transaction record to maintain liveness
2. eagerly clean up a transaction if it is found to be aborted
3. inform the transaction coordinator about an aborted transaction
   record IF the transaction coordinator is continuing to send requests
   through the interceptor.

Notably, the heartbeat loop no longer blindly updates the transaction
coordinator's transaction proto. There wasn't a strong reason for it to be able
to do so, especially now that we no longer push transactions or ratchet their
priority frequently. Moreover, even if those were still frequent occurrences,
updating the proto from the heartbeat loop prevented usual restart handling from
being used. For instance, doing so might prevent us from refreshing the
transaction. All in all, allowing this didn't seem worth the complexity.

This commit also includes some cleanup. For instance, it removes a confusing
dependency where the txnHeartbeater called back into the TxnCoordSender. It also
addresses a longstanding TODO to actually unit test the txnHeartbeater.

Release note: None
craig bot pushed a commit that referenced this issue Aug 21, 2019
39699: kv: don't update transaction proto directly from heartbeat loop r=nvanbenschoten a=nvanbenschoten

Fixes #39652.
Fixes #39661.
Fixes #35144.

This commit fixes the referenced issues by eliminating the practice of updating
the transaction coordinator's proto directly from its heartbeat loop. This was
problematic because of the race described in
https://github.com/cockroachdb/cockroach/blob/dc3686f79b3750500efaff7092c81a3e5ce6d02c/pkg/kv/txn_interceptor_heartbeater.go#L357-L364

The heartbeat loop doesn't know if it's racing with an EndTransaction request
and it could incorrectly interpret a missing transaction record if it is. The
safest thing to do is to limit the path in which it informs the TxnCoordSender
of updates.

This limits the responsibility of the heartbeat loop. Its job is now
only to:
1. update the transaction record to maintain liveness
2. eagerly clean up a transaction if it is found to be aborted
3. inform the transaction coordinator about an aborted transaction
   record IF the transaction coordinator is continuing to send requests
   through the interceptor.

Notably, the heartbeat loop no longer blindly updates the transaction
coordinator's transaction proto. There wasn't a strong reason for it to be able
to do so, especially now that we no longer push transactions or ratchet their
priority frequently. Moreover, even if those were still frequent occurrences,
updating the proto from the heartbeat loop prevented usual restart handling from
being used. For instance, doing so might prevent us from refreshing the
transaction. All in all, allowing this didn't seem worth the complexity.

This commit also includes some cleanup. For instance, it removes a confusing
dependency where the txnHeartbeater called back into the TxnCoordSender. It also
addresses a longstanding TODO to actually unit test the txnHeartbeater.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in 1a84480 Aug 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant