*log.safeError: txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop... #34341

sentry-io · 2019-01-29T11:35:36Z

*log.safeError: txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop | roachpb.TransactionStatus
  File "github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go", line 393, in heartbeatLoop
  File "github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go", line 352, in func1
  File "github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go", line 324, in func1

txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop | roachpb.TransactionStatus

The text was updated successfully, but these errors were encountered:

knz · 2019-01-29T12:00:40Z

cc @tbg @petermattis for triage

tbg · 2019-01-29T15:15:02Z

cc @andreimatei this looks like you could tell us something.

So it is included in Sentry reports. Also improve a Fatal message that's been seen to fire for unknown reasons (haven't yet looked if I can figure out why it fires). Touches cockroachdb#34341 Release note: None

andreimatei · 2019-01-29T21:43:06Z

I feel like the problem here is that we're not stopping the heartbeat loop after a txn gets an error from a batch that's not a commit or abort. We only stop the heartbeat when the final abort is sent, but that opens the possibility for this assertion to fire. But could it be that simple? Have we always had this bug? Or perhaps a refactoring did it? Or an rpc troll on the protobuffer?

andreimatei · 2019-01-29T21:47:12Z

*not just any error, but an error that returns an aborted transaction record (or committed, for that matter), and that's not retriable... And maybe that's not supposed to happen?

So it is included in Sentry reports. Also improve a Fatal message that's been seen to fire for unknown reasons (haven't yet looked if I can figure out why it fires). Touches cockroachdb#34341 Release note: None

34376: distsqlrun: Remove rpc.Context from FlowCtx r=asubiotto a=bdarnell This field was deprecated and has been replaced with a nodeDialer. Release note: None 34378: roachpb: make Transaction implement SafeMessager r=andreimatei a=andreimatei So it is included in Sentry reports. Also improve a Fatal message that's been seen to fire for unknown reasons (haven't yet looked if I can figure out why it fires). Touches #34341 Release note: None Co-authored-by: Ben Darnell <ben@bendarnell.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

34379: release-2.1: roachpb: make Transaction implement SafeMessager r=andreimatei a=andreimatei Backport 1/1 commits from #34378. /cc @cockroachdb/release --- So it is included in Sentry reports. Also improve a Fatal message that's been seen to fire for unknown reasons (haven't yet looked if I can figure out why it fires). Touches #34341 Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

nvanbenschoten · 2019-01-31T19:46:41Z

While hunting for #34025 I stumbled into a Cockroach configuration that makes this easier to reproduce (I've seen in twice in ~50 runs). I dropped TxnLivenessThreshold = 1 * base.DefaultHeartbeatInterval so that txn aborts were more likely. Any information you'd like me to grap @andreimatei?

tbg · 2019-02-27T12:52:09Z

#34695 (comment)

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and I believe that has fixed the majority of crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError should also be well possible) Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 (I believe all the issues above were really fixed by cockroachdb#35105 but this patch makes it more convincing) Release note: None

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.

35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since #35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes #34695 Fixes #34341 Fixes #33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors. Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server labels Jan 29, 2019

tbg assigned andreimatei Jan 29, 2019

tbg added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Jan 29, 2019

andreimatei mentioned this issue Jan 29, 2019

roachpb: make Transaction implement SafeMessager #34378

Merged

andreimatei mentioned this issue Jan 29, 2019

release-2.1: roachpb: make Transaction implement SafeMessager #34379

Merged

nvanbenschoten mentioned this issue Feb 7, 2019

roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

Closed

andreimatei mentioned this issue Feb 13, 2019

roachtest: scaledata/distributed_semaphore/nodes=3 failed #33698

Closed

andreimatei mentioned this issue Feb 27, 2019

kv: don't ingest aborted TxnCoordMeta #35249

Merged

craig bot closed this as completed in #35249 Mar 21, 2019

andreimatei mentioned this issue Mar 21, 2019

release-19.1: kv: don't ingest aborted TxnCoordMeta #36041

Merged

nvanbenschoten mentioned this issue Jun 27, 2019

roachtest: schemachange/index/tpcc/w=1000 failed #36094

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*log.safeError: txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop... #34341

*log.safeError: txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop... #34341

sentry-io bot commented Jan 29, 2019

knz commented Jan 29, 2019

tbg commented Jan 29, 2019

andreimatei commented Jan 29, 2019

andreimatei commented Jan 29, 2019

nvanbenschoten commented Jan 31, 2019

tbg commented Feb 27, 2019

*log.safeError: txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop... #34341

*log.safeError: txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop... #34341

Comments

sentry-io bot commented Jan 29, 2019

knz commented Jan 29, 2019

tbg commented Jan 29, 2019

andreimatei commented Jan 29, 2019

andreimatei commented Jan 29, 2019

nvanbenschoten commented Jan 31, 2019

tbg commented Feb 27, 2019