Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*log.safeError: txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop... #34341

Closed
sentry-io bot opened this issue Jan 29, 2019 · 6 comments · Fixed by #35249
Assignees
Labels
A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.

Comments

@sentry-io
Copy link

sentry-io bot commented Jan 29, 2019

Sentry Issue: COCKROACHDB-QW

*log.safeError: txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop | roachpb.TransactionStatus
  File "github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go", line 393, in heartbeatLoop
  File "github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go", line 352, in func1
  File "github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go", line 324, in func1

txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop | roachpb.TransactionStatus
@knz
Copy link
Contributor

knz commented Jan 29, 2019

cc @tbg @petermattis for triage

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server labels Jan 29, 2019
@tbg
Copy link
Member

tbg commented Jan 29, 2019

cc @andreimatei this looks like you could tell us something.

@tbg tbg added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Jan 29, 2019
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 29, 2019
So it is included in Sentry reports.
Also improve a Fatal message that's been seen to fire for unknown
reasons (haven't yet looked if I can figure out why it fires).

Touches cockroachdb#34341

Release note: None
@andreimatei
Copy link
Contributor

I feel like the problem here is that we're not stopping the heartbeat loop after a txn gets an error from a batch that's not a commit or abort. We only stop the heartbeat when the final abort is sent, but that opens the possibility for this assertion to fire. But could it be that simple? Have we always had this bug? Or perhaps a refactoring did it? Or an rpc troll on the protobuffer?

@andreimatei
Copy link
Contributor

*not just any error, but an error that returns an aborted transaction record (or committed, for that matter), and that's not retriable... And maybe that's not supposed to happen?

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 29, 2019
So it is included in Sentry reports.
Also improve a Fatal message that's been seen to fire for unknown
reasons (haven't yet looked if I can figure out why it fires).

Touches cockroachdb#34341

Release note: None
craig bot pushed a commit that referenced this issue Jan 29, 2019
34376: distsqlrun: Remove rpc.Context from FlowCtx r=asubiotto a=bdarnell

This field was deprecated and has been replaced with a nodeDialer.

Release note: None

34378: roachpb: make Transaction implement SafeMessager r=andreimatei a=andreimatei

So it is included in Sentry reports.
Also improve a Fatal message that's been seen to fire for unknown
reasons (haven't yet looked if I can figure out why it fires).

Touches #34341

Release note: None

Co-authored-by: Ben Darnell <ben@bendarnell.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jan 30, 2019
34379: release-2.1: roachpb: make Transaction implement SafeMessager r=andreimatei a=andreimatei

Backport 1/1 commits from #34378.

/cc @cockroachdb/release

---

So it is included in Sentry reports.
Also improve a Fatal message that's been seen to fire for unknown
reasons (haven't yet looked if I can figure out why it fires).

Touches #34341

Release note: None


Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@nvanbenschoten
Copy link
Member

While hunting for #34025 I stumbled into a Cockroach configuration that makes this easier to reproduce (I've seen in twice in ~50 runs). I dropped TxnLivenessThreshold = 1 * base.DefaultHeartbeatInterval so that txn aborts were more likely. Any information you'd like me to grap @andreimatei?

@tbg
Copy link
Member

tbg commented Feb 27, 2019

#34695 (comment)

andreimatei added a commit to andreimatei/cockroach that referenced this issue Feb 27, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and I
believe that has fixed the majority of crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError should also be well
possible)

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

(I believe all the issues above were really fixed by cockroachdb#35105 but this
patch makes it more convincing)

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Mar 21, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.
craig bot pushed a commit that referenced this issue Mar 21, 2019
35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei

Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since #35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes #34695
Fixes #34341
Fixes #33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig craig bot closed this as completed in #35249 Mar 21, 2019
andreimatei added a commit to andreimatei/cockroach that referenced this issue Mar 21, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants