storage: don't apply local results if cmd processing failed #32166

andreimatei · 2018-11-06T20:29:34Z

This patch fixes a bug with the application of LocalResults: they were
applied even in cases when the processing of the respective command
failed. So, we were evaluating a batch, proposing a command, failing to
apply the command (in particular because of the lease index check), but
then still applying the LocalResult even though the replicated state was
not updated.
This bug had catastrophic consequences: part of the LocalResult is a
list of UpdatedTxn - this is used to ping the txn wait queue and unblock
waiters, communicating a final txn disposition to them. Now imagine an
EndTxn batch that evaluates to a command that fails the lease check. The
respective LocalResult is populate with out txn, updated to the
COMMITTED disposition. The command fails to process, but we still ping
the wait queue telling all the waiters that the transaction committed.
The waiters, if any, now go and send ResolveIntent requests, which
actually commit intents (thinking that the intent's txn committed). And
so we end up with "partially committed transactions" - some of the
writes from a txn are committed (the ones that happened to have waiters
on their intents), others don't.

In order for this bug to manifest itself, we need:

read/write contention on some keys
a command to fail processing. Generally this happens either because of
the lease index check or because of the lease check; so either a
Raft leadership change or a lease change can potentially cause the
badness. In the case of a Raft leadership change, proposals can get
dropped, which leads to reproposals, which seem to frequently(?) end up
processing at the wrong lease index and thus be rejected (and it's the
reproposal processing that triggers the bug). This kind of situation is
triggered by the Jepsen register test, with various nemeses. The lease
check failure can probably be triggered simply with lease transfers
(e.g. lease rebalancing).

Interestingly, the rate of occurence of this bug probably changed
between 2.0 and 2.1 because of the introduction, in 2.1, of the
QueryIntent request (and also, separately, because of increased usage of
lease transfers; this claim was not verified though). Speaking of the
Raft leadership change scenario, once the reproposal fail to be applied
(but the wait queue is erroneously signalled), we also re-evaluate the
EndTransaction batch. Unless something else goes wrong, in 2.0 this
re-evaluation was likely succeeding. In 2.1, however, it tends to fail
if there was a waiter on our txn because the EndTxn requests are usually
bundled with QueryIntent requests - and these QueryIntents fail during
re-evaluation because they don't find the intent - it was, by then,
errneously committed by a waiter through a ResolveIntent. Except for
transaction that wrote to multiple ranges: I think for those the
QueryIntent requests are split off from the EndTransaction request by
the DistSender, and so we're back to the situation in 2.0.

Fixes #30792

Release note (bug fix): Fix a bug causing transactions to appear to be
partially committed. CRDB was sometimes claiming to have failed to
commit a transaction but some (or all) of its writes were actually
persisted.

cockroach-teamcity · 2018-11-06T20:29:42Z

This change is

bdarnell

Reviewed 2 of 2 files at r1, 3 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica_proposal.go, line 89 at r1 (raw file):

	// TODO(andrei): ProposalData.Local is a pointer, but it's always assigned
	// from &batcheval.result.Result.Local, which is not a pointer. So why is this
	// a pointer?

I think it was a pointer just so it could be nil for commands that were not evaluated here, but I'm not sure whether that's actually useful.

pkg/storage/replica_test.go, line 8785 at r2 (raw file):

		t.Fatal(err)
	}
	formatted := tracing.FormatRecordedSpans(collect())

I'm not a fan of tracing/logging based tests. The bug here has observable effects and we should try to use those instead of relying on what gets traced. We should set up a txn wait queue and ensure that a waiting txn is not unblocked by the failed commit attempt.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica_proposal.go, line 89 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think it was a pointer just so it could be nil for commands that were not evaluated here, but I'm not sure whether that's actually useful.

hmm yeah good point. I'll take another look at whether I think this makes sense.

pkg/storage/replica_test.go, line 8785 at r2 (raw file):

I'm not a fan of tracing/logging based tests.

Why not? I think they're a useful tool, particularly when testing for the absence of some action. As a beneficial side effect, I conjecture that they lead to better verbose logging too.

The bug here has observable effects and we should try to use those instead of relying on what gets traced. We should set up a txn wait queue and ensure that a waiting txn is not unblocked by the failed commit attempt.

I agree, and I intend to write a test that involves a reader, in addition to the test I've written which (although a bit unconventional and arguably a tad fragile) is more general - it checks something about all local effects.
I haven't yet because writing that test is not trivial, I think. I'll probably need a new testing hook for trapping a reader being blocked in the txn queue, and then it's unclear how to test that the reader is not woken up at the wrong time (perhaps you have a suggestion). My vague plan was to test for the value ultimately read by the reader.

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica_test.go, line 8785 at r2 (raw file):
Yeah, a test that observes the side effect is harder to write so I'm fine with this for now.

As a beneficial side effect, I conjecture that they lead to better verbose logging too.

They lead to more verbose logging, but I'm not sure that's beneficial.

The problem with log-based tests is that they're not necessarily testing what was intended. We want a regression test to ensure that #30792 never recurs. What we have is a test that the LocalResult is nil at the time that a certain log line is reached. A refactoring could easily change this assumption (for example, if we started using a "detach" pattern to hand these updated txns off earlier). The assertion only looks for lResult.String() in the log message; it's not even doing anything to guarantee that we're matching against the log line in processRaftCommand.

bdarnell · 2018-11-12T16:25:23Z

Go ahead and merge this and backport to the release branches (both 2.0 and 2.1, I think)

This patch fixes a bug with the application of LocalResults: they were applied even in cases when the processing of the respective command failed. So, we were evaluating a batch, proposing a command, failing to apply the command (in particular because of the lease index check), but then still applying the LocalResult even though the replicated state was not updated. This bug had catastrophic consequences: part of the LocalResult is a list of UpdatedTxn - this is used to ping the txn wait queue and unblock waiters, communicating a final txn disposition to them. Now imagine an EndTxn batch that evaluates to a command that fails the lease check. The respective LocalResult is populate with out txn, updated to the COMMITTED disposition. The command fails to process, but we still ping the wait queue telling all the waiters that the transaction committed. The waiters, if any, now go and send ResolveIntent requests, which actually commit intents (thinking that the intent's txn committed). And so we end up with "partially committed transactions" - some of the writes from a txn are committed (the ones that happened to have waiters on their intents), others don't. In order for this bug to manifest itself, we need: - read/write contention on some keys - a command to fail processing. Generally this happens either because of the lease index check or because of the lease check; so either a Raft leadership change or a lease change can potentially cause the badness. In the case of a Raft leadership change, proposals can get dropped, which leads to reproposals, which seem to frequently(?) end up processing at the wrong lease index and thus be rejected (and it's the reproposal processing that triggers the bug). This kind of situation is triggered by the Jepsen register test, with various nemeses. The lease check failure can probably be triggered simply with lease transfers (e.g. lease rebalancing). Interestingly, the rate of occurence of this bug probably changed between 2.0 and 2.1 because of the introduction, in 2.1, of the QueryIntent request (and also, separately, because of increased usage of lease transfers; this claim was not verified though). Speaking of the Raft leadership change scenario, once the reproposal fail to be applied (but the wait queue is erroneously signalled), we also re-evaluate the EndTransaction batch. Unless something else goes wrong, in 2.0 this re-evaluation was likely succeeding. In 2.1, however, it tends to fail if there was a waiter on our txn because the EndTxn requests are usually bundled with QueryIntent requests - and these QueryIntents fail during re-evaluation because they don't find the intent - it was, by then, errneously committed by a waiter through a ResolveIntent. Except for transaction that wrote to multiple ranges: I think for those the QueryIntent requests are split off from the EndTransaction request by the DistSender, and so we're back to the situation in 2.0. Fixes cockroachdb#30792 Release note (bug fix): Fix a bug causing transactions to appear to be partially committed. CRDB was sometimes claiming to have failed to commit a transaction but some (or all) of its writes were actually persisted.

... in case of rejected command. Release note: None

andreimatei

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica_proposal.go, line 89 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

hmm yeah good point. I'll take another look at whether I think this makes sense.

removed TODO and incorporated your explanation in the comment

32166: storage: don't apply local results if cmd processing failed r=andreimatei a=andreimatei This patch fixes a bug with the application of LocalResults: they were applied even in cases when the processing of the respective command failed. So, we were evaluating a batch, proposing a command, failing to apply the command (in particular because of the lease index check), but then still applying the LocalResult even though the replicated state was not updated. This bug had catastrophic consequences: part of the LocalResult is a list of UpdatedTxn - this is used to ping the txn wait queue and unblock waiters, communicating a final txn disposition to them. Now imagine an EndTxn batch that evaluates to a command that fails the lease check. The respective LocalResult is populate with out txn, updated to the COMMITTED disposition. The command fails to process, but we still ping the wait queue telling all the waiters that the transaction committed. The waiters, if any, now go and send ResolveIntent requests, which actually commit intents (thinking that the intent's txn committed). And so we end up with "partially committed transactions" - some of the writes from a txn are committed (the ones that happened to have waiters on their intents), others don't. In order for this bug to manifest itself, we need: - read/write contention on some keys - a command to fail processing. Generally this happens either because of the lease index check or because of the lease check; so either a Raft leadership change or a lease change can potentially cause the badness. In the case of a Raft leadership change, proposals can get dropped, which leads to reproposals, which seem to frequently(?) end up processing at the wrong lease index and thus be rejected (and it's the reproposal processing that triggers the bug). This kind of situation is triggered by the Jepsen register test, with various nemeses. The lease check failure can probably be triggered simply with lease transfers (e.g. lease rebalancing). Interestingly, the rate of occurence of this bug probably changed between 2.0 and 2.1 because of the introduction, in 2.1, of the QueryIntent request (and also, separately, because of increased usage of lease transfers; this claim was not verified though). Speaking of the Raft leadership change scenario, once the reproposal fail to be applied (but the wait queue is erroneously signalled), we also re-evaluate the EndTransaction batch. Unless something else goes wrong, in 2.0 this re-evaluation was likely succeeding. In 2.1, however, it tends to fail if there was a waiter on our txn because the EndTxn requests are usually bundled with QueryIntent requests - and these QueryIntents fail during re-evaluation because they don't find the intent - it was, by then, errneously committed by a waiter through a ResolveIntent. Except for transaction that wrote to multiple ranges: I think for those the QueryIntent requests are split off from the EndTransaction request by the DistSender, and so we're back to the situation in 2.0. Fixes #30792 Release note (bug fix): Fix a bug causing transactions to appear to be partially committed. CRDB was sometimes claiming to have failed to commit a transaction but some (or all) of its writes were actually persisted. Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

craig · 2018-11-12T18:52:16Z

Build succeeded

GitHub CI (Cockroach)

andreimatei assigned bdarnell and tbg Nov 6, 2018

andreimatei requested a review from a team November 6, 2018 20:29

bdarnell approved these changes Nov 7, 2018

View reviewed changes

andreimatei commented Nov 7, 2018

View reviewed changes

bdarnell reviewed Nov 8, 2018

View reviewed changes

andreimatei added 2 commits November 12, 2018 12:51

storage: add a test for the clearing of local results

0a405c4

... in case of rejected command. Release note: None

andreimatei force-pushed the jepsen-fix branch from defc5d7 to 0a405c4 Compare November 12, 2018 17:52

andreimatei commented Nov 12, 2018

View reviewed changes

This was referenced Nov 12, 2018

release-2.1: storage: don't apply local results if cmd processing failed #32220

Merged

release-2.0: storage: don't apply local results if cmd processing failed #32223

Merged

craig bot merged commit 0a405c4 into cockroachdb:master Nov 12, 2018

andreimatei deleted the jepsen-fix branch November 13, 2018 17:32

This was referenced Apr 8, 2019

roachtest: scrub/all-checks/tpcc/w=1000 failed #35986

Closed

roachtest: Save store directories on failure #36640

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: don't apply local results if cmd processing failed #32166

storage: don't apply local results if cmd processing failed #32166

andreimatei commented Nov 6, 2018

cockroach-teamcity commented Nov 6, 2018

bdarnell left a comment

andreimatei left a comment

bdarnell left a comment

bdarnell commented Nov 12, 2018

andreimatei left a comment

craig bot commented Nov 12, 2018

storage: don't apply local results if cmd processing failed #32166

storage: don't apply local results if cmd processing failed #32166

Conversation

andreimatei commented Nov 6, 2018

cockroach-teamcity commented Nov 6, 2018

bdarnell left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

bdarnell commented Nov 12, 2018

andreimatei left a comment

Choose a reason for hiding this comment

craig bot commented Nov 12, 2018

Build succeeded