Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: fix bug in ephemeral app batch #94930

Merged
merged 1 commit into from
Jan 12, 2023

Conversation

tbg
Copy link
Member

@tbg tbg commented Jan 9, 2023

A bug was introduced recently1 which could lead to commands being
early-acked despite ultimately being rejected during command
application. The reason for this was that we previously updated
the lease applied index of the ephemeral batch and didn't any more,
meaning that commands that would fail the MaxLeaseIndex check on the
actual apply batch could erroneously pass it on the ephemeral batch.

Actually, I believe that change only made an existing bug (or at least
semantic incorrectness) more common: it was removing that update
altogether, but the old code blindly updated the MaxLeaseIndex, which
also doesn't seem correct as not all commands (in particular leases)
specify it in the first place. So conceivably, a command that would
be rejected could have reset the MaxLeaseIndex and then allowed a
later command that should also have been rejected to pass.

I think this was mostly theoretical in nature since we only batch
"trivial" commands but this PR improves the update to only occur
for non-rejected commands.

A regression test is added.

Thanks to Nathan for bisecting and diagnosis.

No release note because unreleased.

Fixes #94409.

Epic: CRDB-220
Release note: None

Footnotes

  1. https://github.com/cockroachdb/cockroach/commit/e61de155017a1821d08ba0a414967927adabe0eb#diff-0fd06523f1f485024aef0c7a11d3472945d5ac7cf228d6007b2475ccf6f44cd6L795

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg marked this pull request as ready for review January 9, 2023 16:07
@tbg tbg requested a review from a team January 9, 2023 16:07
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Actually, I believe that change only made an existing bug (or at least
semantic incorrectness) more common: it was removing that update
altogether, but the old code blindly updated the MaxLeaseIndex, which
also doesn't seem correct as not all commands (in particular leases)
specify it in the first place. So conceivably, a command that would
be rejected could have reset the MaxLeaseIndex and then allowed a
later command that should also have been rejected to pass.

I think this was mostly theoretical in nature since we only batch
"trivial" commands

Right, we were only batching trivial commands that could not affect the outcome of later commands (other than through the lease applied index check). Admin operations like splits and especially lease requests have more complicated side effects, so we never consider them or entries later than them for early ack-ing. In fact, we never batch them at all.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @pavelkalinnikov and @tbg)


pkg/kv/kvserver/replica_app_batch.go line 768 at r1 (raw file):

	fr = replicaApplyTestingFilters(ctx, mb.r, cmd, fr)
	cmd.ForcedErrResult = fr
	if !cmd.Rejected() && cmd.LeaseIndex > mb.state.LeaseAppliedIndex {

Should we change this to exactly mirror the logic in replicaAppBatch.stageTrivialReplicatedEvalResult?

A bug was introduced recently[^1] which could lead to commands being
early-acked despite ultimately being rejected during command
application. The reason for this was that we previously updated
the lease applied index of the ephemeral batch and didn't any more,
meaning that commands that would fail the MaxLeaseIndex check on the
actual apply batch could erroneously pass it on the ephemeral batch.

Actually, I believe that change only made an existing bug (or at least
semantic incorrectness) more common: it was removing that update
altogether, but the old code blindly updated the MaxLeaseIndex, which
also doesn't seem correct as not all commands (in particular leases)
specify it in the first place. So conceivably, a command that would
be rejected could have reset the MaxLeaseIndex and then allowed a
later command that should also have been rejected to pass.

I think this was mostly theoretical in nature since we only batch
"trivial" commands but this PR improves the update to only occur
for non-rejected commands.

A regression test is added.

Thanks to Nathan for bisecting and diagnosis.

No release note because unreleased.

Fixes cockroachdb#94409.

[^1]: cockroachdb@e61de15#diff-0fd06523f1f485024aef0c7a11d3472945d5ac7cf228d6007b2475ccf6f44cd6L795

Epic: CRDB-220
Release note: None
@tbg
Copy link
Member Author

tbg commented Jan 10, 2023

Should we change this to exactly mirror the logic in replicaAppBatch.stageTrivialReplicatedEvalResult?

Let me know if you see a nice way of doing that and I'll send a follow-up, but this didn't seem trivial to me. The "real" code relies on the command having been cleared if it has been rejected but the ephemeral batch here is sort of weird already in that it mutates the command (assigning the results of the forced err check), but it seems even weirder if it cleared it out (maybe I'm wrong?).

TFTR!

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Jan 10, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 10, 2023

This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried.

Additional information:

@tbg
Copy link
Member Author

tbg commented Jan 11, 2023

bors r+

@erikgrinaker
Copy link
Contributor

bors single on

@rail
Copy link
Member

rail commented Jan 11, 2023

bors r-

@craig
Copy link
Contributor

craig bot commented Jan 11, 2023

Canceled.

@rail
Copy link
Member

rail commented Jan 11, 2023

bors r+ single off

@craig
Copy link
Contributor

craig bot commented Jan 11, 2023

This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried.

Additional information:

@rail
Copy link
Member

rail commented Jan 11, 2023

bors r-
We have temporarily disabled bors for merges. Please merge the PR manually when all required checks pass.

1 similar comment
@rail
Copy link
Member

rail commented Jan 11, 2023

bors r-
We have temporarily disabled bors for merges. Please merge the PR manually when all required checks pass.

@nvanbenschoten nvanbenschoten merged commit bd3e876 into cockroachdb:master Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: kv/contention/nodes=4 failed
5 participants