-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv/contention/nodes=4 failed #94409
Comments
roachtest.kv/contention/nodes=4 failed with artifacts on master @ 2c3d75f1ce31024d7ffe530f91f22162c053abcd:
Parameters: |
This comment was marked as duplicate.
This comment was marked as duplicate.
1 similar comment
|
This one seems legit. It's unclear to me whether the problem is in KV (txn isolation) or SQL (client misuse), but I'm tossing this one over to KV for immediate investigation. CC @nvanbenschoten. |
Thanks @erikgrinaker. This has reproduced twice in the past week, so it appears amenable to bisecting. I'll do that. |
This git bisect has been slow because of the high iteration time and variability of this failure. I've found that it dates back at least as far as 9d0ee35. I'll keep this thread updated with progress. |
I've now seen the failure as far back as a2d9a0ec98. Notably, that commit does not include 0483f41901, which is a relief. |
Is it worth verifying that this isn't a problem with the workload? Have you tried running with an older workload version? |
I was suspicious of that at one point, but I've been using Anyway, after a few misdiagnoses, I think the bisect finally landed on e61de15 (cc. @tbg). I'll keep running on the commit before this because I've been fooled a few times over the past few days, but we should dig into that commit. One question I have about it is the removal of this line, which used to look like:
Unless I'm missing something, we're no longer updating the I don't yet see a way that this could lead to uniqueness constraint violations, but it feels like a bad issue, if real. One interesting part of this failure is that we're seeing We might never need to connect all the dots (though it's fun to try!) if we can prove that this is the cause of the failure. The fix would look something like that following, paired with an investment in testing to correct the fact that this wasn't quickly caught by a unit test. diff --git a/pkg/kv/kvserver/replica_app_batch.go b/pkg/kv/kvserver/replica_app_batch.go
index 383fa0936c..dc153b5fca 100644
--- a/pkg/kv/kvserver/replica_app_batch.go
+++ b/pkg/kv/kvserver/replica_app_batch.go
@@ -765,6 +765,7 @@ func (mb *ephemeralReplicaAppBatch) Stage(
)
fr = replicaApplyTestingFilters(ctx, mb.r, cmd, fr)
cmd.ForcedErrResult = fr
+ mb.state.LeaseAppliedIndex = cmd.LeaseIndex
return cmd, nil
} I'll also add that #94165 paves the path to removing this "early ack" logic entirely. The optimization is less important if entry application in a raft ready iteration no longer waits for the raft log fsync. If we're finding that it's a burden to maintain this code as we refactor raft, we should explore whether we need it anymore. |
Nice, thanks for the investigation @nvanbenschoten, and sorry for misrouting this to KV -- appreciate the effort. |
cc @cockroachdb/replication |
A bug was introduced recently[^1] which could lead to commands being early-acked despite ultimately being rejected during command application. The reason for this was that we previously updated the lease applied index of the ephemeral batch and didn't any more, meaning that commands that would fail the MaxLeaseIndex check on the actual apply batch could erroneously pass it on the ephemeral batch. Actually, I believe that change only made an existing bug (or at least semantic incorrectness) more common: it was removing that update altogether, but the old code blindly updated the MaxLeaseIndex, which also doesn't seem correct as not all commands (in particular leases) specify it in the first place. So conceivably, a command that would be rejected could have reset the MaxLeaseIndex and then allowed a later command that should also have been rejected to pass. I think this was mostly theoretical in nature since we only batch "trivial" commands but this PR improves the update to only occur for non-rejected commands. A regression test is added. Thanks to Nathan for bisecting and diagnosis. No release note because unreleased. Fixes cockroachdb#94409. [^1]: cockroachdb@e61de15#diff-0fd06523f1f485024aef0c7a11d3472945d5ac7cf228d6007b2475ccf6f44cd6L795 Epic: CRDB-220 Release note: None
A bug was introduced recently[^1] which could lead to commands being early-acked despite ultimately being rejected during command application. The reason for this was that we previously updated the lease applied index of the ephemeral batch and didn't any more, meaning that commands that would fail the MaxLeaseIndex check on the actual apply batch could erroneously pass it on the ephemeral batch. Actually, I believe that change only made an existing bug (or at least semantic incorrectness) more common: it was removing that update altogether, but the old code blindly updated the MaxLeaseIndex, which also doesn't seem correct as not all commands (in particular leases) specify it in the first place. So conceivably, a command that would be rejected could have reset the MaxLeaseIndex and then allowed a later command that should also have been rejected to pass. I think this was mostly theoretical in nature since we only batch "trivial" commands but this PR improves the update to only occur for non-rejected commands. A regression test is added. Thanks to Nathan for bisecting and diagnosis. No release note because unreleased. Fixes cockroachdb#94409. [^1]: cockroachdb@e61de15#diff-0fd06523f1f485024aef0c7a11d3472945d5ac7cf228d6007b2475ccf6f44cd6L795 Epic: CRDB-220 Release note: None
roachtest.kv/contention/nodes=4 failed with artifacts on master @ 73485e56768d61b14a5aa9b3f880613d7820edb9:
Parameters: |
roachtest.kv/contention/nodes=4 failed with artifacts on master @ 64e4fc9faa4e0ab19fe5ba78f053bc2b1390cb5e:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-22877
Epic CRDB-18656
The text was updated successfully, but these errors were encountered: