Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/contention/nodes=4 failed #94409

Closed
cockroach-teamcity opened this issue Dec 29, 2022 · 12 comments · Fixed by #94930
Closed

roachtest: kv/contention/nodes=4 failed #94409

cockroach-teamcity opened this issue Dec 29, 2022 · 12 comments · Fixed by #94930
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 29, 2022

roachtest.kv/contention/nodes=4 failed with artifacts on master @ 64e4fc9faa4e0ab19fe5ba78f053bc2b1390cb5e:

test artifacts and logs in: /artifacts/kv/contention/nodes=4/run_1
(cluster.go:1933).Run: output in run_075158.753543425_n5_workload_run_kv: ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned: COMMAND_PROBLEM: exit status 1
(monitor.go:127).Wait: monitor failure: monitor task failed: t.Fatal() was called

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-22877

Epic CRDB-18656

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 29, 2022
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Dec 29, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Dec 29, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.kv/contention/nodes=4 failed with artifacts on master @ 2c3d75f1ce31024d7ffe530f91f22162c053abcd:

test artifacts and logs in: /artifacts/kv/contention/nodes=4/run_1
(cluster.go:1933).Run: output in run_074505.228160756_n5_workload_run_kv: ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned: COMMAND_PROBLEM: exit status 1
(monitor.go:127).Wait: monitor failure: monitor task failed: t.Fatal() was called

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@tbg

This comment was marked as duplicate.

1 similar comment
@tbg
Copy link
Member

tbg commented Jan 4, 2023

I221229 07:59:58.946686 895 workload/pgx_helpers.go:79  [-] 5  pgx logger [error]: Exec logParams=map[args:[6119450608480346262 dd -5780470643412804901 b1 3307412084736161884 9f -2912259416495554981 60] err:ERROR: duplicate key value violates unique constraint "kv_v_idx" (SQLSTATE 23505) pid:3080458 sql:kv-2 time:38.379989ms]
Error: ERROR: duplicate key value violates unique constraint "kv_v_idx" (SQLSTATE 23505)
run_075158.753543425_n5_workload_run_kv: 07:59:58 cluster.go:1955: > result: ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned: COMMAND_PROBLEM: exit status 1

@erikgrinaker
Copy link
Contributor

This one seems legit. It's unclear to me whether the problem is in KV (txn isolation) or SQL (client misuse), but I'm tossing this one over to KV for immediate investigation. CC @nvanbenschoten.

@erikgrinaker erikgrinaker added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 4, 2023
@nvanbenschoten
Copy link
Member

Thanks @erikgrinaker. This has reproduced twice in the past week, so it appears amenable to bisecting. I'll do that.

@nvanbenschoten
Copy link
Member

This git bisect has been slow because of the high iteration time and variability of this failure. I've found that it dates back at least as far as 9d0ee35. I'll keep this thread updated with progress.

@nvanbenschoten
Copy link
Member

I've now seen the failure as far back as a2d9a0ec98. Notably, that commit does not include 0483f41901, which is a relief.

@erikgrinaker
Copy link
Contributor

Is it worth verifying that this isn't a problem with the workload? Have you tried running with an older workload version?

@nvanbenschoten
Copy link
Member

Is it worth verifying that this isn't a problem with the workload?

I was suspicious of that at one point, but I've been using roachstress.sh, which rebuilds the workload on each iteration, so the bisect would also pick up changes in the workload.

Anyway, after a few misdiagnoses, I think the bisect finally landed on e61de15 (cc. @tbg). I'll keep running on the commit before this because I've been fooled a few times over the past few days, but we should dig into that commit.

One question I have about it is the removal of this line, which used to look like:

mb.state.LeaseAppliedIndex = cmd.LeaseIndex

Unless I'm missing something, we're no longer updating the LeaseAppliedIndex in the ephemeralReplicaAppBatch's ReplicaState while performing the dry run of entry application. This means entries that will be rejected due to prior entries in the same apply batch with higher MaxLeaseIndex (due to a raft reordering) won't be considered rejected during the early ack dry run. This could lead to raft proposals being acknowledged but never applying.

I don't yet see a way that this could lead to uniqueness constraint violations, but it feels like a bad issue, if real. One interesting part of this failure is that we're seeing duplicate key value violates unique constraint on kv_v_idx, a non-unique index. That's nonsensical. I'm guessing it means that we're getting a ConditionFailedError on a CPut when manipulating this non-unique secondary index and SQL is translating this to a uniqueness violation, even though it is actually index corruption. Perhaps we are expecting a key-value to be present but because it was lost, we detect the corruption and throw the confusing error.

We might never need to connect all the dots (though it's fun to try!) if we can prove that this is the cause of the failure. The fix would look something like that following, paired with an investment in testing to correct the fact that this wasn't quickly caught by a unit test.

diff --git a/pkg/kv/kvserver/replica_app_batch.go b/pkg/kv/kvserver/replica_app_batch.go
index 383fa0936c..dc153b5fca 100644
--- a/pkg/kv/kvserver/replica_app_batch.go
+++ b/pkg/kv/kvserver/replica_app_batch.go
@@ -765,6 +765,7 @@ func (mb *ephemeralReplicaAppBatch) Stage(
        )
        fr = replicaApplyTestingFilters(ctx, mb.r, cmd, fr)
        cmd.ForcedErrResult = fr
+       mb.state.LeaseAppliedIndex = cmd.LeaseIndex

        return cmd, nil
 }

I'll also add that #94165 paves the path to removing this "early ack" logic entirely. The optimization is less important if entry application in a raft ready iteration no longer waits for the raft log fsync. If we're finding that it's a burden to maintain this code as we refactor raft, we should explore whether we need it anymore.

@erikgrinaker
Copy link
Contributor

Nice, thanks for the investigation @nvanbenschoten, and sorry for misrouting this to KV -- appreciate the effort.

@blathers-crl
Copy link

blathers-crl bot commented Jan 7, 2023

cc @cockroachdb/replication

@erikgrinaker erikgrinaker removed the T-kv KV Team label Jan 7, 2023
tbg added a commit to tbg/cockroach that referenced this issue Jan 9, 2023
A bug was introduced recently[^1] which could lead to commands being
early-acked despite ultimately being rejected during command
application. The reason for this was that we previously updated
the lease applied index of the ephemeral batch and didn't any more,
meaning that commands that would fail the MaxLeaseIndex check on the
actual apply batch could erroneously pass it on the ephemeral batch.

Actually, I believe that change only made an existing bug (or at least
semantic incorrectness) more common: it was removing that update
altogether, but the old code blindly updated the MaxLeaseIndex, which
also doesn't seem correct as not all commands (in particular leases)
specify it in the first place. So conceivably, a command that would
be rejected could have reset the MaxLeaseIndex and then allowed a
later command that should also have been rejected to pass.

I think this was mostly theoretical in nature since we only batch
"trivial" commands but this PR improves the update to only occur
for non-rejected commands.

A regression test is added.

Thanks to Nathan for bisecting and diagnosis.

No release note because unreleased.

Fixes cockroachdb#94409.

[^1]: cockroachdb@e61de15#diff-0fd06523f1f485024aef0c7a11d3472945d5ac7cf228d6007b2475ccf6f44cd6L795

Epic: CRDB-220
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 10, 2023
A bug was introduced recently[^1] which could lead to commands being
early-acked despite ultimately being rejected during command
application. The reason for this was that we previously updated
the lease applied index of the ephemeral batch and didn't any more,
meaning that commands that would fail the MaxLeaseIndex check on the
actual apply batch could erroneously pass it on the ephemeral batch.

Actually, I believe that change only made an existing bug (or at least
semantic incorrectness) more common: it was removing that update
altogether, but the old code blindly updated the MaxLeaseIndex, which
also doesn't seem correct as not all commands (in particular leases)
specify it in the first place. So conceivably, a command that would
be rejected could have reset the MaxLeaseIndex and then allowed a
later command that should also have been rejected to pass.

I think this was mostly theoretical in nature since we only batch
"trivial" commands but this PR improves the update to only occur
for non-rejected commands.

A regression test is added.

Thanks to Nathan for bisecting and diagnosis.

No release note because unreleased.

Fixes cockroachdb#94409.

[^1]: cockroachdb@e61de15#diff-0fd06523f1f485024aef0c7a11d3472945d5ac7cf228d6007b2475ccf6f44cd6L795

Epic: CRDB-220
Release note: None
@cockroach-teamcity
Copy link
Member Author

roachtest.kv/contention/nodes=4 failed with artifacts on master @ 73485e56768d61b14a5aa9b3f880613d7820edb9:

test artifacts and logs in: /artifacts/kv/contention/nodes=4/run_1
(cluster.go:1933).Run: output in run_074538.500984582_n5_workload_run_kv: ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned: COMMAND_PROBLEM: exit status 1
(monitor.go:127).Wait: monitor failure: monitor task failed: t.Fatal() was called

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants