Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestStoreRangeSplitRaceUninitializedRHS failed #66480

Closed
cockroach-teamcity opened this issue Jun 15, 2021 · 4 comments · Fixed by #98715
Closed

kv/kvserver: TestStoreRangeSplitRaceUninitializedRHS failed #66480

cockroach-teamcity opened this issue Jun 15, 2021 · 4 comments · Fixed by #98715
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss skipped-test T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 15, 2021

kv/kvserver.TestStoreRangeSplitRaceUninitializedRHS failed with artifacts on master @ ba16357c4968ec68b2d0afda7862331f22e14e8c:

=== RUN   TestStoreRangeSplitRaceUninitializedRHS
    test_log_scope.go:73: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeSplitRaceUninitializedRHS222619826
    test_log_scope.go:74: use -show-logs to present logs inline
    testcluster.go:335: pq: setting updated but timed out waiting to read new value
    panic.go:636: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeSplitRaceUninitializedRHS222619826
W210615 14:20:48.009869 977027 server/node.go:783  [n1,summaries] 13  health alerts detected: {Alerts:[{StoreID:1 Category:METRICS Description:ranges.underreplicated Value:37}]}
--- FAIL: TestStoreRangeSplitRaceUninitializedRHS (20.95s)
Reproduce

To reproduce, try:

make stressrace TESTS=TestStoreRangeSplitRaceUninitializedRHS PKG=./pkg/kv/kvserver TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Parameters in this failure:

  • GOFLAGS=-json

/cc @cockroachdb/kv tbg

This test on roachdash | Improve this report!

Jira issue: CRDB-8060

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jun 15, 2021
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@tbg
Copy link
Member

tbg commented Jun 22, 2021

W210615 14:20:47.827240 980525 sql/set_cluster_setting.go:322 ⋮ [n1,client=‹127.0.0.1:38682›,hostssl,user=root] 115 SET CLUSTER SETTING ‹"kv.range_merge.queue_enabled"› timed out waiting for value ‹"false"›, observed ‹"true"›

if _, ok := n.setting.(*settings.VersionSetting); ok && n.value == nil {
// The "version" setting doesn't have a well defined "default" since it
// is set in a startup migration.
return nil
}
errNotReady := errors.New("setting updated but timed out waiting to read new value")
var observed string
err := retry.ForDuration(10*time.Second, func() error {
observed = n.setting.Encoded(&execCfg.Settings.SV)
if observed != expectedEncodedValue {
return errNotReady
}
return nil
})
if err != nil {
log.Warningf(
params.ctx, "SET CLUSTER SETTING %q timed out waiting for value %q, observed %q",
n.name, expectedEncodedValue, observed,
)
}
return err
}

Maybe the 10s just wasn't enough for gossip propagation, this is a race build.

@tbg
Copy link
Member

tbg commented Jun 22, 2021

This is from StartTestCluster btw, so nothing to do with the test really, except perhaps the fact that this test sets aggressive raft cfgs:

RaftConfig: base.RaftConfig{
// An aggressive tick interval lets groups communicate more and thus
// triggers test failures much more reliably. We can't go too aggressive
// or race tests never make any progress.
RaftTickInterval: 100 * time.Millisecond,
RaftElectionTimeoutTicks: 2,
RaftHeartbeatIntervalTicks: 1,

The logs look clean though, so it looks like the cluster is healthy. Not sure. Likely a fluke.

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestStoreRangeSplitRaceUninitializedRHS failed with artifacts on master @ 40b50c952aaa90e0ac565c859c1076a77c9a64be:

=== RUN   TestStoreRangeSplitRaceUninitializedRHS
    test_log_scope.go:73: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeSplitRaceUninitializedRHS849579772
    test_log_scope.go:74: use -show-logs to present logs inline
    testcluster.go:334: pq: setting updated but timed out waiting to read new value
    panic.go:632: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeSplitRaceUninitializedRHS849579772
--- FAIL: TestStoreRangeSplitRaceUninitializedRHS (18.98s)
Reproduce

To reproduce, try:

make stressrace TESTS=TestStoreRangeSplitRaceUninitializedRHS PKG=./pkg/kv/kvserver TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Parameters in this failure:

  • GOFLAGS=-json

/cc @cockroachdb/kv tbg

This test on roachdash | Improve this report!

irfansharif added a commit to irfansharif/cockroach that referenced this issue Jun 30, 2021
craig bot pushed a commit that referenced this issue Jun 30, 2021
67082: ccl,kv: skip a few flakey tests/sub-tests r=irfansharif a=irfansharif

Relevant test failure issues: #66991, #67034, #66942, #66480, #64313.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
@tbg tbg added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Feb 1, 2022
@kvoli
Copy link
Collaborator

kvoli commented Mar 15, 2023

The test is still flaky when running with --stress, failing reliably in under 30 runs. The issue as far as I can tell is that the test sends MsgVote requests to the RHS side of the split every microsecond, until the split completes without error. This can cause the send queue to fill up and RaftTransport().SendAsync() to return false, which fails the test due to this test assertion.

Logs

W230315 19:25:01.200371 2888 kv/kvserver/raft_transport.go:542 ⋮ [T1,n2] 449  raft send queue to n1 is full
client_split_test.go:2149: transport failed to send vote request

Decreasing the frequency of MsgVote requests to 4 microseconds makes the test pass reliably, however I'm not certain if this is because we no longer see the race condition, which we want to test.

I think I may need more expertise to say whether this is expected.

craig bot pushed a commit that referenced this issue Mar 16, 2023
98601: ui: add PGCode to admin api errors r=j82w a=j82w

Adding the PGCode makes it easy to identify
the underlying issue of why the request failed.

old error:
```
RequestError: An internal server error has occurred.
Please check your CockroachDB logs for more details.
```

new error
```
RequestError: An internal server error has occurred. 
Please check your CockroachDB logs for more details. Error Code: 53200
```

<img width="2058" alt="Screenshot 2023-03-14 at 2 51 57 PM" src="https://user-images.githubusercontent.com/8868107/225108075-5db442da-699f-487b-852c-732eb030f644.png">


Epic: none
Closes: #98596

Release note: none

98715: kvserver: deflake and unskip split race uninit rhs r=nvanbenschoten a=kvoli

`TestStoreRangeSplitRaceUninitializedRHS` was skipped some time ago, in mid 2021 (#67082). The test was skipped due to flakes that appeared related to untimely test cluster startup.

This commit unskips the test and makes minor adjustments in order to be current with semantics of Raft transport. Without these adjustments, the `MsgVote` sent every microsecond with the intention of triggering a race, would completely fill up  the Raft transport send queue. Once the queue was full, the test would fail as requests are dropped.

This commit updates the `MsgVote` send loop logic to not require every `MsgVote` request to be sent for the test to succeed.

Resolves: #66480

Release note: None

Co-authored-by: j82w <jwilley@cockroachlabs.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
@craig craig bot closed this as completed in 4be1c75 Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss skipped-test T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants