roachtest: kv/restart/nodes=12 failed #98928

cockroach-teamcity · 2023-03-18T06:30:22Z

roachtest.kv/restart/nodes=12 failed with artifacts on master @ 6c99966f604f3521acdb925b9f689529ffd46df3:

test artifacts and logs in: /artifacts/kv/restart/nodes=12/run_1
(assertions.go:262).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:1021
	            				main/pkg/cmd/roachtest/test_runner.go:993
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	"0" is not greater than "0.5"
	Test:       	kv/restart/nodes=12
(cluster.go:1969).Run: output in run_060156.319224414_n13_cockroach-workload-r: ./cockroach workload run kv --min-block-bytes=8192 --max-block-bytes=8192 --tolerate-errors --read-percent=50  --duration=30m0s --concurrency=64 {pgurl:1-11} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-25597}

The text was updated successfully, but these errors were encountered:

aliher1911 · 2023-03-20T12:31:20Z

Test was reenabled last week after some fixes (#98271) but it looks like the underlying overload behaviour is unchanged.
After node is added after a downtime SQL throughput tanks while node is trying to catch up. Number of leaseholders on the overloaded node is 0 so the change mentioned in PR works. Since throughput drops, overload must be affecting something else.

aliher1911 · 2023-03-20T12:46:51Z

Seems that my interpretation of lease stats is not correct. Node doesn't have leases but other nodes still try to move them to n12.

Queries are stuck with:

W230318 06:26:27.536186 369697 kv/kvclient/kvcoord/dist_sender.go:1675 ⋮ [T1,n10,client=10.128.0.71:50218,user=root] 12986  slow range RPC: have been waiting 145.10s (1 attempts) for RPC Put [/Table/106/1/‹-112334951631669349›/‹0›,/Min), EndTxn(parallel commit) [/Table/106/1/‹-112334951631669349›/‹0›], [txn: d66aaf7d], [can-forward-ts] to r4816:‹/Table/106/1/-112{980658418560248-058367329429228}› [(n12,s12):5, (n7,s7):4, (n9,s9):3, next=6, gen=64, sticky=9223372036.854775807,2147483647]; resp: result is ambiguous: context done during DistSender.Send: ba: ‹Put [/Table/106/1/-112334951631669349/0,/Min), EndTxn(parallel commit) [/Table/106/1/-112334951631669349/0], [txn: d66aaf7d], [can-forward-ts]› RPC error: grpc: ‹context canceled› [code 1/Canceled]

While we try to move some leases right after node comes back:

I230318 06:23:27.300854 3410055 13@kv/kvserver/replicate_queue.go:2033 ⋮ [T1,n10,replicate,s10,r18220/6:‹/Table/106/1/-88{6091…-5998…}›] 33251  transferring lease to s12
I230318 06:23:42.441713 3429253 13@kv/kvserver/replicate_queue.go:2033 ⋮ [T1,n10,replicate,s10,r6461/1:‹/Table/106/1/91{49588…-50511…}›] 33252  transferring lease to s12
I230318 06:23:56.174176 3442526 13@kv/kvserver/replicate_queue.go:2033 ⋮ [T1,n10,replicate,s10,r13280/2:‹/Table/106/1/617{4277…-5199…}›] 33253  transferring lease to s12
I230318 06:24:03.337125 3449366 13@kv/kvserver/replicate_queue.go:2033 ⋮ [T1,n10,replicate,s10,r16913/4:‹/Table/106/1/-6830{948…-026…}›] 33254  transferring lease to s12

aliher1911 · 2023-03-20T13:16:38Z

When you are going to look at it, maybe you can fix

cockroach/pkg/cmd/roachtest/tests/kv.go

Line 988 in b3d3e64

    
           t.Status(fmt.Sprintf("waiting %x after stopping node to allow the node to fall behind", duration))

formatter from %x to %s? I don't want to raise a PR because of 1 char. It prints timestamp as hex currently.

kvoli · 2023-03-20T13:31:00Z

The restarting node (n12) should be treated as suspect and not be receiving any leases for 30s after restarting.

n12 joins back at 06:23.18.71.

I230318 06:23:18.715307 400 1@gossip/gossip.go:1420 ⋮ [T1,n12] 26  node has connected to cluster via gossip

The earliest lease transfer I see towards n12 is at 06:23:25.54, which is 7 seconds after n12 connects to gossip.

2283:I230318 06:23:25.542903 3477207 13@kv/kvserver/replicate_queue.go:2033 ⋮ [T1,n9,replicate,s9,r18364/5:‹/Table/106/1/720{0787…-1709…}›] 35975  transferring lease to s12
...

So it appears that the node isn't being treated as suspect after immediately rejoining and is receiving leases as a result.

I'll look more into why that is the case.

Previously, the `LastUnavailable` time was set in most parts of the storepool when a store was considered either `Unavailable`, `Dead`, `Decommissioned` or `Draining`. When `LastUnavailable` is within the last suspect duration (30s default), the node is treated as suspect by other nodes in the cluster. `LastUnavailable` was not being set when a store was considered dead due to the store not gossiping its store descriptor. This commit updates the `status` storepool function to do just that. Informs: cockroachdb#98928 Release note: None

kvoli · 2023-03-20T20:15:25Z

With the fix in ##99033 to ensure stores are suspect this test should be less flaky. After 30s and before the restarting node hits overload, it may still receive a few leases and cause this test to flake. Removing the GA blocker.

98792: kvserver: unskip `TestNewTruncateDecision` r=erikgrinaker a=erikgrinaker Passed after 10k stress runs. Has been skipped since 2019, issue seems to have been fixed in the meanwhile. Resolves #38584. Epic: none Release note: None 98855: roachtest: enable schema changes in acceptance/version-upgrade r=fqazi a=fqazi Previously, due to flakes we disabled schema changes inside the version update test. This patch re-enables them, since we are confident that the workload itslef is now stable in a mixed version state. Fixes: #58489 Release note: None 99023: kv: add log scope to BenchmarkSingleRoundtripWithLatency r=arulajmani a=nvanbenschoten Informs #98887. Avoids mixing logs with benchmark results, which breaks benchdiff. Release note: None 99033: storepool: set last unavailable on gossip dead r=andrewbaptist a=kvoli Previously, the `LastUnavailable` time was set in most parts of the storepool when a store was considered either `Unavailable`, `Dead`, `Decommissioned` or `Draining`. When `LastUnavailable` is within the last suspect duration (30s default), the node is treated as suspect by other nodes in the cluster. `LastUnavailable` was not being set when a store was considered dead due to the store not gossiping its store descriptor. This commit updates the `status` storepool function to do just that. Informs: #98928 Release note: None 99039: pkg/ccl/backupccl: Remove TestBackupRestoreControlJob r=benbardin a=benbardin This test has was marked skipped for flakiness, in 2018. Fixes: #24136 Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Ben Bardin <bardin@cockroachlabs.com>

github-actions · 2023-04-20T10:06:16Z

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

cockroach-teamcity added this to the 23.1 milestone Mar 18, 2023

blathers-crl bot added the T-kv KV Team label Mar 18, 2023

aliher1911 assigned kvoli Mar 20, 2023

aliher1911 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 20, 2023

kvoli added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 20, 2023

kvoli mentioned this issue Mar 20, 2023

storepool: set last unavailable on gossip dead #99033

Merged

kvoli removed the GA-blocker label Mar 20, 2023

github-actions bot added no-test-failure-activity X-stale labels Apr 20, 2023

github-actions bot closed this as completed Apr 26, 2023

aliher1911 mentioned this issue May 2, 2023

roachtest: kv/restart/nodes=12 failed #102655

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: kv/restart/nodes=12 failed #98928

roachtest: kv/restart/nodes=12 failed #98928

cockroach-teamcity commented Mar 18, 2023 •

edited by cockroach-jira-scripts

Loading

aliher1911 commented Mar 20, 2023 •

edited

Loading

aliher1911 commented Mar 20, 2023

aliher1911 commented Mar 20, 2023

kvoli commented Mar 20, 2023 •

edited

Loading

kvoli commented Mar 20, 2023

github-actions bot commented Apr 20, 2023

roachtest: kv/restart/nodes=12 failed #98928

roachtest: kv/restart/nodes=12 failed #98928

Comments

cockroach-teamcity commented Mar 18, 2023 • edited by cockroach-jira-scripts Loading

aliher1911 commented Mar 20, 2023 • edited Loading

aliher1911 commented Mar 20, 2023

aliher1911 commented Mar 20, 2023

kvoli commented Mar 20, 2023 • edited Loading

kvoli commented Mar 20, 2023

github-actions bot commented Apr 20, 2023

cockroach-teamcity commented Mar 18, 2023 •

edited by cockroach-jira-scripts

Loading

aliher1911 commented Mar 20, 2023 •

edited

Loading

kvoli commented Mar 20, 2023 •

edited

Loading