-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv/restart/nodes=12 failed #98928
Comments
Test was reenabled last week after some fixes (#98271) but it looks like the underlying overload behaviour is unchanged. |
Seems that my interpretation of lease stats is not correct. Node doesn't have leases but other nodes still try to move them to n12. Queries are stuck with:
While we try to move some leases right after node comes back:
|
When you are going to look at it, maybe you can fix cockroach/pkg/cmd/roachtest/tests/kv.go Line 988 in b3d3e64
formatter from %x to %s? I don't want to raise a PR because of 1 char. It prints timestamp as hex currently. |
The restarting node (n12) should be treated as suspect and not be receiving any leases for 30s after restarting. n12 joins back at
The earliest lease transfer I see towards
So it appears that the node isn't being treated as suspect after immediately rejoining and is receiving leases as a result. I'll look more into why that is the case. |
Previously, the `LastUnavailable` time was set in most parts of the storepool when a store was considered either `Unavailable`, `Dead`, `Decommissioned` or `Draining`. When `LastUnavailable` is within the last suspect duration (30s default), the node is treated as suspect by other nodes in the cluster. `LastUnavailable` was not being set when a store was considered dead due to the store not gossiping its store descriptor. This commit updates the `status` storepool function to do just that. Informs: cockroachdb#98928 Release note: None
With the fix in ##99033 to ensure stores are suspect this test should be less flaky. After 30s and before the restarting node hits overload, it may still receive a few leases and cause this test to flake. Removing the GA blocker. |
98792: kvserver: unskip `TestNewTruncateDecision` r=erikgrinaker a=erikgrinaker Passed after 10k stress runs. Has been skipped since 2019, issue seems to have been fixed in the meanwhile. Resolves #38584. Epic: none Release note: None 98855: roachtest: enable schema changes in acceptance/version-upgrade r=fqazi a=fqazi Previously, due to flakes we disabled schema changes inside the version update test. This patch re-enables them, since we are confident that the workload itslef is now stable in a mixed version state. Fixes: #58489 Release note: None 99023: kv: add log scope to BenchmarkSingleRoundtripWithLatency r=arulajmani a=nvanbenschoten Informs #98887. Avoids mixing logs with benchmark results, which breaks benchdiff. Release note: None 99033: storepool: set last unavailable on gossip dead r=andrewbaptist a=kvoli Previously, the `LastUnavailable` time was set in most parts of the storepool when a store was considered either `Unavailable`, `Dead`, `Decommissioned` or `Draining`. When `LastUnavailable` is within the last suspect duration (30s default), the node is treated as suspect by other nodes in the cluster. `LastUnavailable` was not being set when a store was considered dead due to the store not gossiping its store descriptor. This commit updates the `status` storepool function to do just that. Informs: #98928 Release note: None 99039: pkg/ccl/backupccl: Remove TestBackupRestoreControlJob r=benbardin a=benbardin This test has was marked skipped for flakiness, in 2018. Fixes: #24136 Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Ben Bardin <bardin@cockroachlabs.com>
We have marked this test failure issue as stale because it has been |
roachtest.kv/restart/nodes=12 failed with artifacts on master @ 6c99966f604f3521acdb925b9f689529ffd46df3:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-25597
The text was updated successfully, but these errors were encountered: