-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: replicate/wide failed #96546
Comments
Test timed out in Test stack
The test timed out at 12:17. Stacks should have been dumped at 12:17:38: teardown: 12:17:38 test_runner.go:1055: [w15] asked CRDB nodes to dump stacks; check their main (DEV) logs: then for the next five minutes we try to download tsdump, debug.zip, all to no avail, just times out. The test did restart the nodes at 12:08 (intentionally), though it looks like n4 then died again two minutes later, with code 134
We recognize this as exit code 6, or disk stall1. The last log message from this node is close to two minutes old at that point:
so n4 hit a wall. We may want to investigate why it went down, but for now let's note that this shouldn't have killed the cluster and move on. Logs on all nodes (including this one) are full of
so we have a raft leader, but at seven voters and four of them probing (i.e. not helping with the log) we're in a tough spot. Interestingly the down node n4 isn't even in this descriptor. Looking at when the liveness range (r2) became unavailable, we see that it started at around 12:10:20 minus 15s, i.e.
That makes sense given the raft status we saw above,
but how did we end up with four voters in probing state? They were all added as part of rebalancing, and so they got snapshots while they were learners: cockroach debug merge-logs --filter 'r2[^0-9]' logs/*.unredacted/cockroach.team*.log | grep 'applied INITIAL'
Three of the voters have a Match of zero, meaning that the leader hasn't seen an affirmative MsgAppResp from them yet. For s5, we have index 491 acked; it got a snapshot at 451 so it did catch up a little while before losing it again. We added s1 after that and that one works like a charm and is replicating. Will take a break, triage my other tests, and then look into this one again. Noting that #94165 merged a few days ago and was picked up in this run (cc @nvanbenschoten) Footnotes |
I kicked off a series of iterations of this test on master (8022f2a) to see whether it's noticeably more flaky. That would be a good indication that we recently broke something. Unfortunately, 100/100 iterations passed. |
Thanks Nathan. I just figured out what the problem is and luckily it's not in the replication layer. The test does the following:
At restart time, r2 has members (all voters) n1-n9, as expected:
Note that only 6/9 are up. After the restart, n3 quickly
The test seems to be handling this just about as well as it could, we could argue whether the adaptive replication factor is a good idea (it's not) but nothing here is wrong with that code or the test. We simply can not tolerate an outage while intentionally in this state. I'm going to take a quick look at the prometheus metrics for n4 and then close this out. |
n4 did report to prometheus right until it crashed, including |
roachtest.replicate/wide failed with artifacts on master @ 5fbcd8a8deac0205c7df38e340c1eb9692854383:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=1
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-24180
The text was updated successfully, but these errors were encountered: