Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: replicate/wide failed #96546

Closed
cockroach-teamcity opened this issue Feb 4, 2023 · 4 comments
Closed

roachtest: replicate/wide failed #96546

cockroach-teamcity opened this issue Feb 4, 2023 · 4 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 4, 2023

roachtest.replicate/wide failed with artifacts on master @ 5fbcd8a8deac0205c7df38e340c1eb9692854383:

test artifacts and logs in: /artifacts/replicate/wide/run_1
(allocator.go:346).func1: dial tcp 34.139.136.164:26257: connect: connection refused
(test_runner.go:1110).teardownTest: test timed out (10m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-24180

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 4, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Feb 4, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Feb 4, 2023
@tbg
Copy link
Member

tbg commented Feb 7, 2023

Test timed out in

https://github.com/cockroachdb/cockroach/blob/4217aa07ffa0eb9cb38495ca48ea69ce2ebee44f/pkg/cmd/roachtest/tests/allocator.go#L344-L343

Test stack
goroutine 2520111 [IO wait, 7 minutes]:
internal/poll.runtime_pollWait(0x7f5d0e463198, 0x72)
	GOROOT/src/runtime/netpoll.go:305 +0x89
internal/poll.(*pollDesc).wait(0xc00a0ec880?, 0xc004a7c000?, 0x0)
	GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	GOROOT/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00a0ec880, {0xc004a7c000, 0x1000, 0x1000})
	GOROOT/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc00a0ec880, {0xc004a7c000?, 0x62ae086?, 0x2c?})
	GOROOT/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc017e95140, {0xc004a7c000?, 0xc003f21b20?, 0x0?})
	GOROOT/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc00bb92120, {0xc003f21b20, 0x5, 0xc00a0ec880?})
	GOROOT/src/bufio/bufio.go:237 +0x1bb
io.ReadAtLeast({0x75b60a0, 0xc00bb92120}, {0xc003f21b20, 0x5, 0x200}, 0x5)
	GOROOT/src/io/io.go:332 +0x9a
io.ReadFull(...)
	GOROOT/src/io/io.go:351
github.com/lib/pq.(*conn).recvMessage(0xc003f21b00, 0xc005ae2f90)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1004 +0xca
github.com/lib/pq.(*conn).recv1Buf(0xc003f21b00, 0x24?)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1059 +0x2e
github.com/lib/pq.(*conn).recv1(...)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1086
github.com/lib/pq.(*conn).simpleQuery(0xc003f21b00, {0x609bfae?, 0x26})
	github.com/lib/pq/external/com_github_lib_pq/conn.go:696 +0x234
github.com/lib/pq.(*conn).query(0xc003f21b00, {0x609bfae, 0x26}, {0xaf7de70, 0x0, 0x0})
	github.com/lib/pq/external/com_github_lib_pq/conn.go:889 +0x3ee
github.com/lib/pq.(*conn).QueryContext(0x22?, {0x7609198, 0xc0000ba060}, {0x609bfae, 0x26}, {0xaf7de70, 0x0, 0x2?})
	github.com/lib/pq/external/com_github_lib_pq/conn_go18.go:24 +0xd7
database/sql.ctxDriverQuery({0x7609198?, 0xc0000ba060?}, {0x7f5d2123a060?, 0xc003f21b00?}, {0x0?, 0x0?}, {0x609bfae?, 0xc?}, {0xaf7de70, 0x0, ...})
	GOROOT/src/database/sql/ctxutil.go:48 +0xe5
database/sql.(*DB).queryDC.func1()
	GOROOT/src/database/sql/sql.go:1757 +0x175
database/sql.withLock({0x75d69e0, 0xc00c4b6240}, 0xc005ae34e0)
	GOROOT/src/database/sql/sql.go:3439 +0x8c
database/sql.(*DB).queryDC(0xc00105c201?, {0x7609198, 0xc0000ba060}, {0x0, 0x0}, 0xc00c4b6240, 0xc003d5e300, {0x609bfae, 0x26}, {0x0, ...})
	GOROOT/src/database/sql/sql.go:1752 +0x211
database/sql.(*DB).query(0x0?, {0x7609198, 0xc0000ba060}, {0x609bfae, 0x26}, {0x0, 0x0, 0x0}, 0x99?)
	GOROOT/src/database/sql/sql.go:1735 +0xfd
database/sql.(*DB).QueryContext(0xc000a33e60?, {0x7609198, 0xc0000ba060}, {0x609bfae, 0x26}, {0x0, 0x0, 0x0})
	GOROOT/src/database/sql/sql.go:1708 +0xda
database/sql.(*DB).Query(...)
	GOROOT/src/database/sql/sql.go:1726
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runWideReplication.func1()
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/allocator.go:344 +0x8e
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runWideReplication.func3(0xc483f615b?)
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/allocator.go:370 +0x2f
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runWideReplication({0x7609160, 0xc006947780}, {0x7688b50, 0xc008848c60}, {0x76ac990?, 0xc0058c8840?})
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/allocator.go:459 +0xb75
main.(*testRunner).runTest.func2()
	main/pkg/cmd/roachtest/test_runner.go:974 +0xa3
created by main.(*testRunner).runTest
	main/pkg/cmd/roachtest/test_runner.go:959 +0x89e

The test timed out at 12:17. Stacks should have been dumped at 12:17:38:

teardown: 12:17:38 test_runner.go:1055: [w15] asked CRDB nodes to dump stacks; check their main (DEV) logs:

then for the next five minutes we try to download tsdump, debug.zip, all to no avail, just times out.


The test did restart the nodes at 12:08 (intentionally), though it looks like n4 then died again two minutes later, with code 134

cockroach exited with code 134: Sat Feb  4 12:10:10 UTC 2023

We recognize this as exit code 6, or disk stall1. The last log message from this node is close to two minutes old at that point:

teamcity-8580139-1675491419-59-n9cpu1-0004> I230204 12:08:27.194846 1244 3@pebble/event.go:678 ⋮ [T1,n4,s4,pebble] 258 [JOB 66] sstable deleted 000083

so n4 hit a wall. We may want to investigate why it went down, but for now let's note that this shouldn't have killed the cluster and move on.


Logs on all nodes (including this one) are full of

E230204 12:20:38.059962 1752799 kv/kvserver/replica_range_lease.go:462 ⋮ [T1,n3,s3,r48/9:‹/Table/4{6-7}›] 36259  failed to heartbeat own liveness record: replica unavailable: (n3,s3):10 unable to serve request to r
2:‹/System/NodeLiveness{-Max}› [(n8,s8):6, (n3,s3):10, (n9,s9):7, (n1,s1):16, (n7,s7):5, (n5,s5):15, (n4,s4):14, next=17, gen=50]: lost quorum (down: (n8,s8):6,(n9,s9):7,(n7,s7):5,(n4,s4):14); closed timestamp: 167
5512599.749504581,0 (2023-02-04 12:09:59); raft status: {"id":"a","term":8,"vote":"a","commit":491,"lead":"a","raftState":"StateLeader","applied":491,"progress":{

"10":{"match":3911,"next":3912,"state":"StateReplicate"}, // s1/16
"5":{"match":0,"next":273,"state":"StateProbe"}, // s7/5
"6":{"match":0,"next":273,"state":"StateProbe"},  // s8/6
"7":{"match":0,"next":273,"state":"StateProbe"}, s9/7
"a":{"match":3911,"next":3912,"state":"StateReplicate"}, n3/10
"e":{"match":491,"next":492,"state":"StateProbe"} s4/14
"f":{"match":3911,"next":3912,"state":"StateReplicate"} s1/15

},"leadtransferee":"0"}: encountered poisoned latch ‹/System/NodeLiveness/3›@1675512604.220384952,0

so we have a raft leader, but at seven voters and four of them probing (i.e. not helping with the log) we're in a tough spot.

Interestingly the down node n4 isn't even in this descriptor.


Looking at when the liveness range (r2) became unavailable, we see that it started at around 12:10:20 minus 15s, i.e. 12:10:05:

teamcity-8580139-1675491419-59-n9cpu1-0003> W230204 12:10:20.934385 92838 kv/kvserver/spanlatch/manager.go:558 ⋮ [T1,n3,s3,r2/10:‹/System/NodeLiveness{-Max}›] 381 have been waiting 15s to acquire ‹write› latch ‹/System/NodeLiveness/6@1675512602.931118243,0›, held by ‹write› latch ‹/System/NodeLiveness/6@1675512602.931118243,0›

That makes sense given the raft status we saw above,

"5":{"match":0,"next":273,"state":"StateProbe"}, // s7/5
"6":{"match":0,"next":273,"state":"StateProbe"},  // s8/6
"7":{"match":0,"next":273,"state":"StateProbe"}, s9/7
"e":{"match":491,"next":492,"state":"StateProbe"} s4/14

but how did we end up with four voters in probing state? They were all added as part of rebalancing, and so they got snapshots while they were learners:

cockroach debug merge-logs --filter 'r2[^0-9]' logs/*.unredacted/cockroach.team*.log | grep 'applied INITIAL'
teamcity-8580139-1675491419-59-n9cpu1-0002> I230204 12:08:25.605957 2252 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n2,s2,r2/2:‹/System/NodeLiveness{-Max}›] 80 applied INITIAL snapshot ba999e03 from (n1,s1):1 at applied index 30 (total=4ms data=1.4 KiB ingestion=6@3ms)
teamcity-8580139-1675491419-59-n9cpu1-0004> I230204 12:08:27.853597 2818 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n4,s4,r2/3:‹/System/NodeLiveness{-Max}›] 91 applied INITIAL snapshot 2058ba64 from (n1,s1):1 at applied index 46 (total=4ms data=1.7 KiB ingestion=6@3ms)
teamcity-8580139-1675491419-59-n9cpu1-0005> I230204 12:08:30.493975 3583 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n5,s5,r2/4:‹/System/NodeLiveness{-Max}›] 103 applied INITIAL snapshot 254e3ffb from (n1,s1):1 at applied index 60 (total=6ms data=2.0 KiB ingestion=6@5ms)
teamcity-8580139-1675491419-59-n9cpu1-0007> I230204 12:08:33.429530 609 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n7,s7,r2/5:‹/System/NodeLiveness{-Max}›] 64 applied INITIAL snapshot ec91e5f1 from (n1,s1):1 at applied index 81 (total=4ms data=2.5 KiB ingestion=6@3ms)
teamcity-8580139-1675491419-59-n9cpu1-0008> I230204 12:08:39.707984 9384 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n8,s8,r2/6:‹/System/NodeLiveness{-Max}›] 106 applied INITIAL snapshot bceddd99 from (n1,s1):1 at applied index 126 (total=24ms data=3.4 KiB ingestion=6@23ms)
teamcity-8580139-1675491419-59-n9cpu1-0009> I230204 12:08:40.134613 5363 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n9,s9,r2/7:‹/System/NodeLiveness{-Max}›] 106 applied INITIAL snapshot 4f620e54 from (n8,s8):6 at applied index 139 (total=4ms data=5.2 KiB ingestion=6@3ms)
teamcity-8580139-1675491419-59-n9cpu1-0004> I230204 12:08:44.840240 31091 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n4,s4,r2/8:‹/System/NodeLiveness{-Max}›] 120 applied INITIAL snapshot 29f002ef from (n8,s8):6 at applied index 173 (total=3ms data=4.9 KiB ingestion=6@1ms)
teamcity-8580139-1675491419-59-n9cpu1-0006> I230204 12:08:46.841069 27901 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n6,s6,r2/9:‹/System/NodeLiveness{-Max}›] 143 applied INITIAL snapshot 39ca52d7 from (n8,s8):6 at applied index 190 (total=4ms data=5.3 KiB ingestion=6@3ms)
teamcity-8580139-1675491419-59-n9cpu1-0003> I230204 12:08:48.912283 40651 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n3,s3,r2/10:‹/System/NodeLiveness{-Max}›] 153 applied INITIAL snapshot e3bd154f from (n8,s8):6 at applied index 208 (total=4ms data=5.8 KiB ingestion=6@3ms)
teamcity-8580139-1675491419-59-n9cpu1-0001> I230204 12:08:50.828603 63839 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n1,s1,r2/11:‹/System/NodeLiveness{-Max}›] 257 applied INITIAL snapshot 2cbdb1f1 from (n8,s8):6 at applied index 223 (total=4ms data=6.3 KiB ingestion=6@2ms)
teamcity-8580139-1675491419-59-n9cpu1-0002> I230204 12:09:38.524064 51659 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n2,s2,r2/12:‹/System/NodeLiveness{-Max}›] 246 applied INITIAL snapshot ee5c2edd from (n3,s3):10 at applied index 392 (total=20ms data=11 KiB ingestion=6@19ms)
teamcity-8580139-1675491419-59-n9cpu1-0006> I230204 12:09:41.754165 40247 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n6,s6,r2/13:‹/System/NodeLiveness{-Max}›] 217 applied INITIAL snapshot 6cee61a0 from (n3,s3):10 at applied index 408 (total=4ms data=11 KiB ingestion=6@3ms)
teamcity-8580139-1675491419-59-n9cpu1-0004> I230204 12:09:48.179650 61657 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n4,s4,r2/14:‹/System/NodeLiveness{-Max}›] 233 applied INITIAL snapshot 452bdc5c from (n3,s3):10 at applied index 432 (total=8ms data=12 KiB ingestion=6@7ms)
teamcity-8580139-1675491419-59-n9cpu1-0005> I230204 12:09:51.907240 59046 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n5,s5,r2/15:‹/System/NodeLiveness{-Max}›] 278 applied INITIAL snapshot 3d3f01fc from (n3,s3):10 at applied index 451 (total=4ms data=13 KiB ingestion=6@3ms)
teamcity-8580139-1675491419-59-n9cpu1-0001> I230204 12:09:55.831435 85603 kv/kvserver/replica_raftstorage.go:551 ⋮ [T1,n1,s1,r2/16:‹/System/NodeLiveness{-Max}›] 358 applied INITIAL snapshot da50cf44 from (n3,s3):10 at applied index 470 (total=2ms data=14 KiB ingestion=6@1ms)

Three of the voters have a Match of zero, meaning that the leader hasn't seen an affirmative MsgAppResp from them yet. For s5, we have index 491 acked; it got a snapshot at 451 so it did catch up a little while before losing it again. We added s1 after that and that one works like a charm and is replicating.

Will take a break, triage my other tests, and then look into this one again. Noting that #94165 merged a few days ago and was picked up in this run (cc @nvanbenschoten)

Footnotes

  1. 128+6=134, https://github.com/cockroachdb/cockroach/blob/54282ad5728ded88c7b40c1c663c7e57ff6ac2b9/pkg/cli/exit/codes.go#L44-L46

@tbg tbg self-assigned this Feb 7, 2023
@nvanbenschoten
Copy link
Member

I kicked off a series of iterations of this test on master (8022f2a) to see whether it's noticeably more flaky. That would be a good indication that we recently broke something. Unfortunately, 100/100 iterations passed.

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 8, 2023
@tbg
Copy link
Member

tbg commented Feb 8, 2023

Thanks Nathan.

I just figured out what the problem is and luckily it's not in the replication layer.

The test does the following:

  • start 9 nodes
  • set replication factor = 9 and wait until enforced
  • stop cluster
  • restart only n1-n6
  • decommission n9 (in absentia)
  • wait for replication factor 7 being enforced (because cluster now has only 8 members)
  • set replication factor to 5
  • wait for 5x replication

At restart time, r2 has members (all voters) n1-n9, as expected:

[(n8,s8):6 (n2,s2):2 (n9,s9):7 (n5,s5):4 (n7,s7):5 (n4,s4):8 (n6,s6):9 (n3,s3):10 (n1,s1):11] next=12

Note that only 6/9 are up.

After the restart, n3 quickly

  • demotes and removes n4 (still live at this point) - 8 replicas, 3 of which dead
  • demotes and removes n2 (a live voter) - 7 replicas, 3 of which dead
  • rebalance n6->n2 (live to live) - unchanged
  • rebalance n5->n6 (live to live) - unchanged
  • rebalance n2->n4 (live to live) - uh oh, 7 replicas, 3 of which dead, but n4 is going to die
  • rebalance n1->n5 (live to live) - unchanged
  • rebalance n6->n1 (live to live) - unchanged
  • n4 dies, lost quorum

The test seems to be handling this just about as well as it could, we could argue whether the adaptive replication factor is a good idea (it's not) but nothing here is wrong with that code or the test. We simply can not tolerate an outage while intentionally in this state.

I'm going to take a quick look at the prometheus metrics for n4 and then close this out.

@tbg
Copy link
Member

tbg commented Feb 8, 2023

n4 did report to prometheus right until it crashed, including sys_host_disk_write_bytes. Not much to do here, best guess I have is a logging disk stall.

@tbg tbg closed this as completed Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants