roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [one node IO/CPU starved] #111825

cockroach-teamcity · 2023-10-05T08:43:25Z

roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ 6b08842e45668287861af596c28dce58c352d77e:

(test_runner.go:1150).runTest: test timed out (1h0m0s)
(monitor.go:153).Wait: monitor failure: dial tcp 18.227.107.138:26257: connect: connection refused
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=8/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-32080}

The text was updated successfully, but these errors were encountered:

adityamaru · 2023-10-05T14:29:47Z

looking into this

adityamaru · 2023-10-05T17:04:53Z

Job was 56% done:

905777184646561793      RESTORE RESTORE FROM 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=25000/v22.2.0/inc-count=48/2022/12/20-225543.90?AUTH=implicit' AS OF SYSTEM TIME '2022-12-21T02:00:00Z'              root    {115,132,120,146,123,113,112,143,140,119,133,109,105,116,157,141,138,121,106,129,136,142,156,144,147,118,127,150,139,107,111,114,128,154,152,151,126,135,145,122,131,110,124,137,117,108,125,155,130,153,149,148,134}        running NULL    2023-10-05 07:40:59.305935+00   2023-10-05 07:41:00.002923+00   NULL    2023-10-05 08:42:09.565124+00        0.5623053312301636      NULL            1       414291085975248250      2023-10-05 07:41:00.002924+00   2023-10-05 07:41:30.002924+00   1       NULL    NULL

adityamaru · 2023-10-05T18:04:05Z

The job starts at 07:40 and a couple of minutes later until the test times out we see slow range RPCs for AddSSTables. Towards the end of the test we see as high as 400s slow range RPCs.

❯ rg 'slow range RPC.*AddSSTable*' only-jobs | wc -l
      16

teamcity-12059861-1696483956-28-n8cpu8-0005> W231005 08:37:53.519417 51649 kv/kvclient/kvcoord/dist_sender.go:1882 ⋮ [T1,Vsystem,n5,f‹981c994f›,job=905777184646561793,distsql.gateway=1,dist
sql.appname=‹$ internal-resume-job-905777184646561793›] 5810  slow range RPC: have been waiting 473.09s (1 attempts) for RPC AddSSTable [/Table/139/1/‹200000172921039›/‹0›,/Table/139/1/‹200
000173048214›/‹2›/‹1›/‹NULL›) to r813:‹/Table/139/1/20000017{1649339-4998955}› [(n5,s5):4, (n7,s7):2, (n4,s4):5, next=6, gen=101, sticky=1696495568.402488951,0]; resp: ‹(err: <nil>), *kvpb.AddSSTableResponse›

Similarly theres 19 instances of slow AdminSplits and 3 of AdminScatter all in the 90-100s range.

After the job started at 07:40 we saw an inbox communication error and a retry at 07:45:

teamcity-12059861-1696483956-28-n8cpu8-0001> W231005 07:45:56.669497 12809 ccl/backupccl/restore_job.go:210 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=905777184646561793›] 1465  encountered retryable error: importing 938 ranges: ‹inbox communication error›: grpc: ‹context canceled› [code 1/Canceled]

We do now have much better checkpointing to avoid redoing done work and this shouldn't have a major impact on performance but its worth noting anyways. We have a very high number of connection reset IO retries when reading the backup from the bucket:

❯ rg 'connection reset' only-jobs | wc -l
    1626

This would definitely cause the restore to run for longer since we're retrying on a large number of the reads from s3. We also generally see raft struggling to keep up pretty much immediately after the restore starts. Messages like acquired raft proposal quota after 15.404145179s go up to almost 30s and are being logged constantly.

Overall, looks like the cluster was pretty overwhelmed and unable to serve KV requests. I will ping KV in case they're interested in giving this cluster a look.

pav-kv · 2023-10-11T09:02:28Z

Is this any related to #111160? Particularly, do we see one node spinning at 100% CPU?

Upd: Yes, looks a lot like it. We should checkout CPU profiles for the overloaded n4 here.

pav-kv · 2023-10-11T09:17:32Z

The node spends 50% time in raft processing (and much of that is loading/storing sideloaded entries). Not that this looks particularly surprising though, for this test.

It would be nice to compare with a healthy node profile, but we don't have it here. Can run the test manually and collect some.

pav-kv · 2023-10-11T09:26:10Z

What surprises me though is how nicely n4 stays just around 2 GiB of memory usage. Is it possible that we just somehow get a particularly under-provisioned node? We would OOM though, instead of backpressuring like this.

blathers-crl · 2023-10-11T10:14:18Z

cc @cockroachdb/replication

pav-kv · 2023-10-11T11:23:53Z

Writes started at 07:41, and CPU almost immediately spiraled up to 100%.

Counter-intuitively, n4 peaks by all means, but processes the least amount of SSTs.

The first CPU profile (when usage is above 80%) was captured at 07:41:26. I don't see anything unordinary here, everything takes a fair share of CPU.

pav-kv · 2023-10-11T13:19:17Z

One thing stood out in n4 stacks:

8915 goroutine 19884 [select, 59 minutes]:                                                                                                                                                                                              
8916 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc005960580, {0x79221b0, 0xc009318730}, 0xc004d08360, 0xc00234b500)                                                                                 
8917     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:205 +0x7a7                                                                                                                                   
8918 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc005960580, {0x79221b0, 0xc009318730}, 0xc004d08360, 0x66a9158)                                                                   
8919     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:511 +0x3c3                                                                                                                                    
8920 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes(0xc005960580, {0x79221b0?, 0xc009318730?}, 0xc004d08360)                                                                                            
8921     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:189 +0x751                                                                                                                                    
8922 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc004d08360?, {0x79221b0?, 0xc009318730?}, 0x8000101?)                                                                                                           
8923     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:121 +0x25                                                                                                                                     
8924 github.com/cockroachdb/cockroach/pkg/kv/kvserver.sendProbe({0x79221b0, 0xc009318730}, {0x7955830, 0xc005960580})                                                                                                                   
8925     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:226 +0x14b                                                                                                                         
8926 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1({0x7922258?, 0xc0095e4ff0?})                                                                                                            
8927     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:200 +0xf7                                                                                                                          
8928 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()                                                                                                                                                   
8929     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146                                                                                                                                                           
8930 created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx                                                                                                                                                
8931     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b                                                                                                                                                           
8932

@erikgrinaker are we ever allowed to run this method that long? It looks like we evaluated the thing but never got the result or cancel. It's a probe though, and also doesn't hold any locks.

erikgrinaker · 2023-10-11T14:59:45Z

We discussed this separately, but this does seem suspect -- we're basically waiting for the reproposal machinery to get this proposal through, and it isn't happening for whatever reason.

pav-kv · 2023-10-12T08:53:58Z

In this restore run, there is this periodic spike of proposals (from other node, n1, not the slow n4; n1 is probably the gateway). Every 2.5 minutes there is a spike to 35/s (I wonder why).

Those are not AddSSTs though. AddSSTs are sent at a rate <2/s.

Not reproposals either, we only see reproposals on n4:

There are also only a couple of reproposals with a new LAI (NB: I made this graph show the absolute counter, not rate), from n4:

pav-kv · 2023-10-12T09:03:53Z

Running a "happy" run to compare the behaviours. I do see the same periodic spikes from one node, so that's probably just the shape of the workload. There aren't many reproposals though, but notably it's not zero and constantly trickles at a rate of a few reproposals/minute.

pav-kv · 2023-10-12T09:33:27Z

In the happy run, the Raft portion of the CPU profile is hardly noticeable (looks similar on multiple nodes I've looked at):

The profile we've seen above is almost double of this one, and noticeably has a lot more raft processing.

pav-kv · 2023-10-12T14:27:53Z

@sumeerbhola Could you help reading this graph (TLDR from above is that n4 is using 100% CPU and super slow)? Why are we seeing that the number of slots on n4 is significantly higher than on other nodes? What does it mean? We also see that the used slots on this node spiked 2 times. The CPU was 100% at all times though.

erikgrinaker · 2023-10-13T10:38:31Z

The p50 log commit latency here is 10 seconds. Pretty sure this is just an overloaded disk or something.

pav-kv · 2023-10-24T14:42:49Z

@cockroachdb/test-eng Is it possible to check whether VMs from this roachtest run were previously used by other roachtests, and potentially identify those tests? There is a suspicion that this run inherited some leftover cgroups limits.

renatolabs · 2023-10-24T15:31:05Z

Is it possible to check whether VMs from this roachtest run were previously used by other roachtests, and potentially identify those tests?

VM (or rather, cluster) reuse is indeed possible. If a test runs on a cluster and later we need to run a test that requires a cluster compatible with that existing cluster, it might be reused (after Wipe).

This is not the case in this failure, however. If we check the test runner logs, we see that the cluster was created for this test, and was not reused after:

07:39:17 test_runner.go:687: [w4] Created new cluster for test restore/tpce/400GB/aws/nodes=8/cpus=8: teamcity-12059861-1696483956-28-n8cpu8 (arch="amd64")

08:43:32 test_runner.go:830: [w4] test failed: restore/tpce/400GB/aws/nodes=8/cpus=8 (run 1)
08:43:32 test_runner.go:851: [w4] destroying cluster teamcity-12059861-1696483956-28-n8cpu8 [tag:] (8 nodes) because: restore/tpce/400GB/aws/nodes=8/cpus=8 (1) - (test_runner.go:1150).runTest: test timed out (1h0m0s)

blathers-crl · 2023-10-25T09:22:57Z

cc @cockroachdb/replication

pav-kv · 2023-10-26T09:02:01Z

Passing this issue to @cockroachdb/test-eng team.

We've done extensive investigation, and the root cause still seems to be an IO (possibly CPU too) constrained node. It would be nice to get a level below and understand the VM initialization sequence: how does it happen that it's different for different nodes, is it normal, etc?

We could also benefit from some observability improvements here:

it's useful to get system constraints in absolute terms somewhere, whereas the metrics we have are relative and at times ambiguous
grafana for AWS tests (roachprod: scrape AWS hosts #97788) would be helpful too (incl. the node_exporter metrics)

I still have artifacts for this issue and #111160, LMK if you need them.

blathers-crl · 2023-10-26T09:02:31Z

cc @cockroachdb/test-eng

erikgrinaker · 2023-10-26T09:08:09Z

Specifically, we would like to understand the correlation with tmpfs and run-user-1000.mount in #111825 (comment). In both failures, the slow nodes (and only the slow nodes) reported ext4 instead of tmpfs, and did not appear to execute run-user-1000.mount. Is this just an unlikely coincidence, or is it related?

srosenberg · 2023-10-27T15:09:48Z

Specifically, we would like to understand the correlation with tmpfs and run-user-1000.mount in #111825 (comment). In both failures, the slow nodes (and only the slow nodes) reported ext4 instead of tmpfs, and did not appear to execute run-user-1000.mount. Is this just an unlikely coincidence, or is it related?

There is no correlation between tmpfs and run-user-1000.mount messages seen in the logs. The latter is mounted/unmounted by sshd in response to remote ssh sessions; discussion here explains why it might be missing [1]. The occurrence of tmpfs in cockroach.log is bogus. It's a bug in getFileSystemProperties [2], which should be fixed to stop misleading messages.

[1] https://unix.stackexchange.com/a/685632
[2] https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1697589992920519?thread_ts=1697024580.180369&cid=C023S0V4YEB

erikgrinaker · 2023-10-27T15:17:58Z

The correlation is with slow IO: the node which reported ext4 instead of tmpfs and which did not appear to run run-user-1000.mount was also the node which had slow IO (and only that node). We have seen this in 2 out of 2 failures (admittedly a small sample size, but also suspicious enough to have a closer look).

pav-kv · 2023-10-27T16:22:44Z

A bit more precisely, it's not that run-user-1000.mount wasn't run at all. But in both cases:

The user-1000 thing did not run before the process was started.
The "slow" node has a noticeable gap in when it was started.

For completeness, here is the log from run #111160

$ rg user-1000
8.journalctl.txt
2135:Sep 23 08:53:40 teamcity-11896183-1695446936-33-n10cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded.
2239:Sep 23 09:04:43 teamcity-11896183-1695446936-33-n10cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded.
2303:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded.

10.journalctl.txt
2152:Sep 23 08:53:44 teamcity-11896183-1695446936-33-n10cpu8-0010 systemd[1]: run-user-1000.mount: Succeeded.
2258:Sep 23 09:04:43 teamcity-11896183-1695446936-33-n10cpu8-0010 systemd[1]: run-user-1000.mount: Succeeded.
2322:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0010 systemd[1]: run-user-1000.mount: Succeeded.

6.journalctl.txt
2122:Sep 23 08:53:38 teamcity-11896183-1695446936-33-n10cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.
2224:Sep 23 09:04:43 teamcity-11896183-1695446936-33-n10cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.
2288:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.

1.journalctl.txt
2157:Sep 23 08:53:40 teamcity-11896183-1695446936-33-n10cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded.
2353:Sep 23 09:05:38 teamcity-11896183-1695446936-33-n10cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded.

9.journalctl.txt
2144:Sep 23 08:53:41 teamcity-11896183-1695446936-33-n10cpu8-0009 systemd[1]: run-user-1000.mount: Succeeded.
2246:Sep 23 09:04:42 teamcity-11896183-1695446936-33-n10cpu8-0009 systemd[1]: run-user-1000.mount: Succeeded.
2310:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0009 systemd[1]: run-user-1000.mount: Succeeded.

7.journalctl.txt
2136:Sep 23 08:53:41 teamcity-11896183-1695446936-33-n10cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded.
2238:Sep 23 09:04:41 teamcity-11896183-1695446936-33-n10cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded.
2303:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded.

5.journalctl.txt
2463:Sep 23 09:06:10 teamcity-11896183-1695446936-33-n10cpu8-0005 systemd[1]: run-user-1000.mount: Succeeded.

2.journalctl.txt
2133:Sep 23 08:53:37 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.
2235:Sep 23 09:04:42 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.
2299:Sep 23 09:05:24 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.
2383:Sep 23 09:05:39 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.

3.journalctl.txt
2144:Sep 23 08:53:39 teamcity-11896183-1695446936-33-n10cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.
2259:Sep 23 09:05:24 teamcity-11896183-1695446936-33-n10cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.
2343:Sep 23 09:05:41 teamcity-11896183-1695446936-33-n10cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.

4.journalctl.txt
2145:Sep 23 08:53:39 teamcity-11896183-1695446936-33-n10cpu8-0004 systemd[1]: run-user-1000.mount: Succeeded.
2257:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0004 systemd[1]: run-user-1000.mount: Succeeded.
2341:Sep 23 09:05:42 teamcity-11896183-1695446936-33-n10cpu8-0004 systemd[1]: run-user-1000.mount: Succeeded.

and the corresponding test.log:

09:05:24 cluster.go:709: test status: 
09:05:24 cluster.go:709: test status: starting cluster
09:05:24 cockroach.go:312: teamcity-11896183-1695446936-33-n10cpu8: starting nodes
09:05:25 cockroach.go:812: teamcity-11896183-1695446936-33-n10cpu8: initializing cluster
09:05:26 cockroach.go:822: warning: --url specifies user/password, but command "init" does not accept user/password details - details ignored
Cluster successfully initialized
09:05:26 cockroach.go:831: teamcity-11896183-1695446936-33-n10cpu8: setting cluster settings
09:05:28 cockroach.go:841: SET CLUSTER SETTING
SET CLUSTER SETTING
09:06:09 restore.go:450: test status: running setup statements
09:06:09 restore.go:462: test status: running restore
09:06:10 monitor.go:177: Monitor event: n8: cockroach process is running (PID: 6994)
09:06:10 monitor.go:177: Monitor event: n9: cockroach process is running (PID: 6983)
09:06:10 monitor.go:177: Monitor event: n10: cockroach process is running (PID: 6994)
09:06:10 monitor.go:177: Monitor event: n6: cockroach process is running (PID: 6913)
09:06:10 monitor.go:177: Monitor event: n7: cockroach process is running (PID: 6986)
09:06:10 monitor.go:177: Monitor event: n2: cockroach process is running (PID: 6922)
09:06:10 monitor.go:177: Monitor event: n1: cockroach process is running (PID: 7217)
09:06:10 monitor.go:177: Monitor event: n3: cockroach process is running (PID: 6998)
09:06:10 monitor.go:177: Monitor event: n4: cockroach process is running (PID: 6982)
09:06:23 monitor.go:177: Monitor event: n5: cockroach process is running (PID: 8487)

NB: node 5 is the only one who didn't see mount-1000 events before the node was started (see timestamps). It was also started noticeably later from all the other nodes (the gap is 13 seconds).

We observe a similar effect in this issue

The slow node n4 doesn't show mount-1000 at all (others do):

artifacts/2.journalctl.txt
2173:Oct 05 07:33:35 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.
2312:Oct 05 07:40:35 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/5.journalctl.txt
2154:Oct 05 07:33:33 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[1]: run-user-1000.mount: Succeeded.
2263:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/6.journalctl.txt
2149:Oct 05 07:33:33 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.
2261:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/1.journalctl.txt
2145:Oct 05 07:33:30 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded.
2334:Oct 05 07:40:34 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/8.journalctl.txt
2135:Oct 05 07:33:32 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded.
2244:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/3.journalctl.txt
2158:Oct 05 07:33:32 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.
2267:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.
2351:Oct 05 07:40:37 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/7.journalctl.txt
2139:Oct 05 07:33:34 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded.
2248:Oct 05 07:40:26 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded.

Slow node n4 has a start gap from all other nodes:

07:40:20 cluster.go:713: test status: 
07:40:20 cluster.go:713: test status: starting cluster
07:40:20 cockroach.go:312: teamcity-12059861-1696483956-28-n8cpu8: starting nodes
07:40:21 cockroach.go:814: teamcity-12059861-1696483956-28-n8cpu8: initializing cluster
07:40:22 cockroach.go:824: warning: --url specifies user/password, but command "init" does not accept user/password details - details ignored
Cluster successfully initialized
07:40:22 cockroach.go:833: teamcity-12059861-1696483956-28-n8cpu8: setting cluster settings
07:40:23 cockroach.go:843: SET CLUSTER SETTING
SET CLUSTER SETTING
07:40:49 restore.go:450: test status: running setup statements
07:40:49 restore.go:469: test status: running restore
07:40:50 monitor.go:177: Monitor event: n8: cockroach process is running (PID: 6979)
07:40:50 monitor.go:177: Monitor event: n6: cockroach process is running (PID: 7063)
07:40:50 monitor.go:177: Monitor event: n7: cockroach process is running (PID: 6975)
07:40:50 monitor.go:177: Monitor event: n5: cockroach process is running (PID: 7059)
07:40:50 monitor.go:177: Monitor event: n1: cockroach process is running (PID: 7287)
07:40:50 monitor.go:177: Monitor event: n3: cockroach process is running (PID: 7053)
07:40:50 monitor.go:177: Monitor event: n2: cockroach process is running (PID: 7122)
07:40:55 monitor.go:177: Monitor event: n4: cockroach process is running (PID: 8002)

It looks as though either the node is initially slow, or maybe it wasn't entirely configured before the process started.

pav-kv · 2023-10-27T18:17:47Z

Another evidence that the nodes are not symmetric:

$ rg "Startup finished in"
6.journalctl.txt
1217:Oct 05 07:32:43 ip-10-12-4-138 systemd[838]: Startup finished in 42ms.
2113:Oct 05 07:33:23 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: Startup finished in 2.059s (kernel) + 48.820s (userspace) = 50.879s.
2181:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[6500]: Startup finished in 25ms.
2287:Oct 05 07:40:44 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[6900]: Startup finished in 25ms.

1.journalctl.txt
1209:Oct 05 07:32:42 ip-10-12-3-80 systemd[836]: Startup finished in 39ms.
2105:Oct 05 07:33:20 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[1]: Startup finished in 1.996s (kernel) + 47.348s (userspace) = 49.345s.
2174:Oct 05 07:38:52 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[6500]: Startup finished in 24ms.
2360:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[7495]: Startup finished in 24ms.

8.journalctl.txt
1215:Oct 05 07:32:44 ip-10-12-13-119 systemd[840]: Startup finished in 41ms.
2095:Oct 05 07:33:21 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[1]: Startup finished in 1.960s (kernel) + 46.254s (userspace) = 48.215s.
2164:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[6416]: Startup finished in 24ms.
2270:Oct 05 07:40:47 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[6815]: Startup finished in 24ms.

7.journalctl.txt
1215:Oct 05 07:32:44 ip-10-12-8-116 systemd[840]: Startup finished in 43ms.
2103:Oct 05 07:33:23 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[1]: Startup finished in 1.963s (kernel) + 49.298s (userspace) = 51.261s.
2168:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[6411]: Startup finished in 26ms.
2274:Oct 05 07:40:46 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[6812]: Startup finished in 25ms.

5.journalctl.txt
1216:Oct 05 07:32:44 ip-10-12-1-206 systemd[834]: Startup finished in 41ms.
2114:Oct 05 07:33:22 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[1]: Startup finished in 2.005s (kernel) + 49.150s (userspace) = 51.155s.
2183:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[6494]: Startup finished in 24ms.
2289:Oct 05 07:40:43 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[6896]: Startup finished in 24ms.

2.journalctl.txt
1224:Oct 05 07:32:44 ip-10-12-8-218 systemd[829]: Startup finished in 41ms.
2137:Oct 05 07:33:25 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[1]: Startup finished in 2.348s (kernel) + 51.366s (userspace) = 53.715s.
2202:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[6569]: Startup finished in 26ms.
2338:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[7146]: Startup finished in 26ms.

3.journalctl.txt
1214:Oct 05 07:32:42 ip-10-12-2-85 systemd[840]: Startup finished in 44ms.
2122:Oct 05 07:33:22 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: Startup finished in 1.960s (kernel) + 49.613s (userspace) = 51.573s.
2187:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[6489]: Startup finished in 24ms.
2293:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[6890]: Startup finished in 24ms.
2377:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[7076]: Startup finished in 24ms.

4.journalctl.txt
1239:Oct 05 07:33:32 ip-10-12-9-200 systemd[871]: Startup finished in 286ms.
2271:Oct 05 07:38:55 teamcity-12059861-1696483956-28-n8cpu8-0004 systemd[1]: Startup finished in 7.008s (kernel) + 6min 8.411s (userspace) = 6min 15.419s.

All nodes report startup times as ~50s, and ~25ms. Node n4 does it significantly slower.

Same thing happens in the other failed run. All nodes completed the startup in ~50s, whereas the slow n5 has this:

5.journalctl.txt
1248:Sep 23 08:54:09 ip-10-12-10-114 systemd[847]: Startup finished in 247ms.
2341:Sep 23 09:03:24 teamcity-11896183-1695446936-33-n10cpu8-0005 systemd[1]: Startup finished in 12.254s (kernel) + 10min 23.007s (userspace) = 10min 35.262s.
2483:Sep 23 09:06:12 teamcity-11896183-1695446936-33-n10cpu8-0005 systemd[8518]: Startup finished in 520ms.

pav-kv · 2023-10-27T18:21:34Z

The above happens before the cockroach binary is started, or even copied to the nodes.

pav-kv · 2023-10-27T18:43:27Z

@srosenberg Do we have some way of doing a mega-grep on all the recent artifacts (the journalctl files) for AWS tests to see how frequently such a thing happens? Whether it happens for other tests too (but maybe this data-intensive test is just particularly sensitive)?

pav-kv · 2023-10-27T21:11:38Z

In artifacts for a couple of AWS roachtests that I found in teamcity, the startup times are (like for this test) under a minute. I think this is what we should expect pretty consistently. The 6 and 10 min are surprisingly high, and likely indicate some problem with this VM.

Maybe we could add some extra diagnostics with systemd-analyze in our roachtest routines, to help narrowing this down next time such a failure occurs. This tool can even produce SVG/dot images breaking down the whole startup sequence on a timeline. We could then have some artifacts to present to AWS to look at.

renatolabs · 2023-11-15T15:07:36Z

I'm removing the release blocker label as this has become an infrastructure investigation. We haven't had the time to investigate yet, but it doesn't seem like it should block the release.

srosenberg · 2024-10-23T14:56:09Z

This roachtest variant hasn't failed (or run) in several months, so we don't have any new insights. Thus, I'm closing until it resurfaces or becomes more relevant.

cockroach-teamcity added this to the 23.2 milestone Oct 5, 2023

adityamaru self-assigned this Oct 5, 2023

adityamaru added the T-kv KV Team label Oct 5, 2023

pav-kv assigned pav-kv and unassigned adityamaru Oct 11, 2023

pav-kv added the A-kv-replication Relating to Raft, consensus, and coordination. label Oct 11, 2023

blathers-crl bot added the T-kv-replication label Oct 11, 2023

erikgrinaker added T-kv-replication and removed T-kv KV Team labels Oct 25, 2023

pav-kv removed their assignment Oct 26, 2023

pav-kv added the T-testeng TestEng Team label Oct 26, 2023

pav-kv added the A-testing Testing tools and infrastructure label Oct 26, 2023

erikgrinaker removed the T-kv-replication label Oct 26, 2023

pav-kv changed the title ~~roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [slow IO on one node]~~ roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [one node IO/CPU starved] Oct 27, 2023

renatolabs removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 15, 2023

renatolabs added the P-3 Issues/test failures with no fix SLA label Jan 11, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

github-project-automation bot added this to Test Engineering Aug 28, 2024

github-project-automation bot moved this to Triage in Test Engineering Aug 28, 2024

srosenberg closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [one node IO/CPU starved] #111825

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [one node IO/CPU starved] #111825

cockroach-teamcity commented Oct 5, 2023 •

edited by cockroach-jira-scripts

Loading

adityamaru commented Oct 5, 2023

adityamaru commented Oct 5, 2023

adityamaru commented Oct 5, 2023

pav-kv commented Oct 11, 2023 •

edited

Loading

pav-kv commented Oct 11, 2023 •

edited

Loading

pav-kv commented Oct 11, 2023 •

edited

Loading

blathers-crl bot commented Oct 11, 2023

pav-kv commented Oct 11, 2023

pav-kv commented Oct 11, 2023

erikgrinaker commented Oct 11, 2023

pav-kv commented Oct 12, 2023

pav-kv commented Oct 12, 2023 •

edited

Loading

pav-kv commented Oct 12, 2023

pav-kv commented Oct 12, 2023

erikgrinaker commented Oct 13, 2023

pav-kv commented Oct 24, 2023

renatolabs commented Oct 24, 2023

blathers-crl bot commented Oct 25, 2023

pav-kv commented Oct 26, 2023 •

edited

Loading

blathers-crl bot commented Oct 26, 2023

erikgrinaker commented Oct 26, 2023 •

edited

Loading

srosenberg commented Oct 27, 2023

erikgrinaker commented Oct 27, 2023

pav-kv commented Oct 27, 2023 •

edited

Loading

pav-kv commented Oct 27, 2023

pav-kv commented Oct 27, 2023

pav-kv commented Oct 27, 2023 •

edited

Loading

pav-kv commented Oct 27, 2023 •

edited

Loading

renatolabs commented Nov 15, 2023

srosenberg commented Oct 23, 2024

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [one node IO/CPU starved] #111825

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [one node IO/CPU starved] #111825

Comments

cockroach-teamcity commented Oct 5, 2023 • edited by cockroach-jira-scripts Loading

adityamaru commented Oct 5, 2023

adityamaru commented Oct 5, 2023

adityamaru commented Oct 5, 2023

pav-kv commented Oct 11, 2023 • edited Loading

pav-kv commented Oct 11, 2023 • edited Loading

pav-kv commented Oct 11, 2023 • edited Loading

blathers-crl bot commented Oct 11, 2023

pav-kv commented Oct 11, 2023

pav-kv commented Oct 11, 2023

erikgrinaker commented Oct 11, 2023

pav-kv commented Oct 12, 2023

pav-kv commented Oct 12, 2023 • edited Loading

pav-kv commented Oct 12, 2023

pav-kv commented Oct 12, 2023

erikgrinaker commented Oct 13, 2023

pav-kv commented Oct 24, 2023

renatolabs commented Oct 24, 2023

blathers-crl bot commented Oct 25, 2023

pav-kv commented Oct 26, 2023 • edited Loading

blathers-crl bot commented Oct 26, 2023

erikgrinaker commented Oct 26, 2023 • edited Loading

srosenberg commented Oct 27, 2023

erikgrinaker commented Oct 27, 2023

pav-kv commented Oct 27, 2023 • edited Loading

pav-kv commented Oct 27, 2023

pav-kv commented Oct 27, 2023

pav-kv commented Oct 27, 2023 • edited Loading

pav-kv commented Oct 27, 2023 • edited Loading

renatolabs commented Nov 15, 2023

srosenberg commented Oct 23, 2024

cockroach-teamcity commented Oct 5, 2023 •

edited by cockroach-jira-scripts

Loading

pav-kv commented Oct 11, 2023 •

edited

Loading

pav-kv commented Oct 11, 2023 •

edited

Loading

pav-kv commented Oct 11, 2023 •

edited

Loading

pav-kv commented Oct 12, 2023 •

edited

Loading

pav-kv commented Oct 26, 2023 •

edited

Loading

erikgrinaker commented Oct 26, 2023 •

edited

Loading

pav-kv commented Oct 27, 2023 •

edited

Loading

pav-kv commented Oct 27, 2023 •

edited

Loading

pav-kv commented Oct 27, 2023 •

edited

Loading