Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [one node IO/CPU starved] #111825

Closed
cockroach-teamcity opened this issue Oct 5, 2023 · 48 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-testeng TestEng Team X-noreuse Prevent automatic commenting from CI test failures
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 5, 2023

roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ 6b08842e45668287861af596c28dce58c352d77e:

(test_runner.go:1150).runTest: test timed out (1h0m0s)
(monitor.go:153).Wait: monitor failure: dial tcp 18.227.107.138:26257: connect: connection refused
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=8/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-32080

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Oct 5, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Oct 5, 2023
@adityamaru adityamaru self-assigned this Oct 5, 2023
@adityamaru
Copy link
Contributor

looking into this

@adityamaru
Copy link
Contributor

Job was 56% done:

905777184646561793      RESTORE RESTORE FROM 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=25000/v22.2.0/inc-count=48/2022/12/20-225543.90?AUTH=implicit' AS OF SYSTEM TIME '2022-12-21T02:00:00Z'              root    {115,132,120,146,123,113,112,143,140,119,133,109,105,116,157,141,138,121,106,129,136,142,156,144,147,118,127,150,139,107,111,114,128,154,152,151,126,135,145,122,131,110,124,137,117,108,125,155,130,153,149,148,134}        running NULL    2023-10-05 07:40:59.305935+00   2023-10-05 07:41:00.002923+00   NULL    2023-10-05 08:42:09.565124+00        0.5623053312301636      NULL            1       414291085975248250      2023-10-05 07:41:00.002924+00   2023-10-05 07:41:30.002924+00   1       NULL    NULL

@adityamaru
Copy link
Contributor

The job starts at 07:40 and a couple of minutes later until the test times out we see slow range RPCs for AddSSTables. Towards the end of the test we see as high as 400s slow range RPCs.

❯ rg 'slow range RPC.*AddSSTable*' only-jobs | wc -l
      16

teamcity-12059861-1696483956-28-n8cpu8-0005> W231005 08:37:53.519417 51649 kv/kvclient/kvcoord/dist_sender.go:1882 ⋮ [T1,Vsystem,n5,f‹981c994f›,job=905777184646561793,distsql.gateway=1,dist
sql.appname=‹$ internal-resume-job-905777184646561793›] 5810  slow range RPC: have been waiting 473.09s (1 attempts) for RPC AddSSTable [/Table/139/1/‹200000172921039›/‹0›,/Table/139/1/‹200
000173048214›/‹2›/‹1›/‹NULL›) to r813:‹/Table/139/1/20000017{1649339-4998955}› [(n5,s5):4, (n7,s7):2, (n4,s4):5, next=6, gen=101, sticky=1696495568.402488951,0]; resp: ‹(err: <nil>), *kvpb.AddSSTableResponse›

Similarly theres 19 instances of slow AdminSplits and 3 of AdminScatter all in the 90-100s range.

After the job started at 07:40 we saw an inbox communication error and a retry at 07:45:

teamcity-12059861-1696483956-28-n8cpu8-0001> W231005 07:45:56.669497 12809 ccl/backupccl/restore_job.go:210 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=905777184646561793›] 1465  encountered retryable error: importing 938 ranges: ‹inbox communication error›: grpc: ‹context canceled› [code 1/Canceled]

We do now have much better checkpointing to avoid redoing done work and this shouldn't have a major impact on performance but its worth noting anyways. We have a very high number of connection reset IO retries when reading the backup from the bucket:

❯ rg 'connection reset' only-jobs | wc -l
    1626

This would definitely cause the restore to run for longer since we're retrying on a large number of the reads from s3. We also generally see raft struggling to keep up pretty much immediately after the restore starts. Messages like acquired raft proposal quota after 15.404145179s go up to almost 30s and are being logged constantly.

Overall, looks like the cluster was pretty overwhelmed and unable to serve KV requests. I will ping KV in case they're interested in giving this cluster a look.

@adityamaru adityamaru added the T-kv KV Team label Oct 5, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Oct 11, 2023

Is this any related to #111160? Particularly, do we see one node spinning at 100% CPU?

Upd: Yes, looks a lot like it. We should checkout CPU profiles for the overloaded n4 here.

Screenshot 2023-10-11 at 10 05 51

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 11, 2023

The node spends 50% time in raft processing (and much of that is loading/storing sideloaded entries). Not that this looks particularly surprising though, for this test.

Screenshot 2023-10-11 at 10 15 00

It would be nice to compare with a healthy node profile, but we don't have it here. Can run the test manually and collect some.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 11, 2023

What surprises me though is how nicely n4 stays just around 2 GiB of memory usage. Is it possible that we just somehow get a particularly under-provisioned node? We would OOM though, instead of backpressuring like this.

Screenshot 2023-10-11 at 10 24 22

@pav-kv pav-kv assigned pav-kv and unassigned adityamaru Oct 11, 2023
@pav-kv pav-kv added the A-kv-replication Relating to Raft, consensus, and coordination. label Oct 11, 2023
@blathers-crl
Copy link

blathers-crl bot commented Oct 11, 2023

cc @cockroachdb/replication

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 11, 2023

Writes started at 07:41, and CPU almost immediately spiraled up to 100%.

Screenshot 2023-10-11 at 12 16 32 Screenshot 2023-10-11 at 12 16 41 Screenshot 2023-10-11 at 12 17 02 Screenshot 2023-10-11 at 12 18 41

Counter-intuitively, n4 peaks by all means, but processes the least amount of SSTs.

The first CPU profile (when usage is above 80%) was captured at 07:41:26. I don't see anything unordinary here, everything takes a fair share of CPU.

Screenshot 2023-10-11 at 12 21 36

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 11, 2023

One thing stood out in n4 stacks:

8915 goroutine 19884 [select, 59 minutes]:                                                                                                                                                                                              
8916 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc005960580, {0x79221b0, 0xc009318730}, 0xc004d08360, 0xc00234b500)                                                                                 
8917     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:205 +0x7a7                                                                                                                                   
8918 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc005960580, {0x79221b0, 0xc009318730}, 0xc004d08360, 0x66a9158)                                                                   
8919     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:511 +0x3c3                                                                                                                                    
8920 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes(0xc005960580, {0x79221b0?, 0xc009318730?}, 0xc004d08360)                                                                                            
8921     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:189 +0x751                                                                                                                                    
8922 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc004d08360?, {0x79221b0?, 0xc009318730?}, 0x8000101?)                                                                                                           
8923     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:121 +0x25                                                                                                                                     
8924 github.com/cockroachdb/cockroach/pkg/kv/kvserver.sendProbe({0x79221b0, 0xc009318730}, {0x7955830, 0xc005960580})                                                                                                                   
8925     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:226 +0x14b                                                                                                                         
8926 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1({0x7922258?, 0xc0095e4ff0?})                                                                                                            
8927     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:200 +0xf7                                                                                                                          
8928 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()                                                                                                                                                   
8929     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146                                                                                                                                                           
8930 created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx                                                                                                                                                
8931     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b                                                                                                                                                           
8932                                                                         

@erikgrinaker are we ever allowed to run this method that long? It looks like we evaluated the thing but never got the result or cancel. It's a probe though, and also doesn't hold any locks.

@erikgrinaker
Copy link
Contributor

We discussed this separately, but this does seem suspect -- we're basically waiting for the reproposal machinery to get this proposal through, and it isn't happening for whatever reason.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 12, 2023

In this restore run, there is this periodic spike of proposals (from other node, n1, not the slow n4; n1 is probably the gateway). Every 2.5 minutes there is a spike to 35/s (I wonder why).

Screenshot 2023-10-12 at 09 44 09

Those are not AddSSTs though. AddSSTs are sent at a rate <2/s.

Screenshot 2023-10-12 at 09 46 07

Not reproposals either, we only see reproposals on n4:
Screenshot 2023-10-12 at 09 46 55

There are also only a couple of reproposals with a new LAI (NB: I made this graph show the absolute counter, not rate), from n4:
Screenshot 2023-10-12 at 09 51 47

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 12, 2023

Running a "happy" run to compare the behaviours. I do see the same periodic spikes from one node, so that's probably just the shape of the workload. There aren't many reproposals though, but notably it's not zero and constantly trickles at a rate of a few reproposals/minute.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 12, 2023

In the happy run, the Raft portion of the CPU profile is hardly noticeable (looks similar on multiple nodes I've looked at):

Screenshot 2023-10-12 at 10 30 49

The profile we've seen above is almost double of this one, and noticeably has a lot more raft processing.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 12, 2023

@sumeerbhola Could you help reading this graph (TLDR from above is that n4 is using 100% CPU and super slow)? Why are we seeing that the number of slots on n4 is significantly higher than on other nodes? What does it mean? We also see that the used slots on this node spiked 2 times. The CPU was 100% at all times though.

Screenshot 2023-10-12 at 15 23 41

@erikgrinaker
Copy link
Contributor

The p50 log commit latency here is 10 seconds. Pretty sure this is just an overloaded disk or something.

Screenshot 2023-10-13 at 12 37 50

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 24, 2023

@cockroachdb/test-eng Is it possible to check whether VMs from this roachtest run were previously used by other roachtests, and potentially identify those tests? There is a suspicion that this run inherited some leftover cgroups limits.

@renatolabs
Copy link
Contributor

Is it possible to check whether VMs from this roachtest run were previously used by other roachtests, and potentially identify those tests?

VM (or rather, cluster) reuse is indeed possible. If a test runs on a cluster and later we need to run a test that requires a cluster compatible with that existing cluster, it might be reused (after Wipe).

This is not the case in this failure, however. If we check the test runner logs, we see that the cluster was created for this test, and was not reused after:

07:39:17 test_runner.go:687: [w4] Created new cluster for test restore/tpce/400GB/aws/nodes=8/cpus=8: teamcity-12059861-1696483956-28-n8cpu8 (arch="amd64")

08:43:32 test_runner.go:830: [w4] test failed: restore/tpce/400GB/aws/nodes=8/cpus=8 (run 1)
08:43:32 test_runner.go:851: [w4] destroying cluster teamcity-12059861-1696483956-28-n8cpu8 [tag:] (8 nodes) because: restore/tpce/400GB/aws/nodes=8/cpus=8 (1) - (test_runner.go:1150).runTest: test timed out (1h0m0s)

@erikgrinaker erikgrinaker added T-kv-replication and removed T-kv KV Team labels Oct 25, 2023
@blathers-crl
Copy link

blathers-crl bot commented Oct 25, 2023

cc @cockroachdb/replication

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 26, 2023

Passing this issue to @cockroachdb/test-eng team.

We've done extensive investigation, and the root cause still seems to be an IO (possibly CPU too) constrained node. It would be nice to get a level below and understand the VM initialization sequence: how does it happen that it's different for different nodes, is it normal, etc?

We could also benefit from some observability improvements here:

  • it's useful to get system constraints in absolute terms somewhere, whereas the metrics we have are relative and at times ambiguous
  • grafana for AWS tests (roachprod: scrape AWS hosts #97788) would be helpful too (incl. the node_exporter metrics)

I still have artifacts for this issue and #111160, LMK if you need them.

@pav-kv pav-kv removed their assignment Oct 26, 2023
@pav-kv pav-kv added the T-testeng TestEng Team label Oct 26, 2023
@blathers-crl
Copy link

blathers-crl bot commented Oct 26, 2023

cc @cockroachdb/test-eng

@pav-kv pav-kv added the A-testing Testing tools and infrastructure label Oct 26, 2023
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 26, 2023

Specifically, we would like to understand the correlation with tmpfs and run-user-1000.mount in #111825 (comment). In both failures, the slow nodes (and only the slow nodes) reported ext4 instead of tmpfs, and did not appear to execute run-user-1000.mount. Is this just an unlikely coincidence, or is it related?

@srosenberg
Copy link
Member

Specifically, we would like to understand the correlation with tmpfs and run-user-1000.mount in #111825 (comment). In both failures, the slow nodes (and only the slow nodes) reported ext4 instead of tmpfs, and did not appear to execute run-user-1000.mount. Is this just an unlikely coincidence, or is it related?

There is no correlation between tmpfs and run-user-1000.mount messages seen in the logs. The latter is mounted/unmounted by sshd in response to remote ssh sessions; discussion here explains why it might be missing [1]. The occurrence of tmpfs in cockroach.log is bogus. It's a bug in getFileSystemProperties [2], which should be fixed to stop misleading messages.

[1] https://unix.stackexchange.com/a/685632
[2] https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1697589992920519?thread_ts=1697024580.180369&cid=C023S0V4YEB

@erikgrinaker
Copy link
Contributor

The correlation is with slow IO: the node which reported ext4 instead of tmpfs and which did not appear to run run-user-1000.mount was also the node which had slow IO (and only that node). We have seen this in 2 out of 2 failures (admittedly a small sample size, but also suspicious enough to have a closer look).

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 27, 2023

A bit more precisely, it's not that run-user-1000.mount wasn't run at all. But in both cases:

  1. The user-1000 thing did not run before the process was started.
  2. The "slow" node has a noticeable gap in when it was started.
For completeness, here is the log from run #111160
$ rg user-1000
8.journalctl.txt
2135:Sep 23 08:53:40 teamcity-11896183-1695446936-33-n10cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded.
2239:Sep 23 09:04:43 teamcity-11896183-1695446936-33-n10cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded.
2303:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded.

10.journalctl.txt
2152:Sep 23 08:53:44 teamcity-11896183-1695446936-33-n10cpu8-0010 systemd[1]: run-user-1000.mount: Succeeded.
2258:Sep 23 09:04:43 teamcity-11896183-1695446936-33-n10cpu8-0010 systemd[1]: run-user-1000.mount: Succeeded.
2322:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0010 systemd[1]: run-user-1000.mount: Succeeded.

6.journalctl.txt
2122:Sep 23 08:53:38 teamcity-11896183-1695446936-33-n10cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.
2224:Sep 23 09:04:43 teamcity-11896183-1695446936-33-n10cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.
2288:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.

1.journalctl.txt
2157:Sep 23 08:53:40 teamcity-11896183-1695446936-33-n10cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded.
2353:Sep 23 09:05:38 teamcity-11896183-1695446936-33-n10cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded.

9.journalctl.txt
2144:Sep 23 08:53:41 teamcity-11896183-1695446936-33-n10cpu8-0009 systemd[1]: run-user-1000.mount: Succeeded.
2246:Sep 23 09:04:42 teamcity-11896183-1695446936-33-n10cpu8-0009 systemd[1]: run-user-1000.mount: Succeeded.
2310:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0009 systemd[1]: run-user-1000.mount: Succeeded.

7.journalctl.txt
2136:Sep 23 08:53:41 teamcity-11896183-1695446936-33-n10cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded.
2238:Sep 23 09:04:41 teamcity-11896183-1695446936-33-n10cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded.
2303:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded.

5.journalctl.txt
2463:Sep 23 09:06:10 teamcity-11896183-1695446936-33-n10cpu8-0005 systemd[1]: run-user-1000.mount: Succeeded.

2.journalctl.txt
2133:Sep 23 08:53:37 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.
2235:Sep 23 09:04:42 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.
2299:Sep 23 09:05:24 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.
2383:Sep 23 09:05:39 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.

3.journalctl.txt
2144:Sep 23 08:53:39 teamcity-11896183-1695446936-33-n10cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.
2259:Sep 23 09:05:24 teamcity-11896183-1695446936-33-n10cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.
2343:Sep 23 09:05:41 teamcity-11896183-1695446936-33-n10cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.

4.journalctl.txt
2145:Sep 23 08:53:39 teamcity-11896183-1695446936-33-n10cpu8-0004 systemd[1]: run-user-1000.mount: Succeeded.
2257:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0004 systemd[1]: run-user-1000.mount: Succeeded.
2341:Sep 23 09:05:42 teamcity-11896183-1695446936-33-n10cpu8-0004 systemd[1]: run-user-1000.mount: Succeeded.

and the corresponding test.log:

09:05:24 cluster.go:709: test status: 
09:05:24 cluster.go:709: test status: starting cluster
09:05:24 cockroach.go:312: teamcity-11896183-1695446936-33-n10cpu8: starting nodes
09:05:25 cockroach.go:812: teamcity-11896183-1695446936-33-n10cpu8: initializing cluster
09:05:26 cockroach.go:822: warning: --url specifies user/password, but command "init" does not accept user/password details - details ignored
Cluster successfully initialized
09:05:26 cockroach.go:831: teamcity-11896183-1695446936-33-n10cpu8: setting cluster settings
09:05:28 cockroach.go:841: SET CLUSTER SETTING
SET CLUSTER SETTING
09:06:09 restore.go:450: test status: running setup statements
09:06:09 restore.go:462: test status: running restore
09:06:10 monitor.go:177: Monitor event: n8: cockroach process is running (PID: 6994)
09:06:10 monitor.go:177: Monitor event: n9: cockroach process is running (PID: 6983)
09:06:10 monitor.go:177: Monitor event: n10: cockroach process is running (PID: 6994)
09:06:10 monitor.go:177: Monitor event: n6: cockroach process is running (PID: 6913)
09:06:10 monitor.go:177: Monitor event: n7: cockroach process is running (PID: 6986)
09:06:10 monitor.go:177: Monitor event: n2: cockroach process is running (PID: 6922)
09:06:10 monitor.go:177: Monitor event: n1: cockroach process is running (PID: 7217)
09:06:10 monitor.go:177: Monitor event: n3: cockroach process is running (PID: 6998)
09:06:10 monitor.go:177: Monitor event: n4: cockroach process is running (PID: 6982)
09:06:23 monitor.go:177: Monitor event: n5: cockroach process is running (PID: 8487)

NB: node 5 is the only one who didn't see mount-1000 events before the node was started (see timestamps). It was also started noticeably later from all the other nodes (the gap is 13 seconds).

We observe a similar effect in this issue

The slow node n4 doesn't show mount-1000 at all (others do):

artifacts/2.journalctl.txt
2173:Oct 05 07:33:35 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.
2312:Oct 05 07:40:35 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/5.journalctl.txt
2154:Oct 05 07:33:33 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[1]: run-user-1000.mount: Succeeded.
2263:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/6.journalctl.txt
2149:Oct 05 07:33:33 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.
2261:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/1.journalctl.txt
2145:Oct 05 07:33:30 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded.
2334:Oct 05 07:40:34 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/8.journalctl.txt
2135:Oct 05 07:33:32 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded.
2244:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/3.journalctl.txt
2158:Oct 05 07:33:32 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.
2267:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.
2351:Oct 05 07:40:37 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded.

artifacts/7.journalctl.txt
2139:Oct 05 07:33:34 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded.
2248:Oct 05 07:40:26 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded.

Slow node n4 has a start gap from all other nodes:

07:40:20 cluster.go:713: test status: 
07:40:20 cluster.go:713: test status: starting cluster
07:40:20 cockroach.go:312: teamcity-12059861-1696483956-28-n8cpu8: starting nodes
07:40:21 cockroach.go:814: teamcity-12059861-1696483956-28-n8cpu8: initializing cluster
07:40:22 cockroach.go:824: warning: --url specifies user/password, but command "init" does not accept user/password details - details ignored
Cluster successfully initialized
07:40:22 cockroach.go:833: teamcity-12059861-1696483956-28-n8cpu8: setting cluster settings
07:40:23 cockroach.go:843: SET CLUSTER SETTING
SET CLUSTER SETTING
07:40:49 restore.go:450: test status: running setup statements
07:40:49 restore.go:469: test status: running restore
07:40:50 monitor.go:177: Monitor event: n8: cockroach process is running (PID: 6979)
07:40:50 monitor.go:177: Monitor event: n6: cockroach process is running (PID: 7063)
07:40:50 monitor.go:177: Monitor event: n7: cockroach process is running (PID: 6975)
07:40:50 monitor.go:177: Monitor event: n5: cockroach process is running (PID: 7059)
07:40:50 monitor.go:177: Monitor event: n1: cockroach process is running (PID: 7287)
07:40:50 monitor.go:177: Monitor event: n3: cockroach process is running (PID: 7053)
07:40:50 monitor.go:177: Monitor event: n2: cockroach process is running (PID: 7122)
07:40:55 monitor.go:177: Monitor event: n4: cockroach process is running (PID: 8002)

It looks as though either the node is initially slow, or maybe it wasn't entirely configured before the process started.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 27, 2023

Another evidence that the nodes are not symmetric:

$ rg "Startup finished in"
6.journalctl.txt
1217:Oct 05 07:32:43 ip-10-12-4-138 systemd[838]: Startup finished in 42ms.
2113:Oct 05 07:33:23 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: Startup finished in 2.059s (kernel) + 48.820s (userspace) = 50.879s.
2181:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[6500]: Startup finished in 25ms.
2287:Oct 05 07:40:44 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[6900]: Startup finished in 25ms.

1.journalctl.txt
1209:Oct 05 07:32:42 ip-10-12-3-80 systemd[836]: Startup finished in 39ms.
2105:Oct 05 07:33:20 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[1]: Startup finished in 1.996s (kernel) + 47.348s (userspace) = 49.345s.
2174:Oct 05 07:38:52 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[6500]: Startup finished in 24ms.
2360:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[7495]: Startup finished in 24ms.

8.journalctl.txt
1215:Oct 05 07:32:44 ip-10-12-13-119 systemd[840]: Startup finished in 41ms.
2095:Oct 05 07:33:21 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[1]: Startup finished in 1.960s (kernel) + 46.254s (userspace) = 48.215s.
2164:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[6416]: Startup finished in 24ms.
2270:Oct 05 07:40:47 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[6815]: Startup finished in 24ms.

7.journalctl.txt
1215:Oct 05 07:32:44 ip-10-12-8-116 systemd[840]: Startup finished in 43ms.
2103:Oct 05 07:33:23 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[1]: Startup finished in 1.963s (kernel) + 49.298s (userspace) = 51.261s.
2168:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[6411]: Startup finished in 26ms.
2274:Oct 05 07:40:46 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[6812]: Startup finished in 25ms.

5.journalctl.txt
1216:Oct 05 07:32:44 ip-10-12-1-206 systemd[834]: Startup finished in 41ms.
2114:Oct 05 07:33:22 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[1]: Startup finished in 2.005s (kernel) + 49.150s (userspace) = 51.155s.
2183:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[6494]: Startup finished in 24ms.
2289:Oct 05 07:40:43 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[6896]: Startup finished in 24ms.

2.journalctl.txt
1224:Oct 05 07:32:44 ip-10-12-8-218 systemd[829]: Startup finished in 41ms.
2137:Oct 05 07:33:25 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[1]: Startup finished in 2.348s (kernel) + 51.366s (userspace) = 53.715s.
2202:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[6569]: Startup finished in 26ms.
2338:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[7146]: Startup finished in 26ms.

3.journalctl.txt
1214:Oct 05 07:32:42 ip-10-12-2-85 systemd[840]: Startup finished in 44ms.
2122:Oct 05 07:33:22 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: Startup finished in 1.960s (kernel) + 49.613s (userspace) = 51.573s.
2187:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[6489]: Startup finished in 24ms.
2293:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[6890]: Startup finished in 24ms.
2377:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[7076]: Startup finished in 24ms.

4.journalctl.txt
1239:Oct 05 07:33:32 ip-10-12-9-200 systemd[871]: Startup finished in 286ms.
2271:Oct 05 07:38:55 teamcity-12059861-1696483956-28-n8cpu8-0004 systemd[1]: Startup finished in 7.008s (kernel) + 6min 8.411s (userspace) = 6min 15.419s.

All nodes report startup times as ~50s, and ~25ms. Node n4 does it significantly slower.

Same thing happens in the other failed run. All nodes completed the startup in ~50s, whereas the slow n5 has this:

5.journalctl.txt
1248:Sep 23 08:54:09 ip-10-12-10-114 systemd[847]: Startup finished in 247ms.
2341:Sep 23 09:03:24 teamcity-11896183-1695446936-33-n10cpu8-0005 systemd[1]: Startup finished in 12.254s (kernel) + 10min 23.007s (userspace) = 10min 35.262s.
2483:Sep 23 09:06:12 teamcity-11896183-1695446936-33-n10cpu8-0005 systemd[8518]: Startup finished in 520ms.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 27, 2023

The above happens before the cockroach binary is started, or even copied to the nodes.

@pav-kv pav-kv changed the title roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [slow IO on one node] roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [one node IO/CPU starved] Oct 27, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Oct 27, 2023

@srosenberg Do we have some way of doing a mega-grep on all the recent artifacts (the journalctl files) for AWS tests to see how frequently such a thing happens? Whether it happens for other tests too (but maybe this data-intensive test is just particularly sensitive)?

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 27, 2023

In artifacts for a couple of AWS roachtests that I found in teamcity, the startup times are (like for this test) under a minute. I think this is what we should expect pretty consistently. The 6 and 10 min are surprisingly high, and likely indicate some problem with this VM.

Maybe we could add some extra diagnostics with systemd-analyze in our roachtest routines, to help narrowing this down next time such a failure occurs. This tool can even produce SVG/dot images breaking down the whole startup sequence on a timeline. We could then have some artifacts to present to AWS to look at.

@renatolabs
Copy link
Contributor

I'm removing the release blocker label as this has become an infrastructure investigation. We haven't had the time to investigate yet, but it doesn't seem like it should block the release.

@renatolabs renatolabs removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 15, 2023
@renatolabs renatolabs added the P-3 Issues/test failures with no fix SLA label Jan 11, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
@srosenberg
Copy link
Member

This roachtest variant hasn't failed (or run) in several months, so we don't have any new insights. Thus, I'm closing until it resurfaces or becomes more relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-testeng TestEng Team X-noreuse Prevent automatic commenting from CI test failures
Projects
No open projects
Status: Incoming
Status: Triage
Development

No branches or pull requests

7 participants