-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [one node IO/CPU starved] #111825
Comments
looking into this |
Job was 56% done:
|
The job starts at 07:40 and a couple of minutes later until the test times out we see
Similarly theres 19 instances of slow AdminSplits and 3 of AdminScatter all in the 90-100s range. After the job started at 07:40 we saw an inbox communication error and a retry at 07:45:
We do now have much better checkpointing to avoid redoing done work and this shouldn't have a major impact on performance but its worth noting anyways. We have a very high number of connection reset IO retries when reading the backup from the bucket:
This would definitely cause the restore to run for longer since we're retrying on a large number of the reads from s3. We also generally see raft struggling to keep up pretty much immediately after the restore starts. Messages like Overall, looks like the cluster was pretty overwhelmed and unable to serve KV requests. I will ping KV in case they're interested in giving this cluster a look. |
Is this any related to #111160? Particularly, do we see one node spinning at 100% CPU? Upd: Yes, looks a lot like it. We should checkout CPU profiles for the overloaded |
cc @cockroachdb/replication |
One thing stood out in
@erikgrinaker are we ever allowed to run this method that long? It looks like we evaluated the thing but never got the result or cancel. It's a probe though, and also doesn't hold any locks. |
We discussed this separately, but this does seem suspect -- we're basically waiting for the reproposal machinery to get this proposal through, and it isn't happening for whatever reason. |
Running a "happy" run to compare the behaviours. I do see the same periodic spikes from one node, so that's probably just the shape of the workload. There aren't many reproposals though, but notably it's not zero and constantly trickles at a rate of a few reproposals/minute. |
In the happy run, the Raft portion of the CPU profile is hardly noticeable (looks similar on multiple nodes I've looked at): The profile we've seen above is almost double of this one, and noticeably has a lot more raft processing. |
@sumeerbhola Could you help reading this graph (TLDR from above is that |
@cockroachdb/test-eng Is it possible to check whether VMs from this roachtest run were previously used by other roachtests, and potentially identify those tests? There is a suspicion that this run inherited some leftover |
VM (or rather, cluster) reuse is indeed possible. If a test runs on a cluster and later we need to run a test that requires a cluster compatible with that existing cluster, it might be reused (after This is not the case in this failure, however. If we check the test runner logs, we see that the cluster was created for this test, and was not reused after:
|
cc @cockroachdb/replication |
Passing this issue to @cockroachdb/test-eng team. We've done extensive investigation, and the root cause still seems to be an IO (possibly CPU too) constrained node. It would be nice to get a level below and understand the VM initialization sequence: how does it happen that it's different for different nodes, is it normal, etc? We could also benefit from some observability improvements here:
I still have artifacts for this issue and #111160, LMK if you need them. |
cc @cockroachdb/test-eng |
Specifically, we would like to understand the correlation with |
There is no correlation between [1] https://unix.stackexchange.com/a/685632 |
The correlation is with slow IO: the node which reported ext4 instead of tmpfs and which did not appear to run |
A bit more precisely, it's not that
For completeness, here is the log from run #111160
and the corresponding
NB: node 5 is the only one who didn't see We observe a similar effect in this issueThe slow node
Slow node
It looks as though either the node is initially slow, or maybe it wasn't entirely configured before the process started. |
Another evidence that the nodes are not symmetric:
All nodes report startup times as ~50s, and ~25ms. Node Same thing happens in the other failed run. All nodes completed the startup in ~50s, whereas the slow
|
The above happens before the cockroach binary is started, or even copied to the nodes. |
@srosenberg Do we have some way of doing a mega-grep on all the recent artifacts (the |
In artifacts for a couple of AWS roachtests that I found in teamcity, the startup times are (like for this test) under a minute. I think this is what we should expect pretty consistently. The 6 and 10 min are surprisingly high, and likely indicate some problem with this VM. Maybe we could add some extra diagnostics with |
I'm removing the release blocker label as this has become an infrastructure investigation. We haven't had the time to investigate yet, but it doesn't seem like it should block the release. |
This roachtest variant hasn't failed (or run) in several months, so we don't have any new insights. Thus, I'm closing until it resurfaces or becomes more relevant. |
roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ 6b08842e45668287861af596c28dce58c352d77e:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=aws
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for aws clusters
This test on roachdash | Improve this report!
Jira issue: CRDB-32080
The text was updated successfully, but these errors were encountered: