-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083
Comments
https://share.polarsignals.com/73a06c8/ @erikgrinaker this seems to be something we should be looking into more actively. It is "sort of" expected that we're seeing lots of memory held up by sideloaded proposals; after all this phase of the test mostly crams lots of SSTs into our log and then asks us to send them to two followers, who are possibly also a region hop away. But something seems to have changed as we didn't use to see this and also #71132 hasn't prevented it from happening, and I looked before and couldn't see any obvious other leaks. So currently I am expecting that we will see that we happen to have a lot of groups catch up followers at once, overwhelming the system. If that is the case, it would be difficult to even think of a quick fix. We would need to either delay adding new entries to the log or sending entries to followers. The latter happens inside of raft, so the easier choice is the former. Then the question becomes, do we apply it to SSTs only, or to all proposals? SSTs is easier since there is already a concept of delaying them, plus they are not that sensitive to it. But first we need to see that what I'm describing is really what we're seeing. |
Yeah, this seems bad. We seem to be enforcing per-range size limits that should mostly prevent this, so I agree that this seems likely to be because we're catching up many groups at once. Would it be worth bisecting this to find out what triggered it? |
Hard to say, it sure would be nice to know the commit if there is one. On the other hand, it would likely be extremely painful. I think I used to do hundreds of runs when working on #69414, though, and never saw the OOM there. This was based on ab1fc34, so I think that would be our "good" commit (though it has the inconsistency). Now when did I first see this OOM? I think it was in #71050. Note that this isn't the exact same OOM (the memory is held in the inefficiency fixed in #71132) but I think this is still the same. Hmm, maybe it's fine? Really depends on how clean the repro loop is. I think we should run
edit: test balloon launched,
https://teamcity.cockroachdb.com/viewLog.html?buildId=3683316& |
Ok, the roachstress-CI thing seems to work. Going to log the bisect here and update as I make progress. I'm using
d1231cf (confirming starting bad commit): https://teamcity.cockroachdb.com/viewQueued.html?itemId=3683412, we expect this to produce the failure |
Hmm so stressing this test ( |
Screw it, going to try stressing tpccbench as is. I don't have it in me to patch each commit to just do the import, etc.; let's see what we get. |
oops that was the old test again. Ok here for reals: first bad commit BRANCH=release-21.2 SHA=d1231cff60125b397ccce6c79c9aeea771cdcca4 TEST=tpccbench/nodes=9/cpu=4/multi-region COUNT=50 ~/roachstress-ci.sh |
They all passed too. We were supposed to see an oom here. |
Interesting, I suppose there must have been aggravating circumstances in the initial failure -- perhaps a failure mode that caused concurrent I had a look at the debug.zip, and noticed that we have several nodes with ~200 outbound snapshots in progress concurrently:
All of these appear to come via |
Just for the record, if we wanted to limit the size of the messages, we'd have to work something down into raft onto this line https://github.com/cockroachdb/vendored/blob/master/go.etcd.io/etcd/raft/v3/raft.go#L435 Instead of a fixed maxMsgSize we would need to pass an interface that dynamically limits the budget, i.e. something like limiter interface {
Request(size uint64) bool
} and if the limiter returns false, we don't send anything else. The main new thing that comes out of this is that |
This comment has been minimized.
This comment has been minimized.
Last failure is [perm denied #72635] |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@andrewbaptist @erikgrinaker Thanks for the heads-up. The |
roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on master @ 34dc56fbb5789b39be47b110bf22332c7f5654f6:
Parameters: Same failure on other branches
|
87533: sqlliveness: add timeouts to heartbeats r=ajwerner a=aadityasondhi Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves #85541 Release note: None Release justification: low-risk bugfix to existing functionality 88293: backupccl: elide expensive ShowCreate call in SHOW BACKUP r=stevendanna a=adityamaru In #88376 we see the call to `ShowCreate` taking ~all the time on a cluster with 2.5K empty tables. In all cases except `SHOW BACKUP SCHEMAS` we do not need to construct the SQL representation of the table's schema. This results in a marked improvement in the performance of `SHOW BACKUP` as can be seen in #88376 (comment). Fixes: #88376 Release note (performance improvement): `SHOW BACKUP` on a backup containing several table descriptors is now more performant 88471: sql/schemachanger: plumb context, check for cancelation sometimes r=ajwerner a=ajwerner Fixes #87246 This will also improve tracing. Release note: None 88557: testserver: add ShareMostTestingKnobsWithTenant option r=msbutler a=stevendanna The new ShareMostTestingKnobs copies nearly all of the testing knobs specified for a TestServer to any tenant started for that server. The goal here is to make it easier to write tests that depend on testing hooks that work under probabilistic tenant testing. Release justification: non-production code change Release note: None 88562: upgrade grpc to v.1.49.0 r=erikgrinaker a=pavelkalinnikov Fixes #81881 Touches #72083 Release note: upgraded grpc to v1.49.0 to fix a few panics that the old version caused 88568: sql: fix panic due to missing schema r=ajwerner a=ajwerner A schema might not exist because it has been dropped. We need to mark the lookup as required. Fixes #87895 Release note (bug fix): Fixed a bug in pg_catalog tables which could result in an internal error if a schema is concurrently dropped. Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: Aaditya Sondhi <aadityas@cockroachlabs.com> Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
This looks like the issue is that admin scatter takes ~1 hour on these tests. Looking at runs that succeeded / failed, they all take on the order of 55+ minutes to complete this step. Running manually twice also confirmed this number. I'm planning to change the test cockroach/pkg/workload/cli/run.go Line 428 in 682b303
|
Relates to #72083. Allow scatter to complete. Release note: None
The issue was that scatter was taking between 55 minutes and 1hr 18 minutes to complete (based on running the test 10 times). There was nothing hung however and the tests all completed successfully after bumping the timeout. Based on this being a test timing issue only, it would make sense to backport this to 22.2 (and probably also remove the release blocker label).
|
Do we know why this started failing a month ago? Is this now expected behavior, or is the slowdown pathological? |
The best I can tell is that the test was working consistently until about May 26th. After that it didn't run again until July 8th (although I'm not sure why). From July 8th until now, it has been failing about half the time. I tried to see if there was any related reason for this, however there have been a lot of changes between then and now. Even when it was succeeding before, it did run for about the same time as now (4-5 hours total) - so I'm not sure what exactly changed. |
Ok, thanks. If we're sure the 1 hour+ scatter times here are expected then I suggest we close this out with the timeout bump, and deal with any new failures separately. Thanks for looking into this! |
Sounds good - merging this timeout change. This is definitely a good test to have running consistently. It is possible that the scatter times have gotten slightly worse, but scatter was completely rewritten about 6-9 months ago. So it is possible that fixes that occurred to that over the past few months are related. It is also a strange operation that should be re-examined at some point in the near future as it runs "out-of-band" of other things. |
88550: kvserver: use execution timestamps for verified when available r=erikgrinaker a=tbg Now that "most" operations save their execution timestamps, use them for verification. This has the undesirable side effect of failing the entire test suite, which didn't bother specifying timestamps for most operations. Now they are required, and need to be present, at least for all mutations. I took the opportunity to also clean up the test helpers a bit, so now we don't have to pass an `error` when it's not required. The big remaining caveat is that units that return with an ambiguous result don't necessarily have a commit timestamp. I *think* this is only an implementation detail. We *could* ensure that `AmbiguousResultError` always contains the one possible commit timestamp. This should work since `TxnCoordSender` is always local to `kvnemesis`, and so there's no "fallible" component between the two. This would result in a significant simplification of `kvnemesis`, since as is when there are ambiguous deletions, we have to materialize them but cannot assign them a timestamp. This complicates various code paths and to be honest I'm not even sure what exactly we verify and how it all works when there are such "half-materialized" writes. I would rather do away with the concept altogether. Clearly we also won't be able to simplify the verification to simply use commit order if there are operations that don't have a timestamp, which is another reason to keep pushing on this. Release note: None 88641: workload: Bump prepare timeout to 90 minute r=aayushshah15 a=andrewbaptist Relates to #72083. Allow scatter to complete. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com>
Relates to #72083. Allow scatter to complete. Release note: None
Two follow-up questions,
|
These are great questions and unfortunately, I don't know the full answers to either of them.
|
roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on master @ d91fead28392841a943251842fbd43a0affb2eca:
Help
See: roachtest README
|
See: How To Investigate (internal)
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-10940
The text was updated successfully, but these errors were encountered: