-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: clearrange/zfs/checks=true failed #68303
Comments
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ eef03a46f2e43ff70485dadf7d9ad445db05cab4:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs> |
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 6b8d59327add74cf1342345fb3eaffc3a3e765d2:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs> |
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 50ef2fc205baa65c5a740c2d614fe1de279367e9:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs> |
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ cab185ff71f0924953d987fe6ffd14efdd32a3a0:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs> |
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 847514dab6354d4cc4ccf7b2857487b32119fb37:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs> |
These are failing during the "import" workload sporadically. Looking into it. |
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 90809c048d05f923a67ce9b89597b2779fc73e32:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs> |
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 0880e83e30ee5eb9aab7bb2297324e098d028225:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs> |
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 7897f24246bef3cb94f9f4bfaed474ecaa9fdee6:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs> |
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 11e0a4da82124e70e772a009011ca7a4007bff85:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs> |
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ d18da6c092bf1522e7a6478fe3973817e318c247:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs> |
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 61bd543ba7288c8f0eed6cddded7b219c9d1fcd4:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs> |
roachtest.clearrange/checks=true failed with artifacts on master @ 8cae60f603ccc4d83137167b3b31cab09be9d41a:
Same failure on other branches
|
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 8cae60f603ccc4d83137167b3b31cab09be9d41a:
|
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 44ea1fa0eba8fc78544700ef4afded62ab98a021:
|
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 0b57dc40deda1206d9a1c215ffdb219bbf182a39:
|
roachtest.clearrange/checks=true failed with artifacts on master @ c1ef81f5f435b3cc5bdf8b218532e0779f03a6bf:
Same failure on other branches
|
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 15b773c71f92d643795e34c922717fde0447f9cd:
|
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 42e5f9492d0d8d93638241303bca984fe78baae3:
|
Nice job extracting all of that Nick! That goroutine is interesting both because it is not stuck ( Could this slowness be explained by unexpectedly high consistency checker concurrency? All consistency checks on a single node will share the same @tbg would any of the replication circuit breaker work have led to the consistency checker queue detaching its context from an ongoing consistency check and moving on without the consistency check being canceled? If so, could this explain why we have more consistency checks running concurrently than the individual |
Should have also mentioned that the stacks from above were taken from latest Here's a look at what we're calling the "bad" sha (i.e. 6664d0c). Same problem, just much more pronounced: LSM state (on worst node): Store 6:
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 1 6.5 M - 220 M - - - - 221 M - - - 1.0
0 0 0 B 0.00 215 M 685 M 137 0 B 0 52 M 72 0 B 0 0.2
1 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
2 7 25 M 0.57 39 M 0 B 0 0 B 0 96 M 28 97 M 1 2.4
3 21 103 M 0.33 252 M 1.2 G 96 31 M 12 309 M 113 329 M 1 1.2
4 355 4.3 G 1.00 1.2 G 21 G 1.7 K 638 M 100 1.6 G 318 1.6 G 1 1.4
5 2171 29 G 1.00 7.2 G 108 G 7.1 K 11 G 1.0 K 11 G 1.0 K 11 G 1 1.5
6 2803 98 G - 97 G 1.6 G 125 36 M 8 169 G 5.9 K 169 G 1 1.7
total 5357 131 G - 132 G 132 G 9.1 K 12 G 1.1 K 314 G 7.4 K 182 G 5 2.4
flush 36
compact 4326 66 G 4.6 M 1 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
ctype 3159 5 29 1129 4 (default, delete, elision, move, read)
memtbl 1 64 M
zmemtbl 13 280 M
ztbl 6294 105 G
bcache 220 K 3.2 G 20.3% (score == hit-rate)
tcache 10 K 6.5 M 97.8% (score == hit-rate)
titers 2643
filter - - 98.2% (score == utility) 530 goroutines computing the consistency checks: 0 | quotapool | quotapool.go:281 | (*AbstractPool).Acquire(#1309, {#4, *}, {#134, *})
1 | quotapool | int_rate.go:59 | (*RateLimiter).WaitN(#307, {#4, *}, *)
2 | kvserver | replica_consistency.go:581 | (*Replica).sha512.func1({{*, *, *}, {*, 0, 0}}, {*, *, *})
3 | storage | mvcc.go:3902 | ComputeStatsForRange({#147, *}, {*, *, *}, {*, *, *}, 0, {*, ...})
4 | kvserver | replica_consistency.go:636 | (*Replica).sha512(*, {#4, *}, {*, {*, *, 8}, {*, *, 8}, ...}, …)
5 | kvserver | replica_proposal.go:247 | (*Replica).computeChecksumPostApply.func1.1({#156, *}, {{*, *, *, *, *, *, *, *, ...}, ...}, …)
6 | kvserver | replica_proposal.go:253 | (*Replica).computeChecksumPostApply.func1({#4, *})
7 | stop | stopper.go:488 | (*Stopper).RunAsyncTaskEx.func2() |
On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that of the proposal. As of #71806, this context is canceled right after the corresponding proposal is signaled (and the client goroutine returns from `sendWithRangeID`). This effectively prevents most consistency checks from succeeding (they previously were not affected by higher-level cancellation because the consistency check is triggered from a local queue that talks directly to the replica, i.e. had something like a minutes-long timeout). This caused disastrous behavior in the `clearrange` suite of roachtests. That test imports a large table. After the import, most ranges have estimates (due to the ctx cancellation preventing the consistency checks, which as a byproduct trigger stats adjustments) and their stats claim that they are very small. Before recent PR #74674, `ClearRange` on such ranges would use individual point deletions instead of the much more efficient pebble range deletions, effectively writing a lot of data and running the nodes out of disk. Failures of `clearrange` with #74674 were also observed, but they did not involve out-of-disk situations, so are possibly an alternative failure mode (that may still be related to the newly introduced presence of context cancellation). Touches #68303. Release note: None
Last update for the evening. Spent the remainder of today looking less at the
Sampling some commits, I'm noticing the following "good" vs. "bad" behavior: Good (fair allocation of replicas across all nodes): Bad (some nodes run out of disk and stall the import): I started a bisect, but it was taking some time. I'll pick this up again tomorrow. |
I think I see what the problem is. I had actually thought about it before, but erroneously convinced myself that it wasn't an issue. Here's what it looks like on the leaseholder on the bad sha (i.e. ctx cancels):
so no concurrency. But step 2 also happens on each follower, and there it will not have a cancelable context associated to it. So there:
So basically the problem is that if a consistency check fails fast on the leader, this doesn't cancel the in-flight computation on the follower. Since each node is a follower for lots of ranges, we had tons of consistency checks running on each node. What's curious is that when I ran this experiment I should've seen lots of snapshots open but I didn't, but maybe my instrumentation was wrong or the test never got to the point where it exhibited this problem (the graceful shutdowns I introduced after the import hung, I think). With the cancellation fix, we're close to the previous behavior. The only difference is that previously, the computation on the leaseholder was canceled when the consistency checker queue gave up. But like before this wouldn't affect the followers if they still had the computation ongoing. I think this might put a pin in the high ztbl count, right? Thanks for all of the work getting us here @nicktrav! |
How are you bisecting, btw? Are you going through all 319 commits cd1093d...8eaf8d2? It sounds as though the strategy for each bisection attempt would be to cherry-pick 71f0b34 on top, but are we even sure this is "good" for any of the commits in that range? |
This makes a lot of sense. I'll still suggest that we should think carefully about whether the client ctx cancellation is the root of the problem, or whether it's actually d064059. The ability for a client ctx cancellation to propagate to Raft log application on the proposer replica seems like a serious problem to me. It breaks determinism, the cornerstone of the whole "replicated state machine" idea. I'm actually surprised this hasn't caused worse issues, like a short-circuited split on the proposer. We must just not currently check for context cancellation in many places below Raft. |
I fully agree with Nathan here. That commit was motivated by propagating tracing information through command application, but it should not propagate cancellation signals. |
roachtest.clearrange/checks=true failed with artifacts on master @ 71becf337d9d2731298dc092f3ce9cf0f0eedb2c:
Same failure on other branches
|
Is there a good reason why we check for cancellation in any places below Raft? |
There may not be, but it seems brittle pass a cancelable context into a subsystem that must not check cancellation. It's both more robust and also, in my view, more appropriate to execute state machine transitions under a context that does not inherit the wholly unrelated client cancellation. I think we should massage cockroach/pkg/kv/kvserver/replica_application_decoder.go Lines 139 to 171 in 852a80c
such that it avoids deriving from |
Filed #75656 |
I kinda see it the other way around. The subsystem should be robust against any context passed into it. Depending on where you draw the boundary of the subsystem on question, you can say that raft application can be made robust by switching to a non-cancelable context itself. But still, if there's code that only ever runs below Raft, I think we should take out all the cancellation checks (at the very least, for clarity). |
Yeah, taking this approach. There are only 8-ish steps in a full bisect, but it's a little bit of extra work to cherry-pick, etc.. So a little slower going.
I don't think we are based on what came in while I was offline. That said, if I treat a "good" signal for this bisect as whether the replicas are balanced I seem to be zeroing in. |
Right, I think this is what we're saying, and what is proposed in #75656.
Trying to make this guarantee is the part that seems brittle. Even if we carefully audit and ensure that we don't perform context cancellation checks directly in Raft code, it's hard to guarantee that no lower-level logic or library that Raft code calls into will perform such checks. For instance, I broke this guarantee in #73279 while touching distant code, which Erik fixed in #73484. There are also proposals like golang/go#20280 to add context cancellation awareness to the filesystem operations provided by the standard library. If we don't want a subsystem to respect context cancellation, it's best not to give it a cancellable context. |
Maybe we could even remove |
Let's continue discussing this on #75656, this comment thread is already pretty unwiedly. |
I had some luck with the bisect on the replica imbalance issue. I've narrowed it down to e12c9e6. On master with this commit included I see the following behavior on import: I then ran a branch with just this commit excluded and the replicas are far more balanced and the import is able to succeed: I don't have enough context to say whether, outside the context of just the Once the import succeeds, we're into the (well documented) realm of #75656 - In terms of debugging this specific test failure, I think we've found the two separate issues we theorized. |
Great work, and much appreciate the consistent extra miles you're going. I think we should close this issue, and file a separate issue about what you have found, and then link it to the new clear range failure issue once the next nightly creates it. |
Well that's pretty interesting / mildly surprising since this is an ordered ingestion import, where we didn't expect this bulk-sent split size to do much other an aggravate the merge queue an hour later. |
Ack. Will do. @dt - I'll move discussion over there 👍 If I close this, I think it will just re-open when it fails again (at least until the replica imbalance issue is addressed). I'll just make it known to the rest of the Storage folks that we can probably safely leave this one alone. Thanks all. |
roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 701b177d8f4b81d8654dfb4090a2cd3cf82e63a7:
Reproduce
See: roachtest README
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: