-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: cdc/ledger/rangefeed=true failed #46463
Comments
(roachtest).cdc/ledger/rangefeed=true failed on master@b1a0b989bbfef500075a485edc762fe42ca7b32a:
More
Artifacts: /cdc/ledger/rangefeed=true See this test on roachdash |
Starting to look at this now. The end-to-end latency started climbing right around 08:45:30, up until it hit the 1m threshold,
That time frame maps to the following events in the server logs.
|
(roachtest).cdc/ledger/rangefeed=true failed on master@0222b515560bb02e6adf59a09f6c067923bca28c:
More
Artifacts: /cdc/ledger/rangefeed=true See this test on roachdash |
Still no aha moment, but I did try a few things now that we have a few more failures to go off of. I'll post more detailed notes later as I'm still going through the machinery here, but I did try a few ideas and still not able to reproduce it easily. (a) increased the likelihood of splits by reducing qps threshold The reasoning behind trying (c) and (d) is because for the recent failures here and in #47400, the end-to-end latency started climbing about right about 30m into the test. (c) and (d) were the moving pieces in play that were on 30m rotations, so to speak. (Logging from those components also appeared around the time of the latency climb.) Probably unlikely but the issue underlying issue here may be related to #37716 and #36879 (though there's no chaos here), and we may have seen this specific failure happen as far back as #43809 (comment). |
I'm still suspicious of splits + rangefeeds. I suspect we're getting disconnected somehow (I see split events around when the problems started occurring). Aside: I'm concluding the issue here isn't the same as #47187 (despite the similarities) because the workload logs don't appear to hit 0 QPS at any point. Edit: I know what the (non-)issue is, I'll write it up in a bit. Dumping these screenshots here for myself in the mean time. |
I have a fast-ish repro for this specific bug here, but haven't had the time to dig in further (though it shouldn't be too bad, I think): irfansharif/cockroach@200408.cdc-ledger What I think is happening here is that somehow ranges are not receiving closed timestamp updates, and as a result the rangefeeds established over those ranges are essentially wedged. I don't think the closedts subsystem is at fault (as in there's nothing blocking timestamps from being closed out), I think the problem is more that the ranges get "disconnected" from being notified about closed timestamp updates, somehow. When a rangefeed is established over a given range, and the range splits, we have to re-establish rangefeeds on the split ranges. cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go Lines 123 to 127 in 1320e13
As part of that "re-establishment", we run this catch up scan. cockroach/pkg/kv/kvserver/rangefeed/registry.go Lines 277 to 281 in aebcb79
In the test failure captured above, and my early attempts at repro, the end-to-end latency climb only started happening ~30m into the test run. I later found out this was a by effect of the bump in default range size. When the max range size was 512 MB, when a range (with a rangefeed established over it) was split, the catch up iterator took ~3s to iterate over all the keys. That's pretty much all it takes to cause the issue captured in this test, and what I did in my repro-branch. cockroach/pkg/kv/kvserver/rangefeed/registry.go Lines 295 to 296 in bdbe80d
What I observed, with the added lag, was that I was seeing multiple rangefeed registrations for the RHS range, happening over and over. Adding the following log:
I observed.
r63 split off of r35, and kept re-trying the catch up scan, which should have happened just the one time. I tabled the investigation here. |
@ajwerner might this be the bug you've just fixed? |
@irfansharif if you have a cycle it would be nice if you could kick the repro after a rebase to see if it's still there. If not we can assume closed |
Seems likely |
(roachtest).cdc/ledger/rangefeed=true failed on master@055561809b95488bff2cad19422e7f4a7472e3a2:
More
Artifacts: /cdc/ledger/rangefeed=true
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: