Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/ledger/rangefeed=true failed #46463

Closed
cockroach-teamcity opened this issue Mar 24, 2020 · 10 comments
Closed

roachtest: cdc/ledger/rangefeed=true failed #46463

cockroach-teamcity opened this issue Mar 24, 2020 · 10 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).cdc/ledger/rangefeed=true failed on master@055561809b95488bff2cad19422e7f4a7472e3a2:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200324-1823985/cdc/ledger/rangefeed=true/run_1
	cdc.go:931,cdc.go:226,cdc.go:617,test_runner.go:753: max latency was more than allowed: 1m0.387645603s vs 1m0s

More

Artifacts: /cdc/ledger/rangefeed=true

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 24, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone Mar 24, 2020
@irfansharif irfansharif self-assigned this Mar 31, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).cdc/ledger/rangefeed=true failed on master@b1a0b989bbfef500075a485edc762fe42ca7b32a:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200405-1853455/cdc/ledger/rangefeed=true/run_1
	cdc.go:931,cdc.go:226,cdc.go:617,test_runner.go:753: max latency was more than allowed: 1m30.780632326s vs 1m0s

More

Artifacts: /cdc/ledger/rangefeed=true

See this test on roachdash
powered by pkg/cmd/internal/issues

@irfansharif
Copy link
Contributor

Starting to look at this now. The end-to-end latency started climbing right around 08:45:30, up until it hit the 1m threshold,

changefeed: 08:45:18 cdc.go:887: end-to-end steady latency 14.533205361s; max steady latency so far 17.469822178s
changefeed: 08:45:20 cdc.go:887: end-to-end steady latency 13.603024858s; max steady latency so far 17.469822178s
changefeed: 08:45:21 cdc.go:887: end-to-end steady latency 14.672415902s; max steady latency so far 17.469822178s
changefeed: 08:45:22 cdc.go:887: end-to-end steady latency 13.741716471s; max steady latency so far 17.469822178s
changefeed: 08:45:23 cdc.go:887: end-to-end steady latency 14.815501704s; max steady latency so far 17.469822178s
changefeed: 08:45:24 cdc.go:887: end-to-end steady latency 13.886944297s; max steady latency so far 17.469822178s
changefeed: 08:45:25 cdc.go:887: end-to-end steady latency 14.955899771s; max steady latency so far 17.469822178s
changefeed: 08:45:26 cdc.go:887: end-to-end steady latency 14.026840456s; max steady latency so far 17.469822178s
changefeed: 08:45:27 cdc.go:887: end-to-end steady latency 15.096035734s; max steady latency so far 17.469822178s
changefeed: 08:45:28 cdc.go:887: end-to-end steady latency 14.172552093s; max steady latency so far 17.469822178s
changefeed: 08:45:29 cdc.go:887: end-to-end steady latency 13.244949736s; max steady latency so far 17.469822178s
changefeed: 08:45:30 cdc.go:887: end-to-end steady latency 14.313843038s; max steady latency so far 17.469822178s
changefeed: 08:45:31 cdc.go:887: end-to-end steady latency 13.383102205s; max steady latency so far 17.469822178s
changefeed: 08:45:32 cdc.go:887: end-to-end steady latency 14.453854253s; max steady latency so far 17.469822178s
changefeed: 08:45:33 cdc.go:887: end-to-end steady latency 15.429749541s; max steady latency so far 17.469822178s
changefeed: 08:45:35 cdc.go:887: end-to-end steady latency 16.49986564s; max steady latency so far 17.469822178s
changefeed: 08:45:36 cdc.go:887: end-to-end steady latency 17.574455481s; max steady latency so far 17.574455481s
changefeed: 08:45:37 cdc.go:887: end-to-end steady latency 18.643448409s; max steady latency so far 18.643448409s
changefeed: 08:45:38 cdc.go:887: end-to-end steady latency 19.712523697s; max steady latency so far 19.712523697s
changefeed: 08:45:39 cdc.go:887: end-to-end steady latency 20.781855148s; max steady latency so far 20.781855148s
changefeed: 08:45:40 cdc.go:887: end-to-end steady latency 21.853016916s; max steady latency so far 21.853016916s
changefeed: 08:45:41 cdc.go:887: end-to-end steady latency 22.922195883s; max steady latency so far 22.922195883s
changefeed: 08:45:42 cdc.go:887: end-to-end steady latency 23.993745765s; max steady latency so far 23.993745765s
changefeed: 08:45:43 cdc.go:887: end-to-end steady latency 25.065603655s; max steady latency so far 25.065603655s
changefeed: 08:45:44 cdc.go:887: end-to-end steady latency 26.135431177s; max steady latency so far 26.135431177s
changefeed: 08:45:45 cdc.go:887: end-to-end steady latency 27.207005972s; max steady latency so far 27.207005972s
changefeed: 08:45:46 cdc.go:887: end-to-end steady latency 28.278210894s; max steady latency so far 28.278210894s
changefeed: 08:45:47 cdc.go:887: end-to-end steady latency 29.347163819s; max steady latency so far 29.347163819s
changefeed: 08:45:48 cdc.go:887: end-to-end steady latency 30.41688011s; max steady latency so far 30.41688011s
changefeed: 08:45:49 cdc.go:887: end-to-end steady latency 31.486844685s; max steady latency so far 31.486844685s
changefeed: 08:45:51 cdc.go:887: end-to-end steady latency 32.555683735s; max steady latency so far 32.555683735s
changefeed: 08:45:52 cdc.go:887: end-to-end steady latency 33.626292069s; max steady latency so far 33.626292069s
changefeed: 08:45:53 cdc.go:887: end-to-end steady latency 34.697097028s; max steady latency so far 34.697097028s
changefeed: 08:45:54 cdc.go:887: end-to-end steady latency 35.766616308s; max steady latency so far 35.766616308s
changefeed: 08:45:55 cdc.go:887: end-to-end steady latency 36.835497538s; max steady latency so far 36.835497538s
changefeed: 08:45:56 cdc.go:887: end-to-end steady latency 37.907832377s; max steady latency so far 37.907832377s
changefeed: 08:45:57 cdc.go:887: end-to-end steady latency 38.977788388s; max steady latency so far 38.977788388s
changefeed: 08:45:58 cdc.go:887: end-to-end steady latency 40.047050758s; max steady latency so far 40.047050758s
changefeed: 08:45:59 cdc.go:887: end-to-end steady latency 41.116994954s; max steady latency so far 41.116994954s
changefeed: 08:46:00 cdc.go:887: end-to-end steady latency 42.186050834s; max steady latency so far 42.186050834s
changefeed: 08:46:01 cdc.go:887: end-to-end steady latency 43.254787829s; max steady latency so far 43.254787829s
changefeed: 08:46:02 cdc.go:887: end-to-end steady latency 44.324816833s; max steady latency so far 44.324816833s
changefeed: 08:46:03 cdc.go:887: end-to-end steady latency 45.401400168s; max steady latency so far 45.401400168s
changefeed: 08:46:04 cdc.go:887: end-to-end steady latency 46.470333503s; max steady latency so far 46.470333503s
changefeed: 08:46:06 cdc.go:887: end-to-end steady latency 47.538919692s; max steady latency so far 47.538919692s
changefeed: 08:46:07 cdc.go:887: end-to-end steady latency 48.610360279s; max steady latency so far 48.610360279s
changefeed: 08:46:08 cdc.go:887: end-to-end steady latency 49.679751767s; max steady latency so far 49.679751767s
changefeed: 08:46:09 cdc.go:887: end-to-end steady latency 50.751562297s; max steady latency so far 50.751562297s
changefeed: 08:46:10 cdc.go:887: end-to-end steady latency 51.821102761s; max steady latency so far 51.821102761s
changefeed: 08:46:11 cdc.go:887: end-to-end steady latency 52.890717432s; max steady latency so far 52.890717432s
changefeed: 08:46:12 cdc.go:887: end-to-end steady latency 53.962324833s; max steady latency so far 53.962324833s
changefeed: 08:46:13 cdc.go:887: end-to-end steady latency 55.035554318s; max steady latency so far 55.035554318s
changefeed: 08:46:14 cdc.go:887: end-to-end steady latency 56.107136265s; max steady latency so far 56.107136265s
changefeed: 08:46:15 cdc.go:887: end-to-end steady latency 57.179191925s; max steady latency so far 57.179191925s
changefeed: 08:46:16 cdc.go:887: end-to-end steady latency 58.248661037s; max steady latency so far 58.248661037s
changefeed: 08:46:17 cdc.go:887: end-to-end steady latency 59.318353363s; max steady latency so far 59.318353363s
changefeed: 08:46:18 cdc.go:887: end-to-end steady latency 1m0.387645603s; max steady latency so far 1m0.387645603s
08:46:19 test.go:325: test failure: 	cdc.go:931,cdc.go:226,cdc.go:617,test_runner.go:753: max latency was more than allowed: 1m0.387645603s vs 1m0s

That time frame maps to the following events in the server logs.

-- N1
I200324 08:45:32.815771 361 kv/kvserver/store_remove_replica.go:122  [n1,s1,r57/1:/Table/53/1/{661-784}] removing replica r54/1
I200324 08:45:43.871331 305 kv/kvserver/store_remove_replica.go:122  [n1,s1,r60/1:/Table/53/1/{792-898}] removing replica r55/1

-- N2
I200324 08:45:32.807375 311 kv/kvserver/store_remove_replica.go:122  [n2,s2,r57/3:/Table/53/1/{661-784}] removing replica r54/3
I200324 08:45:43.873223 330 kv/kvserver/store_remove_replica.go:122  [n2,s2,r60/3:/Table/53/1/{792-898}] removing replica r55/3

-- N3
I200324 08:45:32.780173 37073763 kv/kvserver/replica_command.go:663  [n3,merge,s3,r57/2:/Table/53/1/{661-784}] initiating a merge of r54:/Table/53/1/{784-898} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=9] into this range (lhs+rhs has (size=15 MiB+14 MiB=28 MiB qps=591.12+568.53=1159.65qps) below threshold (size=128 MiB, qps=1250.00))
I200324 08:45:32.807225 264 kv/kvserver/store_remove_replica.go:122  [n3,s3,r57/2:/Table/53/1/{661-784}] removing replica r54/2
I200324 08:45:43.726444 37278938 kv/kvserver/replica_command.go:397  [n3,split,s3,r57/2:/Table/53/1/{661-898}] initiating a split of this range at key /Table/53/1/792 [r60] (load at key /Table/53/1/792 (2501.00 splitQPS, 2038.35 batches/sec, 1783.02 raft mutations/sec))
I200324 08:45:43.853130 37279759 kv/kvserver/replica_command.go:663  [n3,merge,s3,r60/2:/Table/53/1/{792-898}] initiating a merge of r55:/Table/5{3/1/898-4} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=9] into this range (lhs+rhs has (size=13 MiB+12 MiB=25 MiB qps=0.00+1243.61=1243.61qps) below threshold (size=128 MiB, qps=1250.00))
I200324 08:45:43.871466 234 kv/kvserver/store_remove_replica.go:122  [n3,s3,r60/2:/Table/53/1/{792-898}] removing replica r55/2

@cockroach-teamcity
Copy link
Member Author

(roachtest).cdc/ledger/rangefeed=true failed on master@0222b515560bb02e6adf59a09f6c067923bca28c:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200409-1862625/cdc/ledger/rangefeed=true/run_1
	cdc.go:931,cdc.go:226,cdc.go:617,test_runner.go:753: max latency was more than allowed: 1m49.124409292s vs 1m0s

More

Artifacts: /cdc/ledger/rangefeed=true

See this test on roachdash
powered by pkg/cmd/internal/issues

@irfansharif
Copy link
Contributor

irfansharif commented Apr 14, 2020

Still no aha moment, but I did try a few things now that we have a few more failures to go off of. I'll post more detailed notes later as I'm still going through the machinery here, but I did try a few ideas and still not able to reproduce it easily.

(a) increased the likelihood of splits by reducing qps threshold
(b) increased frequency of sarama metadata refresh
(c) played around with how long the workload itself runs vs. how long the test itself ran (see below)
(d) increase the frequency of the temp object cleaner (see below)

The reasoning behind trying (c) and (d) is because for the recent failures here and in #47400, the end-to-end latency started climbing about right about 30m into the test. (c) and (d) were the moving pieces in play that were on 30m rotations, so to speak. (Logging from those components also appeared around the time of the latency climb.)

Probably unlikely but the issue underlying issue here may be related to #37716 and #36879 (though there's no chaos here), and we may have seen this specific failure happen as far back as #43809 (comment).

@irfansharif
Copy link
Contributor

irfansharif commented Apr 14, 2020

I'm still suspicious of splits + rangefeeds. I suspect we're getting disconnected somehow (I see split events around when the problems started occurring). Aside: I'm concluding the issue here isn't the same as #47187 (despite the similarities) because the workload logs don't appear to hit 0 QPS at any point.

Edit: I know what the (non-)issue is, I'll write it up in a bit. Dumping these screenshots here for myself in the mean time.

image

image

@irfansharif
Copy link
Contributor

I have a fast-ish repro for this specific bug here, but haven't had the time to dig in further (though it shouldn't be too bad, I think): irfansharif/cockroach@200408.cdc-ledger

What I think is happening here is that somehow ranges are not receiving closed timestamp updates, and as a result the rangefeeds established over those ranges are essentially wedged. I don't think the closedts subsystem is at fault (as in there's nothing blocking timestamps from being closed out), I think the problem is more that the ranges get "disconnected" from being notified about closed timestamp updates, somehow.


When a rangefeed is established over a given range, and the range splits, we have to re-establish rangefeeds on the split ranges.

// partialRangeFeed establishes a RangeFeed to the range specified by desc. It
// manages lifecycle events of the range in order to maintain the RangeFeed
// connection; this may involve instructing higher-level functions to retry
// this rangefeed, or subdividing the range further in the event of a split.
func (ds *DistSender) partialRangeFeed(

As part of that "re-establishment", we run this catch up scan.

// runCatchupScan starts a catchup scan which will output entries for all
// recorded changes in the replica that are newer than the catchupTimestamp.
// This uses the iterator provided when the registration was originally created;
// after the scan completes, the iterator will be closed.
func (r *registration) runCatchupScan() error {

In the test failure captured above, and my early attempts at repro, the end-to-end latency climb only started happening ~30m into the test run. I later found out this was a by effect of the bump in default range size. When the max range size was 512 MB, when a range (with a rangefeed established over it) was split, the catch up iterator took ~3s to iterate over all the keys. That's pretty much all it takes to cause the issue captured in this test, and what I did in my repro-branch.

// XXX: Manual hack for introducing catch up scan delay.
time.Sleep(3 * time.Second)

What I observed, with the added lag, was that I was seeing multiple rangefeed registrations for the RHS range, happening over and over. Adding the following log:

log.Infof(ctx, "catch up scan took %s, ctx err %+v", duration.String(), ctx.Err())

I observed.

608:I200415 23:22:43.056213 15944013 kv/kvserver/replica_command.go:397  [n2,split,s2,r35/3:/Table/55{-/2}] initiating a split of this range at key /Table/55/1/546968607878479873 [r63] (512 MiB above threshold size 512 MiB)
...
I200415 23:22:46.432608 19803456 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.636059897s
I200415 23:22:49.042242 19828719 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.553311237s
I200415 23:22:50.660315 65 server/status/runtime.go:498  [n1] runtime stats: 5.3 GiB RSS, 457 goroutines, 579 MiB/206 MiB/820 MiB GO alloc/idle/total, 3.6 GiB/4.4 GiB CGO alloc/total, 1882176.9 CGO/sec, 929.5/83.7 %(u/s)time, 0.0 %gc (11x), 104 MiB/147 MiB (r/w)net
I200415 23:22:52.008701 19853487 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.845833073s
I200415 23:22:54.735336 19880845 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.544007908s
I200415 23:22:57.824849 19909792 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.669301398s
I200415 23:23:00.752082 65 server/status/runtime.go:498  [n1] runtime stats: 5.4 GiB RSS, 455 goroutines, 761 MiB/45 MiB/856 MiB GO alloc/idle/total, 3.6 GiB/4.4 GiB CGO alloc/total, 1730104.6 CGO/sec, 1043.3/92.5 %(u/s)time, 0.0 %gc (13x), 112 MiB/183 MiB (r/w)net
I200415 23:23:01.340857 19941182 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.766179052s
I200415 23:23:05.076929 19976088 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.74370687s
I200415 23:23:08.966996 20013126 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.758871217s

r63 split off of r35, and kept re-trying the catch up scan, which should have happened just the one time. I tabled the investigation here.

@tbg
Copy link
Member

tbg commented May 12, 2020

@ajwerner might this be the bug you've just fixed?

@tbg
Copy link
Member

tbg commented May 12, 2020

@irfansharif if you have a cycle it would be nice if you could kick the repro after a rebase to see if it's still there. If not we can assume closed

@ajwerner
Copy link
Contributor

Seems likely

@irfansharif
Copy link
Contributor

I don't see this re-occurring on master, I'll take it to be fixed by one of #48521 or #48561.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

No branches or pull requests

5 participants