roachtest: cdc/ledger/rangefeed=true failed #46463

cockroach-teamcity · 2020-03-24T08:46:47Z

(roachtest).cdc/ledger/rangefeed=true failed on master@055561809b95488bff2cad19422e7f4a7472e3a2:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200324-1823985/cdc/ledger/rangefeed=true/run_1
	cdc.go:931,cdc.go:226,cdc.go:617,test_runner.go:753: max latency was more than allowed: 1m0.387645603s vs 1m0s

More

Artifacts: /cdc/ledger/rangefeed=true

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2020-04-05T08:34:43Z

(roachtest).cdc/ledger/rangefeed=true failed on master@b1a0b989bbfef500075a485edc762fe42ca7b32a:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200405-1853455/cdc/ledger/rangefeed=true/run_1
	cdc.go:931,cdc.go:226,cdc.go:617,test_runner.go:753: max latency was more than allowed: 1m30.780632326s vs 1m0s

More

Artifacts: /cdc/ledger/rangefeed=true

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

irfansharif · 2020-04-08T17:16:47Z

Starting to look at this now. The end-to-end latency started climbing right around 08:45:30, up until it hit the 1m threshold,

changefeed: 08:45:18 cdc.go:887: end-to-end steady latency 14.533205361s; max steady latency so far 17.469822178s
changefeed: 08:45:20 cdc.go:887: end-to-end steady latency 13.603024858s; max steady latency so far 17.469822178s
changefeed: 08:45:21 cdc.go:887: end-to-end steady latency 14.672415902s; max steady latency so far 17.469822178s
changefeed: 08:45:22 cdc.go:887: end-to-end steady latency 13.741716471s; max steady latency so far 17.469822178s
changefeed: 08:45:23 cdc.go:887: end-to-end steady latency 14.815501704s; max steady latency so far 17.469822178s
changefeed: 08:45:24 cdc.go:887: end-to-end steady latency 13.886944297s; max steady latency so far 17.469822178s
changefeed: 08:45:25 cdc.go:887: end-to-end steady latency 14.955899771s; max steady latency so far 17.469822178s
changefeed: 08:45:26 cdc.go:887: end-to-end steady latency 14.026840456s; max steady latency so far 17.469822178s
changefeed: 08:45:27 cdc.go:887: end-to-end steady latency 15.096035734s; max steady latency so far 17.469822178s
changefeed: 08:45:28 cdc.go:887: end-to-end steady latency 14.172552093s; max steady latency so far 17.469822178s
changefeed: 08:45:29 cdc.go:887: end-to-end steady latency 13.244949736s; max steady latency so far 17.469822178s
changefeed: 08:45:30 cdc.go:887: end-to-end steady latency 14.313843038s; max steady latency so far 17.469822178s
changefeed: 08:45:31 cdc.go:887: end-to-end steady latency 13.383102205s; max steady latency so far 17.469822178s
changefeed: 08:45:32 cdc.go:887: end-to-end steady latency 14.453854253s; max steady latency so far 17.469822178s
changefeed: 08:45:33 cdc.go:887: end-to-end steady latency 15.429749541s; max steady latency so far 17.469822178s
changefeed: 08:45:35 cdc.go:887: end-to-end steady latency 16.49986564s; max steady latency so far 17.469822178s
changefeed: 08:45:36 cdc.go:887: end-to-end steady latency 17.574455481s; max steady latency so far 17.574455481s
changefeed: 08:45:37 cdc.go:887: end-to-end steady latency 18.643448409s; max steady latency so far 18.643448409s
changefeed: 08:45:38 cdc.go:887: end-to-end steady latency 19.712523697s; max steady latency so far 19.712523697s
changefeed: 08:45:39 cdc.go:887: end-to-end steady latency 20.781855148s; max steady latency so far 20.781855148s
changefeed: 08:45:40 cdc.go:887: end-to-end steady latency 21.853016916s; max steady latency so far 21.853016916s
changefeed: 08:45:41 cdc.go:887: end-to-end steady latency 22.922195883s; max steady latency so far 22.922195883s
changefeed: 08:45:42 cdc.go:887: end-to-end steady latency 23.993745765s; max steady latency so far 23.993745765s
changefeed: 08:45:43 cdc.go:887: end-to-end steady latency 25.065603655s; max steady latency so far 25.065603655s
changefeed: 08:45:44 cdc.go:887: end-to-end steady latency 26.135431177s; max steady latency so far 26.135431177s
changefeed: 08:45:45 cdc.go:887: end-to-end steady latency 27.207005972s; max steady latency so far 27.207005972s
changefeed: 08:45:46 cdc.go:887: end-to-end steady latency 28.278210894s; max steady latency so far 28.278210894s
changefeed: 08:45:47 cdc.go:887: end-to-end steady latency 29.347163819s; max steady latency so far 29.347163819s
changefeed: 08:45:48 cdc.go:887: end-to-end steady latency 30.41688011s; max steady latency so far 30.41688011s
changefeed: 08:45:49 cdc.go:887: end-to-end steady latency 31.486844685s; max steady latency so far 31.486844685s
changefeed: 08:45:51 cdc.go:887: end-to-end steady latency 32.555683735s; max steady latency so far 32.555683735s
changefeed: 08:45:52 cdc.go:887: end-to-end steady latency 33.626292069s; max steady latency so far 33.626292069s
changefeed: 08:45:53 cdc.go:887: end-to-end steady latency 34.697097028s; max steady latency so far 34.697097028s
changefeed: 08:45:54 cdc.go:887: end-to-end steady latency 35.766616308s; max steady latency so far 35.766616308s
changefeed: 08:45:55 cdc.go:887: end-to-end steady latency 36.835497538s; max steady latency so far 36.835497538s
changefeed: 08:45:56 cdc.go:887: end-to-end steady latency 37.907832377s; max steady latency so far 37.907832377s
changefeed: 08:45:57 cdc.go:887: end-to-end steady latency 38.977788388s; max steady latency so far 38.977788388s
changefeed: 08:45:58 cdc.go:887: end-to-end steady latency 40.047050758s; max steady latency so far 40.047050758s
changefeed: 08:45:59 cdc.go:887: end-to-end steady latency 41.116994954s; max steady latency so far 41.116994954s
changefeed: 08:46:00 cdc.go:887: end-to-end steady latency 42.186050834s; max steady latency so far 42.186050834s
changefeed: 08:46:01 cdc.go:887: end-to-end steady latency 43.254787829s; max steady latency so far 43.254787829s
changefeed: 08:46:02 cdc.go:887: end-to-end steady latency 44.324816833s; max steady latency so far 44.324816833s
changefeed: 08:46:03 cdc.go:887: end-to-end steady latency 45.401400168s; max steady latency so far 45.401400168s
changefeed: 08:46:04 cdc.go:887: end-to-end steady latency 46.470333503s; max steady latency so far 46.470333503s
changefeed: 08:46:06 cdc.go:887: end-to-end steady latency 47.538919692s; max steady latency so far 47.538919692s
changefeed: 08:46:07 cdc.go:887: end-to-end steady latency 48.610360279s; max steady latency so far 48.610360279s
changefeed: 08:46:08 cdc.go:887: end-to-end steady latency 49.679751767s; max steady latency so far 49.679751767s
changefeed: 08:46:09 cdc.go:887: end-to-end steady latency 50.751562297s; max steady latency so far 50.751562297s
changefeed: 08:46:10 cdc.go:887: end-to-end steady latency 51.821102761s; max steady latency so far 51.821102761s
changefeed: 08:46:11 cdc.go:887: end-to-end steady latency 52.890717432s; max steady latency so far 52.890717432s
changefeed: 08:46:12 cdc.go:887: end-to-end steady latency 53.962324833s; max steady latency so far 53.962324833s
changefeed: 08:46:13 cdc.go:887: end-to-end steady latency 55.035554318s; max steady latency so far 55.035554318s
changefeed: 08:46:14 cdc.go:887: end-to-end steady latency 56.107136265s; max steady latency so far 56.107136265s
changefeed: 08:46:15 cdc.go:887: end-to-end steady latency 57.179191925s; max steady latency so far 57.179191925s
changefeed: 08:46:16 cdc.go:887: end-to-end steady latency 58.248661037s; max steady latency so far 58.248661037s
changefeed: 08:46:17 cdc.go:887: end-to-end steady latency 59.318353363s; max steady latency so far 59.318353363s
changefeed: 08:46:18 cdc.go:887: end-to-end steady latency 1m0.387645603s; max steady latency so far 1m0.387645603s
08:46:19 test.go:325: test failure: 	cdc.go:931,cdc.go:226,cdc.go:617,test_runner.go:753: max latency was more than allowed: 1m0.387645603s vs 1m0s

That time frame maps to the following events in the server logs.

-- N1
I200324 08:45:32.815771 361 kv/kvserver/store_remove_replica.go:122  [n1,s1,r57/1:/Table/53/1/{661-784}] removing replica r54/1
I200324 08:45:43.871331 305 kv/kvserver/store_remove_replica.go:122  [n1,s1,r60/1:/Table/53/1/{792-898}] removing replica r55/1

-- N2
I200324 08:45:32.807375 311 kv/kvserver/store_remove_replica.go:122  [n2,s2,r57/3:/Table/53/1/{661-784}] removing replica r54/3
I200324 08:45:43.873223 330 kv/kvserver/store_remove_replica.go:122  [n2,s2,r60/3:/Table/53/1/{792-898}] removing replica r55/3

-- N3
I200324 08:45:32.780173 37073763 kv/kvserver/replica_command.go:663  [n3,merge,s3,r57/2:/Table/53/1/{661-784}] initiating a merge of r54:/Table/53/1/{784-898} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=9] into this range (lhs+rhs has (size=15 MiB+14 MiB=28 MiB qps=591.12+568.53=1159.65qps) below threshold (size=128 MiB, qps=1250.00))
I200324 08:45:32.807225 264 kv/kvserver/store_remove_replica.go:122  [n3,s3,r57/2:/Table/53/1/{661-784}] removing replica r54/2
I200324 08:45:43.726444 37278938 kv/kvserver/replica_command.go:397  [n3,split,s3,r57/2:/Table/53/1/{661-898}] initiating a split of this range at key /Table/53/1/792 [r60] (load at key /Table/53/1/792 (2501.00 splitQPS, 2038.35 batches/sec, 1783.02 raft mutations/sec))
I200324 08:45:43.853130 37279759 kv/kvserver/replica_command.go:663  [n3,merge,s3,r60/2:/Table/53/1/{792-898}] initiating a merge of r55:/Table/5{3/1/898-4} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=9] into this range (lhs+rhs has (size=13 MiB+12 MiB=25 MiB qps=0.00+1243.61=1243.61qps) below threshold (size=128 MiB, qps=1250.00))
I200324 08:45:43.871466 234 kv/kvserver/store_remove_replica.go:122  [n3,s3,r60/2:/Table/53/1/{792-898}] removing replica r55/2

cockroach-teamcity · 2020-04-09T09:05:13Z

(roachtest).cdc/ledger/rangefeed=true failed on master@0222b515560bb02e6adf59a09f6c067923bca28c:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200409-1862625/cdc/ledger/rangefeed=true/run_1
	cdc.go:931,cdc.go:226,cdc.go:617,test_runner.go:753: max latency was more than allowed: 1m49.124409292s vs 1m0s

More

Artifacts: /cdc/ledger/rangefeed=true

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

irfansharif · 2020-04-14T21:54:59Z

Still no aha moment, but I did try a few things now that we have a few more failures to go off of. I'll post more detailed notes later as I'm still going through the machinery here, but I did try a few ideas and still not able to reproduce it easily.

(a) increased the likelihood of splits by reducing qps threshold
(b) increased frequency of sarama metadata refresh
(c) played around with how long the workload itself runs vs. how long the test itself ran (see below)
(d) increase the frequency of the temp object cleaner (see below)

The reasoning behind trying (c) and (d) is because for the recent failures here and in #47400, the end-to-end latency started climbing about right about 30m into the test. (c) and (d) were the moving pieces in play that were on 30m rotations, so to speak. (Logging from those components also appeared around the time of the latency climb.)

Probably unlikely but the issue underlying issue here may be related to #37716 and #36879 (though there's no chaos here), and we may have seen this specific failure happen as far back as #43809 (comment).

irfansharif · 2020-04-14T22:56:42Z

I'm still suspicious of splits + rangefeeds. I suspect we're getting disconnected somehow (I see split events around when the problems started occurring). Aside: I'm concluding the issue here isn't the same as #47187 (despite the similarities) because the workload logs don't appear to hit 0 QPS at any point.

Edit: I know what the (non-)issue is, I'll write it up in a bit. Dumping these screenshots here for myself in the mean time.

irfansharif · 2020-05-06T20:55:56Z

I have a fast-ish repro for this specific bug here, but haven't had the time to dig in further (though it shouldn't be too bad, I think): irfansharif/cockroach@200408.cdc-ledger

What I think is happening here is that somehow ranges are not receiving closed timestamp updates, and as a result the rangefeeds established over those ranges are essentially wedged. I don't think the closedts subsystem is at fault (as in there's nothing blocking timestamps from being closed out), I think the problem is more that the ranges get "disconnected" from being notified about closed timestamp updates, somehow.

When a rangefeed is established over a given range, and the range splits, we have to re-establish rangefeeds on the split ranges.

cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go

Lines 123 to 127 in 1320e13

    
           // partialRangeFeed establishes a RangeFeed to the range specified by desc. It 
        
           // manages lifecycle events of the range in order to maintain the RangeFeed 
        
           // connection; this may involve instructing higher-level functions to retry 
        
           // this rangefeed, or subdividing the range further in the event of a split. 
        
           func (ds *DistSender) partialRangeFeed(

As part of that "re-establishment", we run this catch up scan.

cockroach/pkg/kv/kvserver/rangefeed/registry.go

Lines 277 to 281 in aebcb79

    
           // runCatchupScan starts a catchup scan which will output entries for all 
        
           // recorded changes in the replica that are newer than the catchupTimestamp. 
        
           // This uses the iterator provided when the registration was originally created; 
        
           // after the scan completes, the iterator will be closed. 
        
           func (r *registration) runCatchupScan() error {

In the test failure captured above, and my early attempts at repro, the end-to-end latency climb only started happening ~30m into the test run. I later found out this was a by effect of the bump in default range size. When the max range size was 512 MB, when a range (with a rangefeed established over it) was split, the catch up iterator took ~3s to iterate over all the keys. That's pretty much all it takes to cause the issue captured in this test, and what I did in my repro-branch.

cockroach/pkg/kv/kvserver/rangefeed/registry.go

Lines 295 to 296 in bdbe80d

    
           // XXX: Manual hack for introducing catch up scan delay. 
        
           time.Sleep(3 * time.Second)

What I observed, with the added lag, was that I was seeing multiple rangefeed registrations for the RHS range, happening over and over. Adding the following log:

cockroach/pkg/kv/kvserver/rangefeed/registry.go

Line 291 in bdbe80d

    
           log.Infof(ctx, "catch up scan took %s, ctx err %+v", duration.String(), ctx.Err())

I observed.

608:I200415 23:22:43.056213 15944013 kv/kvserver/replica_command.go:397  [n2,split,s2,r35/3:/Table/55{-/2}] initiating a split of this range at key /Table/55/1/546968607878479873 [r63] (512 MiB above threshold size 512 MiB)
...
I200415 23:22:46.432608 19803456 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.636059897s
I200415 23:22:49.042242 19828719 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.553311237s
I200415 23:22:50.660315 65 server/status/runtime.go:498  [n1] runtime stats: 5.3 GiB RSS, 457 goroutines, 579 MiB/206 MiB/820 MiB GO alloc/idle/total, 3.6 GiB/4.4 GiB CGO alloc/total, 1882176.9 CGO/sec, 929.5/83.7 %(u/s)time, 0.0 %gc (11x), 104 MiB/147 MiB (r/w)net
I200415 23:22:52.008701 19853487 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.845833073s
I200415 23:22:54.735336 19880845 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.544007908s
I200415 23:22:57.824849 19909792 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.669301398s
I200415 23:23:00.752082 65 server/status/runtime.go:498  [n1] runtime stats: 5.4 GiB RSS, 455 goroutines, 761 MiB/45 MiB/856 MiB GO alloc/idle/total, 3.6 GiB/4.4 GiB CGO alloc/total, 1730104.6 CGO/sec, 1043.3/92.5 %(u/s)time, 0.0 %gc (13x), 112 MiB/183 MiB (r/w)net
I200415 23:23:01.340857 19941182 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.766179052s
I200415 23:23:05.076929 19976088 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.74370687s
I200415 23:23:08.966996 20013126 kv/kvserver/rangefeed/registry.go:291  [n1,s1,r63/1:/Table/55/{1/54696…-2},rangefeed] catch up scan took 2.758871217s

r63 split off of r35, and kept re-trying the catch up scan, which should have happened just the one time. I tabled the investigation here.

tbg · 2020-05-12T14:31:21Z

@ajwerner might this be the bug you've just fixed?

tbg · 2020-05-12T14:32:01Z

@irfansharif if you have a cycle it would be nice if you could kick the repro after a rebase to see if it's still there. If not we can assume closed

ajwerner · 2020-05-12T14:51:46Z

Seems likely

irfansharif · 2020-05-12T22:40:26Z

I don't see this re-occurring on master, I'll take it to be fixed by one of #48521 or #48561.

cockroach-teamcity added this to the 20.1 milestone Mar 24, 2020

cockroach-teamcity assigned andreimatei Mar 24, 2020

irfansharif self-assigned this Mar 31, 2020

nvanbenschoten unassigned andreimatei Apr 6, 2020

irfansharif mentioned this issue Apr 10, 2020

roachtest: scaledata/filesystem_simulator/nodes=3 failed #47326

Closed

cockroach-teamcity mentioned this issue Apr 12, 2020

roachtest: cdc/ledger/rangefeed=true failed #47400

Closed

cockroach-teamcity mentioned this issue May 7, 2020

roachtest: cdc/ledger/rangefeed=true failed #48523

Closed

irfansharif closed this as completed May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: cdc/ledger/rangefeed=true failed #46463

roachtest: cdc/ledger/rangefeed=true failed #46463

cockroach-teamcity commented Mar 24, 2020

cockroach-teamcity commented Apr 5, 2020

irfansharif commented Apr 8, 2020

cockroach-teamcity commented Apr 9, 2020

irfansharif commented Apr 14, 2020 •

edited

Loading

irfansharif commented Apr 14, 2020 •

edited

Loading

irfansharif commented May 6, 2020

tbg commented May 12, 2020

tbg commented May 12, 2020

ajwerner commented May 12, 2020

irfansharif commented May 12, 2020

roachtest: cdc/ledger/rangefeed=true failed #46463

roachtest: cdc/ledger/rangefeed=true failed #46463

Comments

cockroach-teamcity commented Mar 24, 2020

cockroach-teamcity commented Apr 5, 2020

irfansharif commented Apr 8, 2020

cockroach-teamcity commented Apr 9, 2020

irfansharif commented Apr 14, 2020 • edited Loading

irfansharif commented Apr 14, 2020 • edited Loading

irfansharif commented May 6, 2020

tbg commented May 12, 2020

tbg commented May 12, 2020

ajwerner commented May 12, 2020

irfansharif commented May 12, 2020

irfansharif commented Apr 14, 2020 •

edited

Loading

irfansharif commented Apr 14, 2020 •

edited

Loading