roachtest: cdc/tpcc-1000/rangefeed=true failed #37810

cockroach-teamcity · 2019-05-25T08:09:09Z

SHA: https://github.com/cockroachdb/cockroach/commits/fc7e48295cd05f94fd2883498d96d91ad538e559

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1308263&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1516,cdc.go:747,cdc.go:135,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1308263-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		l
		   1m32s       60           19.0            9.0  18253.6  34359.7  38654.7  38654.7 delivery
		   1m32s       60          159.0           76.6  27917.3  77309.4  77309.4  77309.4 newOrder
		   1m32s       60           14.0            9.4  19327.4  24696.1  34359.7  34359.7 orderStatus
		   1m32s       60          150.0           90.1  24696.1  45097.2  73014.4  77309.4 payment
		   1m32s       60            7.0            9.4  19327.4  33286.0  33286.0  33286.0 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   1m33s       60            9.0            9.0  19327.4  34359.7  34359.7  34359.7 delivery
		   1m33s       60          172.9           77.6  24696.1  62277.0  77309.4  77309.4 newOrder
		   1m33s       60           25.0            9.6  20401.1  38654.7  40802.2  40802.2 orderStatus
		   1m33s       60          158.9           90.9  24696.1  49392.1  57982.1  64424.5 payment
		   1m33s       60           18.0            9.5  20401.1  31138.5  31138.5  31138.5 stockLevel
		: signal: killed
	cluster.go:1875,cdc.go:223,cdc.go:475,test.go:1251: unexpected status: failed

The text was updated successfully, but these errors were encountered:

tbg · 2019-05-27T13:17:40Z

Hmm, not clear what happened here. TPCC got an error:

E190525 08:05:06.806037 1 workload/cli/run.go:428  error in newOrder: ERROR: communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 08006)
   1m28s       47            3.0            8.9  21474.8  42949.7  42949.7  42949.7 delivery
   1m28s       47           15.0           73.1  27917.3  55834.6  57982.1  57982.1 newOrder
   1m28s       47            0.0            9.4      0.0      0.0      0.0      0.0 orderStatus
   1m28s       47           42.0           88.4  28991.0  77309.4  85899.3  85899.3 payment
   1m28s       47            1.0            9.5  31138.5  31138.5  31138.5  31138.5 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   1m29s       47            9.0            8.9  23622.3  34359.7  34359.7  34359.7 delivery
   1m29s       47          164.8           74.1  30064.8  73014.4  73014.4  73014.4 newOrder
   1m29s       47            6.0            9.4  24696.1  36507.2  36507.2  36507.2 orderStatus
   1m29s       47          106.9           88.6  30064.8  68719.5  85899.3  85899.3 payment
   1m29s       47            7.0            9.4  30064.8  40802.2  40802.2  40802.2 stockLevel
E190525 08:05:08.904460 1 workload/cli/run.go:428  error in stockLevel: ERROR: rpc error: code = Unavailable desc = transport is closing (SQLSTATE XXUUU)

The workload runs with --tolerate-errors, though. I'm not sure why the job failed (which I think is what "unexpected status: failed" is telling me). It'd sure be nice if the actual job were logged in this case.

danhhz · 2019-05-28T15:08:46Z

There's a jobs.txt in the debug.zip in the artifacts, but it only says "context canceled". The error also gets logged with %+v but sadly we didn't seem to wrap it anywhere (tho I guess it doesn't really matter who caught the cancel so much as who did the cancel).

W190525 08:05:08.910140 839776 ccl/changefeedccl/changefeed_stmt.go:494  [n1] CHANGEFEED job 454495431970291713 returning with error: context canceled

cockroach-teamcity · 2019-06-07T09:41:51Z

SHA: https://github.com/cockroachdb/cockroach/commits/b83798cadfee6447d565688b58657843741f8a45

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1328389&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1513,cdc.go:744,cdc.go:132,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1328389-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		l
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		1h40m33s       22           28.0           20.4    604.0   1409.3   2147.5   2147.5 delivery
		1h40m33s       22          154.0          203.7    486.5    805.3    872.4    939.5 newOrder
		1h40m33s       22           21.0           20.4     48.2     96.5     96.5     96.5 orderStatus
		1h40m33s       22          135.0          204.0    352.3    704.6    805.3    872.4 payment
		1h40m33s       22           23.0           20.4    226.5    335.5    352.3    352.3 stockLevel
		1h40m34s       22           23.0           20.4    771.8   1073.7   1208.0   1208.0 delivery
		1h40m34s       22          166.0          203.7    872.4   1275.1   1744.8   1946.2 newOrder
		1h40m34s       22           25.0           20.4     60.8    100.7    109.1    109.1 orderStatus
		1h40m34s       22          216.0          204.0    637.5    973.1   1476.4   1610.6 payment
		1h40m34s       22           24.0           20.4    251.7    486.5    570.4    570.4 stockLevel
		: signal: killed
	cluster.go:1872,cdc.go:220,cdc.go:472,test.go:1248: unexpected status: failed

danhhz · 2019-06-07T14:11:29Z

Something got stuck and changefeed got way behind. Fairly sure this never sent a resolved timestamp after the initial scan. Good news is we eventually error'd instead of OOMing.

failed, error: memory budget exceeded: 54085928 bytes requested, 1030196349 currently allocated, 1073741824 bytes in budget

Bad news is kicking the closed timestamp didn't seem to help. These messages rotate through a few ranges for the entire duration of the test.

I190607 09:38:54.712904 314 storage/replica_rangefeed.go:509  [n1,s1,r1826/1:/Table/58/1/5{19/719-20/4682}] RangeFeed closed timestamp 1559896182.426224821,0 is behind by 1h9m12.286622718s
I190607 09:39:00.702253 314 storage/replica_rangefeed.go:509  [n1,s1,r701/1:/Table/58/1/5{5/49367-6/1883}] RangeFeed closed timestamp 1559896182.426224821,0 is behind by 1h9m18.275988702s
I190607 09:39:06.702856 314 storage/replica_rangefeed.go:509  [n1,s1,r2061/1:/Table/58/1/56{2/626â€¦-3/632â€¦}] RangeFeed closed timestamp 1559896182.426224821,0 is behind by 1h9m24.276590329s
I190607 09:39:12.704330 314 storage/replica_rangefeed.go:509  [n1,s1,r947/1:/Table/57/1/36{5/1/-â€¦-6/9/-â€¦}] RangeFeed closed timestamp 1559896182.426224821,0 is behind by 1h9m30.278049085s

danhhz · 2019-06-07T14:17:03Z

@tbg if you have a minute I'd be curious what you can make of why these ranges were so unhappy. There's a bunch of handle raft ready/slow heartbeat took/heartbeat failed in the same ranges right around when the changefeed starts at 07:58.

Maybe these have something to do with it? I don't know what they mean though

E190607 07:59:18.067200 977630 storage/replica_range_lease.go:385  [n1,s1,r826/1:/Table/58/1/14{7/631â€¦-8/674â€¦}] mismatch incrementing epoch for {NodeID:2 Epoch:1 Expiration:1559894350.035487800,0 Draining:false Decommissioning:false XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}; actual is {NodeID:2 Epoch:1 Expiration:1559894359.035516243,0 Draining:false Decommissioning:false XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}

tbg · 2019-06-08T01:34:53Z

Hmmm, not sure unfortunately. I was looking at this:

I190607 07:58:42.031633 299 storage/queue.go:527  [n1,s1,r711/1:/Table/54/3/8{88/5/"â€¦-91/7/"â€¦}] rate limited in MaybeAdd (merge): context canceled
I190607 07:58:42.744367 967131 vendor/github.com/Shopify/sarama/client.go:722  [kafka-producer] client/metadata fetching metadata for [stock] from broker 10.142.0.174:9092
I190607 07:58:42.754047 893720 vendor/github.com/Shopify/sarama/async_producer.go:680  [kafka-producer] producer/broker/0 state change to [open] on stock/0
I190607 07:58:46.967075 835065 ccl/changefeedccl/changefeed_processors.go:620  [n1] job 458174908531441665 span /Table/53/1/{39/6/-2053-118/3/-78} is behind by 433303h58m46.96293892s
W190607 07:58:47.784714 264 storage/store.go:3740  [n1,s1,r2136/1:/Table/57/2/20{1/505â€¦-7/285â€¦}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.785213 220 storage/store.go:3740  [n1,s1,r2663/1:/Table/58/1/89{2/258â€¦-3/298â€¦}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.785259 268 storage/store.go:3740  [n1,s1,r354/1:/Table/58/1/6{4/69967-5/74578}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.841947 283 storage/store.go:3740  [n1,s1,r2186/1:/Table/57/2/22{3/637â€¦-9/156â€¦}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.842732 179 storage/store.go:3740  [n1,s1,r839/1:/Table/57/1/30{5/9/-â€¦-7/6/-â€¦}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.842802 211 storage/store.go:3740  [n1,s1,r2167/1:/Table/57/2/2{18/921â€¦-23/637â€¦}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.843608 196 storage/store.go:3740  [n1,s1,r2083/1:/Table/57/2/18{3/959â€¦-9/497â€¦}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.844170 248 storage/store.go:3740  [n1,s1,r2203/1:/Table/57/2/2{29/156â€¦-34/698â€¦}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.879654 208 storage/store.go:3740  [n1,s1,r2607/1:/Table/57/2/54{2/835â€¦-7/865â€¦}] handle raft ready: 5.1s [processed=0]
W190607 07:58:47.879724 270 storage/store.go:3740  [n1,s1,r1889/1:/Table/58/1/55{2/692â€¦-3/731â€¦}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.879799 182 storage/store.go:3740  [n1,s1,r1884/1:/Table/58/1/55{1/652â€¦-2/692â€¦}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.879870 195 storage/store.go:3740  [n1,s1,r1439/1:/Table/57/1/5{89/2/-â€¦-91/1/-â€¦}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.881179 213 storage/store.go:3740  [n1,s1,r1069/1:/Table/57/1/33{0/2/-â€¦-1/10/â€¦}] handle raft ready: 4.8s [processed=0]
W190607 07:58:47.881838 247 storage/store.go:3740  [n1,s1,r1601/1:/Table/58/1/53{4/596â€¦-5/636â€¦}] handle raft ready: 4.8s [processed=0]
W190607 07:58:47.882550 287 storage/store.go:3740  [n1,s1,r1574/1:/Table/58/1/53{3/557â€¦-4/596â€¦}] handle raft ready: 4.8s [processed=0]
I190607 07:58:50.043681 179 storage/queue.go:527  [n1,s1,r327/1:/Table/53/1/6{02/3/-â€¦-39/2/-â€¦}] rate limited in MaybeAdd (raftlog): context canceled

I also poked at these ranges you posted and I'm seeing the warnings, but not anything else that could tell me what's going on. The system generally seems pretty hosed, but I'm unsure why.

@nvanbenschoten look what I also saw (this is master):

W190607 07:59:09.727687 976231 storage/engine/mvcc.go:2230 [n1,s1,r482/1:/Table/54/3/{370/10â€¦-402/1/â€¦}] unable to find value for /Table/54/3/393/8/"=8\x0e\x0f×ŸJD\xbcÈ‡\x95\xf8\x1a8\xc7"/0 ({ID:bb9e5e9c-8b21-41d2-b76a-34d10daa3111 Key:[191 137 247 2 4 136] Epoch:0 Timestamp:1559894313.120606573,0 Priority:49542 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})

cockroach/pkg/storage/engine/mvcc.go

Lines 2227 to 2234 in 7e2ceae

    
           expVal := intent.Status == roachpb.COMMITTED && intent.Txn.Epoch == 0 
        
           if !ok { 
        
           	if expVal { 
        
           		log.Warningf(ctx, "unable to find value for %s (%+v)", 
        
           			intent.Key, intent.Txn) 
        
           	} 
        
           	return false, nil 
        
           }

What do you make of this? Shouldn't that warning never fire?

cockroach-teamcity · 2019-06-16T09:51:31Z

SHA: https://github.com/cockroachdb/cockroach/commits/5c0b1644f9f9fe65bfb8cf3f7a5af2595bd859a8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1341218&tab=buildLog

The test failed on branch=master, cloud=gce:
	cdc.go:875,cdc.go:222,cdc.go:472,test.go:1248: max latency was more than allowed: 2h0m1.00980261s vs 10m0s

cockroach-teamcity · 2019-06-17T07:59:32Z

SHA: https://github.com/cockroachdb/cockroach/commits/056afc72397974ed9489bf11f16ac8e2fc281a67

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1341966&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1513,cdc.go:744,cdc.go:132,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1341966-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		     9.6  19327.4  27917.3  30064.8  30064.8 delivery
		    1m7s     2491          346.0           65.0  26843.5  49392.1  49392.1  49392.1 newOrder
		    1m7s     2491           19.0            9.6  19327.4  23622.3  23622.3  23622.3 orderStatus
		    1m7s     2491          185.0           87.2  25769.8  64424.5  64424.5  64424.5 payment
		    1m7s     2491           17.0            9.9  19327.4  25769.8  38654.7  38654.7 stockLevel
		E190617 07:55:41.096484 1 workload/cli/run.go:425  error in payment: dial tcp 10.142.0.30:26257: connect: connection refused
		    1m8s     2738           13.0            9.7   5637.1  21474.8  24696.1  24696.1 delivery
		    1m8s     2738          221.3           67.3  26843.5  51539.6  51539.6  51539.6 newOrder
		    1m8s     2738           18.0            9.7   4160.7  10737.4  13958.6  13958.6 orderStatus
		    1m8s     2738          115.2           87.6  21474.8  45097.2  66572.0  66572.0 payment
		    1m8s     2738           17.0           10.0  10200.5  30064.8  36507.2  36507.2 stockLevel
		: signal: killed
	cluster.go:1872,cdc.go:220,cdc.go:472,test.go:1248: unexpected node event: 2: dead
	cluster.go:1035,context.go:127,cluster.go:1024,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1872,cdc.go:220,cdc.go:472,test.go:1248: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1341966-cdc-tpcc-1000-rangefeed-true --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: dead
		3: 4446
		1: 4261
		Error:  2: dead

cockroach-teamcity · 2019-06-26T19:08:17Z

SHA: https://github.com/cockroachdb/cockroach/commits/5f358ed804af05f8c4b404efc4d8a282d8e0916c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1361643&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,cdc.go:639,cdc.go:106,cdc.go:476,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1361643-cdc-tpcc-1000-rangefeed-true:4 -- yes | sudo apt-get -q install default-jre 2>&1 > logs/apt-get-install.log returned:
		stderr:
		
		stdout:
		E: Failed to fetch http://us-east1.gce.archive.ubuntu.com/ubuntu/pool/main/p/pango1.0/libpangocairo-1.0-0_1.38.1-1_amd64.deb  504  Gateway Time-out [IP: 35.196.128.168 80]
		
		E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.74.0.6_2019-06-26T19:04:56Z: exit status 100
		: exit status 1

cockroach-teamcity · 2019-07-06T10:21:11Z

SHA: https://github.com/cockroachdb/cockroach/commits/9322e07476de447799c5d3011eb2874930ee2993

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1375546&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190706-1375546/cdc/tpcc-1000/rangefeed=true/run_1
	cluster.go:1724,cdc.go:748,cdc.go:136,cluster.go:2067,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1562393890-12-n4cpu16:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		   8.3  31138.5  38654.7  42949.7  42949.7 delivery
		    1m6s     2356          298.8           55.1  40802.2  49392.1  49392.1  49392.1 newOrder
		    1m6s     2356           16.0            9.6  24696.1  36507.2  40802.2  40802.2 orderStatus
		    1m6s     2356          171.9           79.5  34359.7  62277.0  64424.5  64424.5 payment
		    1m6s     2356           17.0            9.6  36507.2  42949.7  42949.7  42949.7 stockLevel
		E190706 10:19:57.969098 1 workload/cli/run.go:427  error in newOrder: dial tcp 10.142.0.126:26257: connect: connection refused
		    1m7s     2504           10.0            8.4  28991.0  34359.7  34359.7  34359.7 delivery
		    1m7s     2504          317.9           59.0  36507.2  49392.1  49392.1  49392.1 newOrder
		    1m7s     2504           20.0            9.7  18253.6  32212.3  32212.3  32212.3 orderStatus
		    1m7s     2504          188.9           81.1  32212.3  47244.6  64424.5  64424.5 payment
		    1m7s     2504           21.0            9.7  21474.8  38654.7  45097.2  45097.2 stockLevel
		: signal: killed
	cluster.go:2088,cdc.go:224,cdc.go:476,test_runner.go:681: unexpected node event: 2: dead

cockroach-teamcity · 2019-07-20T09:16:07Z

SHA: https://github.com/cockroachdb/cockroach/commits/22d48caaa7d39efdcef7b3c87a99fc421e1473af

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1397412&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190720-1397412/cdc/tpcc-1000/rangefeed=true/run_1
	cluster.go:1726,cdc.go:748,cdc.go:136,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563603417-15-n4cpu16:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		.0          128.6  12884.9  18253.6  19327.4  19327.4 newOrder
		   1m36s     1442           16.0           16.4  13421.8  19327.4  19327.4  19327.4 orderStatus
		   1m36s     1442           62.0          162.5  14495.5  18253.6  19327.4  19327.4 payment
		   1m36s     1442            2.0           16.6  11811.2  12348.0  12348.0  12348.0 stockLevel
		E190720 09:14:34.829584 1 workload/cli/run.go:427  error in payment: dial tcp 10.128.0.87:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   1m37s     1611            0.0           16.4      0.0      0.0      0.0      0.0 delivery
		   1m37s     1611            7.0          127.4  15569.3  17179.9  17179.9  17179.9 newOrder
		   1m37s     1611            0.0           16.3      0.0      0.0      0.0      0.0 orderStatus
		   1m37s     1611            4.0          160.9  13958.6  18253.6  18253.6  18253.6 payment
		   1m37s     1611            2.0           16.5  15569.3  16643.0  16643.0  16643.0 stockLevel
		: signal: killed
	cluster.go:2090,cdc.go:224,cdc.go:476,test_runner.go:691: unexpected node event: 2: dead

cockroach-teamcity · 2019-07-21T09:29:33Z

SHA: https://github.com/cockroachdb/cockroach/commits/7dab0dcfd37c389af357c302c073b9611b5ada25

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1398203&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190721-1398203/cdc/tpcc-1000/rangefeed=true/run_1
	cluster.go:1726,cdc.go:748,cdc.go:136,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563689854-13-n4cpu16:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		    17.2  17179.9  17179.9  17179.9  17179.9 delivery
		   2m14s     1460           15.0          148.7  16106.1  19327.4  20401.1  20401.1 newOrder
		   2m14s     1460            5.0           18.3  16106.1  18253.6  18253.6  18253.6 orderStatus
		   2m14s     1460           24.0          170.8  17179.9  19327.4  20401.1  20401.1 payment
		   2m14s     1460           11.0           17.4  17179.9  19327.4  19327.4  19327.4 stockLevel
		E190721 09:27:55.183492 1 workload/cli/run.go:427  error in payment: dial tcp 10.128.0.99:26257: connect: connection refused
		   2m15s     1618            1.0           17.1  15569.3  15569.3  15569.3  15569.3 delivery
		   2m15s     1618           41.0          147.9  18253.6  20401.1  21474.8  21474.8 newOrder
		   2m15s     1618            4.0           18.2  16106.1  18253.6  18253.6  18253.6 orderStatus
		   2m15s     1618           52.0          169.9  18253.6  20401.1  20401.1  20401.1 payment
		   2m15s     1618            9.0           17.3  16106.1  17179.9  17179.9  17179.9 stockLevel
		: signal: killed
	cluster.go:2090,cdc.go:224,cdc.go:476,test_runner.go:691: unexpected node event: 3: dead

…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. Release note: None

cockroach-teamcity · 2019-07-24T09:34:39Z

SHA: https://github.com/cockroachdb/cockroach/commits/86eab2ff0a1a4c2d9b5f7e7a45deda74c98c6c37

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1402541&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190724-1402541/cdc/tpcc-1000/rangefeed=true/run_1
	cluster.go:1726,cdc.go:748,cdc.go:136,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563949032-13-n4cpu16:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		   0.0 newOrder
		   1m10s     1097            0.0           12.2      0.0      0.0      0.0      0.0 orderStatus
		   1m10s     1097            0.0          121.2      0.0      0.0      0.0      0.0 payment
		   1m10s     1097            0.0           12.2      0.0      0.0      0.0      0.0 stockLevel
		E190724 09:32:48.415002 1 workload/cli/run.go:427  error in payment: dial tcp 10.128.0.124:26257: connect: connection refused
		   1m11s     1441            0.0           11.9      0.0      0.0      0.0      0.0 delivery
		   1m11s     1441            0.0           89.3      0.0      0.0      0.0      0.0 newOrder
		   1m11s     1441            0.0           12.0      0.0      0.0      0.0      0.0 orderStatus
		   1m11s     1441            0.0          119.5      0.0      0.0      0.0      0.0 payment
		   1m11s     1441            0.0           12.0      0.0      0.0      0.0      0.0 stockLevel
		E190724 09:32:49.478652 1 workload/cli/run.go:427  error in orderStatus: dial tcp 10.128.0.124:26257: connect: connection refused
		: signal: killed
	cluster.go:2090,cdc.go:224,cdc.go:476,test_runner.go:691: unexpected node event: 2: dead

…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Release note: None

39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten Fixes #39018. Fixes #37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

cockroach-teamcity added this to the 19.2 milestone May 25, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels May 25, 2019

tbg assigned danhhz May 27, 2019

nvanbenschoten mentioned this issue Jul 22, 2019

roachtest: cdc/ledger/rangefeed=true failed #39018

Closed

nvanbenschoten mentioned this issue Jul 24, 2019

storage: prevent command reproposal with new lease index after application #39064

Merged

craig bot closed this as completed in #39064 Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: cdc/tpcc-1000/rangefeed=true failed #37810

roachtest: cdc/tpcc-1000/rangefeed=true failed #37810

cockroach-teamcity commented May 25, 2019

tbg commented May 27, 2019

danhhz commented May 28, 2019

cockroach-teamcity commented Jun 7, 2019

danhhz commented Jun 7, 2019

danhhz commented Jun 7, 2019

tbg commented Jun 8, 2019

cockroach-teamcity commented Jun 16, 2019

cockroach-teamcity commented Jun 17, 2019

cockroach-teamcity commented Jun 26, 2019

cockroach-teamcity commented Jul 6, 2019

cockroach-teamcity commented Jul 20, 2019

cockroach-teamcity commented Jul 21, 2019

cockroach-teamcity commented Jul 24, 2019

roachtest: cdc/tpcc-1000/rangefeed=true failed #37810

roachtest: cdc/tpcc-1000/rangefeed=true failed #37810

Comments

cockroach-teamcity commented May 25, 2019

tbg commented May 27, 2019

danhhz commented May 28, 2019

cockroach-teamcity commented Jun 7, 2019

danhhz commented Jun 7, 2019

danhhz commented Jun 7, 2019

tbg commented Jun 8, 2019

cockroach-teamcity commented Jun 16, 2019

cockroach-teamcity commented Jun 17, 2019

cockroach-teamcity commented Jun 26, 2019

cockroach-teamcity commented Jul 6, 2019

cockroach-teamcity commented Jul 20, 2019

cockroach-teamcity commented Jul 21, 2019

cockroach-teamcity commented Jul 24, 2019