Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/tpcc-1000/rangefeed=true failed #37810

Closed
cockroach-teamcity opened this issue May 25, 2019 · 13 comments · Fixed by #39064
Closed

roachtest: cdc/tpcc-1000/rangefeed=true failed #37810

cockroach-teamcity opened this issue May 25, 2019 · 13 comments · Fixed by #39064
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/fc7e48295cd05f94fd2883498d96d91ad538e559

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1308263&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1516,cdc.go:747,cdc.go:135,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1308263-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		l
		   1m32s       60           19.0            9.0  18253.6  34359.7  38654.7  38654.7 delivery
		   1m32s       60          159.0           76.6  27917.3  77309.4  77309.4  77309.4 newOrder
		   1m32s       60           14.0            9.4  19327.4  24696.1  34359.7  34359.7 orderStatus
		   1m32s       60          150.0           90.1  24696.1  45097.2  73014.4  77309.4 payment
		   1m32s       60            7.0            9.4  19327.4  33286.0  33286.0  33286.0 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   1m33s       60            9.0            9.0  19327.4  34359.7  34359.7  34359.7 delivery
		   1m33s       60          172.9           77.6  24696.1  62277.0  77309.4  77309.4 newOrder
		   1m33s       60           25.0            9.6  20401.1  38654.7  40802.2  40802.2 orderStatus
		   1m33s       60          158.9           90.9  24696.1  49392.1  57982.1  64424.5 payment
		   1m33s       60           18.0            9.5  20401.1  31138.5  31138.5  31138.5 stockLevel
		: signal: killed
	cluster.go:1875,cdc.go:223,cdc.go:475,test.go:1251: unexpected status: failed

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone May 25, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels May 25, 2019
@tbg
Copy link
Member

tbg commented May 27, 2019

Hmm, not clear what happened here. TPCC got an error:

E190525 08:05:06.806037 1 workload/cli/run.go:428  error in newOrder: ERROR: communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 08006)
   1m28s       47            3.0            8.9  21474.8  42949.7  42949.7  42949.7 delivery
   1m28s       47           15.0           73.1  27917.3  55834.6  57982.1  57982.1 newOrder
   1m28s       47            0.0            9.4      0.0      0.0      0.0      0.0 orderStatus
   1m28s       47           42.0           88.4  28991.0  77309.4  85899.3  85899.3 payment
   1m28s       47            1.0            9.5  31138.5  31138.5  31138.5  31138.5 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   1m29s       47            9.0            8.9  23622.3  34359.7  34359.7  34359.7 delivery
   1m29s       47          164.8           74.1  30064.8  73014.4  73014.4  73014.4 newOrder
   1m29s       47            6.0            9.4  24696.1  36507.2  36507.2  36507.2 orderStatus
   1m29s       47          106.9           88.6  30064.8  68719.5  85899.3  85899.3 payment
   1m29s       47            7.0            9.4  30064.8  40802.2  40802.2  40802.2 stockLevel
E190525 08:05:08.904460 1 workload/cli/run.go:428  error in stockLevel: ERROR: rpc error: code = Unavailable desc = transport is closing (SQLSTATE XXUUU)

The workload runs with --tolerate-errors, though. I'm not sure why the job failed (which I think is what "unexpected status: failed" is telling me). It'd sure be nice if the actual job were logged in this case.

@danhhz
Copy link
Contributor

danhhz commented May 28, 2019

There's a jobs.txt in the debug.zip in the artifacts, but it only says "context canceled". The error also gets logged with %+v but sadly we didn't seem to wrap it anywhere (tho I guess it doesn't really matter who caught the cancel so much as who did the cancel).

W190525 08:05:08.910140 839776 ccl/changefeedccl/changefeed_stmt.go:494  [n1] CHANGEFEED job 454495431970291713 returning with error: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/b83798cadfee6447d565688b58657843741f8a45

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1328389&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1513,cdc.go:744,cdc.go:132,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1328389-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		l
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		1h40m33s       22           28.0           20.4    604.0   1409.3   2147.5   2147.5 delivery
		1h40m33s       22          154.0          203.7    486.5    805.3    872.4    939.5 newOrder
		1h40m33s       22           21.0           20.4     48.2     96.5     96.5     96.5 orderStatus
		1h40m33s       22          135.0          204.0    352.3    704.6    805.3    872.4 payment
		1h40m33s       22           23.0           20.4    226.5    335.5    352.3    352.3 stockLevel
		1h40m34s       22           23.0           20.4    771.8   1073.7   1208.0   1208.0 delivery
		1h40m34s       22          166.0          203.7    872.4   1275.1   1744.8   1946.2 newOrder
		1h40m34s       22           25.0           20.4     60.8    100.7    109.1    109.1 orderStatus
		1h40m34s       22          216.0          204.0    637.5    973.1   1476.4   1610.6 payment
		1h40m34s       22           24.0           20.4    251.7    486.5    570.4    570.4 stockLevel
		: signal: killed
	cluster.go:1872,cdc.go:220,cdc.go:472,test.go:1248: unexpected status: failed

@danhhz
Copy link
Contributor

danhhz commented Jun 7, 2019

Something got stuck and changefeed got way behind. Fairly sure this never sent a resolved timestamp after the initial scan. Good news is we eventually error'd instead of OOMing.

failed, error: memory budget exceeded: 54085928 bytes requested, 1030196349 currently allocated, 1073741824 bytes in budget

Bad news is kicking the closed timestamp didn't seem to help. These messages rotate through a few ranges for the entire duration of the test.

I190607 09:38:54.712904 314 storage/replica_rangefeed.go:509  [n1,s1,r1826/1:/Table/58/1/5{19/719-20/4682}] RangeFeed closed timestamp 1559896182.426224821,0 is behind by 1h9m12.286622718s
I190607 09:39:00.702253 314 storage/replica_rangefeed.go:509  [n1,s1,r701/1:/Table/58/1/5{5/49367-6/1883}] RangeFeed closed timestamp 1559896182.426224821,0 is behind by 1h9m18.275988702s
I190607 09:39:06.702856 314 storage/replica_rangefeed.go:509  [n1,s1,r2061/1:/Table/58/1/56{2/626…-3/632…}] RangeFeed closed timestamp 1559896182.426224821,0 is behind by 1h9m24.276590329s
I190607 09:39:12.704330 314 storage/replica_rangefeed.go:509  [n1,s1,r947/1:/Table/57/1/36{5/1/-…-6/9/-…}] RangeFeed closed timestamp 1559896182.426224821,0 is behind by 1h9m30.278049085s

@danhhz
Copy link
Contributor

danhhz commented Jun 7, 2019

@tbg if you have a minute I'd be curious what you can make of why these ranges were so unhappy. There's a bunch of handle raft ready/slow heartbeat took/heartbeat failed in the same ranges right around when the changefeed starts at 07:58.

Maybe these have something to do with it? I don't know what they mean though

E190607 07:59:18.067200 977630 storage/replica_range_lease.go:385  [n1,s1,r826/1:/Table/58/1/14{7/631…-8/674…}] mismatch incrementing epoch for {NodeID:2 Epoch:1 Expiration:1559894350.035487800,0 Draining:false Decommissioning:false XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}; actual is {NodeID:2 Epoch:1 Expiration:1559894359.035516243,0 Draining:false Decommissioning:false XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}

@tbg
Copy link
Member

tbg commented Jun 8, 2019

Hmmm, not sure unfortunately. I was looking at this:

I190607 07:58:42.031633 299 storage/queue.go:527  [n1,s1,r711/1:/Table/54/3/8{88/5/"…-91/7/"…}] rate limited in MaybeAdd (merge): context canceled
I190607 07:58:42.744367 967131 vendor/github.com/Shopify/sarama/client.go:722  [kafka-producer] client/metadata fetching metadata for [stock] from broker 10.142.0.174:9092
I190607 07:58:42.754047 893720 vendor/github.com/Shopify/sarama/async_producer.go:680  [kafka-producer] producer/broker/0 state change to [open] on stock/0
I190607 07:58:46.967075 835065 ccl/changefeedccl/changefeed_processors.go:620  [n1] job 458174908531441665 span /Table/53/1/{39/6/-2053-118/3/-78} is behind by 433303h58m46.96293892s
W190607 07:58:47.784714 264 storage/store.go:3740  [n1,s1,r2136/1:/Table/57/2/20{1/505…-7/285…}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.785213 220 storage/store.go:3740  [n1,s1,r2663/1:/Table/58/1/89{2/258…-3/298…}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.785259 268 storage/store.go:3740  [n1,s1,r354/1:/Table/58/1/6{4/69967-5/74578}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.841947 283 storage/store.go:3740  [n1,s1,r2186/1:/Table/57/2/22{3/637…-9/156…}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.842732 179 storage/store.go:3740  [n1,s1,r839/1:/Table/57/1/30{5/9/-…-7/6/-…}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.842802 211 storage/store.go:3740  [n1,s1,r2167/1:/Table/57/2/2{18/921…-23/637…}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.843608 196 storage/store.go:3740  [n1,s1,r2083/1:/Table/57/2/18{3/959…-9/497…}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.844170 248 storage/store.go:3740  [n1,s1,r2203/1:/Table/57/2/2{29/156…-34/698…}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.879654 208 storage/store.go:3740  [n1,s1,r2607/1:/Table/57/2/54{2/835…-7/865…}] handle raft ready: 5.1s [processed=0]
W190607 07:58:47.879724 270 storage/store.go:3740  [n1,s1,r1889/1:/Table/58/1/55{2/692…-3/731…}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.879799 182 storage/store.go:3740  [n1,s1,r1884/1:/Table/58/1/55{1/652…-2/692…}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.879870 195 storage/store.go:3740  [n1,s1,r1439/1:/Table/57/1/5{89/2/-…-91/1/-…}] handle raft ready: 5.0s [processed=0]
W190607 07:58:47.881179 213 storage/store.go:3740  [n1,s1,r1069/1:/Table/57/1/33{0/2/-…-1/10/…}] handle raft ready: 4.8s [processed=0]
W190607 07:58:47.881838 247 storage/store.go:3740  [n1,s1,r1601/1:/Table/58/1/53{4/596…-5/636…}] handle raft ready: 4.8s [processed=0]
W190607 07:58:47.882550 287 storage/store.go:3740  [n1,s1,r1574/1:/Table/58/1/53{3/557…-4/596…}] handle raft ready: 4.8s [processed=0]
I190607 07:58:50.043681 179 storage/queue.go:527  [n1,s1,r327/1:/Table/53/1/6{02/3/-…-39/2/-…}] rate limited in MaybeAdd (raftlog): context canceled

I also poked at these ranges you posted and I'm seeing the warnings, but not anything else that could tell me what's going on. The system generally seems pretty hosed, but I'm unsure why.

@nvanbenschoten look what I also saw (this is master):

W190607 07:59:09.727687 976231 storage/engine/mvcc.go:2230 [n1,s1,r482/1:/Table/54/3/{370/10…-402/1/…}] unable to find value for /Table/54/3/393/8/"=8\x0e\x0fןJD\xbcȇ\x95\xf8\x1a8\xc7"/0 ({ID:bb9e5e9c-8b21-41d2-b76a-34d10daa3111 Key:[191 137 247 2 4 136] Epoch:0 Timestamp:1559894313.120606573,0 Priority:49542 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})

expVal := intent.Status == roachpb.COMMITTED && intent.Txn.Epoch == 0
if !ok {
if expVal {
log.Warningf(ctx, "unable to find value for %s (%+v)",
intent.Key, intent.Txn)
}
return false, nil
}

What do you make of this? Shouldn't that warning never fire?

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5c0b1644f9f9fe65bfb8cf3f7a5af2595bd859a8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1341218&tab=buildLog

The test failed on branch=master, cloud=gce:
	cdc.go:875,cdc.go:222,cdc.go:472,test.go:1248: max latency was more than allowed: 2h0m1.00980261s vs 10m0s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/056afc72397974ed9489bf11f16ac8e2fc281a67

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1341966&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1513,cdc.go:744,cdc.go:132,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1341966-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		     9.6  19327.4  27917.3  30064.8  30064.8 delivery
		    1m7s     2491          346.0           65.0  26843.5  49392.1  49392.1  49392.1 newOrder
		    1m7s     2491           19.0            9.6  19327.4  23622.3  23622.3  23622.3 orderStatus
		    1m7s     2491          185.0           87.2  25769.8  64424.5  64424.5  64424.5 payment
		    1m7s     2491           17.0            9.9  19327.4  25769.8  38654.7  38654.7 stockLevel
		E190617 07:55:41.096484 1 workload/cli/run.go:425  error in payment: dial tcp 10.142.0.30:26257: connect: connection refused
		    1m8s     2738           13.0            9.7   5637.1  21474.8  24696.1  24696.1 delivery
		    1m8s     2738          221.3           67.3  26843.5  51539.6  51539.6  51539.6 newOrder
		    1m8s     2738           18.0            9.7   4160.7  10737.4  13958.6  13958.6 orderStatus
		    1m8s     2738          115.2           87.6  21474.8  45097.2  66572.0  66572.0 payment
		    1m8s     2738           17.0           10.0  10200.5  30064.8  36507.2  36507.2 stockLevel
		: signal: killed
	cluster.go:1872,cdc.go:220,cdc.go:472,test.go:1248: unexpected node event: 2: dead
	cluster.go:1035,context.go:127,cluster.go:1024,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1872,cdc.go:220,cdc.go:472,test.go:1248: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1341966-cdc-tpcc-1000-rangefeed-true --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: dead
		3: 4446
		1: 4261
		Error:  2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5f358ed804af05f8c4b404efc4d8a282d8e0916c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1361643&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,cdc.go:639,cdc.go:106,cdc.go:476,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1361643-cdc-tpcc-1000-rangefeed-true:4 -- yes | sudo apt-get -q install default-jre 2>&1 > logs/apt-get-install.log returned:
		stderr:
		
		stdout:
		E: Failed to fetch http://us-east1.gce.archive.ubuntu.com/ubuntu/pool/main/p/pango1.0/libpangocairo-1.0-0_1.38.1-1_amd64.deb  504  Gateway Time-out [IP: 35.196.128.168 80]
		
		E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.74.0.6_2019-06-26T19:04:56Z: exit status 100
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9322e07476de447799c5d3011eb2874930ee2993

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1375546&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190706-1375546/cdc/tpcc-1000/rangefeed=true/run_1
	cluster.go:1724,cdc.go:748,cdc.go:136,cluster.go:2067,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1562393890-12-n4cpu16:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		   8.3  31138.5  38654.7  42949.7  42949.7 delivery
		    1m6s     2356          298.8           55.1  40802.2  49392.1  49392.1  49392.1 newOrder
		    1m6s     2356           16.0            9.6  24696.1  36507.2  40802.2  40802.2 orderStatus
		    1m6s     2356          171.9           79.5  34359.7  62277.0  64424.5  64424.5 payment
		    1m6s     2356           17.0            9.6  36507.2  42949.7  42949.7  42949.7 stockLevel
		E190706 10:19:57.969098 1 workload/cli/run.go:427  error in newOrder: dial tcp 10.142.0.126:26257: connect: connection refused
		    1m7s     2504           10.0            8.4  28991.0  34359.7  34359.7  34359.7 delivery
		    1m7s     2504          317.9           59.0  36507.2  49392.1  49392.1  49392.1 newOrder
		    1m7s     2504           20.0            9.7  18253.6  32212.3  32212.3  32212.3 orderStatus
		    1m7s     2504          188.9           81.1  32212.3  47244.6  64424.5  64424.5 payment
		    1m7s     2504           21.0            9.7  21474.8  38654.7  45097.2  45097.2 stockLevel
		: signal: killed
	cluster.go:2088,cdc.go:224,cdc.go:476,test_runner.go:681: unexpected node event: 2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/22d48caaa7d39efdcef7b3c87a99fc421e1473af

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1397412&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190720-1397412/cdc/tpcc-1000/rangefeed=true/run_1
	cluster.go:1726,cdc.go:748,cdc.go:136,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563603417-15-n4cpu16:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		.0          128.6  12884.9  18253.6  19327.4  19327.4 newOrder
		   1m36s     1442           16.0           16.4  13421.8  19327.4  19327.4  19327.4 orderStatus
		   1m36s     1442           62.0          162.5  14495.5  18253.6  19327.4  19327.4 payment
		   1m36s     1442            2.0           16.6  11811.2  12348.0  12348.0  12348.0 stockLevel
		E190720 09:14:34.829584 1 workload/cli/run.go:427  error in payment: dial tcp 10.128.0.87:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   1m37s     1611            0.0           16.4      0.0      0.0      0.0      0.0 delivery
		   1m37s     1611            7.0          127.4  15569.3  17179.9  17179.9  17179.9 newOrder
		   1m37s     1611            0.0           16.3      0.0      0.0      0.0      0.0 orderStatus
		   1m37s     1611            4.0          160.9  13958.6  18253.6  18253.6  18253.6 payment
		   1m37s     1611            2.0           16.5  15569.3  16643.0  16643.0  16643.0 stockLevel
		: signal: killed
	cluster.go:2090,cdc.go:224,cdc.go:476,test_runner.go:691: unexpected node event: 2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7dab0dcfd37c389af357c302c073b9611b5ada25

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1398203&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190721-1398203/cdc/tpcc-1000/rangefeed=true/run_1
	cluster.go:1726,cdc.go:748,cdc.go:136,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563689854-13-n4cpu16:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		    17.2  17179.9  17179.9  17179.9  17179.9 delivery
		   2m14s     1460           15.0          148.7  16106.1  19327.4  20401.1  20401.1 newOrder
		   2m14s     1460            5.0           18.3  16106.1  18253.6  18253.6  18253.6 orderStatus
		   2m14s     1460           24.0          170.8  17179.9  19327.4  20401.1  20401.1 payment
		   2m14s     1460           11.0           17.4  17179.9  19327.4  19327.4  19327.4 stockLevel
		E190721 09:27:55.183492 1 workload/cli/run.go:427  error in payment: dial tcp 10.128.0.99:26257: connect: connection refused
		   2m15s     1618            1.0           17.1  15569.3  15569.3  15569.3  15569.3 delivery
		   2m15s     1618           41.0          147.9  18253.6  20401.1  21474.8  21474.8 newOrder
		   2m15s     1618            4.0           18.2  16106.1  18253.6  18253.6  18253.6 orderStatus
		   2m15s     1618           52.0          169.9  18253.6  20401.1  20401.1  20401.1 payment
		   2m15s     1618            9.0           17.3  16106.1  17179.9  17179.9  17179.9 stockLevel
		: signal: killed
	cluster.go:2090,cdc.go:224,cdc.go:476,test_runner.go:691: unexpected node event: 3: dead

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 24, 2019
…ation

Fixes cockroachdb#39018.
Fixes cockroachdb#37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft
proposal to be applied multiple times at multiple applied indexes. The
bug was possible if a raft proposal was reproposed twice, once with the
same max lease index and once with a new max lease index. Because there
are two entries for the same proposal, one necessarily has to have an
invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430)
If these two entries happened to land in the same application batch on
the leaseholder then the first entry would be rejected and the second
would apply. The replicas LeaseAppliedIndex would then be bumped all the
way to the max lease index of the entry that applied. The bug occurred when
the first entry (who must have hit a proposalIllegalLeaseIndex), called into
tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be
equal to the Replica's LeaseAppliedIndex because it would match the index
in the successful entry. We would then repropose the proposal with a larger
lease applied index. This new entry could then apply and result in duplicate
entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it
caught this duplicate entry application and panicked loudly. Other tests might
also be failing because of it but might not have as obvious of symptoms when
they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of
fixing an issue where two entries for the same command could be in the same
batch and only one would be linked to its ProposalData struct and be considered
locally proposed (see the change in retrieveLocalProposals). I believe that this
would prevent the command from being properly acknowledged if the first entry
was rejected due to its max lease index and the second entry had a larger max
lease index and applied. I think the first entry would have eventually hit the
check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData
had a new MaxLeaseIndex so it would have added it back to the proposal map, but
then it would have had to wait for refreshProposalsLocked to refresh the
proposal, at which point this refresh would have hit a lease index error and
would be reproposed at a higher index. Not only could this cause duplicate
versions of the same command to apply (described above), but I think this could
even loop forever without acknowledging the client. It seems like there should
be a way for this to cause cockroachdb#39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out
these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should
hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to
properly handle errors during tryReproposeWithNewLeaseIndex reproposals and
that is properly tested.

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/86eab2ff0a1a4c2d9b5f7e7a45deda74c98c6c37

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1402541&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190724-1402541/cdc/tpcc-1000/rangefeed=true/run_1
	cluster.go:1726,cdc.go:748,cdc.go:136,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563949032-13-n4cpu16:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		   0.0 newOrder
		   1m10s     1097            0.0           12.2      0.0      0.0      0.0      0.0 orderStatus
		   1m10s     1097            0.0          121.2      0.0      0.0      0.0      0.0 payment
		   1m10s     1097            0.0           12.2      0.0      0.0      0.0      0.0 stockLevel
		E190724 09:32:48.415002 1 workload/cli/run.go:427  error in payment: dial tcp 10.128.0.124:26257: connect: connection refused
		   1m11s     1441            0.0           11.9      0.0      0.0      0.0      0.0 delivery
		   1m11s     1441            0.0           89.3      0.0      0.0      0.0      0.0 newOrder
		   1m11s     1441            0.0           12.0      0.0      0.0      0.0      0.0 orderStatus
		   1m11s     1441            0.0          119.5      0.0      0.0      0.0      0.0 payment
		   1m11s     1441            0.0           12.0      0.0      0.0      0.0      0.0 stockLevel
		E190724 09:32:49.478652 1 workload/cli/run.go:427  error in orderStatus: dial tcp 10.128.0.124:26257: connect: connection refused
		: signal: killed
	cluster.go:2090,cdc.go:224,cdc.go:476,test_runner.go:691: unexpected node event: 2: dead

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 24, 2019
…ation

Fixes cockroachdb#39018.
Fixes cockroachdb#37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft
proposal to be applied multiple times at multiple applied indexes. The
bug was possible if a raft proposal was reproposed twice, once with the
same max lease index and once with a new max lease index. Because there
are two entries for the same proposal, one necessarily has to have an
invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430)
If these two entries happened to land in the same application batch on
the leaseholder then the first entry would be rejected and the second
would apply. The replicas LeaseAppliedIndex would then be bumped all the
way to the max lease index of the entry that applied. The bug occurred when
the first entry (who must have hit a proposalIllegalLeaseIndex), called into
tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be
equal to the Replica's LeaseAppliedIndex because it would match the index
in the successful entry. We would then repropose the proposal with a larger
lease applied index. This new entry could then apply and result in duplicate
entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it
caught this duplicate entry application and panicked loudly. Other tests might
also be failing because of it but might not have as obvious of symptoms when
they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of
fixing an issue where two entries for the same command could be in the same
batch and only one would be linked to its ProposalData struct and be considered
locally proposed (see the change in retrieveLocalProposals). I believe that this
would prevent the command from being properly acknowledged if the first entry
was rejected due to its max lease index and the second entry had a larger max
lease index and applied. I think the first entry would have eventually hit the
check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData
had a new MaxLeaseIndex so it would have added it back to the proposal map, but
then it would have had to wait for refreshProposalsLocked to refresh the
proposal, at which point this refresh would have hit a lease index error and
would be reproposed at a higher index. Not only could this cause duplicate
versions of the same command to apply (described above), but I think this could
even loop forever without acknowledging the client. It seems like there should
be a way for this to cause cockroachdb#39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out
these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should
hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to
properly handle errors during tryReproposeWithNewLeaseIndex reproposals and
that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10
`cdc/ledger/rangefeed=true` roachtests after reducing its duration down to
5m. Usually, at least one of these tests would hit the `negative refcount`
assertion. I then incrementally added more and more logging around entry
application until I painted a full picture of which logical ops were being
consumed by the rangefeed processor and why the same raft command was being
applied twice (once it became clear that one was). After a few more rounds
of fine-tuning the logging, the interaction with reproposals with a new max
lease index became clear.

Release note: None
craig bot pushed a commit that referenced this issue Jul 24, 2019
39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten

Fixes #39018.
Fixes #37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in #39064 Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants