roachtest: cdc/ledger/rangefeed=true failed #39018

cockroach-teamcity · 2019-07-22T08:40:11Z

SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/ledger/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190722-1399000/cdc/ledger/rangefeed=true/run_1
	cluster.go:1726,cdc.go:767,cdc.go:150,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563776264-11-n4cpu16:4 -- ./workload run ledger --mix=balance=0,withdrawal=50,deposit=50,reversal=0 {pgurl:1-3} --duration=30m returned:
		stderr:
		
		stdout:
		 0.0 withdrawal
		  22m33s        0            0.0          689.3      0.0      0.0      0.0      0.0 deposit
		  22m33s        0            0.0          689.3      0.0      0.0      0.0      0.0 withdrawal
		  22m34s        0            1.0          688.8   7784.6   7784.6   7784.6   7784.6 deposit
		  22m34s        0            0.0          688.8      0.0      0.0      0.0      0.0 withdrawal
		  22m35s        0           10.0          688.3    486.5   9663.7   9663.7   9663.7 deposit
		  22m35s        0            7.0          688.3   9126.8   9663.7   9663.7   9663.7 withdrawal
		  22m36s        0          198.0          687.9     14.7    125.8    503.3    570.4 deposit
		  22m36s        0          188.0          688.0     14.2     56.6    113.2    570.4 withdrawal
		  22m37s        0          321.0          687.7     14.7     24.1     33.6     58.7 deposit
		  22m37s        0          315.0          687.7     14.2     24.1     27.3     30.4 withdrawal
		Error: error in withdrawal: driver: bad connection
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,cdc.go:224,cdc.go:566,test_runner.go:691: Goexit() was called

The text was updated successfully, but these errors were encountered:

jordanlewis · 2019-07-22T12:53:38Z

panic: negative refcount -2 for txn &{txnID:[243 109 56 86 196 43 78 73 185 39 226 13 92 104 108 244] txnKey:[] txnMinTimestamp:{WallTime:0 Logical:0} timestamp:{WallTime:1563784722739899098 Logical:0} refCount:-2 index:34} [recovered]
	panic: negative refcount -2 for txn &{txnID:[243 109 56 86 196 43 78 73 185 39 226 13 92 104 108 244] txnKey:[] txnMinTimestamp:{WallTime:0 Logical:0} timestamp:{WallTime:1563784722739899098 Logical:0} refCount:-2 index:34}

goroutine 21594813 [running]:
panic(0x37209c0, 0xc02f755070)
	/usr/local/go/src/runtime/panic.go:565 +0x2c5 fp=0xc00ac095e8 sp=0xc00ac09558 pc=0x785f75
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc0004c7680, 0x44ec800, 0xc02593ab10)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:181 +0x121 fp=0xc00ac09648 sp=0xc00ac095e8 pc=0x1282431
runtime.call32(0x0, 0x3dd5698, 0xc0169bbf70, 0x1800000018)
	/usr/local/go/src/runtime/asm_amd64.s:519 +0x3b fp=0xc00ac09678 sp=0xc00ac09648 pc=0x7b3ffb
panic(0x37209c0, 0xc02f755070)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5 fp=0xc00ac09708 sp=0xc00ac09678 pc=0x785e65
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*unresolvedIntentQueue).assertOnlyPositiveRefCounts(0xc023e65168)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/resolved_timestamp.go:509 +0x161 fp=0xc00ac097d8 sp=0xc00ac09708 pc=0x1a3cee1
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*unresolvedIntentQueue).AllowNegRefCount(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/resolved_timestamp.go:501
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*resolvedTimestamp).Init(0xc023e65140, 0xc00ac09ea0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/resolved_timestamp.go:106 +0x33 fp=0xc00ac097f8 sp=0xc00ac097d8 pc=0x1a3b533
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*Processor).initResolvedTS(0xc023e65080, 0x44ec740, 0xc01c2a8d40)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/processor.go:536 +0x34 fp=0xc00ac09820 sp=0xc00ac097f8 pc=0x1a389c4
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*Processor).consumeEvent(0xc023e65080, 0x44ec740, 0xc01c2a8d40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/processor.go:476 +0x1d1 fp=0xc00ac09888 sp=0xc00ac09820 pc=0x1a383e1
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*Processor).Start.func1(0x44ec800, 0xc02593ab10)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/processor.go:241 +0xa9c fp=0xc00ac09f78 sp=0xc00ac09888 pc=0x1a3ec7c
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc023eef3e0, 0xc0004c7680, 0xc02593aae0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0xfb fp=0xc00ac09fc8 sp=0xc00ac09f78 pc=0x12845cb
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc00ac09fd0 sp=0xc00ac09fc8 pc=0x7b5d11
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xa8

nvanbenschoten · 2019-07-22T19:11:13Z

Looks like this has also started recently in #37810.

nvanbenschoten · 2019-07-22T19:21:02Z

This appears to reproduce fairly easily on master.

nvanbenschoten · 2019-07-22T20:22:18Z

I'm in the process of bisecting this.

nvanbenschoten · 2019-07-22T21:35:39Z

The issue was introduced somewhere between 95a0b6b and 509baff. The suspense is killing me.

nvanbenschoten · 2019-07-22T21:41:48Z

Looking through those commits, it seems highly likely that the issue was introduced in 509baff.

nvanbenschoten · 2019-07-22T22:31:25Z

I'm worried that the batching change is causing issues with:

cockroach/pkg/storage/replica_rangefeed.go

Lines 453 to 459 in e0044e8

    
           // Read the value directly from the Engine. This is performed in the 
        
           // same raftMu critical section that the logical op's corresponding 
        
           // WriteBatch is applied, so the value should exist. 
        
           val, _, err := engine.MVCCGet(ctx, r.Engine(), key, ts, engine.MVCCGetOptions{Tombstones: true}) 
        
           if val == nil && err == nil { 
        
           	err = errors.New("value missing in engine") 
        
           }

I think we'll want to have that take a Reader and read from the write batch as we build it so that a later write in the same application batch can't interfere with an earlier write.

That said, I would expect this to cause a value missing in engine error, not the panic we're seeing here.

…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. Release note: None

cockroach-teamcity · 2019-07-24T08:54:33Z

SHA: https://github.com/cockroachdb/cockroach/commits/86eab2ff0a1a4c2d9b5f7e7a45deda74c98c6c37

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/ledger/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1402541&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190724-1402541/cdc/ledger/rangefeed=true/run_1
	cluster.go:1726,cdc.go:767,cdc.go:150,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563949032-12-n4cpu16:4 -- ./workload run ledger --mix=balance=0,withdrawal=50,deposit=50,reversal=0 {pgurl:1-3} --duration=30m returned:
		stderr:
		
		stdout:
		          629.8     24.1     39.8     50.3     71.3 withdrawal
		    6m9s        0          214.0          628.7     25.2    130.0    142.6    142.6 deposit
		    6m9s        0          187.0          628.6     25.2    130.0    142.6    142.6 withdrawal
		   6m10s        0            0.0          627.0      0.0      0.0      0.0      0.0 deposit
		   6m10s        0            0.0          626.9      0.0      0.0      0.0      0.0 withdrawal
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   6m11s        0            0.0          625.3      0.0      0.0      0.0      0.0 deposit
		   6m11s        0            0.0          625.2      0.0      0.0      0.0      0.0 withdrawal
		   6m12s        0            0.0          623.6      0.0      0.0      0.0      0.0 deposit
		   6m12s        0            0.0          623.6      0.0      0.0      0.0      0.0 withdrawal
		Error: error in withdrawal: read tcp 10.128.0.68:54150->10.128.0.118:26257: read: connection reset by peer
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,cdc.go:224,cdc.go:566,test_runner.go:691: Goexit() was called

…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Release note: None

39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten Fixes #39018. Fixes #37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

cockroach-teamcity added this to the 19.2 milestone Jul 22, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 22, 2019

jordanlewis assigned nvanbenschoten Jul 22, 2019

This was referenced Jul 24, 2019

storage: prevent command reproposal with new lease index after application #39064

Merged

roachtest: acceptance/bank/cluster-recovery failed #38785

Closed

craig bot closed this as completed in #39064 Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: cdc/ledger/rangefeed=true failed #39018

roachtest: cdc/ledger/rangefeed=true failed #39018

cockroach-teamcity commented Jul 22, 2019

jordanlewis commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

cockroach-teamcity commented Jul 24, 2019

roachtest: cdc/ledger/rangefeed=true failed #39018

roachtest: cdc/ledger/rangefeed=true failed #39018

Comments

cockroach-teamcity commented Jul 22, 2019

jordanlewis commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

nvanbenschoten commented Jul 22, 2019

cockroach-teamcity commented Jul 24, 2019