Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/ledger/rangefeed=true failed #39018

Closed
cockroach-teamcity opened this issue Jul 22, 2019 · 8 comments · Fixed by #39064
Closed

roachtest: cdc/ledger/rangefeed=true failed #39018

cockroach-teamcity opened this issue Jul 22, 2019 · 8 comments · Fixed by #39064
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/ledger/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190722-1399000/cdc/ledger/rangefeed=true/run_1
	cluster.go:1726,cdc.go:767,cdc.go:150,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563776264-11-n4cpu16:4 -- ./workload run ledger --mix=balance=0,withdrawal=50,deposit=50,reversal=0 {pgurl:1-3} --duration=30m returned:
		stderr:
		
		stdout:
		 0.0 withdrawal
		  22m33s        0            0.0          689.3      0.0      0.0      0.0      0.0 deposit
		  22m33s        0            0.0          689.3      0.0      0.0      0.0      0.0 withdrawal
		  22m34s        0            1.0          688.8   7784.6   7784.6   7784.6   7784.6 deposit
		  22m34s        0            0.0          688.8      0.0      0.0      0.0      0.0 withdrawal
		  22m35s        0           10.0          688.3    486.5   9663.7   9663.7   9663.7 deposit
		  22m35s        0            7.0          688.3   9126.8   9663.7   9663.7   9663.7 withdrawal
		  22m36s        0          198.0          687.9     14.7    125.8    503.3    570.4 deposit
		  22m36s        0          188.0          688.0     14.2     56.6    113.2    570.4 withdrawal
		  22m37s        0          321.0          687.7     14.7     24.1     33.6     58.7 deposit
		  22m37s        0          315.0          687.7     14.2     24.1     27.3     30.4 withdrawal
		Error: error in withdrawal: driver: bad connection
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,cdc.go:224,cdc.go:566,test_runner.go:691: Goexit() was called

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jul 22, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 22, 2019
@jordanlewis
Copy link
Member

panic: negative refcount -2 for txn &{txnID:[243 109 56 86 196 43 78 73 185 39 226 13 92 104 108 244] txnKey:[] txnMinTimestamp:{WallTime:0 Logical:0} timestamp:{WallTime:1563784722739899098 Logical:0} refCount:-2 index:34} [recovered]
	panic: negative refcount -2 for txn &{txnID:[243 109 56 86 196 43 78 73 185 39 226 13 92 104 108 244] txnKey:[] txnMinTimestamp:{WallTime:0 Logical:0} timestamp:{WallTime:1563784722739899098 Logical:0} refCount:-2 index:34}

goroutine 21594813 [running]:
panic(0x37209c0, 0xc02f755070)
	/usr/local/go/src/runtime/panic.go:565 +0x2c5 fp=0xc00ac095e8 sp=0xc00ac09558 pc=0x785f75
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc0004c7680, 0x44ec800, 0xc02593ab10)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:181 +0x121 fp=0xc00ac09648 sp=0xc00ac095e8 pc=0x1282431
runtime.call32(0x0, 0x3dd5698, 0xc0169bbf70, 0x1800000018)
	/usr/local/go/src/runtime/asm_amd64.s:519 +0x3b fp=0xc00ac09678 sp=0xc00ac09648 pc=0x7b3ffb
panic(0x37209c0, 0xc02f755070)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5 fp=0xc00ac09708 sp=0xc00ac09678 pc=0x785e65
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*unresolvedIntentQueue).assertOnlyPositiveRefCounts(0xc023e65168)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/resolved_timestamp.go:509 +0x161 fp=0xc00ac097d8 sp=0xc00ac09708 pc=0x1a3cee1
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*unresolvedIntentQueue).AllowNegRefCount(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/resolved_timestamp.go:501
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*resolvedTimestamp).Init(0xc023e65140, 0xc00ac09ea0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/resolved_timestamp.go:106 +0x33 fp=0xc00ac097f8 sp=0xc00ac097d8 pc=0x1a3b533
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*Processor).initResolvedTS(0xc023e65080, 0x44ec740, 0xc01c2a8d40)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/processor.go:536 +0x34 fp=0xc00ac09820 sp=0xc00ac097f8 pc=0x1a389c4
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*Processor).consumeEvent(0xc023e65080, 0x44ec740, 0xc01c2a8d40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/processor.go:476 +0x1d1 fp=0xc00ac09888 sp=0xc00ac09820 pc=0x1a383e1
github.com/cockroachdb/cockroach/pkg/storage/rangefeed.(*Processor).Start.func1(0x44ec800, 0xc02593ab10)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/rangefeed/processor.go:241 +0xa9c fp=0xc00ac09f78 sp=0xc00ac09888 pc=0x1a3ec7c
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc023eef3e0, 0xc0004c7680, 0xc02593aae0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0xfb fp=0xc00ac09fc8 sp=0xc00ac09f78 pc=0x12845cb
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc00ac09fd0 sp=0xc00ac09fc8 pc=0x7b5d11
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xa8

@nvanbenschoten
Copy link
Member

Looks like this has also started recently in #37810.

@nvanbenschoten
Copy link
Member

This appears to reproduce fairly easily on master.

@nvanbenschoten
Copy link
Member

I'm in the process of bisecting this.

@nvanbenschoten
Copy link
Member

The issue was introduced somewhere between 95a0b6b and 509baff. The suspense is killing me.

@nvanbenschoten
Copy link
Member

Looking through those commits, it seems highly likely that the issue was introduced in 509baff.

@nvanbenschoten
Copy link
Member

I'm worried that the batching change is causing issues with:

// Read the value directly from the Engine. This is performed in the
// same raftMu critical section that the logical op's corresponding
// WriteBatch is applied, so the value should exist.
val, _, err := engine.MVCCGet(ctx, r.Engine(), key, ts, engine.MVCCGetOptions{Tombstones: true})
if val == nil && err == nil {
err = errors.New("value missing in engine")
}

I think we'll want to have that take a Reader and read from the write batch as we build it so that a later write in the same application batch can't interfere with an earlier write.

That said, I would expect this to cause a value missing in engine error, not the panic we're seeing here.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 24, 2019
…ation

Fixes cockroachdb#39018.
Fixes cockroachdb#37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft
proposal to be applied multiple times at multiple applied indexes. The
bug was possible if a raft proposal was reproposed twice, once with the
same max lease index and once with a new max lease index. Because there
are two entries for the same proposal, one necessarily has to have an
invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430)
If these two entries happened to land in the same application batch on
the leaseholder then the first entry would be rejected and the second
would apply. The replicas LeaseAppliedIndex would then be bumped all the
way to the max lease index of the entry that applied. The bug occurred when
the first entry (who must have hit a proposalIllegalLeaseIndex), called into
tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be
equal to the Replica's LeaseAppliedIndex because it would match the index
in the successful entry. We would then repropose the proposal with a larger
lease applied index. This new entry could then apply and result in duplicate
entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it
caught this duplicate entry application and panicked loudly. Other tests might
also be failing because of it but might not have as obvious of symptoms when
they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of
fixing an issue where two entries for the same command could be in the same
batch and only one would be linked to its ProposalData struct and be considered
locally proposed (see the change in retrieveLocalProposals). I believe that this
would prevent the command from being properly acknowledged if the first entry
was rejected due to its max lease index and the second entry had a larger max
lease index and applied. I think the first entry would have eventually hit the
check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData
had a new MaxLeaseIndex so it would have added it back to the proposal map, but
then it would have had to wait for refreshProposalsLocked to refresh the
proposal, at which point this refresh would have hit a lease index error and
would be reproposed at a higher index. Not only could this cause duplicate
versions of the same command to apply (described above), but I think this could
even loop forever without acknowledging the client. It seems like there should
be a way for this to cause cockroachdb#39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out
these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should
hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to
properly handle errors during tryReproposeWithNewLeaseIndex reproposals and
that is properly tested.

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/86eab2ff0a1a4c2d9b5f7e7a45deda74c98c6c37

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/ledger/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1402541&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190724-1402541/cdc/ledger/rangefeed=true/run_1
	cluster.go:1726,cdc.go:767,cdc.go:150,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563949032-12-n4cpu16:4 -- ./workload run ledger --mix=balance=0,withdrawal=50,deposit=50,reversal=0 {pgurl:1-3} --duration=30m returned:
		stderr:
		
		stdout:
		          629.8     24.1     39.8     50.3     71.3 withdrawal
		    6m9s        0          214.0          628.7     25.2    130.0    142.6    142.6 deposit
		    6m9s        0          187.0          628.6     25.2    130.0    142.6    142.6 withdrawal
		   6m10s        0            0.0          627.0      0.0      0.0      0.0      0.0 deposit
		   6m10s        0            0.0          626.9      0.0      0.0      0.0      0.0 withdrawal
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   6m11s        0            0.0          625.3      0.0      0.0      0.0      0.0 deposit
		   6m11s        0            0.0          625.2      0.0      0.0      0.0      0.0 withdrawal
		   6m12s        0            0.0          623.6      0.0      0.0      0.0      0.0 deposit
		   6m12s        0            0.0          623.6      0.0      0.0      0.0      0.0 withdrawal
		Error: error in withdrawal: read tcp 10.128.0.68:54150->10.128.0.118:26257: read: connection reset by peer
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,cdc.go:224,cdc.go:566,test_runner.go:691: Goexit() was called

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 24, 2019
…ation

Fixes cockroachdb#39018.
Fixes cockroachdb#37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft
proposal to be applied multiple times at multiple applied indexes. The
bug was possible if a raft proposal was reproposed twice, once with the
same max lease index and once with a new max lease index. Because there
are two entries for the same proposal, one necessarily has to have an
invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430)
If these two entries happened to land in the same application batch on
the leaseholder then the first entry would be rejected and the second
would apply. The replicas LeaseAppliedIndex would then be bumped all the
way to the max lease index of the entry that applied. The bug occurred when
the first entry (who must have hit a proposalIllegalLeaseIndex), called into
tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be
equal to the Replica's LeaseAppliedIndex because it would match the index
in the successful entry. We would then repropose the proposal with a larger
lease applied index. This new entry could then apply and result in duplicate
entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it
caught this duplicate entry application and panicked loudly. Other tests might
also be failing because of it but might not have as obvious of symptoms when
they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of
fixing an issue where two entries for the same command could be in the same
batch and only one would be linked to its ProposalData struct and be considered
locally proposed (see the change in retrieveLocalProposals). I believe that this
would prevent the command from being properly acknowledged if the first entry
was rejected due to its max lease index and the second entry had a larger max
lease index and applied. I think the first entry would have eventually hit the
check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData
had a new MaxLeaseIndex so it would have added it back to the proposal map, but
then it would have had to wait for refreshProposalsLocked to refresh the
proposal, at which point this refresh would have hit a lease index error and
would be reproposed at a higher index. Not only could this cause duplicate
versions of the same command to apply (described above), but I think this could
even loop forever without acknowledging the client. It seems like there should
be a way for this to cause cockroachdb#39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out
these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should
hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to
properly handle errors during tryReproposeWithNewLeaseIndex reproposals and
that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10
`cdc/ledger/rangefeed=true` roachtests after reducing its duration down to
5m. Usually, at least one of these tests would hit the `negative refcount`
assertion. I then incrementally added more and more logging around entry
application until I painted a full picture of which logical ops were being
consumed by the rangefeed processor and why the same raft command was being
applied twice (once it became clear that one was). After a few more rounds
of fine-tuning the logging, the interaction with reproposals with a new max
lease index became clear.

Release note: None
craig bot pushed a commit that referenced this issue Jul 24, 2019
39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten

Fixes #39018.
Fixes #37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in #39064 Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants