Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/ledger/rangefeed=true failed #36077

Closed
cockroach-teamcity opened this issue Mar 23, 2019 · 11 comments
Closed

roachtest: cdc/ledger/rangefeed=true failed #36077

cockroach-teamcity opened this issue Mar 23, 2019 · 11 comments
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/6cac063ae1cb578130afbafb2abf4035268a10c9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/ledger/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194308&tab=buildLog

The test failed on master:
	cluster.go:1267,cdc.go:652,cdc.go:147,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1194308-cdc-ledger-rangefeed-true:4 -- ./workload run ledger --mix=balance=0,withdrawal=50,deposit=50,reversal=0 {pgurl:1-3} --duration=30m returned:
		stderr:
		
		stdout:
		awal
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   3m11s        0          643.9          591.2     25.2     37.7     52.4     88.1 deposit
		   3m11s        0          633.0          591.3     24.1     37.7     46.1     79.7 withdrawal
		   3m12s        0          646.1          591.5     25.2     35.7     44.0     75.5 deposit
		   3m12s        0          624.1          591.4     24.1     37.7     54.5     92.3 withdrawal
		   3m13s        0          609.9          591.6     25.2     37.7     50.3     88.1 deposit
		   3m13s        0          655.9          591.8     24.1     35.7     44.0    226.5 withdrawal
		   3m14s        0          646.0          591.9     26.2     35.7     46.1     83.9 deposit
		   3m14s        0          617.0          591.9     24.1     37.7     56.6    121.6 withdrawal
		   3m15s        0          570.0          591.8     25.2     39.8     54.5    243.3 deposit
		   3m15s        0          558.0          591.7     24.1     41.9    218.1    251.7 withdrawal
		: signal: killed
	cluster.go:1626,cdc.go:221,cdc.go:460,test.go:1214: unexpected status: failed

@cockroach-teamcity cockroach-teamcity added this to the 19.1 milestone Mar 23, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 23, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/837e946efc272bd8a9e0e08484733f8755ff5ab1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/ledger/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1247401&tab=buildLog

The test failed on release-19.1:
	cluster.go:1329,cdc.go:765,cdc.go:149,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1247401-cdc-ledger-rangefeed-true:4 -- ./workload run ledger --mix=balance=0,withdrawal=50,deposit=50,reversal=0 {pgurl:1-3} --duration=30m returned:
		stderr:
		
		stdout:
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		      1s        0          460.6          473.9     33.6     48.2     71.3    113.2 deposit
		      1s        0          467.4          480.9     31.5     44.0     62.9     92.3 withdrawal
		      2s        0          532.0          503.0     31.5     41.9     54.5     88.1 deposit
		      2s        0          495.0          488.0     30.4     41.9     65.0    113.2 withdrawal
		      3s        0          503.0          503.0     31.5     46.1     52.4     65.0 deposit
		      3s        0          520.0          498.6     30.4     41.9     60.8    104.9 withdrawal
		      4s        0          523.0          508.0     31.5     41.9     54.5     96.5 deposit
		      4s        0          489.0          496.2     31.5     41.9     56.6     96.5 withdrawal
		: signal: killed
	cluster.go:1688,cdc.go:223,cdc.go:565,test.go:1237: unexpected status: failed

@danhhz
Copy link
Contributor

danhhz commented Apr 18, 2019

First failure has no logs and is from long enough ago that it's missing all kinds of fixes. Most recent failure is on release-19.1 and should be fixed by backporting #36852, but I'm waiting for 19.1.1 to do that

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/46f8608c4fe2d94b771beb37bcee19136040fd74

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/ledger/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1253450&tab=buildLog

The test failed on master:
	cluster.go:1329,cdc.go:765,cdc.go:149,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1253450-cdc-ledger-rangefeed-true:4 -- ./workload run ledger --mix=balance=0,withdrawal=50,deposit=50,reversal=0 {pgurl:1-3} --duration=30m returned:
		stderr:
		
		stdout:
		1s        0          570.9          547.0     25.2     46.1     58.7    125.8 withdrawal
		  23m32s        0          579.1          547.1     27.3     48.2     58.7    121.6 deposit
		  23m32s        0          551.1          547.0     25.2     46.1     60.8     79.7 withdrawal
		  23m33s        0          457.0          547.0     32.5     54.5     67.1    142.6 deposit
		  23m33s        0          507.0          547.0     31.5     50.3     71.3    100.7 withdrawal
		  23m34s        0          471.0          546.9     30.4     67.1    142.6    192.9 deposit
		  23m34s        0          440.0          546.9     29.4     62.9    142.6    151.0 withdrawal
		  23m35s        0          463.0          546.9     32.5     58.7     71.3     79.7 deposit
		  23m35s        0          462.0          546.9     31.5     56.6     71.3     96.5 withdrawal
		  23m36s        0          514.1          546.9     30.4     58.7     79.7    109.1 deposit
		  23m36s        0          518.1          546.9     28.3     50.3     65.0     79.7 withdrawal
		: signal: killed
	cluster.go:1688,cdc.go:223,cdc.go:565,test.go:1237: unexpected status: failed

@tbg
Copy link
Member

tbg commented Apr 23, 2019

Fished this from the logs in the above failures:

changefeed: 06:40:56 cdc.go:858: unexpected status: failed, error: [NotLeaseHolderError] r227: replica (n2,s2):2 not lease holder; replica (n3,s3):3 is

If I'm tracing this right, it's from

if err := db.QueryRow(
`SELECT status, payload, progress FROM system.jobs WHERE id = $1`, jobID,
).Scan(&status, &payloadBytes, &progressBytes); err != nil {
return changefeedInfo{}, err
.

@danhhz
Copy link
Contributor

danhhz commented Apr 23, 2019

Ooo and this is on master. Yesterday, I got a "descriptor not found" error that was missing the retryable tag, so there's two outstanding now.

Sadly, in my logging to print the error details, I swapped which one should have the %v and which should have the %+v, so I haven't made it easy on myself here. I'll take a peek at the logs anyway and see if I can piece something together.

@danhhz
Copy link
Contributor

danhhz commented Apr 23, 2019

Repro'd locally which pinned it down to coming out of the rangefeedImpl call on poller. I added more debugging inside that method, but in the meantime I suspect WaitForTS.

g.GoCtx(p.rangefeedImpl)

@danhhz
Copy link
Contributor

danhhz commented Apr 23, 2019

Aha! Got one coming out of (*DistSender).RangeFeed. cc @nvanbenschoten what is RangeFeed's contract around these? What level should they be retried at?

unexpected status: failed, error: internal error: uncaught error: [NotLeaseHolderError] r673: replica (n3,s3):2 not lease holder; replica (n2,s2):3 is
error from rangefeed
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*poller).rangefeedImpl.func1
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/poller.go:299
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1
	/go/src/github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:170
github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1333
receiving stream of data changes
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*poller).RunUsingRangefeeds.func2
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/poller.go:209
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1
	/go/src/github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:170
github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1333
watching changes
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*changeAggregator).Start.func1
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_processors.go:200
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:325
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1333

https://github.com/cockroachdb/cockroach/blob/bbfc5a8b43e38006a5120ea4ad7d9fdc852824e2/pkg/ccl/changefeedccl/poller.go#L299

@danhhz
Copy link
Contributor

danhhz commented Apr 23, 2019

Oh, I bet it's this. I noticed that we weren't handling NotLeaseHolderError in (*DistSender).partialRangeFeed and I was wondering how anything ever worked at all, but this comment sounds like a rare-ish thing.

cockroach/pkg/storage/store.go

Lines 3125 to 3142 in 8dfd761

// If we have an uninitialized copy of the range, then we are
// probably a valid member of the range, we're just in the
// process of getting our snapshot. If we returned
// RangeNotFoundError, the client would invalidate its cache,
// but we can be smarter: the replica that caused our
// uninitialized replica to be created is most likely the
// leader.
return roachpb.NewError(&roachpb.NotLeaseHolderError{
RangeID: args.RangeID,
LeaseHolder: repl.creatingReplica,
// The replica doesn't have a range descriptor yet, so we have to build
// a ReplicaDescriptor manually.
Replica: roachpb.ReplicaDescriptor{
NodeID: repl.store.nodeDesc.NodeID,
StoreID: repl.store.StoreID(),
ReplicaID: replicaID,
},
})

Sounds like we need to handle NotLeaseHolderError in this switch. @nvanbenschoten does that sound right to you?

switch t := pErr.GetDetail().(type) {
case *roachpb.SendError, *roachpb.RangeNotFoundError:
// Evict the decriptor from the cache and reload on next attempt.
if err := rangeInfo.token.Evict(ctx); err != nil {
return err
}
rangeInfo.desc = nil
continue
case *roachpb.RangeKeyMismatchError:
// Evict the decriptor from the cache.
if err := rangeInfo.token.Evict(ctx); err != nil {
return err
}
return ds.divideAndSendRangeFeedToRanges(ctx, rangeInfo.rs, ts, rangeCh)
case *roachpb.RangeFeedRetryError:
switch t.Reason {
case roachpb.RangeFeedRetryError_REASON_REPLICA_REMOVED,
roachpb.RangeFeedRetryError_REASON_RAFT_SNAPSHOT,
roachpb.RangeFeedRetryError_REASON_LOGICAL_OPS_MISSING,
roachpb.RangeFeedRetryError_REASON_SLOW_CONSUMER:
// Try again with same descriptor. These are transient
// errors that should not show up again.
continue
case roachpb.RangeFeedRetryError_REASON_RANGE_SPLIT,
roachpb.RangeFeedRetryError_REASON_RANGE_MERGED:
// Evict the decriptor from the cache.
if err := rangeInfo.token.Evict(ctx); err != nil {
return err
}
return ds.divideAndSendRangeFeedToRanges(ctx, rangeInfo.rs, ts, rangeCh)
default:
log.Fatalf(ctx, "unexpected RangeFeedRetryError reason %v", t.Reason)
}
default:
return t
}

@nvanbenschoten
Copy link
Member

Yes, that all sounds correct to me. It's a little strange that we need to handle NotLeaseHolderErrors for RangeFeeds because we're fine having them attached to follower replicas, but the place where they're being thrown from makes sense.

@danhhz
Copy link
Contributor

danhhz commented Apr 24, 2019

Would it be cleaner to remove the optimization and make that code throw a RangeNotFoundError instead then?

danhhz added a commit to danhhz/cockroach that referenced this issue Apr 24, 2019
RangeFeeds work on followers, so they don't get NotLeaseHolderError in
the normal course of things. However, if an uninitialized replica is
contacted for a RangeFeed, it uses NotLeaseHolderError as a hint for
where a better one might be found. The error handling in
partialRangeFeed was missing this case, causing changefeeds to fail on
what should be a retryable error.

Touches cockroachdb#36077

Release note: None
danhhz added a commit to danhhz/cockroach that referenced this issue Apr 24, 2019
(*Store).Send has an optimization for uninitialized replicas to send
back a NotLeaseHolderError with a hint of where an initialized replica
might be found. RangeFeeds can always be served from followers and so
don't otherwise return NotLeaseHolderError. This commit changes the
corresponding codepath in (*Store).Send to return a RangeNotFoundError
instead.

Touches cockroachdb#36077

Release note: None
craig bot pushed a commit that referenced this issue Apr 25, 2019
37099: storage: stop returning NotLeaseHolderError from (*Store).RangeFeed r=nvanbenschoten a=danhhz

(*Store).Send has an optimization for uninitialized replicas to send
back a NotLeaseHolderError with a hint of where an initialized replica
might be found. RangeFeeds can always be served from followers and so
don't otherwise return NotLeaseHolderError. This commit changes the
corresponding codepath in (*Store).Send to return a RangeNotFoundError
instead.

Touches #36077

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
danhhz added a commit to danhhz/cockroach that referenced this issue May 6, 2019
(*Store).Send has an optimization for uninitialized replicas to send
back a NotLeaseHolderError with a hint of where an initialized replica
might be found. RangeFeeds can always be served from followers and so
don't otherwise return NotLeaseHolderError. This commit changes the
corresponding codepath in (*Store).Send to return a RangeNotFoundError
instead.

Touches cockroachdb#36077

Release note: None
@danhhz
Copy link
Contributor

danhhz commented May 7, 2019

A bunch of things got fixed, so at this point seems like any new failures are likely to be new issues. Closing

@danhhz danhhz closed this as completed May 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants