Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/crdb-chaos/rangefeed=true failed [skipped] #35974

Closed
cockroach-teamcity opened this issue Mar 20, 2019 · 30 comments · Fixed by #36852
Closed

roachtest: cdc/crdb-chaos/rangefeed=true failed [skipped] #35974

cockroach-teamcity opened this issue Mar 20, 2019 · 30 comments · Fixed by #36852
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/3a7ea2d8c9d4a3e0d97f8f106fcf95b3f03765ec

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog

The test failed on master:
	cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1187480-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		.8      0.0      0.0      0.0      0.0 delivery
		   14m5s     3221            4.0            9.1  38654.7  38654.7  38654.7  38654.7 newOrder
		   14m5s     3221            0.0            0.8      0.0      0.0      0.0      0.0 orderStatus
		   14m5s     3221            1.0            8.8 103079.2 103079.2 103079.2 103079.2 payment
		   14m5s     3221            0.0            0.4      0.0      0.0      0.0      0.0 stockLevel
		E190320 06:33:54.976726 1 workload/cli/run.go:420  error in orderStatus: dial tcp 10.142.15.195:26257: connect: connection refused
		   14m6s     3243            0.0            0.8      0.0      0.0      0.0      0.0 delivery
		   14m6s     3243            2.0            9.1  40802.2  53687.1  53687.1  53687.1 newOrder
		   14m6s     3243            1.0            0.8  32212.3  32212.3  32212.3  32212.3 orderStatus
		   14m6s     3243            3.0            8.8  40802.2 103079.2 103079.2 103079.2 payment
		   14m6s     3243            0.0            0.4      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1626,cdc.go:213,cdc.go:433,test.go:1214: unexpected status: failed

@cockroach-teamcity cockroach-teamcity added this to the 19.1 milestone Mar 20, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 20, 2019
@danhhz
Copy link
Contributor

danhhz commented Mar 20, 2019

changefeed: 06:33:52 cdc.go:737: unexpected status: failed, error: result is ambiguous (error=unable to dial n2: breaker open [exhausted])

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dfa23c01e4ea39b19ca8b2e5c8a4e7cf9b9445f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog

The test failed on master:
	cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1189954-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		0.0      0.0 newOrder
		  11m25s     2297            0.0            1.6      0.0      0.0      0.0      0.0 orderStatus
		  11m25s     2297            0.0           16.6      0.0      0.0      0.0      0.0 payment
		  11m25s     2297            0.0            1.6      0.0      0.0      0.0      0.0 stockLevel
		E190321 06:30:05.195455 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.81:26257: connect: connection refused
		  11m26s     2312            0.0            1.6      0.0      0.0      0.0      0.0 delivery
		  11m26s     2312            0.0           16.7      0.0      0.0      0.0      0.0 newOrder
		  11m26s     2312            0.0            1.6      0.0      0.0      0.0      0.0 orderStatus
		  11m26s     2312            0.0           16.5      0.0      0.0      0.0      0.0 payment
		  11m26s     2312            0.0            1.6      0.0      0.0      0.0      0.0 stockLevel
		E190321 06:30:06.245787 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.81:26257: connect: connection refused
		: signal: killed
	cluster.go:1626,cdc.go:213,cdc.go:433,test.go:1214: unexpected status: failed

@danhhz
Copy link
Contributor

danhhz commented Mar 21, 2019

unexpected status: failed, error: descriptor not found

"descriptor not found" must have come out of the (*LeaseManager).Acquire call on

tableDesc, _, err = c.leaseMgr.Acquire(ctx, ts, tableID)

But this is only called when we get a kv that has changed and it's called with the table ID and mvcc timestamp of the kv, so it's hard to imagine that we're actually missing that table descriptor. This is happening when a node is going down, so it's not hard to imagine that it's another missing retry.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/b5768aecd39461ab9a54e2e7db059a3fe8b00459

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1191957&tab=buildLog

The test failed on release-19.1:
	cluster.go:1267,cdc.go:633,cdc.go:133,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1191957-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		l
		   7m28s     1690            2.0            1.8     41.9     44.0     44.0     44.0 delivery
		   7m28s     1690           15.0           18.1     31.5    218.1    218.1    218.1 newOrder
		   7m28s     1690            0.0            1.8      0.0      0.0      0.0      0.0 orderStatus
		   7m28s     1690           19.0           18.2     21.0    318.8    352.3    352.3 payment
		   7m28s     1690            2.0            1.7     13.1     16.3     16.3     16.3 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   7m29s     1690            0.0            1.8      0.0      0.0      0.0      0.0 delivery
		   7m29s     1690           20.0           18.1     30.4     39.8     41.9     41.9 newOrder
		   7m29s     1690            5.0            1.8      6.8      8.1      8.1      8.1 orderStatus
		   7m29s     1690           12.0           18.2     14.2     17.8     18.9     18.9 payment
		   7m29s     1690            1.0            1.7     15.7     15.7     15.7     15.7 stockLevel
		: signal: killed
	cluster.go:1626,cdc.go:221,cdc.go:441,test.go:1214: unexpected status: failed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a200cea4368ec90aaee12337d7ab5f9ca555108f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1191939&tab=buildLog

The test failed on master:
	cdc.go:764,cdc.go:223,cdc.go:441,test.go:1214: max latency was more than allowed: 10m18.684024786s vs 10m0s

@danhhz danhhz self-assigned this Mar 22, 2019
@danhhz
Copy link
Contributor

danhhz commented Mar 22, 2019

unexpected status: failed, error: internal error: uncaught error: [NotLeaseHolderError] r671: replica (n3,s3):3 not lease holder; replica (n2,s2):2 is CHAOS: 06:33:44 chaos.go:65: chaos stopping: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/6cac063ae1cb578130afbafb2abf4035268a10c9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194308&tab=buildLog

The test failed on master:
	cluster.go:1267,cdc.go:633,cdc.go:133,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1194308-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		     6.6      6.6 orderStatus
		   9m16s     2034           14.0           18.6     13.1     19.9  13421.8  13421.8 payment
		   9m16s     2034            2.0            1.8      7.3     13.6     13.6     13.6 stockLevel
		E190323 06:34:02.200130 1 workload/cli/run.go:420  error in newOrder: dial tcp 10.142.0.143:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   9m17s     2055            3.0            1.8     37.7     50.3     50.3     50.3 delivery
		   9m17s     2055           10.0           18.9     31.5     39.8     39.8     39.8 newOrder
		   9m17s     2055            2.0            1.9      5.2      5.5      5.5      5.5 orderStatus
		   9m17s     2055           23.0           18.6     13.6     17.8     18.9     18.9 payment
		   9m17s     2055            1.0            1.8     11.5     11.5     11.5     11.5 stockLevel
		E190323 06:34:03.218505 1 workload/cli/run.go:420  error in newOrder: dial tcp 10.142.0.143:26257: connect: connection refused
		: signal: killed
	cluster.go:1626,cdc.go:221,cdc.go:441,test.go:1214: unexpected status: failed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9399d559ae196e5cf2ad122195048ff9115ab56a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194326&tab=buildLog

The test failed on release-19.1:
	cluster.go:1267,cdc.go:633,cdc.go:133,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1194326-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		0.0      0.0 newOrder
		  13m55s     3076            0.0            1.8      0.0      0.0      0.0      0.0 orderStatus
		  13m55s     3076            2.0           17.6   8589.9   9663.7   9663.7   9663.7 payment
		  13m55s     3076            0.0            1.8      0.0      0.0      0.0      0.0 stockLevel
		E190323 06:37:25.484912 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.94:26257: connect: connection refused
		  13m56s     3104            0.0            1.7      0.0      0.0      0.0      0.0 delivery
		  13m56s     3104            0.0           18.0      0.0      0.0      0.0      0.0 newOrder
		  13m56s     3104            0.0            1.8      0.0      0.0      0.0      0.0 orderStatus
		  13m56s     3104            0.0           17.6      0.0      0.0      0.0      0.0 payment
		  13m56s     3104            0.0            1.8      0.0      0.0      0.0      0.0 stockLevel
		E190323 06:37:26.509351 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.94:26257: connect: connection refused
		: signal: killed
	cluster.go:1626,cdc.go:221,cdc.go:441,test.go:1214: unexpected status: failed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5a746073c3f8ede851f37dd895cf1a91d6dcc3cf

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195714&tab=buildLog

The test failed on master:
	cluster.go:1267,cdc.go:633,cdc.go:133,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1195714-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		     0.0 newOrder
		    9m2s     1820            0.0            1.8      0.0      0.0      0.0      0.0 orderStatus
		    9m2s     1820            0.0           18.5      0.0      0.0      0.0      0.0 payment
		    9m2s     1820            0.0            1.8      0.0      0.0      0.0      0.0 stockLevel
		E190324 06:29:27.290467 1 workload/cli/run.go:420  error in newOrder: dial tcp 10.142.0.159:26257: connect: connection refused
		    9m3s     1838            0.0            1.8      0.0      0.0      0.0      0.0 delivery
		    9m3s     1838            0.0           18.5      0.0      0.0      0.0      0.0 newOrder
		    9m3s     1838            2.0            1.8      5.5      6.0      6.0      6.0 orderStatus
		    9m3s     1838            0.0           18.4      0.0      0.0      0.0      0.0 payment
		    9m3s     1838            0.0            1.8      0.0      0.0      0.0      0.0 stockLevel
		E190324 06:29:28.320063 1 workload/cli/run.go:420  error in newOrder: dial tcp 10.142.0.159:26257: connect: connection refused
		: signal: killed
	cluster.go:1626,cdc.go:221,cdc.go:441,test.go:1214: unexpected status: failed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c59f5347d5424edb90575fb0fd50bad677953752

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195732&tab=buildLog

The test failed on release-19.1:
	cluster.go:1267,cdc.go:633,cdc.go:133,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1195732-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		     1.8      0.0      0.0      0.0      0.0 delivery
		  18m29s     3890            0.0           17.8      0.0      0.0      0.0      0.0 newOrder
		  18m29s     3890            0.0            1.7      0.0      0.0      0.0      0.0 orderStatus
		  18m29s     3890            1.0           17.4     17.8     17.8     17.8     17.8 payment
		  18m29s     3890            0.0            1.7      0.0      0.0      0.0      0.0 stockLevel
		E190324 06:37:05.437461 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.21:26257: connect: connection refused
		  18m30s     3923            0.0            1.8      0.0      0.0      0.0      0.0 delivery
		  18m30s     3923            0.0           17.8      0.0      0.0      0.0      0.0 newOrder
		  18m30s     3923            0.0            1.7      0.0      0.0      0.0      0.0 orderStatus
		  18m30s     3923            0.0           17.4      0.0      0.0      0.0      0.0 payment
		  18m30s     3923            0.0            1.7      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1626,cdc.go:221,cdc.go:441,test.go:1214: unexpected status: failed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7bc9ea5fbe0c0082fdcfd408245a79c62b00edd4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1197065&tab=buildLog

The test failed on master:
	cdc.go:764,cdc.go:223,cdc.go:441,test.go:1214: max latency was more than allowed: 11m59.092787293s vs 10m0s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ec89b45cea7e8a6dd92a9bfd60e0cc06842e06d8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1197083&tab=buildLog

The test failed on release-19.1:
	cdc.go:764,cdc.go:223,cdc.go:441,test.go:1214: max latency was more than allowed: 21m16.778147585s vs 10m0s

danhhz added a commit to danhhz/cockroach that referenced this issue Mar 25, 2019
…list

In the roachtests for crdb-chaos and sink-chaos we're seeing changefeeds
fail with surprising errors:

    [NotLeaseHolderError] r681: replica (n1,s1):1 not lease holder; replica (n2,s2):2 is

    descriptor not found

We'd like to avoid failing a changefeed unnecessarily, so when an error
bubbles up to the top level, we'd like to retry the distributed flow if
possible. We initially tried to whitelist which errors should cause the
changefeed to retry, but this turns out to be brittle, so this commit
switches to a blacklist. Any error that is expected to be permanent is
now marked with `MarkTerminalError` by the time it comes out of
`distChangefeedFlow`. Everything else should be logged loudly and
retried.

Touches cockroachdb#35974
Touches cockroachdb#36019

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/25398c010b2af75b11fed189680ea6b9645f0cf5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1199659&tab=buildLog

The test failed on master:
	cluster.go:1267,cdc.go:633,cdc.go:133,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1199659-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		0      0.0 newOrder
		    9m7s     1931            0.0            1.9      0.0      0.0      0.0      0.0 orderStatus
		    9m7s     1931            0.0           18.8      0.0      0.0      0.0      0.0 payment
		    9m7s     1931            0.0            1.8      0.0      0.0      0.0      0.0 stockLevel
		E190326 06:26:01.028395 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.156:26257: connect: connection refused
		    9m8s     1958            0.0            1.9      0.0      0.0      0.0      0.0 delivery
		    9m8s     1958           17.0           18.9   1409.3   7784.6   9126.8   9126.8 newOrder
		    9m8s     1958            5.0            1.9   1208.0   2684.4   2684.4   2684.4 orderStatus
		    9m8s     1958           36.0           18.8   4563.4   8589.9   9663.7   9663.7 payment
		    9m8s     1958            1.0            1.8   7516.2   7516.2   7516.2   7516.2 stockLevel
		E190326 06:26:02.042152 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.156:26257: connect: connection refused
		: signal: killed
	cluster.go:1626,cdc.go:221,cdc.go:441,test.go:1214: unexpected status: failed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7f8a0969e8e9eb7e9fc0d2fe96e03849d30dd561

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1199677&tab=buildLog

The test failed on release-19.1:
	cdc.go:764,cdc.go:223,cdc.go:441,test.go:1214: max latency was more than allowed: 11m50.536309913s vs 10m0s

danhhz added a commit to danhhz/cockroach that referenced this issue Mar 26, 2019
…list

In the roachtests for crdb-chaos and sink-chaos we're seeing changefeeds
fail with surprising errors:

    [NotLeaseHolderError] r681: replica (n1,s1):1 not lease holder; replica (n2,s2):2 is

    descriptor not found

We'd like to avoid failing a changefeed unnecessarily, so when an error
bubbles up to the top level, we'd like to retry the distributed flow if
possible. We initially tried to whitelist which errors should cause the
changefeed to retry, but this turns out to be brittle, so this commit
switches to a blacklist. Any error that is expected to be permanent is
now marked with `MarkTerminalError` by the time it comes out of
`distChangefeedFlow`. Everything else should be logged loudly and
retried.

Touches cockroachdb#35974
Touches cockroachdb#36019

Release note: None
danhhz added a commit to danhhz/cockroach that referenced this issue Mar 26, 2019
…list

In the roachtests for crdb-chaos and sink-chaos we're seeing changefeeds
fail with surprising errors:

    [NotLeaseHolderError] r681: replica (n1,s1):1 not lease holder; replica (n2,s2):2 is

    descriptor not found

We'd like to avoid failing a changefeed unnecessarily, so when an error
bubbles up to the top level, we'd like to retry the distributed flow if
possible. We initially tried to whitelist which errors should cause the
changefeed to retry, but this turns out to be brittle, so this commit
switches to a blacklist. Any error that is expected to be permanent is
now marked with `MarkTerminalError` by the time it comes out of
`distChangefeedFlow`. Everything else should be logged loudly and
retried.

Touches cockroachdb#35974
Touches cockroachdb#36019

Release note: None
craig bot pushed a commit that referenced this issue Mar 27, 2019
36132: changefeedccl: switch high-level retry marker from whitelist to blacklist r=nvanbenschoten a=danhhz

In the roachtests for crdb-chaos and sink-chaos we're seeing changefeeds
fail with surprising errors:

    [NotLeaseHolderError] r681: replica (n1,s1):1 not lease holder; replica (n2,s2):2 is

    descriptor not found

We'd like to avoid failing a changefeed unnecessarily, so when an error
bubbles up to the top level, we'd like to retry the distributed flow if
possible. We initially tried to whitelist which errors should cause the
changefeed to retry, but this turns out to be brittle, so this commit
switches to a blacklist. Any error that is expected to be permanent is
now marked with `MarkTerminalError` by the time it comes out of
`distChangefeedFlow`. Everything else should be logged loudly and
retried.

Touches #35974
Touches #36019

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/17565100d1e7c66341e6db3e39bb66202958cb81

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1204567&tab=buildLog

The test failed on master:
	cdc.go:764,cdc.go:223,cdc.go:441,test.go:1216: max latency was more than allowed: 14m22.873288214s vs 10m0s

@danhhz
Copy link
Contributor

danhhz commented Mar 28, 2019

It does seem odd that the latency of this latest failure was so high, but I'm not entirely surprised that we don't consistenly handle recovery from crdb nodes crashing in a prompt way. Definitely worth looking into what happened and if we can make it more predictable, but the fact that it recovered at all is encouraging.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3aadd20bbf0940ef65f8b2cdcda498401ba5d9c6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1206925&tab=buildLog

The test failed on release-19.1:
	cluster.go:1293,cdc.go:633,cdc.go:133,cluster.go:1631,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1206925-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		   1.6      0.0      0.0      0.0      0.0 delivery
		   6m45s     1174            0.0           16.5      0.0      0.0      0.0      0.0 newOrder
		   6m45s     1174            1.0            1.6   5100.3   5100.3   5100.3   5100.3 orderStatus
		   6m45s     1174           11.0           16.9   4026.5   6174.0   6442.5   6442.5 payment
		   6m45s     1174            4.0            1.7   1946.2   6442.5   6442.5   6442.5 stockLevel
		E190328 18:46:19.488850 1 workload/cli/run.go:420  error in payment: dial tcp 10.128.15.197:26257: connect: connection refused
		   6m46s     1205            0.0            1.6      0.0      0.0      0.0      0.0 delivery
		   6m46s     1205            0.0           16.5      0.0      0.0      0.0      0.0 newOrder
		   6m46s     1205            1.0            1.6      4.7      4.7      4.7      4.7 orderStatus
		   6m46s     1205           11.0           16.9   3892.3   6174.0   6174.0   6174.0 payment
		   6m46s     1205            0.0            1.7      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1652,cdc.go:221,cdc.go:441,test.go:1223: unexpected status: failed
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:774,test.go:760,cluster.go:1652,cdc.go:221,cdc.go:441,test.go:1223: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1206925-cdc-crdb-chaos-rangefeed-true --oneshot --ignore-empty-nodes: exit status 1

@danhhz
Copy link
Contributor

danhhz commented Mar 28, 2019

This last one failed due to the newly added dead node detection. @tbg should we skip that check for choas tests? Or should I just throw a c.Start at the end of the test to make sure everything is back up and make the dead node detection happy?

@tbg
Copy link
Member

tbg commented Mar 28, 2019

The chaos runner makes sure that nodes are always restarted, are you using something homegrown here?

// NB: the roachtest harness checks that at the end of the test,
// all nodes that have data also have a running process.
l.Printf("restarting %v (chaos is done)\n", target)
c.Start(ctx, c.t.(*test), target)

@tbg
Copy link
Member

tbg commented Mar 28, 2019

If you ask me the dead node detection is just a collateral, the test failed because the SQL workload failed and that cancelled the context and so the chaos runner didn't bother restarting the node. I'm out for today, but if you wanted to make a change to make sure that it always did so, I would appreciate that (should be a two liner)

@danhhz
Copy link
Contributor

danhhz commented Mar 28, 2019

Ah, you're right. I saw the "dead node detection" line on a crdb-chaos test and jumped to conclusions. The changefeed failure (which is what failed the test, the workload is run with --tolerate-errors in this test) is yet another example of why we switched to a blacklist for terminal errors.

This test does use something homegrown for crdb chaos, which doesn't guarantee the chaos'd node is restarted when the test shuts down. Given that we have a common one, it sounds like switching to it is the answer to my question above.

@tbg
Copy link
Member

tbg commented Mar 28, 2019

(Switching SGTM -- but that runner will also exit with a down node if the ctx cancels, so while you're there just make sure that

case <-ctx.Done():

also restarts the node)

@danhhz
Copy link
Contributor

danhhz commented Mar 28, 2019

👍 Will do! (though I doubt I'll get to it today)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d03a34e92d2ee558fb6aedb0709b733a1fab97f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1207666&tab=buildLog

The test failed on master:
	cluster.go:1293,cdc.go:629,cdc.go:133,cluster.go:1631,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1207666-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		l
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   25m9s     5655            2.0            1.8   2080.4   2550.1   2550.1   2550.1 delivery
		   25m9s     5655           26.0           18.6    369.1   1543.5   1677.7   1677.7 newOrder
		   25m9s     5655            1.0            1.8     32.5     32.5     32.5     32.5 orderStatus
		   25m9s     5655           47.0           17.9    226.5   1543.5   1677.7   1677.7 payment
		   25m9s     5655            1.0            1.8      9.4      9.4      9.4      9.4 stockLevel
		  25m10s     5655            3.0            1.8     46.1     50.3     50.3     50.3 delivery
		  25m10s     5655           21.0           18.6     35.7     46.1     48.2     48.2 newOrder
		  25m10s     5655            2.0            1.8      5.5      7.6      7.6      7.6 orderStatus
		  25m10s     5655           16.0           17.9     15.2     16.8     19.9     19.9 payment
		  25m10s     5655            4.0            1.8     14.2     33.6     33.6     33.6 stockLevel
		: signal: killed
	cluster.go:1652,cdc.go:221,cdc.go:441,test.go:1223: unexpected status: failed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/23f9707873abbd2de91a42055535529d7ff296ce

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1209900&tab=buildLog

The test failed on release-19.1:
	cdc.go:760,cdc.go:223,cdc.go:441,test.go:1223: max latency was more than allowed: 12m2.976338187s vs 10m0s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5921cf0dcc76548931cc85500c0fa2186a82142f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212185&tab=buildLog

The test failed on release-19.1:
	cluster.go:1293,cdc.go:629,cdc.go:133,cluster.go:1631,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1212185-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		        16.5      0.0      0.0      0.0      0.0 newOrder
		  13m48s     2846            0.0            1.6      0.0      0.0      0.0      0.0 orderStatus
		  13m48s     2846            0.0           16.4      0.0      0.0      0.0      0.0 payment
		  13m48s     2846            0.0            1.6      0.0      0.0      0.0      0.0 stockLevel
		E190401 03:57:10.771640 1 workload/cli/run.go:420  error in stockLevel: dial tcp 10.128.15.192:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  13m49s     2865            0.0            1.6      0.0      0.0      0.0      0.0 delivery
		  13m49s     2865            0.0           16.5      0.0      0.0      0.0      0.0 newOrder
		  13m49s     2865            0.0            1.6      0.0      0.0      0.0      0.0 orderStatus
		  13m49s     2865            0.0           16.3      0.0      0.0      0.0      0.0 payment
		  13m49s     2865            0.0            1.6      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1652,cdc.go:221,cdc.go:441,test.go:1223: unexpected status: failed
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:774,test.go:760,cluster.go:1652,cdc.go:221,cdc.go:441,test.go:1223: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1212185-cdc-crdb-chaos-rangefeed-true --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 4755
		1: 5191
		2: dead
		Error:  2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a039a93a5cc6eb3f395ceb6f7dc8030985dccc29

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212269&tab=buildLog

The test failed on release-19.1:
	cdc.go:174,cluster.go:1631,errgroup.go:57: read tcp 172.17.0.2:47318->35.227.97.25:26257: read: connection reset by peer
	cluster.go:1293,cdc.go:629,cdc.go:133,cluster.go:1631,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1212269-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		: signal: killed
	cluster.go:1652,cdc.go:221,cdc.go:441,test.go:1223: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/2851c7d56ee4966109691b5c48c73ec8d4cc9847

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1215354&tab=buildLog

The test failed on master:
	cdc.go:869,cdc.go:224,cdc.go:538,test.go:1226: max latency was more than allowed: 18m55.097916095s vs 10m0s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/877ebd1ece299b9ee621aa0d091657621593d844

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1215372&tab=buildLog

The test failed on release-19.1:
	cdc.go:869,cdc.go:224,cdc.go:538,test.go:1226: max latency was more than allowed: 16m43.820485507s vs 10m0s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1cbf3680129e47bd310640bf32b665662f30faa9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217781&tab=buildLog

The test failed on release-19.1:
	cdc.go:176,cluster.go:1667,errgroup.go:57: read tcp 172.17.0.2:49582->104.196.26.36:26257: read: connection reset by peer
	cluster.go:1329,cdc.go:746,cdc.go:135,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1217781-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		: signal: killed
	cluster.go:1688,cdc.go:223,cdc.go:546,test.go:1228: Goexit() was called

@danhhz danhhz changed the title roachtest: cdc/crdb-chaos/rangefeed=true failed roachtest: cdc/crdb-chaos/rangefeed=true failed [skipped] Apr 3, 2019
danhhz added a commit to danhhz/cockroach that referenced this issue Apr 15, 2019
For a while, the cdc/crdb-chaos and cdc/sink-chaos roachtests have been
failing because an error that should be marked as retryable wasn't. As a
result of the discussion in cockroachdb#35974, I tried switching from a whitelist
(retryable error) to a blacklist (terminal error) in cockroachdb#36132, but on
reflection this doesn't seem like a great idea. We added a safety net to
prevent false negatives from retrying indefinitely but it was
immediately apparent that this meant we needed to tune the retry loop
parameters. Better is to just do the due diligence of investigating the
errors that should be retried and retrying them.

The commit is intended for backport into 19.1 once it's baked for a bit.

Closes cockroachdb#35974
Closes cockroachdb#36018
Closes cockroachdb#36019
Closes cockroachdb#36432

Release note (bug fix): `CHANGEFEED` now retry instead of erroring in
more situations
craig bot pushed a commit that referenced this issue Apr 16, 2019
36804: sql/sem/pretty: use left alignment for column names in CREATE r=knz a=knz

Before:

```
CREATE TABLE t (
    name STRING,
    id INT8
       NOT NULL
       PRIMARY KEY
)
```

After:

```
CREATE TABLE t (
    name STRING,
    id   INT8
         NOT NULL
         PRIMARY KEY
)
```


36852: changefeedccl: switch retryable errors back to a whitelist r=nvanbenschoten a=danhhz

For a while, the cdc/crdb-chaos and cdc/sink-chaos roachtests have been
failing because an error that should be marked as retryable wasn't. As a
result of the discussion in #35974, I tried switching from a whitelist
(retryable error) to a blacklist (terminal error) in #36132, but on
reflection this doesn't seem like a great idea. We added a safety net to
prevent false negatives from retrying indefinitely but it was
immediately apparent that this meant we needed to tune the retry loop
parameters. Better is to just do the due diligence of investigating the
errors that should be retried and retrying them.

The commit is intended for backport into 19.1 once it's baked for a bit.

Closes #35974
Closes #36018
Closes #36019
Closes #36432

Release note (bug fix): `CHANGEFEED` now retry instead of erroring in
more situations

36872: coldata: fix Slice when slicing up to batch.Length() r=yuzefovich a=asubiotto

A panic occured because we weren't treating the end slice index as
exclusive, resulting in an out of bounds panic when attempting to slice
the nulls slice.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Co-authored-by: Alfonso Subiotto Marqués <alfonso@cockroachlabs.com>
@craig craig bot closed this as completed in #36852 Apr 16, 2019
danhhz added a commit to danhhz/cockroach that referenced this issue Apr 24, 2019
For a while, the cdc/crdb-chaos and cdc/sink-chaos roachtests have been
failing because an error that should be marked as retryable wasn't. As a
result of the discussion in cockroachdb#35974, I tried switching from a whitelist
(retryable error) to a blacklist (terminal error) in cockroachdb#36132, but on
reflection this doesn't seem like a great idea. We added a safety net to
prevent false negatives from retrying indefinitely but it was
immediately apparent that this meant we needed to tune the retry loop
parameters. Better is to just do the due diligence of investigating the
errors that should be retried and retrying them.

The commit is intended for backport into 19.1 once it's baked for a bit.

Closes cockroachdb#35974
Closes cockroachdb#36018
Closes cockroachdb#36019
Closes cockroachdb#36432

Release note (bug fix): `CHANGEFEED` now retry instead of erroring in
more situations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants