Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/crdb-chaos/rangefeed=true failed (skipped on release-19.1) #36905

Closed
cockroach-teamcity opened this issue Apr 17, 2019 · 12 comments · Fixed by #37498
Closed

roachtest: cdc/crdb-chaos/rangefeed=true failed (skipped on release-19.1) #36905

cockroach-teamcity opened this issue Apr 17, 2019 · 12 comments · Fixed by #37498
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/c65b71a27e4d0941bf9427b5dec1ff7f096bba7b

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1245461&tab=buildLog

The test failed on release-19.1:
	cluster.go:1329,cdc.go:746,cdc.go:135,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1245461-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		    3254            6.0            1.7   3087.0   5637.1   5637.1   5637.1 stockLevel
		  14m10s     3254            0.0            1.7      0.0      0.0      0.0      0.0 delivery
		  14m10s     3254            0.0           17.5      0.0      0.0      0.0      0.0 newOrder
		  14m10s     3254            0.0            1.7      0.0      0.0      0.0      0.0 orderStatus
		  14m10s     3254            0.0           17.1      0.0      0.0      0.0      0.0 payment
		  14m10s     3254            0.0            1.7      0.0      0.0      0.0      0.0 stockLevel
		  14m11s     3254            0.0            1.7      0.0      0.0      0.0      0.0 delivery
		  14m11s     3254            0.0           17.5      0.0      0.0      0.0      0.0 newOrder
		  14m11s     3254            0.0            1.7      0.0      0.0      0.0      0.0 orderStatus
		  14m11s     3254            0.0           17.1      0.0      0.0      0.0      0.0 payment
		  14m11s     3254            0.0            1.7      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1688,cdc.go:223,cdc.go:546,test.go:1237: unexpected status: failed

@cockroach-teamcity cockroach-teamcity added this to the 19.1 milestone Apr 17, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Apr 17, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9e6ae3cc37e7691147bb6f5d1a156ebe4c5cf7f9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1245443&tab=buildLog

The test failed on master:
	cdc.go:877,cdc.go:225,cdc.go:546,test.go:1237: max latency was more than allowed: 17m8.395430645s vs 10m0s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/837e946efc272bd8a9e0e08484733f8755ff5ab1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1247401&tab=buildLog

The test failed on release-19.1:
	cdc.go:176,cluster.go:1667,errgroup.go:57: read tcp 172.17.0.2:52448->35.237.136.168:26257: read: connection reset by peer
	cluster.go:1329,cdc.go:746,cdc.go:135,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1247401-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		: signal: killed
	cluster.go:1688,cdc.go:223,cdc.go:546,test.go:1237: Goexit() was called

@danhhz
Copy link
Contributor

danhhz commented Apr 18, 2019

#36905 (comment) is on release-19.1 and is expected to be fixed by #36852 but that backport is waiting on 19.1.1

#36905 (comment) is #36879 and there's definitely something to look into here

#36905 (comment) dunno what happened here, maybe one of the tpcc overload things that just got fixed? Failure happens right at the point when tpcc load and changefeed both start. Nothing in the logs about the changefeed and doesn't seem to be an OOM

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9938cb1a2cca4c0350244f76845f0c61391d44a7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1249130&tab=buildLog

The test failed on release-19.1:
	cdc.go:877,cdc.go:225,cdc.go:546,test.go:1237: max latency was more than allowed: 18m59.787838768s vs 10m0s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dd7c697e986fc528da7b12c6c10dcce7f64a486c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1252804&tab=buildLog

The test failed on release-19.1:
	cluster.go:1329,cdc.go:746,cdc.go:135,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1252804-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		     513            1.0            2.0      7.6      7.6      7.6      7.6 stockLevel
		   2m47s      513            4.0            2.0     48.2     56.6     56.6     56.6 delivery
		   2m47s      513           10.0           20.3     29.4     35.7     35.7     35.7 newOrder
		   2m47s      513            2.0            2.1      5.0      6.6      6.6      6.6 orderStatus
		   2m47s      513           20.0           20.6     14.7     18.9     19.9     19.9 payment
		   2m47s      513            2.0            2.0     11.0     14.7     14.7     14.7 stockLevel
		   2m48s      513            1.0            1.9     50.3     50.3     50.3     50.3 delivery
		   2m48s      513            8.0           20.2     25.2    285.2    285.2    285.2 newOrder
		   2m48s      513            1.0            2.1      5.2      5.2      5.2      5.2 orderStatus
		   2m48s      513           12.0           20.6     14.2     17.8    234.9    234.9 payment
		   2m48s      513            2.0            2.0      9.4     11.0     11.0     11.0 stockLevel
		: signal: killed
	cluster.go:1688,cdc.go:223,cdc.go:546,test.go:1237: unexpected status: failed

@danhhz danhhz self-assigned this Apr 22, 2019
@danhhz
Copy link
Contributor

danhhz commented Apr 22, 2019

This latest one again should hopefully have been fixed once #36852 is backported

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ec4728ae986b46d4f57009233b86971198b275ed

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1255121&tab=buildLog

The test failed on master:
	cluster.go:1329,cdc.go:734,cdc.go:135,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1255121-cdc-crdb-chaos-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		l
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  19m13s     4451            0.0            1.9      0.0      0.0      0.0      0.0 delivery
		  19m13s     4451           22.0           19.7     30.4     33.6    520.1    520.1 newOrder
		  19m13s     4451            2.0            1.9      6.3      6.8      6.8      6.8 orderStatus
		  19m13s     4451           13.0           19.2     14.2     16.3     16.8     16.8 payment
		  19m13s     4451            1.0            1.9     19.9     19.9     19.9     19.9 stockLevel
		  19m14s     4451            2.0            1.9     39.8     41.9     41.9     41.9 delivery
		  19m14s     4451           13.0           19.7     35.7     35.7     44.0     44.0 newOrder
		  19m14s     4451            0.0            1.9      0.0      0.0      0.0      0.0 orderStatus
		  19m14s     4451            8.0           19.2     14.2     19.9     19.9     19.9 payment
		  19m14s     4451            1.0            1.9     12.6     12.6     12.6     12.6 stockLevel
		: signal: killed
	cluster.go:1688,cdc.go:223,cdc.go:535,test.go:1237: unexpected status: failed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c0d8e9d838fca9f79bc10d9fb43eeeaa502fdd91

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1255139&tab=buildLog

The test failed on release-19.1:
	cdc.go:865,cdc.go:225,cdc.go:535,test.go:1237: max latency was more than allowed: 10m53.353918368s vs 10m0s

@danhhz
Copy link
Contributor

danhhz commented Apr 23, 2019

latest failure (#36905 (comment)) should be fixed by #37009 (this is a new fix than the one I've been linking everywhere)

the one before (#36905 (comment)) is likely something that needs to be marked retryable. possibly the same as #36077 (comment)

W190423 06:39:35.545007 15475 ccl/changefeedccl/changefeed_stmt.go:459  [n1] CHANGEFEED job 445415483939127297 returning with error: [NotLeaseHolderError] r661: replica (n2,s2):2 not lease holder; replica (n3,s3):3 is

@danhhz danhhz changed the title roachtest: cdc/crdb-chaos/rangefeed=true failed roachtest: cdc/crdb-chaos/rangefeed=true failed (skipped on release-19.1) Apr 23, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/856ba9108f112f85d406bbe88d2208651859336e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1274175&tab=buildLog

The test failed on branch=master, cloud=gce:
	cdc.go:874,cdc.go:225,cdc.go:544,test.go:1251: max latency was more than allowed: 11m23.488455649s vs 10m0s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/23155799e92e54915ae66259d06a630e981afbeb

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1277061&tab=buildLog

The test failed on branch=master, cloud=gce:
	cdc.go:874,cdc.go:225,cdc.go:544,test.go:1251: max latency was more than allowed: 11m12.871557456s vs 10m0s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e4b8fe4656edc962b1ee6bae516e523d4bd7dfc9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/crdb-chaos/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1287045&tab=buildLog

The test failed on branch=master, cloud=gce:
	cdc.go:874,cdc.go:225,cdc.go:544,test.go:1251: max latency was more than allowed: 11m17.679187634s vs 10m0s

danhhz added a commit to danhhz/cockroach that referenced this issue May 13, 2019
Also make the crdb-chaos test more lenient to recovery times to work
around the flakes we're seeing. This thread should probably get pulled
at some point so leaving cockroachdb#36879 open to track it.

Closes cockroachdb#36905
Closes cockroachdb#36979

Release note: None
craig bot pushed a commit that referenced this issue May 13, 2019
37498: roachtest: unskip cdc/{crdb,sink}-chaos on 19.1 r=tbg a=danhhz

Also make the crdb-chaos test more lenient to recovery times to work
around the flakes we're seeing. This thread should probably get pulled
at some point so leaving #36879 open to track it.

Closes #36905
Closes #36979

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
@craig craig bot closed this as completed in #37498 May 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants