Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=true failed #38720

Closed
cockroach-teamcity opened this issue Jul 6, 2019 · 73 comments · Fixed by #42068
Closed

roachtest: clearrange/checks=true failed #38720

cockroach-teamcity opened this issue Jul 6, 2019 · 73 comments · Fixed by #42068
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/9322e07476de447799c5d3011eb2874930ee2993

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1375546&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190706-1375546/clearrange/checks=true/run_1
	test_runner.go:696: test timed out (6h30m0s)
	cluster.go:1724,clearrange.go:56,clearrange.go:35,test_runner.go:681: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1562393890-17-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190706 10:31:39.229952 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jul 6, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 6, 2019
@andreimatei
Copy link
Contributor

timed out importing

@nvanbenschoten nvanbenschoten assigned dt and ajkr and unassigned andreimatei Jul 9, 2019
@nvanbenschoten
Copy link
Member

Same cause as #38772, but let's leave this open to avoid re-triaging.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1ca35fc4a0e2665e7f6efd945e65a0db97984fa7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1396096&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190719-1396096/clearrange/checks=true/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563517204-17-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190719 09:32:00.507966 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7dab0dcfd37c389af357c302c073b9611b5ada25

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1398203&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190721-1398203/clearrange/checks=true/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563689854-18-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190721 09:35:27.459458 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190722-1399000/clearrange/checks=true/run_1
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563776264-16-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190722 10:33:58.520253 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7111a67b2ea3a19c2f312f8d214b8823f431cac0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1400942&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190723-1400942/clearrange/checks=true/run_1
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563862417-17-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190723 10:05:47.932230 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/26edea51118a0e16b61748c08068bfa6f76543ca

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1404886&tab=buildLog

The test failed on branch=provisional_201907241708_v19.2.0-alpha.20190729, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190725-1404886/clearrange/checks=true/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564034590-18-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190725 09:49:47.076984 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ff04012ed2d2c0c8e30e4de106ca0a350bca8c3e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1404856&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190725-1404856/clearrange/checks=true/run_1
	cluster.go:2090,clearrange.go:187,clearrange.go:35,test_runner.go:691: unexpected node event: 1: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/cfdaadc3514e7e8660f6c009ba159fdfd604f0a8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1409070&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190727-1409070/clearrange/checks=true/run_1
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564208378-16-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190727 10:52:17.185973 1 ccl/workloadccl/fixture.go:316  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/65055d6c16bf9386d8c4f4f9cd23e0a848814dc9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1411157&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190730-1411157/clearrange/checks=true/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564466961-19-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190730 09:24:30.985998 1 ccl/workloadccl/fixture.go:316  starting import of 1 tables
		: signal: killed
	cluster.go:1314,test_runner.go:677,panic.go:406,test.go:257,test.go:242,cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: r13355 (/Table/53/1/49587984) is inconsistent: RANGE_INCONSISTENT stats: {ContainsEstimates:false LastUpdateNanos:1564500009571458437 IntentAge:0 GCBytesAge:0 LiveBytes:7518372 LiveCount:732 KeyBytes:15372 KeyCount:732 ValBytes:7503000 ValCount:732 IntentBytes:0 IntentCount:0 SysBytes:1046 SysCount:8}
		replica (n10,s10):1 is inconsistent: expected checksum 66ad059006cb17a1822d70caeef6ff5f7319c84c3aa2b7caac4128aa62d50cd79a9f4ca30897229c85096f91952073368cc84752b5ab80541b8b7741204b4627, got 9666dfabd4e07d4062286bb92b1ac42d910d282819d5afe67452ee194c32e0938b2549afd626af320022b9726f96772a4fb56cb55b643bdcf439572bf3b3ef12
		persisted stats: exp {ContainsEstimates:false LastUpdateNanos:1564500009571458437 IntentAge:0 GCBytesAge:0 LiveBytes:7518372 LiveCount:732 KeyBytes:15372 KeyCount:732 ValBytes:7503000 ValCount:732 IntentBytes:0 IntentCount:0 SysBytes:1046 SysCount:8}, got {ContainsEstimates:false LastUpdateNanos:1564500009571458437 IntentAge:0 GCBytesAge:0 LiveBytes:41125084 LiveCount:4004 KeyBytes:84084 KeyCount:4004 ValBytes:41041000 ValCount:4004 IntentBytes:0 IntentCount:0 SysBytes:1046 SysCount:8}
		replica (n2,s2):3 is inconsistent: expected checksum 66ad059006cb17a1822d70caeef6ff5f7319c84c3aa2b7caac4128aa62d50cd79a9f4ca30897229c85096f91952073368cc84752b5ab80541b8b7741204b4627, got 9666dfabd4e07d4062286bb92b1ac42d910d282819d5afe67452ee194c32e0938b2549afd626af320022b9726f96772a4fb56cb55b643bdcf439572bf3b3ef12
		persisted stats: exp {ContainsEstimates:false LastUpdateNanos:1564500009571458437 IntentAge:0 GCBytesAge:0 LiveBytes:7518372 LiveCount:732 KeyBytes:15372 KeyCount:732 ValBytes:7503000 ValCount:732 IntentBytes:0 IntentCount:0 SysBytes:1046 SysCount:8}, got {ContainsEstimates:false LastUpdateNanos:1564500009571458437 IntentAge:0 GCBytesAge:0 LiveBytes:41125084 LiveCount:4004 KeyBytes:84084 KeyCount:4004 ValBytes:41041000 ValCount:4004 IntentBytes:0 IntentCount:0 SysBytes:1046 SysCount:8}
		

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/da56c792e968574b8f1d9ef3fdb45d56a530221a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1415578&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190801-1415578/clearrange/checks=true/run_1
	test_runner.go:706: test timed out (6h30m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5bd37e8eb58ca66b9293c234bc572411057fec3a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1417287&tab=buildLog

The test failed on branch=provisional_201908012151_v19.2.0-alpha.20190729, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190802-1417287/clearrange/checks=true/run_1
	test_runner.go:706: test timed out (6h30m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/51a6fdedf0ce1d1329d40d801a7deaf8206b6b07

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1420118&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190803-1420118/clearrange/checks=true/run_1
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564812638-17-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190803 10:51:22.086459 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190803 11:33:15.889023 13 ccl/workloadccl/fixture.go:516  imported bank (41m54s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/51a6fdedf0ce1d1329d40d801a7deaf8206b6b07

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1436116&tab=buildLog

The test failed on branch=provisional_201908060405_v19.1.4, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190812-1436116/clearrange/checks=true/run_1
	cluster.go:1735,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1565651234-17-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190813 03:58:32.651015 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190813 07:04:20.383029 67 ccl/workloadccl/fixture.go:516  imported bank (3h5m48s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: internal error: uncaught error: IO error: While pread offset 4441250 len 30842: /mnt/data1/cockroach/cockroach-temp806854214/009598.sst: Input/output error
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/01ee0704865391599abef3bbc89f462117f8007a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1445527&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190820-1445527/clearrange/checks=true/run_1
	test_runner.go:688: test timed out (6h30m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/93860e69f96aa3a86bd8bb42f310fb2629d53f39

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1447036&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190821-1447036/clearrange/checks=true/run_1
	cluster.go:1735,clearrange.go:56,clearrange.go:35,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1566368490-18-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190821 10:13:24.887732 1 ccl/workloadccl/fixture.go:316  starting import of 1 tables
		Error: importing fixture: importing table bank: dial tcp 127.0.0.1:26257: connect: connection refused
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9a982e902638e116ed6a76f4fa635a0a1445d88a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1447054&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190821-1447054/clearrange/checks=true/run_1
	cluster.go:1735,clearrange.go:56,clearrange.go:35,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1566367544-17-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190821 10:32:46.591900 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190821 11:39:54.713457 57 ccl/workloadccl/fixture.go:516  imported bank (1h7m8s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7ca0a86b8595c097fd8f27581b1509c47f17e8a3

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1450654&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190823-1450654/clearrange/checks=true/run_1
	cluster.go:1735,clearrange.go:56,clearrange.go:35,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1566541739-16-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190823 10:46:20.560400 1 ccl/workloadccl/fixture.go:316  starting import of 1 tables
		Error: importing fixture: importing table bank: dial tcp 127.0.0.1:26257: connect: connection refused
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/40f8f0eb00f4b3bf5bac11fb5ae132e33a492713

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1452154&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190824-1452154/clearrange/checks=true/run_1
	cluster.go:1735,clearrange.go:56,clearrange.go:35,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1566627477-16-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190824 09:06:13.979256 1 ccl/workloadccl/fixture.go:316  starting import of 1 tables
		Error: importing fixture: importing table bank: dial tcp 127.0.0.1:26257: connect: connection refused
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/497167b1c596eda2b70bed91c51ebf39b4356c33

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1453099&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190825-1453099/clearrange/checks=true/run_1
	cluster.go:1735,clearrange.go:56,clearrange.go:35,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1566714671-16-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190825 10:14:27.341521 1 ccl/workloadccl/fixture.go:316  starting import of 1 tables
		Error: importing fixture: importing table bank: dial tcp 127.0.0.1:26257: connect: connection reset by peer
		Error:  exit status 1
		: exit status 1

@bdarnell
Copy link
Contributor

bdarnell commented Oct 7, 2019

This test is our only one that sets COCKROACH_CONSISTENCY_AGGRESSIVE=true and COCKROACH_FATAL_ON_STATS_MISMATCH=true. That makes it turn up errors that other tests don't, but it incorrectly suggests that the issue is related to clearrange (could be, but there's no reason to think that now). Maybe we should have a more neutral test that just runs tpcc/bank/kv with these checks enabled.

tbg added a commit to tbg/cockroach that referenced this issue Oct 8, 2019
The clearrange is the only test running with this option, and it fired.
Increase our coverage of stats mismatches to hopefully find a better
repro target.

See
cockroachdb#38720 (comment).

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/239513342a2d23f683bbc1d386f87ff59cc78d10

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1539668&tab=artifacts#/clearrange/checks=true

The test failed on branch=provisional_201910141814_v19.2.0-rc.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191015-1539668/clearrange/checks=true/run_1
	test_runner.go:704: test timed out (6h30m0s)

craig bot pushed a commit that referenced this issue Oct 16, 2019
41432: roachprod: fatal nodes on stats mismatch r=bdarnell a=tbg

The clearrange is the only test running with this option, and it fired.
Increase our coverage of stats mismatches to hopefully find a better
repro target.

See
#38720 (comment).

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Oct 22, 2019
In light of the stats inconsistency seen in [clearrange], we want to be
stricter about verifying the stats in nightly testing. This commit makes
sure `./cockroach debug check-store` is fast enough to do so:

On a ~71GB fully compacted store directory it reliably takes well below
two minutes (on GCE local SSD).

[clearrange]: cockroachdb#38720 (comment)

Release note (performance improvement): The `./cockroach debug check-store` command is now faster.
@tbg
Copy link
Member

tbg commented Oct 23, 2019

We've got a repro of the stats inconsistency on #37815 (comment). I'm stressing that test overnight to get my hands on a data dir.

tbg added a commit to tbg/cockroach that referenced this issue Oct 25, 2019
In light of the stats inconsistency seen in [clearrange], we want to be
stricter about verifying the stats in nightly testing. This commit makes
sure `./cockroach debug check-store` is fast enough to do so:

On a ~71GB fully compacted store directory it reliably takes well below
two minutes (on GCE local SSD).

[clearrange]: cockroachdb#38720 (comment)

Release note (performance improvement): The `./cockroach debug check-store` command is now faster.
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/262e6f2499e34eb4373d0450fa9f6a820a609b2c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1565222&tab=artifacts#/clearrange/checks=true

The test failed on branch=provisional_201910301435_v19.2.0-rc.3, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191030-1565222/clearrange/checks=true/run_1
	test_runner.go:712: test timed out (6h30m0s)

@solongordon solongordon mentioned this issue Oct 31, 2019
18 tasks
@tbg
Copy link
Member

tbg commented Oct 31, 2019

In the clearrange logs, I'm seeing this message all over the place:

E191031 01:05:15.030119 889001 storage/replica_range_lease.go:917 [n1,s1,r1800/8:/Table/53/1/589{23508-31416}] lease repl=(n5,s5):6 seq=43 start=1572465046.159715450,0 epo=2 pro=1572465046.159720863,0 owned by replica (n5,s5):6 that no longer exists

This is from

// However, this is possible if the `cockroach debug
// unsafe-remove-dead-replicas` command has been used, so
// this is just a logged error instead of a fatal
// assertion.
log.Errorf(ctx, "lease %s owned by replica %+v that no longer exists",
status.Lease, status.Lease.Replica)

and we're not supposed to be able to hit it in this test.

@ajwerner
Copy link
Contributor

Here's what we see in the logs and the range status report:

19:50:07.164401476: n8 becomes leaseholder for range r1800 at seq 42
19:50:46.152919: [n8,s8,r1800/1:/Table/53/1/589{23508-31416}] proposing ENTER_JOINT ADD_REPLICA[(n1,s1):8VOTER_INCOMING], REMOVE_REPLICA[(n5,s5):6VOTER_OUTGOING]: after=[(n8,s8):1 (n9,s9):7 (n5,s5):6VOTER_OUTGOING (n1,s1):8VOTER_INCOMING] next=9
19:50:46.15971545: n5 becomes leaseholder at seq 43 (not clear whether the above proposal has committed)

At this point it seems like the replicate queue runs on n5, the new leaseholder which detects the joint configuration and attempts to move out of it:

19:50:46.194579: [n5,replicate,s5,r1800/6:/Table/53/1/589{23508-31416}] change replicas (add [] remove []): existing descriptor r1800:/Table/53/1/589{23508-31416} [(n8,s8):1, (n9,s9):7, (n5,s5):6VOTER_OUTGOING, (n1,s1):8VOTER_INCOMING, next=9, gen=159]                                                                             
19:50:46.200054: [n5,s5,r1800/6:/Table/53/1/589{23508-31416}] proposing LEAVE_JOINT: after=[(n8,s8):1 (n9,s9):7 (n1,s1):8]
next=9

We then immediately see

19:50:46.204074: [n5,s5,r1800/6:/Table/53/1/589{23508-31416}] removing replica r1800/6                               
19:50:46.213758: [n8,s8,r1800/1:/Table/53/1/589{23508-31416}] lease repl=(n5,s5):6 seq=43 start=1572465046.159715450,0 epo=2 pro=1572465046.159720863,0 owned by replica (n5,s5):6 that no longer exists

Which continues until the test times out.

Below find the lease history with a translation to timestamps:

"lease_history": [
    {
      "start": {
        "wall_time": 1572465007164401476
      },
      "replica": {
        "node_id": 8,
        "store_id": 8,
        "replica_id": 1
      },
      "proposed_ts": {
        "wall_time": 1572465007164404847
      },
      "epoch": 3,
      "sequence": 42
    },
    {
      "start": {
        "wall_time": 1572465046159715450
      },
      "replica": {
        "node_id": 5,
        "store_id": 5,
        "replica_id": 6
      },
      "proposed_ts": {
        "wall_time": 1572465046159720863
      },
      "epoch": 2,
      "sequence": 43
    }
  ],
{
    "42": {
        "proposed": "2019-10-30 19:50:07.164404847 +0000 UTC",
        "start": "2019-10-30 19:50:07.164401476 +0000 UTC"
    },
    "43": {
        "proposed": "2019-10-30 19:50:46.159720863 +0000 UTC",
        "start": "2019-10-30 19:50:46.15971545 +0000 UTC"
    }
}

@ajwerner
Copy link
Contributor

In theory the check below should prevent the commands which removes n5 from being applied but it seems to not be doing the trick:

// Ensure that we aren't trying to remove ourselves from the range without
// having previously given up our lease, since the range won't be able
// to make progress while the lease is owned by a removed replica (and
// leases can stay in such a state for a very long time when using epoch-
// based range leases). This shouldn't happen often, but has been seen
// before (#12591).
replID := p.command.ProposerReplica.ReplicaID
for _, rDesc := range crt.Removed() {
if rDesc.ReplicaID == replID {
msg := fmt.Sprintf("received invalid ChangeReplicasTrigger %s to remove self (leaseholder)", crt)
log.Error(p.ctx, msg)
return 0, roachpb.NewErrorf("%s: %s", r, msg)
}
}

@ajwerner
Copy link
Contributor

The bug I'm pretty sure is that the command to LEAVE_JOINT always returns an empty set for removed. This means that if a replica is VOTER_OUTGOING when it becomes the leaseholder it will happily remove itself.

See the logic exercised by this test:

crt.InternalRemovedReplicas = nil
crt.InternalAddedReplicas = nil
repl1.Type = ReplicaTypeVoterFull()
crt.Desc.SetReplicas(MakeReplicaDescriptors([]ReplicaDescriptor{repl1, learner}))
act = crt.String()
require.Empty(t, crt.Added())
require.Empty(t, crt.Removed())
exp = "LEAVE_JOINT: after=[(n1,s2):3 (n7,s8):9LEARNER] next=10"
require.Equal(t, exp, act)

Typing up a patch now.

@ajwerner
Copy link
Contributor

I guess actually there's two ways we can prevent this specific problem from occurring with the first being easier but leaving me with more questions about how we get out of a bad scenario and the second having a less obvious implementation.

  1. Prevent the leaseholder which is VOTER_OUTGOING from removing itself
  • This is easy to do by augmenting the above check in propose()
  • How does the leaseholder ever get out of VOTER_OUTGOING? Do we need to add logic to make it transfer the lease away?
  1. Prevent a replica from receiving the lease while VOTER_OUTGOING in the first place
  • I don't see an obvious way to get this invariant.

@tbg
Copy link
Member

tbg commented Oct 31, 2019

I thought 2) was true:

} else if t := repDesc.GetType(); t != roachpb.VOTER_FULL {
// NB: there's no harm in transferring the lease to a VOTER_INCOMING,
// but we disallow it anyway. On the other hand, transferring to
// VOTER_OUTGOING would be a pretty bad idea since those voters are
// dropped when transitioning out of the joint config, which then
// amounts to removing the leaseholder without any safety precautions.
// This would either wedge the range or allow illegal reads to be
// served.
//
// Since the leaseholder can't remove itself and is a VOTER_FULL, we
// also know that in any configuration there's at least one VOTER_FULL.
//
// TODO(tbg): if this code path is hit during a lease transfer (we check
// upstream of raft, but this check has false negatives) then we are in
// a situation where the leaseholder is a node that has set its
// minProposedTS and won't be using its lease any more. Either the setting
// of minProposedTS needs to be "reversible" (tricky) or we make the
// lease evaluation succeed, though with a lease that's "invalid" so that
// a new lease can be requested right after.
return errors.Errorf(`replica of type %s cannot hold lease`, t)
}

Is this code just broken because it looks up the replica for the store evaluating the command, rather than the transfer target? Seems like it...

@ajwerner
Copy link
Contributor

That's pretty 🤦‍♂️ but is at least an easy fix.

@craig craig bot closed this as completed in cb127fd Oct 31, 2019
tbg added a commit to tbg/cockroach that referenced this issue Nov 4, 2019
In light of the stats inconsistency seen in [clearrange], we want to be
stricter about verifying the stats in nightly testing. This commit makes
sure `./cockroach debug check-store` is fast enough to do so:

On a ~71GB fully compacted store directory it reliably takes well below
two minutes (on GCE local SSD).

[clearrange]: cockroachdb#38720 (comment)

Release note (performance improvement): The `./cockroach debug check-store` command is now faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.