Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #39005

Closed
cockroach-teamcity opened this issue Jul 19, 2019 · 38 comments · Fixed by #40431 or #40981
Closed

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #39005

cockroach-teamcity opened this issue Jul 19, 2019 · 38 comments · Fixed by #40431 or #40981
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/1ca35fc4a0e2665e7f6efd945e65a0db97984fa7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1396096&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190719-1396096/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:706: test timed out (10h0m0s)
	cluster.go:2090,tpcc.go:842,tpcc.go:559,test_runner.go:691: context canceled

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jul 19, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 19, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190722-1399000/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:706: test timed out (10h0m0s)
	cluster.go:2090,tpcc.go:842,tpcc.go:559,test_runner.go:691: context canceled

@tbg
Copy link
Member

tbg commented Jul 23, 2019

After a completed instance of the load, the next one never quite got to starting. The chaos is this:

if b.Chaos {
// Increase the load generation duration.
loadDur = 10 * time.Minute
// Kill one node at a time.
ch := Chaos{
Timer: Periodic{Period: 90 * time.Second, DownTime: 5 * time.Second},
Target: roachNodes.randNode,
Stopper: loadDone,
}

and there's a neverending stream of

teamcity-1563776264-09-n10cpu4: initializing cluster settings9: 11474

14:09:16 test.go:181: test status: running benchmark, warehouses=1230
14:09:16 cluster.go:315: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563776264-09-n10cpu4:10 -- ./workload run tpcc --warehouses=2000 --active-warehouses=1230 --tolerate-errors --ramp=5m0s --duration=10m0s --partitions=3 {pgurl:10} --histograms=perf/warehouses=1230/stats.json
Initializing 2460 connections...
Initializing 12300 workers and preparing statements...
I190722 14:10:21.519267 1 workload/cli/run.go:336  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2460 connections...
CHAOS: 14:10:46 chaos.go:90: killing :3
14:10:46 test.go:181: test status: stopping cluster
14:10:46 cluster.go:315: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1563776264-09-n10cpu4:3
teamcity-1563776264-09-n10cpu4: stopping and waiting.3: kill exited nonzero
3: dead

CHAOS: 14:10:53 chaos.go:119: restarting :3 after 5s of downtime
14:10:53 test.go:181: test status: starting cluster
14:10:53 cluster.go:315: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1563776264-09-n10cpu4:3
teamcity-1563776264-09-n10cpu4: starting..3: 10710

Initializing 12300 workers and preparing statements...
I190722 14:12:22.049897 1 workload/cli/run.go:336  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2460 connections...
CHAOS: 14:12:23 chaos.go:90: killing :1
14:12:23 test.go:181: test status: stopping cluster
14:12:23 cluster.go:315: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1563776264-09-n10cpu4:1
teamcity-1563776264-09-n10cpu4: stopping and waiting.1: kill exited nonzero
1: dead

CHAOS: 14:12:30 chaos.go:119: restarting :1 after 5s of downtime
14:12:30 test.go:181: test status: starting cluster
14:12:30 cluster.go:315: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1563776264-09-n10cpu4:1
teamcity-1563776264-09-n10cpu4: starting..
teamcity-1563776264-09-n10cpu4: initializing cluster settings1: 13221

CHAOS: 14:14:00 chaos.go:90: killing :4
14:14:00 test.go:181: test status: stopping cluster
14:14:00 cluster.go:315: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1563776264-09-n10cpu4:4
teamcity-1563776264-09-n10cpu4: stopping and waiting.4: kill exited nonzero
4: dead

CHAOS: 14:14:07 chaos.go:119: restarting :4 after 5s of downtime
14:14:07 test.go:181: test status: starting cluster
14:14:07 cluster.go:315: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1563776264-09-n10cpu4:4
teamcity-1563776264-09-n10cpu4: starting..
4: 12346
Initializing 12300 workers and preparing statements...
I190722 14:14:21.277299 1 workload/cli/run.go:336  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2460 connections...

until the test times out. I take this to mean that the prepare phase takes more than the time it takes for the gateway node to be killed. Nodes are killed randomly every 90s, so I assume somehow we got into a state where the PREPAREs never returned. I looked into some of the node logs and wasn't seeing anything out of the ordinary, but of course they're pretty messy thanks to the chaos.

The last failure mode of this test in its history prior to this issue was an assertion that has since been fixed, but it also has timed out before: https://teamcity.cockroachdb.com/viewLog.html?buildTypeId=Cockroach_Nightlies_WorkloadNightly&buildId=1367354

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/26edea51118a0e16b61748c08068bfa6f76543ca

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1404886&tab=buildLog

The test failed on branch=provisional_201907241708_v19.2.0-alpha.20190729, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190725-1404886/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:2090,tpcc.go:842,tpcc.go:559,test_runner.go:691: signal: interrupt
	cluster.go:1788,cluster.go:1807,cluster.go:1911,cluster.go:1294,context.go:135,cluster.go:1291,test_runner.go:677,panic.go:406,test.go:257,test.go:242,cluster.go:2090,tpcc.go:842,tpcc.go:559,test_runner.go:691: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/cfdaadc3514e7e8660f6c009ba159fdfd604f0a8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1409070&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190727-1409070/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:706: test timed out (10h0m0s)
	cluster.go:2090,tpcc.go:842,tpcc.go:559,test_runner.go:691: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5bd37e8eb58ca66b9293c234bc572411057fec3a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1417287&tab=buildLog

The test failed on branch=provisional_201908012151_v19.2.0-alpha.20190729, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190802-1417287/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	tpcc.go:693,tpcc.go:559,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod install teamcity-1564706298-09-n10cpu4:10 -- haproxy returned:
		stderr:
		
		stdout:
		iverse Translation-en [8,440 B]
		Get:38 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/main Sources [4,848 B]
		Get:39 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/universe Sources [6,740 B]
		Get:40 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [7,280 B]
		Get:41 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/main Translation-en [4,456 B]
		Get:42 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [7,804 B]
		Get:43 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-backports/universe Translation-en [4,184 B]
		Fetched 26.3 MB in 5s (5,069 kB/s)
		Reading package lists...
		E: Failed to fetch http://us-central1.gce.archive.ubuntu.com/ubuntu/dists/xenial/universe/binary-amd64/by-hash/SHA256/c8bc8c1425fef0712430d3991cf15ea96892aa5e13130b36c126fa90887ca756  Hash Sum mismatch
		E: Some index files failed to download. They have been ignored, or old ones used instead.
		Error:  exit status 100
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e8faca611a902766154ed82581d6d3a7483ad231

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1460982&tab=buildLog

The test failed on branch=provisional_201908291837_v19.2.0-beta.20190903, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190829-1460982/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:2114,tpcc.go:868,tpcc.go:579,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567109858-75-n10cpu4:10 -- ./workload fixtures load tpcc --warehouses=2000 --scatter --checks=false --partitions=3 {pgurl:1} returned:
		stderr:
		
		stdout:
		ndex entries, 7.4 MiB)
		I190830 01:02:14.623231 72 ccl/workloadccl/fixture.go:544  loaded 8.5 GiB table history in 8m9.150207014s (60000000 rows, 120000000 index entries, 18 MiB)
		I190830 01:07:50.753241 71 ccl/workloadccl/fixture.go:544  loaded 34 GiB table customer in 13m45.280133153s (60000000 rows, 60000000 index entries, 43 MiB)
		I190830 01:16:51.101983 76 ccl/workloadccl/fixture.go:544  loaded 63 GiB table stock in 22m45.628716396s (200000000 rows, 200000000 index entries, 47 MiB)
		I190830 01:17:52.367471 77 ccl/workloadccl/fixture.go:544  loaded 46 GiB table order_line in 23m46.89445485s (600008449 rows, 600008449 index entries, 33 MiB)
		I190830 01:17:52.713404 1 ccl/workloadccl/cliccl/fixtures.go:290  restored 155 GiB bytes in 9 tables (took 23m47.254263441s, 110.96 MiB/s)
		Error: Could not postload: could not partition tables: Couldn't exec "ALTER PARTITION p1_0 OF TABLE order_line CONFIGURE ZONE USING constraints = '[+rack=0]'": pq: partition "p1_0" does not exist on index "primary"
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/66bd279c9aa682c2b7adcec87ec0c639b8039a33

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1461635&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190830-1461635/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:2114,tpcc.go:868,tpcc.go:579,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567146353-64-n10cpu4:10 -- ./workload fixtures load tpcc --warehouses=2000 --scatter --checks=false --partitions=3 {pgurl:1} returned:
		stderr:
		
		stdout:
		 index entries, 7.2 MiB)
		I190830 10:49:44.856615 88 ccl/workloadccl/fixture.go:544  loaded 8.5 GiB table history in 13m8.040541331s (60000000 rows, 120000000 index entries, 11 MiB)
		I190830 10:55:01.313260 87 ccl/workloadccl/fixture.go:544  loaded 34 GiB table customer in 18m24.496935208s (60000000 rows, 60000000 index entries, 32 MiB)
		I190830 11:00:07.631060 93 ccl/workloadccl/fixture.go:544  loaded 46 GiB table order_line in 23m30.814999048s (600008449 rows, 600008449 index entries, 33 MiB)
		I190830 11:05:45.284294 92 ccl/workloadccl/fixture.go:544  loaded 63 GiB table stock in 29m8.46788165s (200000000 rows, 200000000 index entries, 37 MiB)
		I190830 11:05:45.742027 1 ccl/workloadccl/cliccl/fixtures.go:290  restored 155 GiB bytes in 9 tables (took 29m8.938567126s, 90.55 MiB/s)
		Error: Could not postload: could not partition tables: Couldn't exec "ALTER PARTITION p1_0 OF TABLE order_line CONFIGURE ZONE USING constraints = '[+rack=0]'": pq: partition "p1_0" does not exist on index "primary"
		Error:  exit status 1
		: exit status 1

@solongordon solongordon self-assigned this Aug 30, 2019
@solongordon
Copy link
Contributor

Looks like this test needs an update after the recent partitioning changes.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e8faca611a902766154ed82581d6d3a7483ad231

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1462518&tab=buildLog

The test failed on branch=provisional_201908291837_v19.2.0-beta.20190903, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190830-1462518/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:2114,tpcc.go:868,tpcc.go:579,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567188851-74-n10cpu4:10 -- ./workload fixtures load tpcc --warehouses=2000 --scatter --checks=false --partitions=3 {pgurl:1} returned:
		stderr:
		
		stdout:
		 entries, 4.8 MiB)
		I190830 23:07:52.061347 88 ccl/workloadccl/fixture.go:544  loaded 8.5 GiB table history in 13m7.418185085s (60000000 rows, 120000000 index entries, 11 MiB)
		I190830 23:54:57.465902 87 ccl/workloadccl/fixture.go:544  loaded 34 GiB table customer in 1h0m12.822433769s (60000000 rows, 60000000 index entries, 9.8 MiB)
		I190830 23:58:50.742437 92 ccl/workloadccl/fixture.go:544  loaded 63 GiB table stock in 1h4m6.099055598s (200000000 rows, 200000000 index entries, 17 MiB)
		I190830 23:59:58.057121 93 ccl/workloadccl/fixture.go:544  loaded 46 GiB table order_line in 1h5m13.413953506s (600008449 rows, 600008449 index entries, 12 MiB)
		I190830 23:59:58.778726 1 ccl/workloadccl/cliccl/fixtures.go:290  restored 155 GiB bytes in 9 tables (took 1h5m14.14875638s, 40.46 MiB/s)
		Error: Could not postload: could not partition tables: Couldn't exec "ALTER PARTITION p1_0 OF TABLE order_line CONFIGURE ZONE USING constraints = '[+rack=0]'": pq: partition "p1_0" does not exist on index "primary"
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d51fa78ff90a113c9009d263dfaf58d3672670a6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1463583&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190902-1463583/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:2114,tpcc.go:868,tpcc.go:579,test_runner.go:673: unexpected node event: 2: dead

@tbg
Copy link
Member

tbg commented Sep 3, 2019

^- this seems pretty bad,

F190902 10:41:24.184057 143 storage/store.go:2172  [n2,s2,r145/5:/Table/5{6/1/300/…-7}] replica descriptor of local store not found in right hand side of split
goroutine 143 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000448301, 0xc000448360, 0x0, 0xc007aaca60)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x7baba00, 0xc000000004, 0x7379447, 0x10, 0x87c, 0xc0091acfc0, 0x6e)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x93e
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x4e29ec0, 0xc003f1eff0, 0x4, 0x2, 0x454c25b, 0x47, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x4e29ec0, 0xc003f1eff0, 0x1, 0xc000000004, 0x454c25b, 0x47, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:180
github.com/cockroachdb/cockroach/pkg/storage.splitPostApply(0x4e29ec0, 0xc003f1eff0, 0x0, 0x15c098fc191eee30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2172 +0xa65
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleSplitResult(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_application_result.go:233
github.com/cockroachdb/cockroach/pkg/storage.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult(0xc006cc68c0, 0x4e29ec0, 0xc003f1eff0, 0x0, 0x0, 0xc006084840, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_application_state_machine.go:943 +0x852
github.com/cockroachdb/cockroach/pkg/storage.(*replicaStateMachine).ApplySideEffects(0xc006cc68c0, 0x4e77dc0, 0xc0034c6008, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_application_state_machine.go:856 +0x72d
github.com/cockroachdb/cockroach/pkg/storage/apply.mapCheckedCmdIter(0x7f11ebc78500, 0xc006cc6ad8, 0xc0072473d8, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/apply/cmd.go:182 +0x11b
github.com/cockroachdb/cockroach/pkg/storage/apply.(*Task).applyOneBatch(0xc007247800, 0x4e29ec0, 0xc003f1eff0, 0x4e77e80, 0xc006cc6a78, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/apply/task.go:276 +0x228
github.com/cockroachdb/cockroach/pkg/storage/apply.(*Task).ApplyCommittedEntries(0xc007247800, 0x4e29ec0, 0xc003f1eff0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/apply/task.go:242 +0xcf
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc006cc6800, 0x4e29ec0, 0xc003f1eff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:759 +0xd87
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x4e29ec0, 0xc003f1eff0, 0xc006cc6800, 0x4e29ec0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3615 +0x131
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc000b18000, 0x4e29ec0, 0xc003f1eff0, 0xc003d59a00, 0xc007247e98, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3368 +0x150
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc000b18000, 0x4e29ec0, 0xc00065c690, 0x91)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3604 +0x1f4
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc0008e6780, 0x4e29ec0, 0xc00065c690)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:238 +0x214
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x4e29ec0, 0xc00065c690)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:161 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc000292e50, 0xc000671ef0, 0xc000292e20)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0xfb
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xa8

@tbg
Copy link
Member

tbg commented Sep 3, 2019

Oh wait, that's the same thing I'm already fixing in

#39796

@tbg tbg self-assigned this Sep 3, 2019
craig bot pushed a commit that referenced this issue Sep 3, 2019
40248: opt: calculate number of rows processed when costing joins r=rytaft a=rytaft

This PR updates the costing of joins to take into account the number of
rows processed by the operator. This number may be larger than the
number of output rows if an additional filter is applied as part of the
ON condition that is not used to determine equality
columns for the join.

For example, consider the query
  `SELECT * FROM abc JOIN def ON a = e AND b = 3;`

Assuming there is no index on b, if a lookup join is used to execute this
query, the number of rows processed is actually the same as the query
  `SELECT * FROM abc JOIN def ON a = e;`

The difference is that the filter b=3 must also be applied to every row in
the first query. The coster now takes this into account when determining
the cost of joins.

Fixes #34810

Release note: None

40431: workload: fix partition commands in tpcc import r=solongordon a=solongordon

The commands for partitioning indexes in the TPCC import were erroring
out due to a syntax change introduced in #39332. I updated them to use
`ALTER PARTITION ... OF INDEX` rather than `ALTER PARTITION ... OF
TABLE`.

Fixes #39005
Fixes #40360
Fixes #40416

Release note: None

Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
Co-authored-by: Solon Gordon <solon@cockroachlabs.com>
@craig craig bot closed this as completed in f56a83d Sep 3, 2019
@solongordon
Copy link
Contributor

I fixed a related issue but looks like there is still an outstanding one being handled by @tbg.

@solongordon solongordon reopened this Sep 3, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4784fe3c51545db5fb5d411937ec1db2ef2b9761

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1472753&tab=buildLog

The test failed on branch=provisional_201909060000_v19.2.0-beta.20190910, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190906-1472753/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:2114,tpcc.go:868,tpcc.go:579,test_runner.go:688: unexpected node event: 9: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/201b400c5a59d42d436f417a284a129cff3ed7b3

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1471316&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190906-1471316/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:703: test timed out (10h0m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/47bb2a58c87fc1259291ec9dde78de3e54bd8a3d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1475396&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190909-1475396/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:703: test timed out (10h0m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bdf41f7d03f0dafeeaf3bc6aac40f502ab069a6a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1478770&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190911-1478770/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:1664,tpcc.go:752,search.go:43,search.go:173,tpcc.go:747,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false --racks=3 teamcity-1568183721-17-n10cpu4:1-9 returned:
		stderr:
		
		stdout:
		file: /mnt/data1/cockroach/cockroach-temp423259631/TEMP_DIR.LOCK: Resource temporarily unavailable
		*
		Error: could not cleanup temporary directories from record file: could not lock temporary directory /mnt/data1/cockroach/cockroach-temp423259631, may still be in use: IO error: While lock file: /mnt/data1/cockroach/cockroach-temp423259631/TEMP_DIR.LOCK: Resource temporarily unavailable
		Failed running "start"
		E190911 11:58:14.950135 1 cli/error.go:233  exit status 1
		Error: exit status 1
		Failed running "start"
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:256
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1535
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1337: 
		I190911 11:58:17.574765 1 cluster_synced.go:1617  command failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/66832694652037f18cd4b29e1471cd237009ef98

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1478788&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190911-1478788/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:703: test timed out (10h0m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/991282eacbbe1315fde694be9785ad8f6fa929d3

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1481778&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190912-1481778/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:2114,tpcc.go:868,tpcc.go:579,test_runner.go:688: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1568268581-66-n10cpu4:10 -- ./workload fixtures load tpcc --warehouses=2000 --scatter --checks=false --partitions=3 {pgurl:1} returned:
		stderr:
		
		stdout:
		 entries, 4.8 MiB)
		I190912 10:11:39.966240 75 ccl/workloadccl/fixture.go:553  loaded 8.5 GiB table history in 12m18.849426866s (60000000 rows, 120000000 index entries, 12 MiB)
		I190912 10:17:08.271193 74 ccl/workloadccl/fixture.go:553  loaded 34 GiB table customer in 17m47.154136974s (60000000 rows, 60000000 index entries, 33 MiB)
		I190912 10:34:08.441739 80 ccl/workloadccl/fixture.go:553  loaded 46 GiB table order_line in 34m47.324967681s (600008449 rows, 600008449 index entries, 22 MiB)
		I190912 10:34:26.512157 79 ccl/workloadccl/fixture.go:553  loaded 63 GiB table stock in 35m5.395087093s (200000000 rows, 200000000 index entries, 30 MiB)
		I190912 10:34:27.040308 1 ccl/workloadccl/cliccl/fixtures.go:290  restored 155 GiB bytes in 9 tables (took 35m5.92414155s, 75.20 MiB/s)
		Error: Could not postload: could not partition tables: Couldn't exec "ALTER PARTITION p1_0 OF INDEX order_line@order_line_stock_fk_idx CONFIGURE ZONE USING constraints = '[+rack=0]'": pq: syntax error at or near "index"
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/42d307e191ff6787a45e058be164fa452c47f368

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1487519&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190914-1487519/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:2114,tpcc.go:868,tpcc.go:579,test_runner.go:688: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1568441097-16-n10cpu4:10 -- ./workload fixtures load tpcc --warehouses=2000 --scatter --checks=false --partitions=3 {pgurl:1} returned:
		stderr:
		
		stdout:
		ries, 6.4 MiB)
		I190914 09:56:08.196039 65 ccl/workloadccl/fixture.go:553  loaded 8.5 GiB table history in 14m49.038691165s (60000000 rows, 120000000 index entries, 9.8 MiB)
		I190914 10:03:43.646802 85 ccl/workloadccl/fixture.go:553  loaded 63 GiB table stock in 22m24.489145095s (200000000 rows, 200000000 index entries, 48 MiB)
		I190914 10:04:53.592355 64 ccl/workloadccl/fixture.go:553  loaded 34 GiB table customer in 23m34.434663049s (60000000 rows, 60000000 index entries, 25 MiB)
		I190914 10:15:42.344484 86 ccl/workloadccl/fixture.go:553  loaded 46 GiB table order_line in 34m23.186009977s (600008449 rows, 600008449 index entries, 23 MiB)
		I190914 10:15:42.882510 1 ccl/workloadccl/cliccl/fixtures.go:290  restored 155 GiB bytes in 9 tables (took 34m23.725239226s, 76.74 MiB/s)
		Error: Could not postload: could not partition tables: Couldn't exec "ALTER PARTITION p1_0 OF INDEX order_line@order_line_stock_fk_idx CONFIGURE ZONE USING constraints = '[+rack=0]'": pq: syntax error at or near "index"
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/62b1678f652461bbc1aaf6bc2c0dd03105ce0ebe

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1488785&tab=buildLog

The test failed on branch=40765, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190914-1488785/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:703: test timed out (10h0m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/62b1678f652461bbc1aaf6bc2c0dd03105ce0ebe

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1489712&tab=buildLog

The test failed on branch=40765, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190915-1489712/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:703: test timed out (10h0m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/169729d6a3d1c18d5d652fc40a87bbf5a3bb8a00

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1491709&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190917-1491709/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:703: test timed out (10h0m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d6a7e59e653596b8baca946b6be714956a0e4c2c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1496394&tab=artifacts#/tpccbench/nodes=9/cpu=4/chaos/partition

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190919-1496394/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:704: test timed out (10h0m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/073999b81ddfed3bbc8409d534912fea12b6d500

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1498740&tab=artifacts#/tpccbench/nodes=9/cpu=4/chaos/partition

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190920-1498740/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:704: test timed out (10h0m0s)

@jordanlewis
Copy link
Member

@nvanbenschoten this doesn't seem like a real problem? Is the test configured to run for longer than the default timeout of 10 hours by accident?

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d6a7e59e653596b8baca946b6be714956a0e4c2c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1499672&tab=artifacts#/tpccbench/nodes=9/cpu=4/chaos/partition

The test failed on branch=provisional_201909201312_v19.1.5, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190920-1499672/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:704: test timed out (10h0m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a92c7d01d3076eabafbd536d8a344511ec9081c6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1500206&tab=artifacts#/tpccbench/nodes=9/cpu=4/chaos/partition

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190921-1500206/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:1664,tpcc.go:762,search.go:43,search.go:173,tpcc.go:757,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false --racks=3 teamcity-1569042580-17-n10cpu4:1-9 returned:
		stderr:
		
		stdout:
		teamcity-1569042580-17-n10cpu4: starting.: signal: killed

@jordanlewis
Copy link
Member

Same as above I think? I think this test is passing just needs to be taught not to kill itself at the end. Who owns this fella? @tbg @nvanbenschoten

@tbg
Copy link
Member

tbg commented Sep 23, 2019

Let's see

@tbg
Copy link
Member

tbg commented Sep 23, 2019

Doesn't seem as simple as that. The test is here

registerTPCCBenchSpec(r, tpccBenchSpec{
Nodes: 9,
CPUs: 4,
Chaos: true,
LoadConfig: singlePartitionedLoadgen,
LoadWarehouses: 2000,
EstimatedMax: 600,
})

and basically starts at 600 warehouses, line searching to figure out the max warehouse count that passes. From the log I see that it passes 600, 610, 630, 670 and then something goes wrong. It never missed the tpmc goal, so IMO it can't have been ready to quit (is that right @nvanbenschoten?)

Lots of random stuff happens in the logs so I'm not quite sure what's going on, but I think it's two things: first "something" goes wrong that stalls the test. Then, at the 12h mark, the machines get nuked and we see

ssh: connect to host 35.184.118.128 port 22: Connection refused

for various IPs (presumably the nodes in the cluster) and then the test terminating.

I'm going to have to take a closer look at what the "something" above is.

@tbg
Copy link
Member

tbg commented Sep 23, 2019

I think what's happening is that the chaos runner is restarting a node, and while that is happening, the test harness is stopping the cluster. I see in CHAOS.log:

CHAOS: 11:25:43 chaos.go:100: restarting :1 (chaos is done)
CHAOS: 11:25:48 chaos.go:62: chaos stopping: could not restart node :1: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1569042580-17-n10cpu4:1 returned:
stderr:

which interleaves with

11:25:44 test.go:181: test status: stopping cluster
11:25:44 cluster.go:315: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1569042580-17-n10cpu4:1-9

in the main log.

This does seem to check out with the code, where we use the outer monitor to launch the chaos agent, but then restart the cluster repeatedly while it's running:

// Restart the cluster before each iteration to help eliminate
// inter-trial interactions.
m.ExpectDeaths(int32(len(roachNodes)))
c.Stop(ctx, roachNodes)
c.Start(ctx, t, append(b.startOpts(), roachNodes)...)
time.Sleep(restartWait)
// Set up the load generation configuration.
rampDur := 5 * time.Minute
loadDur := 10 * time.Minute
loadDone := make(chan time.Time, numLoadGroups)
// If we're running chaos in this configuration, modify this config.
if b.Chaos {
// Increase the load generation duration.
loadDur = 10 * time.Minute
// Kill one node at a time.
ch := Chaos{
Timer: Periodic{Period: 90 * time.Second, DownTime: 5 * time.Second},
Target: roachNodes.randNode,
Stopper: loadDone,
}
m.Go(ch.Runner(c, m))

I started a run with today's master a while ago. This problem wouldn't be guaranteed to occur, but it seems like it's been occurring pretty frequently. I'll also whip up a fix (don't share the chaos agent across individual tpcc runs) and run that a few times.

solongordon added a commit to solongordon/cockroach that referenced this issue Sep 23, 2019
This reverts commit f56a83d. That fix
broke backward compatibility with v2.1, and it is no longer necessary
due to 65c5e37.

Refers cockroachdb#39005

Release note: None

Release justification: Non-production code change
@solongordon
Copy link
Contributor

Separate from the issue @tbg described, this was also failing against v2.1 because I used some partitioning syntax that wasn't supported there. Should be fixed by #40984.

craig bot pushed a commit that referenced this issue Sep 23, 2019
40984: workload: revert partition fix for tpcc import r=solongordon a=solongordon

This reverts commit f56a83d. That fix
broke backward compatibility with v2.1, and it is no longer necessary
due to 65c5e37.

Refers #39005

Release note: None

Release justification: Non-production code change

Co-authored-by: Solon Gordon <solon@cockroachlabs.com>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/6b14c0aa3ed1b4ba6d5f937e9352c5383afe1c37

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1501050&tab=artifacts#/tpccbench/nodes=9/cpu=4/chaos/partition

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190923-1501050/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	test_runner.go:704: test timed out (10h0m0s)

craig bot pushed a commit that referenced this issue Sep 24, 2019
40981: roachtest: deflake tpccbench chaos tests r=nvanbenschoten a=tbg

These tests were running a chaos agent across cluster restarts. Whenever
a cluster restart would overlap with the chaos agent restarting a node,
one of the two operations would fail and jam the test.

Fixes #39005.

I suggest https://github.com/cockroachdb/cockroach/pull/40981/files?w=1 for reviewing (i.e. ignore whitespace).

Release justification: de-flakes a roachtest without changes to the
release binary.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig craig bot closed this as completed in 8abea21 Sep 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
5 participants