roachtest: import/tpcc/warehouses=1000/nodes=32 failed #39072

cockroach-teamcity · 2019-07-24T11:33:31Z

SHA: https://github.com/cockroachdb/cockroach/commits/86eab2ff0a1a4c2d9b5f7e7a45deda74c98c6c37

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1402541&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190724-1402541/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1726,import.go:44,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563949032-23-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		 table (1000 rows, 0 index entries, took 44.694184085s, 0.00 MiB/s)
		I190724 11:30:12.695244 25 ccl/workloadccl/fixture.go:396  imported 1006 KiB in district table (10000 rows, 0 index entries, took 48.135399593s, 0.02 MiB/s)
		I190724 11:30:21.339813 30 ccl/workloadccl/fixture.go:396  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 56.778004479s, 0.14 MiB/s)
		I190724 11:31:56.028329 29 ccl/workloadccl/fixture.go:396  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 2m31.466814741s, 0.83 MiB/s)
		I190724 11:32:50.758743 27 ccl/workloadccl/fixture.go:396  imported 4.3 GiB in history table (30000000 rows, 60000000 index entries, took 3m26.198521036s, 21.23 MiB/s)
		I190724 11:32:51.256559 28 ccl/workloadccl/fixture.go:396  imported 1.3 GiB in order table (30000000 rows, 30000000 index entries, took 3m26.694688835s, 6.55 MiB/s)
		Error: importing fixture: importing table order_line: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,import.go:47,import.go:83,test_runner.go:691: Goexit() was called

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2019-07-25T10:57:05Z

SHA: https://github.com/cockroachdb/cockroach/commits/26edea51118a0e16b61748c08068bfa6f76543ca

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1404886&tab=buildLog

The test failed on branch=provisional_201907241708_v19.2.0-alpha.20190729, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190725-1404886/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1726,import.go:44,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564034590-24-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		x entries, took 43.300709149s, 0.18 MiB/s)
		I190725 10:50:24.138033 58 ccl/workloadccl/fixture.go:437  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 1m59.677267799s, 1.05 MiB/s)
		I190725 10:52:22.810949 56 ccl/workloadccl/fixture.go:437  imported 4.3 GiB in history table (30000000 rows, 60000000 index entries, took 3m58.350897267s, 18.36 MiB/s)
		I190725 10:53:03.780603 55 ccl/workloadccl/fixture.go:437  imported 17 GiB in customer table (30000000 rows, 30000000 index entries, took 4m39.320727724s, 63.21 MiB/s)
		I190725 10:54:18.941997 57 ccl/workloadccl/fixture.go:437  imported 1.3 GiB in order table (30000000 rows, 30000000 index entries, took 5m54.481433124s, 3.82 MiB/s)
		I190725 10:56:12.810728 60 ccl/workloadccl/fixture.go:437  imported 31 GiB in stock table (100000000 rows, 100000000 index entries, took 7m48.349636666s, 68.49 MiB/s)
		Error: importing fixture: importing table order_line: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,import.go:47,import.go:83,test_runner.go:691: Goexit() was called

cockroach-teamcity · 2019-07-30T11:58:37Z

SHA: https://github.com/cockroachdb/cockroach/commits/65055d6c16bf9386d8c4f4f9cd23e0a848814dc9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1411157&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190730-1411157/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1726,import.go:44,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564466961-27-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		0 rows, 0 index entries, took 54.00361011s, 0.02 MiB/s)
		I190730 11:53:55.753827 12 ccl/workloadccl/fixture.go:432  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 56.090317665s, 0.00 MiB/s)
		I190730 11:55:10.394012 22 ccl/workloadccl/fixture.go:432  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 2m10.727704007s, 0.96 MiB/s)
		I190730 11:57:08.219376 15 ccl/workloadccl/fixture.go:432  imported 4.3 GiB in history table (30000000 rows, 60000000 index entries, took 4m8.555192444s, 17.61 MiB/s)
		I190730 11:57:42.626085 14 ccl/workloadccl/fixture.go:432  imported 17 GiB in customer table (30000000 rows, 30000000 index entries, took 4m42.962273858s, 62.40 MiB/s)
		I190730 11:57:56.294668 21 ccl/workloadccl/fixture.go:432  imported 1.3 GiB in order table (30000000 rows, 30000000 index entries, took 4m56.629412478s, 4.57 MiB/s)
		Error: importing fixture: importing table order_line: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,import.go:47,import.go:83,test_runner.go:691: Goexit() was called

nvanbenschoten · 2019-08-04T00:38:26Z

Similar symptoms to those in #39071. @dt assigning to you for triage.

cockroach-teamcity · 2019-08-24T10:32:33Z

SHA: https://github.com/cockroachdb/cockroach/commits/40f8f0eb00f4b3bf5bac11fb5ae132e33a492713

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1452154&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190824-1452154/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1735,import.go:44,cluster.go:2081,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1566627477-23-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190824 10:31:51.983968 1 ccl/workloadccl/fixture.go:316  starting import of 9 tables
		I190824 10:31:55.773969 11 ccl/workloadccl/fixture.go:432  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 3.776206635s, 0.01 MiB/s)
		I190824 10:31:56.789404 66 ccl/workloadccl/fixture.go:432  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 4.790003703s, 1.62 MiB/s)
		: signal: killed

cockroach-teamcity · 2019-09-02T11:06:23Z

SHA: https://github.com/cockroachdb/cockroach/commits/d51fa78ff90a113c9009d263dfaf58d3672670a6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1463583&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190902-1463583/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1735,import.go:44,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567405952-75-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190902 11:04:58.410696 1 ccl/workloadccl/fixture.go:316  starting import of 9 tables
		I190902 11:05:01.232602 38 ccl/workloadccl/fixture.go:432  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 2.809280405s, 0.02 MiB/s)
		I190902 11:05:01.737323 39 ccl/workloadccl/fixture.go:432  imported 1006 KiB in district table (10000 rows, 0 index entries, took 3.313903014s, 0.30 MiB/s)
		I190902 11:05:27.429719 44 ccl/workloadccl/fixture.go:432  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 29.004769843s, 0.27 MiB/s)
		: signal: killed

cockroach-teamcity · 2019-09-05T04:36:02Z

SHA: https://github.com/cockroachdb/cockroach/commits/c3f447650742706c52354f629b30bb64e5a6959c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1468445&tab=buildLog

The test failed on branch=provisional_201909041806_v19.2.0-beta.20190910, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190905-1468445/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1735,import.go:44,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567652906-38-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190905 04:34:48.992160 1 ccl/workloadccl/fixture.go:316  starting import of 9 tables
		I190905 04:34:54.979431 24 ccl/workloadccl/fixture.go:432  imported 1006 KiB in district table (10000 rows, 0 index entries, took 5.974582141s, 0.16 MiB/s)
		I190905 04:34:55.488419 29 ccl/workloadccl/fixture.go:432  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 6.479141462s, 1.20 MiB/s)
		I190905 04:35:12.859279 23 ccl/workloadccl/fixture.go:432  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 23.854670279s, 0.00 MiB/s)
		: signal: killed

cockroach-teamcity · 2019-09-07T05:08:52Z

SHA: https://github.com/cockroachdb/cockroach/commits/4e1deb21a3e28387c45e6a87a3cefe6bb49f39ac

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1473326&tab=buildLog

The test failed on branch=provisional_201909070026_v19.2.0-beta.20190911, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190907-1473326/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1735,import.go:44,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567827321-35-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		xture.go:433  imported 1.3 GiB in order table (30000000 rows, 30000000 index entries, took 3m57.980516323s, 5.69 MiB/s)
		I190907 05:04:57.890527 41 ccl/workloadccl/fixture.go:433  imported 17 GiB in customer table (30000000 rows, 30000000 index entries, took 4m20.833337161s, 67.69 MiB/s)
		I190907 05:07:05.412153 16 ccl/workloadccl/fixture.go:433  imported 23 GiB in order_line table (300011520 rows, 300011520 index entries, took 6m28.351265105s, 60.17 MiB/s)
		Error: importing fixture: importing table stock: pq: descriptor changed: [expected] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239] != [actual] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239]: unexpected value: raw_bytes:"\363P\244T\003\010\203\t\022\002\301\212\032\001\302\"\006\010\020\020\020\030\001\"\006\010\010\020\010\030\005\"\006\010\036\020\036\030\013(\r0\357\001:\n\010\234\366\273\371\232\322\204\341\025@\001" timestamp:<wall_time:1567832452567360492 logical:1 > 
		Error:  exit status 1
		: exit status 1

jordanlewis · 2019-09-07T14:17:41Z

		Error: importing fixture: importing table stock: pq: descriptor changed: [expected] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239] != [actual] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239]: unexpected value: raw_bytes:"\363P\244T\003\010\203\t\022\002\301\212\032\001\302\"\006\010\020\020\020\030\001\"\006\010\010\020\010\030\005\"\006\010\036\020\036\030\013(\r0\357\001:\n\010\234\366\273\371\232\322\204\341\025@\001" timestamp:<wall_time:1567832452567360492 logical:1 >

Could this be a release blocker? Isn't this an import into an empty cluster failing?

cc @dt

andy-kimball · 2019-09-07T14:27:50Z

I wonder if it's the same thing, because this issue and 39071 seem to time out, whereas this newest thing is failing with the descriptor changed error.

dt · 2019-09-07T15:44:32Z

descriptor changed looks like a new error and is something from core (I see it mentioned in learner tests?)

In cockroachdb#39072 we see: ``` Error: importing fixture: importing table stock: pq: descriptor changed: [expected] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239] != [actual] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239]: unexpected value: raw_bytes:"\363P\244T\003\010\203\t\022\002\301\212\032\001\302\"\006\010\020\020\020\030\001\"\006\010\010\020\010\030\005\"\006\010\036\020\036\030\013(\r0\357\001:\n\010\234\366\273\371\232\322\204\341\025@\001" timestamp:<wall_time:1567832452567360492 logical:1 > ``` Which is frustrating because those stringified values look identical to me. Hopefully this debugging will reveal something in any future repros. Release note: None

ajwerner · 2019-09-10T17:00:00Z

Is this possibly #37499 and the InternalReplicas not matching? That seems like it could be a problem @danhhz what do you think?

The line in question is:

cockroach/pkg/storage/replica_command.go

Line 73 in 3ad8bda

desc.RangeID == actualDesc.RangeID && !desc.Equal(actualDesc) {

ajwerner · 2019-09-10T17:16:00Z

Err more likely this is a sticky bit thing.

ajwerner · 2019-09-10T19:06:18Z

There are two classes of failures here. One is too many retries in AdminSplit. Some of that comes from the setting of the sticky bit (I think). The others (where the error is killed) are consistency checker failures which I suspect may be related to #38772 and reproduce readily.

ajwerner · 2019-09-10T19:43:27Z

After looking a little deeper into the above killed failures those seem like #40470

These new failures I see seem to involve joint concensus and splits and merges:

They both seem to involve the error caused by the workaround to #40333 in #40363

ajwerner-1568140936-02-n32cpu4-0006> E190910 18:45:47.364461 13319 storage/replica_raft.go:304  [n6,s6,r980/2:/Table/58/{1/906/3…-2}] received invalid ChangeReplicasTrigger ENTER_JOINT ADD_REPLICA[(n14,s14):5VOTER_INCOMING], REMOVE_REPLICA[(n6,s6):2VOTER_OUTGOING]: after=[(n23,s23):1 (n6,s6):2VOTER_OUTGOING (n8,s8):4 (n14,s14):5VOTER_INCOMING] next=6 to remove self (leaseholder)
ajwerner-1568140936-02-n32cpu4-0023> I190910 18:45:47.372336 9132 storage/replica_command.go:1066  [n23,s23,r980/1:/Table/58/{1/906/3…-2}] could not promote [n14,s14] to voter, rolling back: change replicas of r980 failed: storage/replica_raft.go:305: [n6,s6,r980/2:/Table/58/{1/906/3…-2}]: received invalid ChangeReplicasTrigger ENTER_JOINT ADD_REPLICA[(n14,s14):5VOTER_INCOMING], REMOVE_REPLICA[(n6,s6):2VOTER_OUTGOING]: after=[(n23,s23):1 (n6,s6):2VOTER_OUTGOING (n8,s8):4 (n14,s14):5VOTER_INCOMING] next=6 to remove self (leaseholder)

consistency_failure.txt

ajwerner-1568140936-03-n32cpu4-0023> E190910 18:48:33.085188 14838 storage/replica_raft.go:304  [n23,s23,r2850/1:/Table/59/2/1{56/3/"…-87/6/"…}] received invalid ChangeReplicasTrigger ENTER_JOINT ADD_REPLICA[(n3,s3):4VOTER_INCOMING], REMOVE_REPLICA[(n23,s23):1VOTER_OUTGOING]: after=[(n23,s23):1VOTER_OUTGOING (n9,s9):2 (n5,s5):3 (n3,s3):4VOTER_INCOMING] next=5 to remove self (leaseholder)
ajwerner-1568140936-03-n32cpu4-0023> I190910 18:48:33.109304 14838 storage/replica_command.go:1066  [n23,s23,r2850/1:/Table/59/2/1{56/3/"…-87/6/"…}] could not promote [n3,s3] to voter, rolling back: change replicas of r2850 failed: storage/replica_raft.go:305: [n23,s23,r2850/1:/Table/59/2/1{56/3/"…-87/6/"…}]: received invalid ChangeReplicasTrigger ENTER_JOINT ADD_REPLICA[(n3,s3):4VOTER_INCOMING], REMOVE_REPLICA[(n23,s23):1VOTER_OUTGOING]: after=[(n23,s23):1VOTER_OUTGOING (n9,s9):2 (n5,s5):3 (n3,s3):4VOTER_INCOMING] next=5 to remove self (leaseholder)

consistency_failure2.txt

I'm trying to get some more repros now.

nvanbenschoten · 2019-09-10T19:45:17Z

It's possible that the sticky bit issue is resolved by a533375, which was merged yesterday? I think this was broken by one of the atomic replica change refactors before being resolved in that commit.

ajwerner · 2019-09-10T19:47:03Z

One thing to note is that these were produced on a branch designed to fail this test due to the race. It does not internally retry AdminSplit as much. It's possible that the error from AdminSplit is actually what introduces the hazard. The build was this diff off of 3ad8bda.

https://gist.github.com/ajwerner/9e98a6ed5285408b57275de60faba872

ajwerner · 2019-09-10T19:51:51Z

It's possible that the sticky bit issue is resolved by a533375, which was merged yesterday? I think this was broken by one of the atomic replica change refactors before being resolved in that commit.

Does that change matter in completely 19.2 clusters? I think it's just the case that each processor sets the sticky bit to a different time:

cockroach/pkg/ccl/importccl/import_processor.go

Line 266 in 0f9c8da

    
           expirationTime = cp.flowCtx.Cfg.DB.Clock().Now().Add(time.Hour.Nanoseconds(), 0)

And also there's a ton of concurrent data movement going and nowadays with learners and atomic changes we update the replica descriptor a bunch so it's not the most insane thing to me that we'd need to retry 10 times.

-               expirationTime = cp.flowCtx.Cfg.DB.Clock().Now().Add(time.Hour.Nanoseconds(), 0)
+               expirationTime = hlc.Timestamp{WallTime: cp.spec.WalltimeNanos + time.Hour.Nanoseconds()}

I hope will make it somewhat better. I'm also open to increasing that 10 to something higher.

ajwerner · 2019-09-10T20:09:58Z

It's not clear that this repros as readily on vanilla master. There are two interesting things in that diff, the scatter retries on condition failed and the number of retries in the split command. I'm trying to isolate which is the culprit. I'm thinking it's the retries which are masking a correctness bug.

ajwerner · 2019-09-10T20:14:12Z

I did hit:

                Error: importing fixture: importing table history: pq: descriptor changed: [expected] r618:/Table/5{3/3-4} [(n14,s14):5, (n9,s9):22, (n5,s5):3, (n1,s1):23LEARNER, next=24, gen=146] != [actual] r618:/Table/5{3/3-4} [(n14,s14):5, (n9,s9):22VOTER_OUTGOING, (n5,s5):3, (n1,s1):23VOTER_INCOMING, next=24, gen=147]: unexpected value: raw_bytes:"7\211T:\003\010\352\004\022\002\275\213\032\001\276\"\006\010\016\020\016\030\005\"\010\010\t\020\t\030\026 \003\"\006\010\005\020\005\030\003\"\010\010\001\020\001\030\027 \002(\0300\223\001:\n\010\201\355\202\244\362\350\313\341\025@\001" timestamp:<wall_time:1568145490187302019 >

on master

ajwerner · 2019-09-10T20:48:53Z

Oof so the consistency failure repro'd with the following patch:

diff --git a/pkg/storage/replica_command.go b/pkg/storage/replica_command.go
index 64311f1ede..24cece7291 100644
--- a/pkg/storage/replica_command.go
+++ b/pkg/storage/replica_command.go
@@ -516,7 +516,7 @@ func (r *Replica) executeAdminCommandWithDescriptor(
        ctx context.Context, updateDesc func(*roachpb.RangeDescriptor) error,
 ) *roachpb.Error {
        retryOpts := base.DefaultRetryOptions()
-       retryOpts.MaxRetries = 10
+       retryOpts.MaxRetries = 2
        var lastErr error
        for retryable := retry.StartWithCtx(ctx, retryOpts); retryable.Next(); {
                // The replica may have been destroyed since the start of the retry loop.

off of 3ad8bda

The cluster is alive ajwerner-1568147181-02-n32cpu4

EDIT: that one was 2, I rebuilt with 1 thinking it wasn't going to repro.

ajwerner · 2019-09-10T21:10:46Z

Internal data link for the test failure above: https://drive.google.com/file/d/1i9YsjvWXtc0b6D627-tRvLwkkKygSLtJ/view?usp=sharing

ajwerner · 2019-09-10T21:20:27Z

Okay 5 for 5 of these seems to follow soon after

ajwerner-1568147181-02-n32cpu4-0025> I190910 20:30:23.410751 10465 storage/replica_command.go:1051  [n25,s25,r1670/3:/Table/54/2/{875/1/…-906/3/…}] could not promote [n19,s19] to voter, rolling back: change replicas of r1670 failed: storage/replica_raft.go:305: [n25,s25,r1670/3:/Table/54/2/{875/1/…-906/3/…}]: received invalid ChangeReplicasTrigger ENTER_JOINT ADD_REPLICA[(n19,s19):4VOTER_INCOMING], REMOVE_REPLICA[(n25,s25):3VOTER_OUTGOING]: after=[(n31,s31):1 (n12,s12):2 (n25,s25):3VOTER_OUTGOING (n19,s19):4VOTER_INCOMING] next=5 to remove self (leaseholder)

ajwerner · 2019-09-10T21:31:42Z

And now of course #6 doesn't. Attached are the files from another repro. Given how easy it is to repro I'm going to destroy these cluster. Pretty scary that a retry loop makes this okay.

Internal: https://drive.google.com/file/d/16aDQF21j37m3YRyjwCiiauPHBug8Ob5a/view?usp=sharing

ajwerner · 2019-09-10T21:39:47Z

@andreimatei I'm passing this one over, at least for the consistency failures. We can talk tomorrow

nvanbenschoten · 2019-09-11T23:31:32Z

I have been able to reproduce the replica inconsistency crash without atomic replica reconfiguration enabled, so it doesn't look like this is related to that change.

nvanbenschoten · 2019-09-11T23:59:59Z

On a hunch, I tried reverting #39709 to see if we were failing to fully replicate a range during snapshots. That did not fix the issue.

thoszhang · 2019-09-12T16:56:52Z

@pbardea has #40676 out, which will get rid of the repeated splits during IMPORT. I'm going to try to reproduce the issue and verify that #40676 does fix it.

thoszhang · 2019-09-12T21:43:29Z

So far I've run this test with #40676 (to fix the extra splits during IMPORT) 5 times with retryOpts.MaxRetries = 2, and it has failed every time with descriptor changed. I did try running the test on master with retryOpts.MaxRetries = 2 once and got 3/7 imports failing, instead of just 1/7 with #40676, which suggests that #40676 is helping, but it's hard to say.

I can try running with max retries set to 10, but do we have a principled way of setting the max retries? Is it fine to just have a very high probability of not failing due to this issue, or do we expect to never get it at all?

Internal link to files for one of the runs: https://drive.google.com/open?id=1xQ6ER_LUzbSdqAcKz3ZA9pbaxoTBrgNv

(cc @pbardea)

nvanbenschoten · 2019-09-16T17:58:39Z

Is it fine to just have a very high probability of not failing due to this issue, or do we expect to never get it at all?

I think a very high probability of not failing is ok. @ajwerner and I talked last week about introducing some kind of best-effort semaphore on leaseholders to reduce contention on the descriptor, but I think fundementally we'll never be able to place a hard upper-bound on the number of retries that this operation requires.

thoszhang · 2019-09-16T18:27:51Z

Okay, I ran the roachtest 20 times with #40676 (and retries set to 10) and didn't get any failures. I think that's sufficient evidence that #40676 helps and we should merge it.

40676: sql: stop presplitting spans on every import proc r=pbardea a=pbardea We are currently presplitting spans (on every table and index boundary) for every processor. We should only be doing this once so this PR moves it to the planning stage. Addresses #39072. Release note: None 40838: makefile: increase test timeout r=andreimatei a=andreimatei This patch increases the test timeout for every package from 12m to 20m (and under stress from 25m to 30m). This is motivated by the sql TestLogic which has been inching towards the 12m recently (according to TeamCity history) and is sometimes timing out. Hopefully it's because we've been adding more tests... Fixes #40572 Release note: None Release justification: fix timeout flakes Co-authored-by: Paul Bardea <pbardea@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

pbardea · 2019-09-18T19:58:33Z

Closed by #40676.

cockroach-teamcity added this to the 19.2 milestone Jul 24, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 24, 2019

nvanbenschoten assigned dt Aug 4, 2019

andy-kimball mentioned this issue Sep 7, 2019

19.2 release blockers list #40447

Closed

53 tasks

dt assigned tbg and unassigned dt Sep 7, 2019

ajwerner mentioned this issue Sep 10, 2019

storage: print a pretty.Diff with the descriptor changed error #40637

Closed

nvanbenschoten mentioned this issue Sep 10, 2019

roachtest: clearrange/checks=false failed #38772

Closed

ajwerner assigned andreimatei Sep 10, 2019

thoszhang mentioned this issue Sep 12, 2019

sql: stop presplitting spans on every import proc #40676

Merged

pbardea closed this as completed Sep 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: import/tpcc/warehouses=1000/nodes=32 failed #39072

roachtest: import/tpcc/warehouses=1000/nodes=32 failed #39072

cockroach-teamcity commented Jul 24, 2019

cockroach-teamcity commented Jul 25, 2019

cockroach-teamcity commented Jul 30, 2019

nvanbenschoten commented Aug 4, 2019

cockroach-teamcity commented Aug 24, 2019

cockroach-teamcity commented Sep 2, 2019

cockroach-teamcity commented Sep 5, 2019

cockroach-teamcity commented Sep 7, 2019

jordanlewis commented Sep 7, 2019

andy-kimball commented Sep 7, 2019

dt commented Sep 7, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

nvanbenschoten commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019 •

edited

Loading

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

nvanbenschoten commented Sep 11, 2019

nvanbenschoten commented Sep 11, 2019

thoszhang commented Sep 12, 2019

thoszhang commented Sep 12, 2019

nvanbenschoten commented Sep 16, 2019

thoszhang commented Sep 16, 2019

pbardea commented Sep 18, 2019

roachtest: import/tpcc/warehouses=1000/nodes=32 failed #39072

roachtest: import/tpcc/warehouses=1000/nodes=32 failed #39072

Comments

cockroach-teamcity commented Jul 24, 2019

cockroach-teamcity commented Jul 25, 2019

cockroach-teamcity commented Jul 30, 2019

nvanbenschoten commented Aug 4, 2019

cockroach-teamcity commented Aug 24, 2019

cockroach-teamcity commented Sep 2, 2019

cockroach-teamcity commented Sep 5, 2019

cockroach-teamcity commented Sep 7, 2019

jordanlewis commented Sep 7, 2019

andy-kimball commented Sep 7, 2019

dt commented Sep 7, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

nvanbenschoten commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019 • edited Loading

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

ajwerner commented Sep 10, 2019

nvanbenschoten commented Sep 11, 2019

nvanbenschoten commented Sep 11, 2019

thoszhang commented Sep 12, 2019

thoszhang commented Sep 12, 2019

nvanbenschoten commented Sep 16, 2019

thoszhang commented Sep 16, 2019

pbardea commented Sep 18, 2019

ajwerner commented Sep 10, 2019 •

edited

Loading