Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpcc/warehouses=1000/nodes=32 failed #39072

Closed
cockroach-teamcity opened this issue Jul 24, 2019 · 31 comments
Closed

roachtest: import/tpcc/warehouses=1000/nodes=32 failed #39072

cockroach-teamcity opened this issue Jul 24, 2019 · 31 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/86eab2ff0a1a4c2d9b5f7e7a45deda74c98c6c37

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1402541&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190724-1402541/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1726,import.go:44,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563949032-23-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		 table (1000 rows, 0 index entries, took 44.694184085s, 0.00 MiB/s)
		I190724 11:30:12.695244 25 ccl/workloadccl/fixture.go:396  imported 1006 KiB in district table (10000 rows, 0 index entries, took 48.135399593s, 0.02 MiB/s)
		I190724 11:30:21.339813 30 ccl/workloadccl/fixture.go:396  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 56.778004479s, 0.14 MiB/s)
		I190724 11:31:56.028329 29 ccl/workloadccl/fixture.go:396  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 2m31.466814741s, 0.83 MiB/s)
		I190724 11:32:50.758743 27 ccl/workloadccl/fixture.go:396  imported 4.3 GiB in history table (30000000 rows, 60000000 index entries, took 3m26.198521036s, 21.23 MiB/s)
		I190724 11:32:51.256559 28 ccl/workloadccl/fixture.go:396  imported 1.3 GiB in order table (30000000 rows, 30000000 index entries, took 3m26.694688835s, 6.55 MiB/s)
		Error: importing fixture: importing table order_line: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,import.go:47,import.go:83,test_runner.go:691: Goexit() was called

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jul 24, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 24, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/26edea51118a0e16b61748c08068bfa6f76543ca

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1404886&tab=buildLog

The test failed on branch=provisional_201907241708_v19.2.0-alpha.20190729, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190725-1404886/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1726,import.go:44,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564034590-24-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		x entries, took 43.300709149s, 0.18 MiB/s)
		I190725 10:50:24.138033 58 ccl/workloadccl/fixture.go:437  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 1m59.677267799s, 1.05 MiB/s)
		I190725 10:52:22.810949 56 ccl/workloadccl/fixture.go:437  imported 4.3 GiB in history table (30000000 rows, 60000000 index entries, took 3m58.350897267s, 18.36 MiB/s)
		I190725 10:53:03.780603 55 ccl/workloadccl/fixture.go:437  imported 17 GiB in customer table (30000000 rows, 30000000 index entries, took 4m39.320727724s, 63.21 MiB/s)
		I190725 10:54:18.941997 57 ccl/workloadccl/fixture.go:437  imported 1.3 GiB in order table (30000000 rows, 30000000 index entries, took 5m54.481433124s, 3.82 MiB/s)
		I190725 10:56:12.810728 60 ccl/workloadccl/fixture.go:437  imported 31 GiB in stock table (100000000 rows, 100000000 index entries, took 7m48.349636666s, 68.49 MiB/s)
		Error: importing fixture: importing table order_line: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,import.go:47,import.go:83,test_runner.go:691: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/65055d6c16bf9386d8c4f4f9cd23e0a848814dc9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1411157&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190730-1411157/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1726,import.go:44,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564466961-27-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		0 rows, 0 index entries, took 54.00361011s, 0.02 MiB/s)
		I190730 11:53:55.753827 12 ccl/workloadccl/fixture.go:432  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 56.090317665s, 0.00 MiB/s)
		I190730 11:55:10.394012 22 ccl/workloadccl/fixture.go:432  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 2m10.727704007s, 0.96 MiB/s)
		I190730 11:57:08.219376 15 ccl/workloadccl/fixture.go:432  imported 4.3 GiB in history table (30000000 rows, 60000000 index entries, took 4m8.555192444s, 17.61 MiB/s)
		I190730 11:57:42.626085 14 ccl/workloadccl/fixture.go:432  imported 17 GiB in customer table (30000000 rows, 30000000 index entries, took 4m42.962273858s, 62.40 MiB/s)
		I190730 11:57:56.294668 21 ccl/workloadccl/fixture.go:432  imported 1.3 GiB in order table (30000000 rows, 30000000 index entries, took 4m56.629412478s, 4.57 MiB/s)
		Error: importing fixture: importing table order_line: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,import.go:47,import.go:83,test_runner.go:691: Goexit() was called

@nvanbenschoten
Copy link
Member

Similar symptoms to those in #39071. @dt assigning to you for triage.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/40f8f0eb00f4b3bf5bac11fb5ae132e33a492713

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1452154&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190824-1452154/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1735,import.go:44,cluster.go:2081,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1566627477-23-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190824 10:31:51.983968 1 ccl/workloadccl/fixture.go:316  starting import of 9 tables
		I190824 10:31:55.773969 11 ccl/workloadccl/fixture.go:432  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 3.776206635s, 0.01 MiB/s)
		I190824 10:31:56.789404 66 ccl/workloadccl/fixture.go:432  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 4.790003703s, 1.62 MiB/s)
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d51fa78ff90a113c9009d263dfaf58d3672670a6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1463583&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190902-1463583/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1735,import.go:44,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567405952-75-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190902 11:04:58.410696 1 ccl/workloadccl/fixture.go:316  starting import of 9 tables
		I190902 11:05:01.232602 38 ccl/workloadccl/fixture.go:432  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 2.809280405s, 0.02 MiB/s)
		I190902 11:05:01.737323 39 ccl/workloadccl/fixture.go:432  imported 1006 KiB in district table (10000 rows, 0 index entries, took 3.313903014s, 0.30 MiB/s)
		I190902 11:05:27.429719 44 ccl/workloadccl/fixture.go:432  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 29.004769843s, 0.27 MiB/s)
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c3f447650742706c52354f629b30bb64e5a6959c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1468445&tab=buildLog

The test failed on branch=provisional_201909041806_v19.2.0-beta.20190910, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190905-1468445/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1735,import.go:44,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567652906-38-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190905 04:34:48.992160 1 ccl/workloadccl/fixture.go:316  starting import of 9 tables
		I190905 04:34:54.979431 24 ccl/workloadccl/fixture.go:432  imported 1006 KiB in district table (10000 rows, 0 index entries, took 5.974582141s, 0.16 MiB/s)
		I190905 04:34:55.488419 29 ccl/workloadccl/fixture.go:432  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 6.479141462s, 1.20 MiB/s)
		I190905 04:35:12.859279 23 ccl/workloadccl/fixture.go:432  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 23.854670279s, 0.00 MiB/s)
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4e1deb21a3e28387c45e6a87a3cefe6bb49f39ac

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1473326&tab=buildLog

The test failed on branch=provisional_201909070026_v19.2.0-beta.20190911, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190907-1473326/import/tpcc/warehouses=1000/nodes=32/run_1
	cluster.go:1735,import.go:44,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567827321-35-n32cpu4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		xture.go:433  imported 1.3 GiB in order table (30000000 rows, 30000000 index entries, took 3m57.980516323s, 5.69 MiB/s)
		I190907 05:04:57.890527 41 ccl/workloadccl/fixture.go:433  imported 17 GiB in customer table (30000000 rows, 30000000 index entries, took 4m20.833337161s, 67.69 MiB/s)
		I190907 05:07:05.412153 16 ccl/workloadccl/fixture.go:433  imported 23 GiB in order_line table (300011520 rows, 300011520 index entries, took 6m28.351265105s, 60.17 MiB/s)
		Error: importing fixture: importing table stock: pq: descriptor changed: [expected] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239] != [actual] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239]: unexpected value: raw_bytes:"\363P\244T\003\010\203\t\022\002\301\212\032\001\302\"\006\010\020\020\020\030\001\"\006\010\010\020\010\030\005\"\006\010\036\020\036\030\013(\r0\357\001:\n\010\234\366\273\371\232\322\204\341\025@\001" timestamp:<wall_time:1567832452567360492 logical:1 > 
		Error:  exit status 1
		: exit status 1

@jordanlewis
Copy link
Member

		Error: importing fixture: importing table stock: pq: descriptor changed: [expected] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239] != [actual] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239]: unexpected value: raw_bytes:"\363P\244T\003\010\203\t\022\002\301\212\032\001\302\"\006\010\020\020\020\030\001\"\006\010\010\020\010\030\005\"\006\010\036\020\036\030\013(\r0\357\001:\n\010\234\366\273\371\232\322\204\341\025@\001" timestamp:<wall_time:1567832452567360492 logical:1 > 

Could this be a release blocker? Isn't this an import into an empty cluster failing?

cc @dt

@andy-kimball
Copy link
Contributor

I wonder if it's the same thing, because this issue and 39071 seem to time out, whereas this newest thing is failing with the descriptor changed error.

@dt
Copy link
Member

dt commented Sep 7, 2019

descriptor changed looks like a new error and is something from core (I see it mentioned in learner tests?)

@dt dt assigned tbg and unassigned dt Sep 7, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 10, 2019
In cockroachdb#39072 we see:

```
Error: importing fixture: importing table stock: pq: descriptor changed: [expected] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239] != [actual] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239]: unexpected value: raw_bytes:"\363P\244T\003\010\203\t\022\002\301\212\032\001\302\"\006\010\020\020\020\030\001\"\006\010\010\020\010\030\005\"\006\010\036\020\036\030\013(\r0\357\001:\n\010\234\366\273\371\232\322\204\341\025@\001" timestamp:<wall_time:1567832452567360492 logical:1 >
```

Which is frustrating because those stringified values look identical to me.
Hopefully this debugging will reveal something in any future repros.

Release note: None
@ajwerner
Copy link
Contributor

Is this possibly #37499 and the InternalReplicas not matching? That seems like it could be a problem @danhhz what do you think?

The line in question is:

desc.RangeID == actualDesc.RangeID && !desc.Equal(actualDesc) {

@ajwerner
Copy link
Contributor

Err more likely this is a sticky bit thing.

@ajwerner
Copy link
Contributor

There are two classes of failures here. One is too many retries in AdminSplit. Some of that comes from the setting of the sticky bit (I think). The others (where the error is killed) are consistency checker failures which I suspect may be related to #38772 and reproduce readily.

@ajwerner
Copy link
Contributor

After looking a little deeper into the above killed failures those seem like #40470

These new failures I see seem to involve joint concensus and splits and merges:

They both seem to involve the error caused by the workaround to #40333 in #40363

ajwerner-1568140936-02-n32cpu4-0006> E190910 18:45:47.364461 13319 storage/replica_raft.go:304  [n6,s6,r980/2:/Table/58/{1/906/3…-2}] received invalid ChangeReplicasTrigger ENTER_JOINT ADD_REPLICA[(n14,s14):5VOTER_INCOMING], REMOVE_REPLICA[(n6,s6):2VOTER_OUTGOING]: after=[(n23,s23):1 (n6,s6):2VOTER_OUTGOING (n8,s8):4 (n14,s14):5VOTER_INCOMING] next=6 to remove self (leaseholder)
ajwerner-1568140936-02-n32cpu4-0023> I190910 18:45:47.372336 9132 storage/replica_command.go:1066  [n23,s23,r980/1:/Table/58/{1/906/3…-2}] could not promote [n14,s14] to voter, rolling back: change replicas of r980 failed: storage/replica_raft.go:305: [n6,s6,r980/2:/Table/58/{1/906/3…-2}]: received invalid ChangeReplicasTrigger ENTER_JOINT ADD_REPLICA[(n14,s14):5VOTER_INCOMING], REMOVE_REPLICA[(n6,s6):2VOTER_OUTGOING]: after=[(n23,s23):1 (n6,s6):2VOTER_OUTGOING (n8,s8):4 (n14,s14):5VOTER_INCOMING] next=6 to remove self (leaseholder)   

consistency_failure.txt

ajwerner-1568140936-03-n32cpu4-0023> E190910 18:48:33.085188 14838 storage/replica_raft.go:304  [n23,s23,r2850/1:/Table/59/2/1{56/3/"…-87/6/"…}] received invalid ChangeReplicasTrigger ENTER_JOINT ADD_REPLICA[(n3,s3):4VOTER_INCOMING], REMOVE_REPLICA[(n23,s23):1VOTER_OUTGOING]: after=[(n23,s23):1VOTER_OUTGOING (n9,s9):2 (n5,s5):3 (n3,s3):4VOTER_INCOMING] next=5 to remove self (leaseholder)
ajwerner-1568140936-03-n32cpu4-0023> I190910 18:48:33.109304 14838 storage/replica_command.go:1066  [n23,s23,r2850/1:/Table/59/2/1{56/3/"…-87/6/"…}] could not promote [n3,s3] to voter, rolling back: change replicas of r2850 failed: storage/replica_raft.go:305: [n23,s23,r2850/1:/Table/59/2/1{56/3/"…-87/6/"…}]: received invalid ChangeReplicasTrigger ENTER_JOINT ADD_REPLICA[(n3,s3):4VOTER_INCOMING], REMOVE_REPLICA[(n23,s23):1VOTER_OUTGOING]: after=[(n23,s23):1VOTER_OUTGOING (n9,s9):2 (n5,s5):3 (n3,s3):4VOTER_INCOMING] next=5 to remove self (leaseholder) 

consistency_failure2.txt

I'm trying to get some more repros now.

@nvanbenschoten
Copy link
Member

It's possible that the sticky bit issue is resolved by a533375, which was merged yesterday? I think this was broken by one of the atomic replica change refactors before being resolved in that commit.

@ajwerner
Copy link
Contributor

One thing to note is that these were produced on a branch designed to fail this test due to the race. It does not internally retry AdminSplit as much. It's possible that the error from AdminSplit is actually what introduces the hazard. The build was this diff off of 3ad8bda.

https://gist.github.com/ajwerner/9e98a6ed5285408b57275de60faba872

@ajwerner
Copy link
Contributor

It's possible that the sticky bit issue is resolved by a533375, which was merged yesterday? I think this was broken by one of the atomic replica change refactors before being resolved in that commit.

Does that change matter in completely 19.2 clusters? I think it's just the case that each processor sets the sticky bit to a different time:

expirationTime = cp.flowCtx.Cfg.DB.Clock().Now().Add(time.Hour.Nanoseconds(), 0)

And also there's a ton of concurrent data movement going and nowadays with learners and atomic changes we update the replica descriptor a bunch so it's not the most insane thing to me that we'd need to retry 10 times.

-               expirationTime = cp.flowCtx.Cfg.DB.Clock().Now().Add(time.Hour.Nanoseconds(), 0)
+               expirationTime = hlc.Timestamp{WallTime: cp.spec.WalltimeNanos + time.Hour.Nanoseconds()}

I hope will make it somewhat better. I'm also open to increasing that 10 to something higher.

@ajwerner
Copy link
Contributor

It's not clear that this repros as readily on vanilla master. There are two interesting things in that diff, the scatter retries on condition failed and the number of retries in the split command. I'm trying to isolate which is the culprit. I'm thinking it's the retries which are masking a correctness bug.

@ajwerner
Copy link
Contributor

I did hit:

                Error: importing fixture: importing table history: pq: descriptor changed: [expected] r618:/Table/5{3/3-4} [(n14,s14):5, (n9,s9):22, (n5,s5):3, (n1,s1):23LEARNER, next=24, gen=146] != [actual] r618:/Table/5{3/3-4} [(n14,s14):5, (n9,s9):22VOTER_OUTGOING, (n5,s5):3, (n1,s1):23VOTER_INCOMING, next=24, gen=147]: unexpected value: raw_bytes:"7\211T:\003\010\352\004\022\002\275\213\032\001\276\"\006\010\016\020\016\030\005\"\010\010\t\020\t\030\026 \003\"\006\010\005\020\005\030\003\"\010\010\001\020\001\030\027 \002(\0300\223\001:\n\010\201\355\202\244\362\350\313\341\025@\001" timestamp:<wall_time:1568145490187302019 >

on master

@ajwerner
Copy link
Contributor

ajwerner commented Sep 10, 2019

Oof so the consistency failure repro'd with the following patch:

diff --git a/pkg/storage/replica_command.go b/pkg/storage/replica_command.go
index 64311f1ede..24cece7291 100644
--- a/pkg/storage/replica_command.go
+++ b/pkg/storage/replica_command.go
@@ -516,7 +516,7 @@ func (r *Replica) executeAdminCommandWithDescriptor(
        ctx context.Context, updateDesc func(*roachpb.RangeDescriptor) error,
 ) *roachpb.Error {
        retryOpts := base.DefaultRetryOptions()
-       retryOpts.MaxRetries = 10
+       retryOpts.MaxRetries = 2
        var lastErr error
        for retryable := retry.StartWithCtx(ctx, retryOpts); retryable.Next(); {
                // The replica may have been destroyed since the start of the retry loop.

off of 3ad8bda

The cluster is alive ajwerner-1568147181-02-n32cpu4

EDIT: that one was 2, I rebuilt with 1 thinking it wasn't going to repro.

@ajwerner
Copy link
Contributor

Internal data link for the test failure above: https://drive.google.com/file/d/1i9YsjvWXtc0b6D627-tRvLwkkKygSLtJ/view?usp=sharing

@ajwerner
Copy link
Contributor

Okay 5 for 5 of these seems to follow soon after

ajwerner-1568147181-02-n32cpu4-0025> I190910 20:30:23.410751 10465 storage/replica_command.go:1051  [n25,s25,r1670/3:/Table/54/2/{875/1/…-906/3/…}] could not promote [n19,s19] to voter, rolling back: change replicas of r1670 failed: storage/replica_raft.go:305: [n25,s25,r1670/3:/Table/54/2/{875/1/…-906/3/…}]: received invalid ChangeReplicasTrigger ENTER_JOINT ADD_REPLICA[(n19,s19):4VOTER_INCOMING], REMOVE_REPLICA[(n25,s25):3VOTER_OUTGOING]: after=[(n31,s31):1 (n12,s12):2 (n25,s25):3VOTER_OUTGOING (n19,s19):4VOTER_INCOMING] next=5 to remove self (leaseholder)    

@ajwerner
Copy link
Contributor

And now of course #6 doesn't. Attached are the files from another repro. Given how easy it is to repro I'm going to destroy these cluster. Pretty scary that a retry loop makes this okay.

Internal: https://drive.google.com/file/d/16aDQF21j37m3YRyjwCiiauPHBug8Ob5a/view?usp=sharing

@ajwerner
Copy link
Contributor

@andreimatei I'm passing this one over, at least for the consistency failures. We can talk tomorrow

@nvanbenschoten
Copy link
Member

I have been able to reproduce the replica inconsistency crash without atomic replica reconfiguration enabled, so it doesn't look like this is related to that change.

@nvanbenschoten
Copy link
Member

On a hunch, I tried reverting #39709 to see if we were failing to fully replicate a range during snapshots. That did not fix the issue.

@thoszhang
Copy link
Contributor

@pbardea has #40676 out, which will get rid of the repeated splits during IMPORT. I'm going to try to reproduce the issue and verify that #40676 does fix it.

@thoszhang
Copy link
Contributor

So far I've run this test with #40676 (to fix the extra splits during IMPORT) 5 times with retryOpts.MaxRetries = 2, and it has failed every time with descriptor changed. I did try running the test on master with retryOpts.MaxRetries = 2 once and got 3/7 imports failing, instead of just 1/7 with #40676, which suggests that #40676 is helping, but it's hard to say.

I can try running with max retries set to 10, but do we have a principled way of setting the max retries? Is it fine to just have a very high probability of not failing due to this issue, or do we expect to never get it at all?

Internal link to files for one of the runs: https://drive.google.com/open?id=1xQ6ER_LUzbSdqAcKz3ZA9pbaxoTBrgNv

(cc @pbardea)

@nvanbenschoten
Copy link
Member

Is it fine to just have a very high probability of not failing due to this issue, or do we expect to never get it at all?

I think a very high probability of not failing is ok. @ajwerner and I talked last week about introducing some kind of best-effort semaphore on leaseholders to reduce contention on the descriptor, but I think fundementally we'll never be able to place a hard upper-bound on the number of retries that this operation requires.

@thoszhang
Copy link
Contributor

Okay, I ran the roachtest 20 times with #40676 (and retries set to 10) and didn't get any failures. I think that's sufficient evidence that #40676 helps and we should merge it.

craig bot pushed a commit that referenced this issue Sep 17, 2019
40676: sql: stop presplitting spans on every import proc r=pbardea a=pbardea

We are currently presplitting spans (on every table and index boundary)
for every processor. We should only be doing this once so this PR moves
it to the planning stage.

Addresses #39072.

Release note: None

40838: makefile: increase test timeout r=andreimatei a=andreimatei

This patch increases the test timeout for every package from 12m to 20m
(and under stress from 25m to 30m).
This is motivated by the sql TestLogic which has been inching towards
the 12m recently (according to TeamCity history) and is sometimes timing
out. Hopefully it's because we've been adding more tests...

Fixes #40572

Release note: None

Release justification: fix timeout flakes

Co-authored-by: Paul Bardea <pbardea@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@pbardea
Copy link
Contributor

pbardea commented Sep 18, 2019

Closed by #40676.

@pbardea pbardea closed this as completed Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

10 participants