Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: scrub/all-checks/tpcc/w=1000 failed #35986

Closed
cockroach-teamcity opened this issue Mar 20, 2019 · 63 comments · Fixed by #36942
Closed

roachtest: scrub/all-checks/tpcc/w=1000 failed #35986

cockroach-teamcity opened this issue Mar 20, 2019 · 63 comments · Fixed by #36942
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/3a7ea2d8c9d4a3e0d97f8f106fcf95b3f03765ec

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog

The test failed on master:
	cluster.go:1267,tpcc.go:130,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1187480-scrub-all-checks-tpcc-w-1000:5 -- ./workload run tpcc --warehouses=1000 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		9.2 103079.2 newOrder
		   59m6s    26598            0.0            2.1      0.0      0.0      0.0      0.0 orderStatus
		   59m6s    26598            0.0           20.0      0.0      0.0      0.0      0.0 payment
		   59m6s    26598            0.0            2.1      0.0      0.0      0.0      0.0 stockLevel
		E190320 15:17:10.469844 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.62:26257: connect: connection refused
		   59m7s    71358            0.0            2.2      0.0      0.0      0.0      0.0 delivery
		   59m7s    71358            1.0           21.1 103079.2 103079.2 103079.2 103079.2 newOrder
		   59m7s    71358            0.0            2.1      0.0      0.0      0.0      0.0 orderStatus
		   59m7s    71358            0.0           20.0      0.0      0.0      0.0      0.0 payment
		   59m7s    71358            0.0            2.1      0.0      0.0      0.0      0.0 stockLevel
		E190320 15:17:11.469848 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.62:26257: connect: connection refused
		: signal: killed
	cluster.go:1626,tpcc.go:140,scrub.go:58,test.go:1214: unexpected node event: 3: dead

@cockroach-teamcity cockroach-teamcity added this to the 19.1 milestone Mar 20, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 20, 2019
@thoszhang thoszhang self-assigned this Mar 21, 2019
@thoszhang
Copy link
Contributor

This was another OOM. The heap profile that got written looks extremely similar to the last one (#35790):

(pprof) top
Showing nodes accounting for 2.72GB, 83.45% of 3.26GB total
Dropped 822 nodes (cum <= 0.02GB)
Showing top 10 nodes out of 204
      flat  flat%   sum%        cum   cum%
    1.50GB 46.14% 46.14%     1.50GB 46.14%  github.com/cockroachdb/cockroach/pkg/roachpb.(*ScanResponse).Unmarshal
    0.62GB 19.11% 65.25%     0.62GB 19.11%  github.com/cockroachdb/cockroach/pkg/storage/engine.copyFromSliceVector
    0.18GB  5.49% 70.74%     0.20GB  6.03%  github.com/cockroachdb/cockroach/pkg/sql.newOptTable
    0.10GB  3.01% 73.74%     0.10GB  3.01%  github.com/cockroachdb/cockroach/pkg/sql/pgwire/pgwirebase.(*ReadBuffer).reset
    0.09GB  2.88% 76.62%     0.09GB  2.88%  github.com/cockroachdb/cockroach/vendor/github.com/andy-kimball/arenaskl.NewArena
    0.08GB  2.55% 79.17%     0.09GB  2.67%  github.com/cockroachdb/cockroach/pkg/sql/parser.(*sqlParserImpl).Parse
    0.04GB  1.35% 80.53%     0.04GB  1.35%  github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*parser).recvMsg
    0.04GB  1.33% 81.86%     0.04GB  1.33%  github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*DatumAlloc).NewDInt
    0.03GB  0.85% 82.71%     0.03GB  1.03%  github.com/cockroachdb/cockroach/pkg/sql/parser.(*scanner).scanNumber
    0.02GB  0.74% 83.45%     0.02GB  0.74%  github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*EncDatumRowAlloc).AllocRow

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9399d559ae196e5cf2ad122195048ff9115ab56a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194326&tab=buildLog

The test failed on release-19.1:
	scrub.go:83,cluster.go:1605,errgroup.go:57: read tcp 172.17.0.2:60850->35.237.203.215:26257: read: connection reset by peer
	cluster.go:1267,tpcc.go:130,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1194326-scrub-all-checks-tpcc-w-1000:5 -- ./workload run tpcc --warehouses=1000 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		.0            0.0      0.0      0.0      0.0      0.0 newOrder
		   4m36s    22663            0.0            0.0      0.0      0.0      0.0      0.0 orderStatus
		   4m36s    22663            0.0            0.0      0.0      0.0      0.0      0.0 payment
		   4m36s    22663            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		E190323 14:51:44.563428 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.20:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   4m37s    86325            0.0            0.0      0.0      0.0      0.0      0.0 delivery
		   4m37s    86325            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
		   4m37s    86325            0.0            0.0      0.0      0.0      0.0      0.0 orderStatus
		   4m37s    86325            0.0            0.0      0.0      0.0      0.0      0.0 payment
		   4m37s    86325            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1626,tpcc.go:140,scrub.go:58,test.go:1214: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a039a93a5cc6eb3f395ceb6f7dc8030985dccc29

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212269&tab=buildLog

The test failed on release-19.1:
	cluster.go:1293,tpcc.go:130,cluster.go:1631,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1212269-scrub-all-checks-tpcc-w-1000:5 -- ./workload run tpcc --warehouses=1000 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		0            0.0      0.0      0.0      0.0      0.0 newOrder
		    5m0s    74395            0.0            0.0      0.0      0.0      0.0      0.0 orderStatus
		    5m0s    74395            0.0            0.0      0.0      0.0      0.0      0.0 payment
		    5m0s    74395            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		E190401 14:40:54.130813 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.139:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		    5m1s    87634            0.0            0.0      0.0      0.0      0.0      0.0 delivery
		    5m1s    87634            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
		    5m1s    87634            0.0            0.0      0.0      0.0      0.0      0.0 orderStatus
		    5m1s    87634            0.0            0.0      0.0      0.0      0.0      0.0 payment
		    5m1s    87634            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1652,tpcc.go:140,scrub.go:58,test.go:1223: unexpected node event: 2: dead
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:774,test.go:760,cluster.go:1652,tpcc.go:140,scrub.go:58,test.go:1223: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1212269-scrub-all-checks-tpcc-w-1000 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: dead
		1: 3931
		4: 3311
		3: 3253
		Error:  2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c6df752eefe4609b8a5bbada0955f79a2cfb790e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217763&tab=buildLog

The test failed on master:
	scrub.go:83,cluster.go:1667,errgroup.go:57: dial tcp 35.237.106.233:26257: connect: connection refused
	cluster.go:1329,tpcc.go:130,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1217763-scrub-all-checks-tpcc-w-1000:5 -- ./workload run tpcc --warehouses=1000 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		.0      0.0 newOrder
		  21m42s    34983            0.0            1.0      0.0      0.0      0.0      0.0 orderStatus
		  21m42s    34983            0.0            8.6      0.0      0.0      0.0      0.0 payment
		  21m42s    34983            0.0            0.9      0.0      0.0      0.0      0.0 stockLevel
		E190403 14:28:39.898324 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.62:26257: connect: connection refused
		  21m43s    72479            0.0            1.1      0.0      0.0      0.0      0.0 delivery
		  21m43s    72479            0.0            9.5      0.0      0.0      0.0      0.0 newOrder
		  21m43s    72479            0.0            1.0      0.0      0.0      0.0      0.0 orderStatus
		  21m43s    72479            0.0            8.6      0.0      0.0      0.0      0.0 payment
		  21m43s    72479            0.0            0.9      0.0      0.0      0.0      0.0 stockLevel
		E190403 14:28:40.898314 1 workload/cli/run.go:420  error in newOrder: dial tcp 10.142.0.62:26257: connect: connection refused
		: signal: killed
	cluster.go:1688,tpcc.go:140,scrub.go:58,test.go:1228: unexpected node event: 1: dead
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:776,test.go:762,cluster.go:1688,tpcc.go:140,scrub.go:58,test.go:1228: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1217763-scrub-all-checks-tpcc-w-1000 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		1: dead
		3: 4074
		4: 3907
		2: 3701
		Error:  1: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5267932f6fec0405b31328c1ad43711b0bb013e5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1220238&tab=buildLog

The test failed on master:
	cluster.go:1329,tpcc.go:142,scrub.go:58,test.go:1228: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1220238-scrub-all-checks-tpcc-w-1000:5 -- ./workload check tpcc --warehouses=1000 {pgurl:1} returned:
		stderr:
		
		stdout:
		I190404 17:10:52.905025 1 workload/tpcc/tpcc.go:292  check 3.3.2.1 took 405.83348ms
		I190404 17:11:17.096606 1 workload/tpcc/tpcc.go:292  check 3.3.2.2 took 24.191499294s
		I190404 17:11:24.410089 1 workload/tpcc/tpcc.go:292  check 3.3.2.3 took 7.313397246s
		I190404 17:15:08.651204 1 workload/tpcc/tpcc.go:292  check 3.3.2.4 took 3m44.241034081s
		Error: check failed: 3.3.2.4: length of order.sum(o_ol_cnt) != order_line.count(*)
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/b88a6ce86bfe507e14e1e80fdaefd219b5f0b046

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1222804&tab=buildLog

The test failed on provisional_201904050025_v19.1.0-rc.2:
	cluster.go:1329,tpcc.go:130,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1222804-scrub-all-checks-tpcc-w-1000:5 -- ./workload run tpcc --warehouses=1000 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		0      0.0 newOrder
		   7m26s    45579            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		   7m26s    45579            0.0            0.0      0.0      0.0      0.0      0.0 payment
		   7m26s    45579            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		E190405 14:00:53.004702 1 workload/cli/run.go:420  error in newOrder: dial tcp 10.128.0.10:26257: connect: connection refused
		   7m27s   117069            0.0            0.0      0.0      0.0      0.0      0.0 delivery
		   7m27s   117069            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
		   7m27s   117069            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		   7m27s   117069            0.0            0.0      0.0      0.0      0.0      0.0 payment
		   7m27s   117069            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		E190405 14:00:54.004722 1 workload/cli/run.go:420  error in newOrder: dial tcp 10.128.0.10:26257: connect: connection refused
		: signal: killed
	cluster.go:1688,tpcc.go:140,scrub.go:58,test.go:1228: unexpected node event: 4: dead
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:776,test.go:762,cluster.go:1688,tpcc.go:140,scrub.go:58,test.go:1228: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1222804-scrub-all-checks-tpcc-w-1000 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		4: dead
		1: 3644
		3: 3288
		2: 3411
		Error:  4: dead

@tbg
Copy link
Member

tbg commented Apr 5, 2019

@andreimatei is dusting off (well, hasn't really had much time to gather dust) the roachtest stresser.

It looks like we need to treat this as a release blocker until we have a reason to believe that 19.1 is unaffected. I'm adding it to the list. cc @bdarnell

@bdarnell
Copy link
Contributor

bdarnell commented Apr 5, 2019

if leftRows.Next() || rightRows.Next() {
return errors.Errorf("length of order.sum(o_ol_cnt) != order_line.count(*)")
}

This check (and a few others in this file) are imprecise: If the server crashes while results are being streamed back (or otherwise returns an error), it will report this as a data inconsistency instead of the error that occurred. We should be sure to report leftRows.Err() and rightRows.Err() in case they are not nil.

There were no server crashes during this run, and I don't see a way that these read-only check statements could return a late error, but when we do manage to reproduce this we should make sure it's not a false positive.

@andreimatei
Copy link
Contributor

Actually I'm not a good owner for this. I'm haven't gotten to testing anything today. I don't know if I'll do anything over the weekend, and then on Tuesday I'm off for a few days. Another victim might be better :)

@bdarnell
Copy link
Contributor

bdarnell commented Apr 6, 2019

More sketchiness in the check: it does two separate read-only transactions, without syncing them using AOST. So if it's run on a non-idle cluster, it could have false positives. I think there's some reason to think that might be at least a possibility here (although I'm not convinced).

The test runs tpcc for 2 hours, and during that time it runs SCRUB three times, and expects them to complete in less than that time. A comment indicates that SCRUB is expected to take 35m. In this instance, the three SCRUBs took 33m, 45m, and 40m, adding up to more than 2h (with a 10m sleep at startup). The last scrub finishes after tpcc has stopped:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 7200.0s     1568.4  12.2%  16507.8   5100.3  90194.3 103079.2 103079.2 103079.2
17:10:51 scrub.go:80: SCRUB check 3 took 40m39.359757107s

"Stopping" TPCC in this case means cancelling all the workers. If anything is in a cancellation-insensitive retry loop on the server side, it could keep going.

The check that failed started at 17:11:24, 33s after the last SCRUB finished. But p90 latency was 90s, indicating that the scrub test is either badly overloaded or experiencing high contention. So if there was enough stuff queued up at the end of the test that something went on to get its timestamp pushed by 30+ seconds and still commit, we could explain the failure.

The scrub test runs tpcc-1000 on a 4-node, 4 vcpu/node cluster. That's not enough to keep up with the workload. Is it bad enough to queue up enough work (perhaps exacerbated by long-running scrub queries to push timestamps) that something could still be running by the time the check starts? It seems unlikely, but maybe?

@bdarnell
Copy link
Contributor

bdarnell commented Apr 6, 2019

Aha, tpcc specifically detaches its transactions from the context that gets cancelled:

// Run transactions with a background context because we don't want to
// cancel them when the context expires. Instead, let them finish normally
// but don't account for them in the histogram.
start := timeutil.Now()
if _, err := tx.run(context.Background(), warehouseID); err != nil {
return errors.Wrapf(err, "error in %s", txInfo.name)
}

That increases the odds that the last round of contending transactions fight it out for 30+ seconds considerably.

@tbg
Copy link
Member

tbg commented Apr 6, 2019

Interesting. So the theory is that the check is buggy and the trigger for that isn't rare? It is true that the scrub tests are vastly underprovisioned. There seems to be enough evidence that this might've been the problem.

While we scrutinize and fix the checks, we should also improve them so that they at least log the numbers they're comparing. In an ideal world, they would also run a verbose query that would return the inconsistency found, though if there is one we're going to want to save the data directory anyway.

We could add an S3 bucket to which we can upload the store directories in such a case. This would also be used by the replica consistency checks in roachtests. cc @bobvawter

@bdarnell
Copy link
Contributor

bdarnell commented Apr 6, 2019

Interesting. So the theory is that the check is buggy and the trigger for that isn't rare?

The checks are clearly buggy and should fail easily if run concurrently with a live workload (at least with --wait=false, which the scrub tests also use. True tpcc spends enough time idling that it might be unlikely to hit the inconsistency).

The speculative part is that maybe the overload condition is leaving some activity running even after the TPCC run has reached its end. The context leak in #35986 (comment) is less of a smoking gun than it initially appeared, since the checks are run from a separate workload check process that only starts after the original workload run process has terminated. So unless there's yet another issue in the way these steps are sequenced, there is no more client activity by the time the checks run.

So for this to be the problem, we need some transactions to be queued up in such a way that in the absence of client activity, they will push their timestamps out by a significant amount but still be able to have server-side retries and commit. That seems implausible to me on the time scales we're talking about.

There's a ton of this in the logs, all at the same time when the test ends:

I190404 17:08:44.121859 25964942 internal/client/txn.go:619  [n1] async rollback failed: context deadline exceeded

So we can see that there were a lot of pending transactions that apparently got canceled when the client went away, but because we weren't able to process all the cancellations in time maybe something kept going. Normally I'd say this is a sign that we need to replace RunAsyncTask with RunLimitedAsyncTask to provide backpressure, but there's nothing to backpressure in this case. It could still be a win if it delays the start of each async rollback's timeout, though.

The goroutine count on n1 was 11k before the async rollback failed spam, and 225 after. Other nodes had somewhat lower "before" counts (mostly around 8k), but all dropped to around 200 afterwards, which I think rules out any server-side retry loops. We're just left with whatever's queued up in the raft log, which means that no new timestamps are getting assigned. I can believe that we'd still be working through that raft log by the time the checks start, but I don't see that it would make the checks inconsistent.

@bdarnell
Copy link
Contributor

bdarnell commented Apr 6, 2019

The check queries run for over three minutes. This makes them eligible for follower reads for the latter half of their run. The overload conditions in this test may stress certain aspects of follower reads, such as how they work when nodes (both leader and follower) are far behind on their raft logs, or intents resolving with their timestamps pushed significantly ahead of their original timestamps.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/6da68d7fe2c9a29b85e2ec0c7e545a0d6bdc4c5c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1226521&tab=buildLog

The test failed on release-19.1:
	scrub.go:83,cluster.go:1667,errgroup.go:57: read tcp 172.17.0.2:39498->34.73.1.69:26257: read: connection reset by peer
	cluster.go:1329,tpcc.go:169,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1226521-scrub-all-checks-tpcc-w-1000:5 -- ./workload run tpcc --warehouses=1000 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		      0.0 newOrder
		  1h1m2s    25858            1.0            2.2   4563.4   4563.4   4563.4   4563.4 orderStatus
		  1h1m2s    25858            2.0           20.4  47244.6 103079.2 103079.2 103079.2 payment
		  1h1m2s    25858            0.0            2.1      0.0      0.0      0.0      0.0 stockLevel
		E190407 15:10:28.426806 1 workload/cli/run.go:420  error in delivery: dial tcp 10.142.0.152:26257: connect: connection refused
		  1h1m3s    78871            0.0            2.2      0.0      0.0      0.0      0.0 delivery
		  1h1m3s    78871            0.0           21.6      0.0      0.0      0.0      0.0 newOrder
		  1h1m3s    78871            0.0            2.2      0.0      0.0      0.0      0.0 orderStatus
		  1h1m3s    78871            0.0           20.4      0.0      0.0      0.0      0.0 payment
		  1h1m3s    78871            0.0            2.1      0.0      0.0      0.0      0.0 stockLevel
		E190407 15:10:29.426873 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.152:26257: connect: connection refused
		: signal: killed
	cluster.go:1688,tpcc.go:179,scrub.go:58,test.go:1237: Goexit() was called
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1688,tpcc.go:179,scrub.go:58,test.go:1237: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1226521-scrub-all-checks-tpcc-w-1000 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		1: dead
		2: 3588
		4: 3560
		3: 3699
		Error:  1: dead

@bdarnell
Copy link
Contributor

bdarnell commented Apr 8, 2019

I haven't reproduced the check failure yet, but after watching some normal runs we don't seem to be getting very large raft logs. I've just spot-checked, but every data range I've looked at has a last log index of no more than 30. I also haven't seen any mismatched applied indexes; all the followers are up to date with the leader on the ranges I've looked at.

What we see instead is a lot of invalid leases. During the test run, one or more nodes misses a heartbeat, invalidating a lot of leases. These leases are reestablished very slowly while the checks are running, and then at the end of the checks all the remaining ranges get leases very quickly. I'm not sure whether that's a sign that there's a bottleneck somewhere preventing leases from being acquired quickly or if there's something about the access patterns of the checks that touches a lot of formerly-idle ranges at the end.

So during the check that failed, our read RPCs are likely to hit a lot of ranges without a valid lease. For fresh reads, that would force a lease acquisition. For follower reads, it depends on how the closed timestamp has advanced.

My current working hypothesis:

  • During the test, a follower's heartbeat failed and it missed receiving a log entry.
  • The range quiesced anyway because the node was not live.
  • Since the node was actually up, it learned about this quiescence on the next heartbeat, without also learning that there were log entries it didn't have.
  • A quiescent range advances its closed timestamp without getting new lease applied indexes. Until something wakes the range back up, this follower will be able to serve follower reads even though it's not up to date.

@bdarnell
Copy link
Contributor

bdarnell commented Apr 8, 2019

Since the node was actually up, it learned about this quiescence on the next heartbeat, without also learning that there were log entries it didn't have.

No, that's not how it works. We check for this.

Maybe there were two separate quiescence events:

  • The first quiescence worked normally and all followers were up to date.
  • One follower misses a heartbeat and becomes non-live.
  • While the follower is unavailable, the range wakes up, commits a change, then re-quiesces. The follower misses all of this.
  • When the follower becomes live again, it still has its quiesced range.

@ajwerner
Copy link
Contributor

ajwerner commented Apr 8, 2019

The check queries run for over three minutes. This makes them eligible for follower reads for the latter half of their run. The overload conditions in this test may stress certain aspects of follower reads, such as how they work when nodes (both leader and follower) are far behind on their raft logs, or intents resolving with their timestamps pushed significantly ahead of their original timestamps.

What makes you think that this has to do with the check queries and not queries in the TPC-C workload? It's worth noting that a large number of TPC-C queries take well over 45 seconds (see the P95 values during the test). My intuition is the stale read for one of the order updating queries is more likely to be the culprit for problems than anything else. Do we anticipate that the scrub would interfere with the check?

  • A quiescent range advances its closed timestamp without getting new lease applied indexes. Until something wakes the range back up, this follower will be able to serve follower reads even though it's not up to date.

I'd like to understand how this happens. The replica which has failed to heartbeat its liveness record should still be able to serve reads below the old closed timestamp, but in theory shouldn't those reads be safe? Also, and this now is pushing on things I don't fully understand, when the node becomes live again, shouldn't the epoch have changed?

If we're generally worried about correctness of follower reads in read-write transactions we could take a more dramatic step to prevent follower reads in all read-write transactions. We could make clients explicitly opt in to them by setting a header flag and enforcing that we only set such a header for read only transactions in the conn executor.

@bdarnell
Copy link
Contributor

bdarnell commented Apr 8, 2019

What makes you think that this has to do with the check queries and not queries in the TPC-C workload?

A fair question. When/if we manage to reproduce this we can easily distinguish between a problem in the persisted data and a problem in the check. But in the meantime, I believe it's more likely to be a problem in the check.

The failing check verifies that the o_ol_cnt field is consistent with a count() in the order_line table (the former is effectively a denormalized view of the latter). The order lines are all inserted at once so this invariant does not depend on a read-modify-write within the new_order transaction, and I don't see an easy way for a stale read to result in (this particular kind of) garbage data.

Do we anticipate that the scrub would interfere with the check?

I can't think of any reason it would.

The replica which has failed to heartbeat its liveness record should still be able to serve reads below the old closed timestamp, but in theory shouldn't those reads be safe?

Yes. The (potential) problem is that there is special logic for advancing the closed timestamp of a quiesced range. (there was a lot of discussion about this on the RFC but now I'm just paging all this stuff back in and trying to figure out where we left things).

Also, and this now is pushing on things I don't fully understand, when the node becomes live again, shouldn't the epoch have changed?

Not necessarily. The epoch is bumped when one node steals another's leases. A node can go from live to dead and back again and reuse its old epoch. (also, in this scenario it's the follower that becomes non-live, but we usually only care about epochs on leaders/leaseholders)

If we're generally worried about correctness of follower reads in read-write transactions we could take a more dramatic step to prevent follower reads in all read-write transactions.

Didn't we already do that? #35969

@bdarnell
Copy link
Contributor

bdarnell commented Apr 8, 2019

Yes. The (potential) problem is that there is special logic for advancing the closed timestamp of a quiesced range. (there was a lot of discussion about this on the RFC but now I'm just paging all this stuff back in and trying to figure out where we left things).

Quiescence looks like a red herring. An early version of the original follower reads RFC had special treatment for quiescence, but this was dropped in #26362.

@ajwerner
Copy link
Contributor

ajwerner commented Apr 8, 2019

Quiescence looks like a red herring. An early version of the original follower reads RFC had special treatment for quiescence, but this was dropped in #26362.

Cool, I was about to remark that I don't see much in the way of special logic around quiesced ranges. We always seem to read the LeaseAppliedIndex when publishing to the closed timestamp subsystem.

Didn't we already do that? #35969

Ack.


If I understand the state of the discussion, we're back to not really having a plausible theory for how this might have happened. It seems to me that either (a) the data actually inconsistent at check time or (b) the data was not inconsistent at check time but the check transaction failed to observe a consistent snapshot

The argument against (a) is that:

The order lines are all inserted at once so this invariant does not depend on a read-modify-write within the new_order transaction, and I don't see an easy way for a stale read to result in (this particular kind of) garbage data.

(b) also seems somewhat unlikely given that we think that all writing transactions from workload run had concluded before the workload check began. If we were to observe a stale read from a follower during the check transaction, it seems like it would have to be quite stale.

The other thing going on here is that there's a concurrent SCRUB but as I understand it that scrub should not be writing in a way that might interfere with reads.

Is there anything I'm missing?

@bdarnell
Copy link
Contributor

bdarnell commented Apr 8, 2019

If I understand the state of the discussion, we're back to not really having a plausible theory for how this might have happened. It seems to me that either (a) the data actually inconsistent at check time or (b) the data was not inconsistent at check time but the check transaction failed to observe a consistent snapshot

Yup, except that there is no "check transaction". The check runs two non-transactional statements at a time while the database is believed to be idle, and expects them to both see consistent data.

The argument against (a) is that: The order lines are all inserted at once

That's an argument against a follower read causing us to insert inconsistent data, not a complete argument against (a). It could also be caused by something like the issue fixed in #32166, for example.

If we were to observe a stale read from a follower during the check transaction, it seems like it would have to be quite stale.

Right. That's why I was looking at quiescence issues, since a quiescent range could be arbitrarily far behind.

The other thing going on here is that there's a concurrent SCRUB but as I understand it that scrub should not be writing in a way that might interfere with reads.

Note that the scrub is concurrent with the main tpcc workload, not with the checks. By the time the checks start they're supposed to be the only thing running. But scrub is just a normal read and I don't see any reason to suspect it. Instead of the scrub, I believe the reason this test saw an issue that none of our other TPCC tests did is that it uses a very under-provisioned cluster.

@bdarnell
Copy link
Contributor

Continuing to iterate on the theory from #35986 (comment)

Incrementing the epoch causes leases to be non-equivalent, which makes it impossible for a new epoch's lease to make it into leasePostApply without setting leaseChangingHands. So what if we take out the epoch increment?

I190411 04:11:51.380438 207 storage/node_liveness.go:775  [n4,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context done during DistSender.
Send: context deadline exceeded)
W190411 04:11:51.380471 207 storage/node_liveness.go:518  [n4,hb] slow heartbeat took 4.5s
W190411 04:11:51.380490 207 storage/node_liveness.go:458  [n4,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
I190411 04:11:54.382208 141352 internal/client/txn.go:619  [n4] async rollback failed: aborted during DistSender.Send: context deadline exceeded
W190411 04:11:55.880906 207 storage/node_liveness.go:518  [n4,hb] slow heartbeat took 4.5s
I190411 04:11:55.880939 207 storage/node_liveness.go:448  [n4,hb] heartbeat failed on epoch increment; retrying
...
W190411 04:12:00.110867 207 storage/node_liveness.go:518  [n4,hb] slow heartbeat took 4.2s

At 04:11:51, 04:11:55, and 04:12:00, we see three separate heartbeats that take over 4 seconds. The first one times out (at 4.5s), so when the second one starts the first is still pending in the raft log. The second one has a CPut failure, which gets logged as "heartbeat failed on epoch increment" even though that's not necessarily what happens.

// If the actual liveness is different than expected, but is
// considered live, treat the heartbeat as a success. This can
// happen when the periodic heartbeater races with a concurrent
// lease acquisition.
if actual.IsLive(nl.clock.Now(), nl.clock.MaxOffset()) && !incrementEpoch {
return errNodeAlreadyLive
}
// Otherwise, return error.
return ErrEpochIncremented

A CPut failure is turned into either errNodeAlreadyLive or ErrEpochIncremented, but there is also a third case: a previous heartbeat succeeded, invalidating our CPut expected value, but is no longer live by the time we apply.

All of n4's leases, including the one for r5542, were considered expired and many of them entered this Heartbeat call:

if err = p.repl.store.cfg.NodeLiveness.Heartbeat(ctx, status.Liveness); err != nil {

When the third heartbeat succeeds at 04:12:00, all the leases are unblocked.

W190411 04:12:00.136248 150047 storage/node_liveness.go:518  [n4,s4,r5542/1:/Table/59/1/{622-780/0}] slow heartbeat took 3.5s
W190411 04:12:00.136336 150025 storage/node_liveness.go:518  [n4,s4,r4521/1:/Table/55/1/87{2/831…-3/122…}] slow heartbeat took 3.5s
W190411 04:12:00.136533 150093 storage/node_liveness.go:518  [n4,s4,r3894/1:/Table/54/1/94{1/6/2…-3/6/9…}] slow heartbeat took 3.5s

(note that despite the log spam, there is some batching; we're not hammering the liveness range with redundant heartbeats here)

At this point, we've successfully heartbeat our existing epoch, so our old "expired" lease has been revived! The code in requestLeaseAsync doesn't consider that possibility, though. It goes ahead and proposes a new lease anyway.

The new proposed lease is identical to the old one except for the Start and ProposedTS fields. ProposedTS is ignored by Lease.Equivalent. Start is modified by RequestLease. So I think we had multiple redundant leases proposed and applied, which didn't trigger the leaseChangingHands part of leasePostApply. But that should be OK, because in order to have the same Sequence, they had to have the same Start, so there shouldn't be a need to update the timestamp cache. So I don't see how to get from here to an inconsistency.

@tbg
Copy link
Member

tbg commented Apr 15, 2019

I poked around a bit too but can't find anything going wrong here neither. In the end I looked into what happens with closed timestamps when the epoch becomes stale (and later non-stale). The answer is, nothing interesting:

next, epoch, err := p.cfg.Clock(p.cfg.NodeID)
next.WallTime -= int64(targetDuration)
if err != nil {
if p.everyClockLog.ShouldLog() {
log.Warningf(ctx, "unable to move closed timestamp forward: %s", err)
}
// Broadcast even if nothing new was queued, so that the subscribers
// loop to check their client's context.
p.mu.Broadcast()
} else {

We just don't close out a timestamp and leave the tracker alone. When we become live again, it is emitted, as you'd expect to.

Time for a new repro.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9e6ae3cc37e7691147bb6f5d1a156ebe4c5cf7f9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1245443&tab=buildLog

The test failed on master:
	cluster.go:1329,tpcc.go:168,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1245443-scrub-all-checks-tpcc-w-1000:5 -- ./workload run tpcc --warehouses=1000 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		n.go:428  error in newOrder: dial tcp 10.142.0.111:26257: connect: connection refused
		   9m11s    20272            0.0            0.0      0.0      0.0      0.0      0.0 delivery
		   9m11s    20272            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
		   9m11s    20272            0.0            0.4      0.0      0.0      0.0      0.0 orderStatus
		   9m11s    20272            0.0            0.0      0.0      0.0      0.0      0.0 payment
		   9m11s    20272            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		   9m12s    20272            0.0            0.0      0.0      0.0      0.0      0.0 delivery
		   9m12s    20272            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
		   9m12s    20272            0.0            0.4      0.0      0.0      0.0      0.0 orderStatus
		   9m12s    20272            0.0            0.0      0.0      0.0      0.0      0.0 payment
		   9m12s    20272            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1688,tpcc.go:178,scrub.go:58,test.go:1237: unexpected node event: 4: dead
	cluster.go:953,context.go:89,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1688,tpcc.go:178,scrub.go:58,test.go:1237: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1245443-scrub-all-checks-tpcc-w-1000 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		4: dead
		2: 3395
		1: 3226
		3: 3295
		Error:  4: dead

@bdarnell
Copy link
Contributor

I don't have another repro, but I have a theory. The key is the fact that the node lost and renewed its liveness several times, combined with weak input validation in cmd_lease_request.go. Follower reads aren't necessary, and it appears that this bug has been with us a very long time. New timeline:

  1. n4's liveness expires at time t1. It tries to heartbeat but its liveness heartbeat gets stuck in a queue somewhere because the disk is overloaded.
  2. Some other node (i'll call it n1) picks a timestamp t2 and checks the lease status. It sees that the leaseholder n4 has an expired lease, so it tries to increment n4's epoch. This also goes in a queue somewhere.
  3. n4's heartbeat goes through, updating its liveness. It is able to serve reads until the liveness expires again at t3.
  4. n1's epoch increment goes through. n4 is no longer live, so the increment is allowed, even though it is a separate event of non-liveness than the one that prompted this message to be sent.
  5. n1 proceeds to request a lease, using t2 as its start timestamp. This timestamp has been carried forward from the original leaseStatus call.
  6. Input validation in cmd_lease_request.go is weak. It checks that the start time does not move backwards for extensions, or for new expiration-based leases, but it doesn't perform this check for epoch-based leases (because the necessary state is not local; it's held on the liveness range). The lease is allowed.
  7. The new leaseholder resets its timestamp cache to t2, even though n4 may have served reads between t2 and t3.

I think the fix is that whenever we perform a liveness action in the course of requesting a lease, we have to adjust our start time to a value after the last known change to the liveness record.

@tbg
Copy link
Member

tbg commented Apr 18, 2019

In 1-3, shouldn't the call to IncrementEpoch and n4's liveness heartbeat serialize and force one of the two to fail? We call into IncrementEpoch here:

} else if err = p.repl.store.cfg.NodeLiveness.IncrementEpoch(ctx, status.Liveness); err != nil {
log.Error(ctx, err)
}

and the status got passed down all the way from here:

return r.requestLeaseLocked(ctx, status), nil

This means, or rather, should mean, that the Liveness in the status is both the one proven to have expired and the base in the increment's CPut. Thus IncrementEpoch would only succeed if the persisted liveness hadn't been changed (by the heartbeat, which changes the expiration) by the time the increment's CPut evaluates. Conversely, if we do manage to increment the epoch, n4 wouldn't be able to "reactivate" the old epoch via its heartbeat.

I also looked at IncrementEpoch for a way in which it could observe the changed Expiration and increment the epoch anyway, but I don't see it. We seem to require the passed-in liveness:

update := livenessUpdate{Liveness: *liveness}
update.Epoch++
if err := nl.updateLiveness(ctx, update, liveness, func(actual storagepb.Liveness) error {
defer nl.maybeUpdate(actual)
if actual.Epoch > liveness.Epoch {
return errEpochAlreadyIncremented
} else if actual.Epoch < liveness.Epoch {
return errors.Errorf("unexpected liveness epoch %d; expected >= %d", actual.Epoch, liveness.Epoch)
}
return errors.Errorf("mismatch incrementing epoch for %+v; actual is %+v", *liveness, actual)
}); err != nil {
if err == errEpochAlreadyIncremented {
return nil
}
return err
}
log.Infof(ctx, "incremented n%d liveness epoch to %d", update.NodeID, update.Epoch)
nl.maybeUpdate(update.Liveness)
nl.metrics.EpochIncrements.Inc(1)
return nil

I didn't see anything in updateLivenessAttempt neither.

I generally agree that a bug in this area still makes sense and hope that I'm missing something in the above "refutation".

@bdarnell
Copy link
Contributor

Thus IncrementEpoch would only succeed if the persisted liveness hadn't been changed (by the heartbeat, which changes the expiration) by the time the increment's CPut evaluates.

There's this:

if err == errEpochAlreadyIncremented {
return nil
}

If our CPut fails because of an expiration mismatch, we then check the Epoch field of the new value. If it's already at our desired (post-increment) value because someone else incremented it, we consider the increment a success. In other words, when two increments race, they can both succeed, and go on to race their RequestLeases. The one that lost the increment race may have a Start time that is too low, so if it wins the RequestLease race it won't have the right timestamp cache.

So we need to add a third actor to this drama, replacing step 4 from #35986 (comment) :

4.1: After n4's second missed heartbeat, another node (call it n2) sends another epoch increment at t4. This increment succeeds.
4.2: n1's epoch increment goes through. The cput fails, but the method succeeds. n1 thinks it incremented the epoch at time t2
4.3: n1 and n2 both submit RequestLease rpcs, with start times t2 and t4 respectively. But n4 was able to serve reads between t2 and t3, so if the lease at t2 wins the timestamp cache is missing entries.

@bdarnell
Copy link
Contributor

I think this is a bug. Is it the bug that happens here? Parts of it sound far-fetched, especially the fact that n1's IncrementEpoch starts at step 2 but doesn't complete until step 4.2 after both a heartbeat from n4, a second liveness expiration, and an IncrementEpoch from n2.

However, there are factors in the node liveness implementation that make this more plausible. A semaphore in the liveness module insures that only one IncrementEpoch is in flight at a time. Since these are performed with a background context, there is no timeout that will fire while they're in the queue (there is a timeout that starts once the RPC begins). So when a node fails liveness, many different ranges may independently reach the IncrementEpoch path and queue up their own redundant increments. Many of these will hit the fast path at the start of updateLivenessAttempt, but we could still queue up a bunch of them while things are bogged down (as we can see from the log spam when the jam is cleared). So I think with some things backlogged on the liveness semaphore, and others waiting on the KV machinery on the liveness range's leaseholder, we could see such extreme reorderings like that.

The real question is how we can test this hypothesis. I don't see a good way to induce this kind of timing in a roachtest (maybe by hacking a sleep into IncrementEpoch?). I could probably manually reproduce this timing in a unittest, but it would be very fiddly and doesn't tell us whether it's what's really happening.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 18, 2019
IncrementEpoch was failing to distinguish between the request that
caused the increment and another caller making the same increment.
This is incorrect since a successful increment tells you *when* the
node was confirmed dead, while relying on another node's increment
leaves this uncertain.

In rare cases involving a badly overloaded cluster, this could result
in inconsistencies (non-serializable transactions) due to incorrect
timestamp cache management.

Fixes cockroachdb#35986

Release note (bug fix): Fix a rare inconsistency that could occur on
badly overloaded clusters.
@cockroach-teamcity

This comment has been minimized.

@tbg
Copy link
Member

tbg commented Apr 19, 2019

^- no actionable artifacts because #36562 hasn't landed. @andreimatei do you have any cycles to help me push that over the finish line?

bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 19, 2019
IncrementEpoch was failing to distinguish between the request that
caused the increment and another caller making the same increment.
This is incorrect since a successful increment tells you *when* the
node was confirmed dead, while relying on another node's increment
leaves this uncertain.

In rare cases involving a badly overloaded cluster, this could result
in inconsistencies (non-serializable transactions) due to incorrect
timestamp cache management.

Fixes cockroachdb#35986

Release note (bug fix): Fix a rare inconsistency that could occur on
badly overloaded clusters.
craig bot pushed a commit that referenced this issue Apr 19, 2019
36942: storage: Don't swallow ErrEpochAlreadyIncremented r=tbg a=bdarnell

IncrementEpoch was failing to distinguish between the request that
caused the increment and another caller making the same increment.
This is incorrect since a successful increment tells you *when* the
node was confirmed dead, while relying on another node's increment
leaves this uncertain.

In rare cases involving a badly overloaded cluster, this could result
in inconsistencies (non-serializable transactions) due to incorrect
timestamp cache management.

Fixes #35986

Release note (bug fix): Fix a rare inconsistency that could occur on
badly overloaded clusters.

Co-authored-by: Ben Darnell <ben@bendarnell.com>
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 19, 2019
IncrementEpoch was failing to distinguish between the request that
caused the increment and another caller making the same increment.
This is incorrect since a successful increment tells you *when* the
node was confirmed dead, while relying on another node's increment
leaves this uncertain.

In rare cases involving a badly overloaded cluster, this could result
in inconsistencies (non-serializable transactions) due to incorrect
timestamp cache management.

Fixes cockroachdb#35986

Release note (bug fix): Fix a rare inconsistency that could occur on
badly overloaded clusters.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 19, 2019
IncrementEpoch was failing to distinguish between the request that
caused the increment and another caller making the same increment.
This is incorrect since a successful increment tells you *when* the
node was confirmed dead, while relying on another node's increment
leaves this uncertain.

In rare cases involving a badly overloaded cluster, this could result
in inconsistencies (non-serializable transactions) due to incorrect
timestamp cache management.

Fixes cockroachdb#35986

Release note (bug fix): Fix a rare inconsistency that could occur on
badly overloaded clusters.
@craig craig bot closed this as completed in #36942 Apr 19, 2019
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 21, 2019
I had to study this code for cockroachdb#35986, so I took the opportunity to
update the documentation while it's fresh in my mind.

Release note: None
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 23, 2019
The scrub roachtest was previously running tpcc-1000 on a cluster of
12 total vcpus, which is not enough (it needs ~double that). This
exposed a lot of interesting issues like cockroachdb#35986, but it's only
incidental to the main purpose of this test (and it's also flaky due
to uninteresting problems associated with overloading).

Switch the test to tpcc-100 so it can be stable; we'll reintroduce a
test dedicated to overload conditions in the future (when we can make
it stable).

Fixes cockroachdb#35985
Fixes cockroachdb#37017

Release note: None
craig bot pushed a commit that referenced this issue Apr 23, 2019
37046: roachtest: Shrink scrub workloads r=lucy-zhang a=bdarnell

The scrub roachtest was previously running tpcc-1000 on a cluster of
12 total vcpus, which is not enough (it needs ~double that). This
exposed a lot of interesting issues like #35986, but it's only
incidental to the main purpose of this test (and it's also flaky due
to uninteresting problems associated with overloading).

Switch the test to tpcc-100 so it can be stable; we'll reintroduce a
test dedicated to overload conditions in the future (when we can make
it stable).

Fixes #35985
Fixes #37017

Release note: None

Co-authored-by: Ben Darnell <ben@bendarnell.com>
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 24, 2019
I had to study this code for cockroachdb#35986, so I took the opportunity to
update the documentation while it's fresh in my mind.

Release note: None
craig bot pushed a commit that referenced this issue Apr 24, 2019
36984: storage: Add comments about leases and liveness r=tbg a=bdarnell

I had to study this code for #35986, so I took the opportunity to
update the documentation while it's fresh in my mind.

Release note: None

37100: roachtest: Hopefully deflake election-after-restart r=tbg a=bdarnell

Logs suggest that we're seeing failures when nodes aren't caught up on
all the split processing before the restart, so give this a chance to
finish.

Updates #35047

Release note: None

Co-authored-by: Ben Darnell <ben@bendarnell.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants