Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/contention/nodes=4 failed #40786

Closed
cockroach-teamcity opened this issue Sep 15, 2019 · 52 comments · Fixed by #45603
Closed

roachtest: kv/contention/nodes=4 failed #40786

cockroach-teamcity opened this issue Sep 15, 2019 · 52 comments · Fixed by #45603
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/62b1678f652461bbc1aaf6bc2c0dd03105ce0ebe

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1489712&tab=buildLog

The test failed on branch=40765, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190915-1489712/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2091,errgroup.go:57: spent 7.017544% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Sep 15, 2019
@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Sep 15, 2019
@nvanbenschoten
Copy link
Member

We do see a few two minute long periods of 0 qps, which is exactly what the test sets COCKROACH_TXN_LIVENESS_HEARTBEAT_MULTIPLIER to. This indicates a situation where transactions are waiting for another transaction's liveness to expire.

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3221.0s        0            0.0          278.8      0.0      0.0      0.0      0.0 write
 3222.0s        0            0.0          278.7      0.0      0.0      0.0      0.0 write
 3223.0s        0            0.0          278.6      0.0      0.0      0.0      0.0 write
 3224.0s        0            0.0          278.5      0.0      0.0      0.0      0.0 write
 3225.0s        0            0.0          278.4      0.0      0.0      0.0      0.0 write
 3226.0s        0            0.0          278.3      0.0      0.0      0.0      0.0 write
 3227.0s        0            0.0          278.2      0.0      0.0      0.0      0.0 write
 3228.0s        0            0.0          278.1      0.0      0.0      0.0      0.0 write
 3229.0s        0            0.0          278.1      0.0      0.0      0.0      0.0 write
 3230.0s        0            0.0          278.0      0.0      0.0      0.0      0.0 write
 3231.0s        0            0.0          277.9      0.0      0.0      0.0      0.0 write
 3232.0s        0            0.0          277.8      0.0      0.0      0.0      0.0 write
 3233.0s        0            0.0          277.7      0.0      0.0      0.0      0.0 write
 3234.0s        0            0.0          277.6      0.0      0.0      0.0      0.0 write
 3235.0s        0            0.0          277.5      0.0      0.0      0.0      0.0 write
 3236.0s        0            0.0          277.5      0.0      0.0      0.0      0.0 write
 3237.0s        0            0.0          277.4      0.0      0.0      0.0      0.0 write
 3238.0s        0            0.0          277.3      0.0      0.0      0.0      0.0 write
 3239.0s        0            0.0          277.2      0.0      0.0      0.0      0.0 write
 3240.0s        0            0.0          277.1      0.0      0.0      0.0      0.0 write

This was on branch 40765, but something that failed there should also fail on master. @irfansharif do you have any interest in trying to reproduce this or seeing if you can piece anything together from the logs?

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c6342c90a7fa4ceb1b674faa47a95e1726d05e79

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1496387&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190919-1496387/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2091,errgroup.go:57: spent 17.543860% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@irfansharif
Copy link
Contributor

Taking a look at this now.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c0cab6082c2383e77f0a806c10a1506b6aa7b6bf

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1525981&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191008-1525981/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2120,errgroup.go:57: spent 14.035088% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/06106c6209709b9ef3f1d194a70e71ddf812c4ce

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1542489&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191017-1542489/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2125,errgroup.go:57: spent 29.824561% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/86d6a7e0f571e2e2c637fc3d73318f2e2336451b

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1547645&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191020-1547645/kv/contention/nodes=4/run_1
	ts_util.go:116,kv.go:239,cluster.go:2138,errgroup.go:57: average txns per second 75.183555 was under target 100.000000

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/b7719563a787dec1da419a7a08bcc1e51df3fe2c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1554121&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=provisional_201910212329_v19.2.0-rc.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191023-1554121/kv/contention/nodes=4/run_1
	ts_util.go:116,kv.go:239,cluster.go:2138,errgroup.go:57: average txns per second 47.831668 was under target 100.000000

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/0f473848083559c8a98be032949df9428068c223

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1565997&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191031-1565997/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2142,errgroup.go:57: spent 31.578947% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/239513342a2d23f683bbc1d386f87ff59cc78d10

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1575479&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=provisional_201910141814_v19.2.0-rc.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191106-1575479/kv/contention/nodes=4/run_1
	cluster.go:1712,kv.go:204,test_runner.go:697: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --env=COCKROACH_TXN_LIVENESS_HEARTBEAT_MULTIPLIER=120 teamcity-1575479-1573004318-50-n5cpu4:1-4 returned:
		stderr:
		
		stdout:
		teamcity-1575479-1573004318-50-n5cpu4: starting...........
		0: exit status 255
		~ ./cockroach version
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.getCockroachVersion
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:88
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:149
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1553
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1337: 
		I191106 04:49:07.944723 1 cluster_synced.go:1635  command failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ff13356a005e446f3f11bd37cc9772f568d8f41f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1578961&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=release-19.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191107-1578961/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2142,errgroup.go:57: spent 5.172414% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/33b96613ae532b25a1b6b716453bece9b60ba2d6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1583742&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191109-1583742/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2142,errgroup.go:57: spent 15.789474% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c019acf1ed7afc16e04dfe324d1b98b84d39a489

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1583760&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=release-19.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191109-1583760/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2142,errgroup.go:57: spent 28.070175% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/0e9dd73f803247cdcfd06f51ce6b23396af1b9f5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1587121&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191112-1587121/kv/contention/nodes=4/run_1
	cluster.go:1712,kv.go:204,test_runner.go:697: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --env=COCKROACH_TXN_LIVENESS_HEARTBEAT_MULTIPLIER=120 teamcity-1587121-1573544175-44-n5cpu4:1-4 returned:
		stderr:
		
		stdout:
		teamcity-1587121-1573544175-44-n5cpu4: starting.....
		2: exit status 255
		~ ./cockroach version
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.getCockroachVersion
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:88
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:149
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1553
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1337: 
		I191112 10:43:21.642237 1 cluster_synced.go:1635  command failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/64162f27783bd59799531ff977f0fd1d0fd5ae86

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1587139&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=release-19.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191112-1587139/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2142,errgroup.go:57: spent 5.263158% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@irfansharif
Copy link
Contributor

irfansharif commented Nov 12, 2019

Update: I still have this on my backburner, and I'm thinking about it. A lot of the failures above are benign, they fail at ~5.5% of time below target throughput (the threshold is set at 5%). But a few others are not, and need more thought.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e26c2f26ca95796f84e7396b832b80f5d53605ae

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1589114&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=release-19.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191113-1589114/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2142,errgroup.go:57: spent 5.263158% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/647db0fecd18d8de760443131c6edb33828a619a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1599851&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=release-19.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191120-1599851/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2142,errgroup.go:57: spent 40.350877% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/205330426e50b966ede8a4644d77570d8265fe66

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1602072&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191121-1602072/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2142,errgroup.go:57: spent 35.087719% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/581a2bff2f50d5626f68bc1d51af0cf05b455dfc

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1618924&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191130-1618924/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2142,errgroup.go:57: spent 7.017544% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1da69d917105a0280aad10e86a7ee8eb2059cc92

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1623285&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=provisional_201912031738_v20.1.0-alpha20191209, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191203-1623285/kv/contention/nodes=4/run_1
	ts_util.go:116,kv.go:239,cluster.go:2142,errgroup.go:57: average txns per second 44.719369 was under target 100.000000

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ed717cbaf741e3a32c76db25b16a59dc2a8221d7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1624103&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191204-1624103/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2142,errgroup.go:57: spent 7.017544% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e1c1a9ae765730782f3adc2298a76d6a07ad055d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1626298&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191205-1626298/kv/contention/nodes=4/run_1
	ts_util.go:116,kv.go:239,cluster.go:2142,errgroup.go:57: average txns per second 71.079843 was under target 100.000000

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c85d435c2caa7a3066b1affdcc34b7ed6a6f3653

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1633173&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=release-19.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191208-1633173/kv/contention/nodes=4/run_1
	ts_util.go:116,kv.go:239,cluster.go:2142,errgroup.go:57: average txns per second 57.680178 was under target 100.000000

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9d697080f227324220c09d1b63da7bbf969133ac

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1633532&tab=artifacts#/kv/contention/nodes=4

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191209-1633532/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2142,errgroup.go:57: spent 7.017544% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@cockroach-teamcity
Copy link
Member Author

(roachtest).kv/contention/nodes=4 failed on master@3bb183403b4bc75c25fbcfc69d7d35de76d2b984:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200119-1699769/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2180,errgroup.go:57: spent 35.087719% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

Repro

Artifacts: /kv/contention/nodes=4
roachdash

powered by pkg/cmd/internal/issues

@tbg tbg added the branch-master Failures and bugs on the master branch. label Jan 22, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).kv/contention/nodes=4 failed on master@3c408a47707a15d4b5c92031c4eafcfa643d307c:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200124-1708438/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:239,cluster.go:2294,errgroup.go:57: spent 12.280702% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

More

Artifacts: /kv/contention/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).kv/contention/nodes=4 failed on master@b67d7d14e22f227caa4a2bc3727e3baf815fb25e:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/kv/contention/nodes=4/run_1
	ts_util.go:130,kv.go:241,cluster.go:2344,errgroup.go:57: spent 61.403509% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

	cluster.go:2368,kv.go:244,test_runner.go:741: error with attached stack trace:
		    main.(*monitor).WaitE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2356
		    main.(*monitor).Wait
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2364
		    main.registerKVContention.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/kv.go:244
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:741
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - monitor failure:
		  - error with attached stack trace:
		    main.(*monitor).wait.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2412
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - monitor task failed:
		  - error with attached stack trace:
		    main.init
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2309
		    runtime.doInit
		    	/usr/local/go/src/runtime/proc.go:5222
		    runtime.main
		    	/usr/local/go/src/runtime/proc.go:190
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - Goexit() was called

More

Artifacts: /kv/contention/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).kv/contention/nodes=4 failed on master@2739821b911d777fa2a927295d699b559360a802:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/kv/contention/nodes=4/run_1
	ts_util.go:48,ts_util.go:102,kv.go:241,cluster.go:2344,errgroup.go:57: Post http://35.238.41.231:26258/ts/query: read tcp 172.17.0.2:53100->35.238.41.231:26258: read: connection reset by peer

	cluster.go:2368,kv.go:244,test_runner.go:741: error with attached stack trace:
		    main.(*monitor).WaitE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2356
		    main.(*monitor).Wait
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2364
		    main.registerKVContention.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/kv.go:244
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:741
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - monitor failure:
		  - error with attached stack trace:
		    main.(*monitor).wait.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2412
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - monitor task failed:
		  - error with attached stack trace:
		    main.init
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2309
		    runtime.doInit
		    	/usr/local/go/src/runtime/proc.go:5222
		    runtime.main
		    	/usr/local/go/src/runtime/proc.go:190
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - Goexit() was called

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1747374-1581581035-61-n5cpu4 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: 3981
		4: 4367
		3: 4051
		1: dead
		Error:  1: dead

More

Artifacts: /kv/contention/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@nvanbenschoten
Copy link
Member

Uh oh, @andreimatei. Looks like this is from da5dc5b.

F200213 11:29:00.000718 2116669 kv/txn_interceptor_span_refresher.go:231  [n1,txn=7a6443fb] unexpected WriteTooOld request. ba: [txn: 7a6443fb], EndTxn(commit:false tsflex:false) [/Min]  (txn: "sql txn" meta={id=7a6443fb key=/Table/53/1/-4819943175058341713/0 pri=0.02601487 epo=0 ts=1581593337.993660978,1 min=1581593336.484904973,0 seq=13} rw=true stat=PENDING rts=1581593336.484904973,0 wto=true max=1581593336.984904973,0)
goroutine 2116669 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x7408001, 0xed5d729fc, 0x0, 0x34881e71f01)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x7404f80, 0xc000000004, 0x699faa6, 0x24, 0xe7, 0xc0063794a0, 0x160)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:211 +0xa0c
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x4b19080, 0xc008729950, 0x4, 0x2, 0x43227aa, 0x30, 0xc008a14f20, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x4b19080, 0xc008729950, 0x1, 0x4, 0x43227aa, 0x30, 0xc008a14f20, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:155
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc0051c7538, 0x4b19080, 0xc008729950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:231 +0x3d0
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).SendLocked(0xc0051c7538, 0x4b19080, 0xc008729950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:179 +0x158
github.com/cockroachdb/cockroach/pkg/kv.(*txnPipeliner).SendLocked(0xc0051c7488, 0x4b19080, 0xc008729950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_pipeliner.go:208 +0x159
github.com/cockroachdb/cockroach/pkg/kv.(*txnSeqNumAllocator).SendLocked(0xc0051c7468, 0x4b19080, 0xc008729950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_seq_num_allocator.go:105 +0x20c
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).SendLocked(0xc0051c73c8, 0x4b19080, 0xc008729950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:168 +0x1a9
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc0051c7200, 0x4b19080, 0xc008729950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:493 +0x406
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc000113b80, 0x4b19040, 0xc0087e2d20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:754 +0x122
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send(0xc0063821b0, 0x4b19040, 0xc0087e2d20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:870 +0x11e
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).rollback.func1.1(0x4b19040, 0xc0087e2d20, 0xb2d05e00, 0x4b19040)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:689 +0x98
github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout(0x4b19040, 0xc0087e2d20, 0x42ad0fa, 0x12, 0xb2d05e00, 0xc003420e00, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:135 +0xbf
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).rollback.func1(0x4b18fc0, 0xc006860240)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:688 +0x174
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0003a6510, 0x4b18fc0, 0xc006860240, 0xc006b1cce0, 0x16, 0x0, 0x0, 0xc006659940)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:322 +0x152
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:317 +0x131

@andreimatei andreimatei self-assigned this Feb 13, 2020
@andreimatei
Copy link
Contributor

looking

@irfansharif irfansharif removed their assignment Feb 13, 2020
andreimatei added a commit to andreimatei/cockroach that referenced this issue Feb 13, 2020
The span refresher interceptor is supposed to terminate the
txn.WriteTooOld flag. For that, it asserts that the request doesn't not
have that flag set on it; only responses should have it. There was,
however, a case where terminating the flag was not happening as
intended: when a non-retriable error was coming back with the
WriteTooOld flag set, the flag made its way into the client's txn,
slipping past the interceptor. This was causing further requests to fail
the aforementioned assertion. (Since errors are at play here, the only
further request should be a Rollback).

This patch fixes this by having servers never return errors with this
flag set. Setting this flag on an error does not make sense anyway.

Touches cockroachdb#40786

Release note: None
@cockroach-teamcity
Copy link
Member Author

(roachtest).kv/contention/nodes=4 failed on master@e671a4ef97cbc6cf5d22f8f322fd45733d302094:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/kv/contention/nodes=4/run_1
	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1749385-1581667748-53-n5cpu4 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		1: 4205
		3: 4259
		2: 4404
		4: dead
		Error:  4: dead

More

Artifacts: /kv/contention/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

andreimatei added a commit to andreimatei/cockroach that referenced this issue Feb 14, 2020
The span refresher interceptor is supposed to terminate the
txn.WriteTooOld flag. For that, it asserts that the request doesn't not
have that flag set on it; only responses should have it. There was,
however, a case where terminating the flag was not happening as
intended: when a non-retriable error was coming back with the
WriteTooOld flag set, the flag made its way into the client's txn,
slipping past the interceptor. This was causing further requests to fail
the aforementioned assertion. (Since errors are at play here, the only
further request should be a Rollback).

This patch fixes this by having servers never return errors with this
flag set. Setting this flag on an error does not make sense anyway.

Touches cockroachdb#40786

Release note: None
craig bot pushed a commit that referenced this issue Feb 14, 2020
45102: storage,kv: don't return errors with the WriteTooOld flag set r=andreimatei a=andreimatei

The span refresher interceptor is supposed to terminate the
txn.WriteTooOld flag. For that, it asserts that the request doesn't not
have that flag set on it; only responses should have it. There was,
however, a case where terminating the flag was not happening as
intended: when a non-retriable error was coming back with the
WriteTooOld flag set, the flag made its way into the client's txn,
slipping past the interceptor. This was causing further requests to fail
the aforementioned assertion. (Since errors are at play here, the only
further request should be a Rollback).

This patch fixes this by having servers never return errors with this
flag set. Setting this flag on an error does not make sense anyway.

Touches #40786

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@cockroach-teamcity
Copy link
Member Author

(roachtest).kv/contention/nodes=4 failed on master@5f9a71adb995837bcff27b9456188018434be4b8:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200303-1780216/kv/contention/nodes=4/run_1
	ts_util.go:116,kv.go:259,cluster.go:2344,errgroup.go:57: average txns per second 1.911392 was under target 100.000000

	cluster.go:2368,kv.go:262,test_runner.go:741: error with attached stack trace:
		    main.(*monitor).WaitE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2356
		    main.(*monitor).Wait
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2364
		    main.registerKVContention.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/kv.go:262
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:741
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - monitor failure:
		  - error with attached stack trace:
		    main.(*monitor).wait.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2412
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - monitor task failed:
		  - error with attached stack trace:
		    main.init
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2309
		    runtime.doInit
		    	/usr/local/go/src/runtime/proc.go:5222
		    runtime.main
		    	/usr/local/go/src/runtime/proc.go:190
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - Goexit() was called

More

Artifacts: /kv/contention/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@irfansharif
Copy link
Contributor

irfansharif commented Mar 3, 2020

@nvanbenschoten: This looks related to #45568, is it?

@nvanbenschoten
Copy link
Member

Hopefully - #45568 made the test more antagonistic and #45603 is fixing the deadlock that doing so exacerbates.

craig bot pushed a commit that referenced this issue Mar 3, 2020
45595: sql: make cleanup jobs spawned by alter primary key not cancelable r=spaskob a=rohany

Fixes #45500.

This PR makes the job spawned by ALTER PRIMARY KEY that cleans
up indexes be uncancelable.

This PR additionally fixes a related bug where ALTER PRIMARY KEY
would spawn a job when it didn't actually enqueue any mutations
on the table descriptor, causing future schema changes on the
table to hang.

Release note (sql change): This PR makes the cleanup job spawned
by ALTER PRIMARY KEY in some cases uncancelable.

45603: storage/txnwait: terminate push when pusher aborted at lower epoch r=nvanbenschoten a=nvanbenschoten

Closes #40786.
Closes #44336.

This commit resolves a bug in distributed deadlock detection that would
allow a deadlock between transactions to go undetected, stalling the
workload indefinitely.

The issue materialized as follows:
1. two transactions would deadlock and each enter a txnwait queue
2. they would poll their pushees record along with their own
3. deadlock detection would eventually pick this up and abort one of the txns
   using the pusher's copy of the txn's proto
4. however, the aborted txn has since restarted and bumped it epoch
5. the aborted txn continued to query its record, but failed to ingest any
   updates from it because the record was at a lower epoch than its own
   copy of its txn proto. So it never noticed that it was ABORTED
6. all other txns in the system including the original contending txn
   piled up behind the aborted txn in the contention queue, waiting for
   it to notice it was aborted and exit the queue
7. deadlock!

I'm optimistically closing the two `kv/contention/nodes=4` issues both
because I hope this is the cause of their recent troubles and also because
I've been spending a lot of time with the test recently in light of #45482
and plan to stabilize it fully.

I plan to backport this to release-19.2. This doesn't need to go all the
way back to release-19.1 because this was introduces in aed892a.

Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in 0a1f251 Mar 3, 2020
koorosh added a commit to koorosh/cockroach that referenced this issue Mar 6, 2020
commit feb3d45
Author: David Hartunian <davidh@cockroachlabs.com>
Date:   Mon Mar 2 22:53:52 2020 -0500

    sql: add API for accessing statement diagnostics

    Added 3 new RPC endpoints under `_status` for manipulating Statement Diagnostics
    Requests:

    1. `CreateStatementDiagnosticsRequest`
    Creates a new request for a given statement fingerprint

    2. `StatementDiagnosticsRequests`
    Fetches all statement diagnostics requests

    3. `StatementDiagnostics`
    Fetches a specific statement diagnostics request that has been completed and
    includes the JSON payload of the trace

    Basic implementations for these endpoints have been added to the
    `apiReducers.tsx` and `api.ts` files for use by the Admin UI.

    A minor bug where completed requests did not have `completedAt` timestamp set
    was fixed.

commit 9d684ea
Merge: 2e621c7 62c8667
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Wed Mar 4 02:09:24 2020 +0000

    Merge cockroachdb#45575

    45575: cli: reveal SQL error details/hints in CLI commands r=ajwerner a=knz

    Found while investigating cockroachdb#43114. First commit from that PR.

    Previously, when a CLI command that uses SQL under the hood
    encountered an error, only the main error message was displayed.

    This patch changes it to reveal the other additional user-visible
    fields.

    Release note (cli change): Certain CLI command now provide more
    details and/or a hint when they encounter an error.

    Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

commit 2e621c7
Merge: 7b64b0b 3daeced
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Wed Mar 4 00:26:33 2020 +0000

    Merge cockroachdb#45520

    45520: opt: FK checks for Upsert r=RaduBerinde a=RaduBerinde

    These commit add support for opt-driven FK checks for UPSERT and INSERT ON CONFLICT DO UPDATE. This is a simplified and cleaned up reimplementation of cockroachdb#45095.

    There aren't a lot of logictests around FKs and upsert. I plan to try to build a randomized testing facility for FK checks, where we generate randomized mutations and cross-check FK violations or lack there of against a separate instance of the database without FK relationships.

    #### Revert "opt: add first half of upsert FK checks"

    This reverts commit 96447c8.

    We are reconsidering the direction of the upsert FK checks: instead of filtering
    inserted vs updated rows, we can use the return columns to get the "new" values.
    This change also made some things harder to understand: we used to build the FK
    check input using just the columns corresponding to the FK instead of all
    columns.

    Release note: None

    #### opt: refactor optbuilder FK check code

    The optbuilder FK check code has become unnecessarily complicated:
     - there are two ways of creating a WithScan (`makeFKInputScan` and
       `projectOrdinals`) which are similar but take different kinds of information
       as input.
     - there are various slices of column ordinals or column IDs that are threaded
       through long parts of code, making it hard to follow.

    This change cleans this up by creating a fkCheckHelper which contains the
    metadata related to FK table ordinals and is capable of creating a WithScan with
    either the "new" or the "fetched" values.

    The new code should generate the same effective plans; the differences are:
     - we now always create the WithScan before the other table Scan so some column IDs in the plan changed;
     - we no longer have an unnecessary projection for Update (it was used to
       renumber the columns coming out of WithScan).

    Release note: None

    #### opt: add outbound FK checks for upsert

    This is a re-implementation of Justin's change 3d1dd0f, except that we now
    perform the insertion check for all new rows instead of just the inserted rows.
    We do this by utilizing the same column that would be used for a RETURNING
    clause, which in some cases is of the form
    `CASE WHEN canary IS NULL col1 ELSE col2`.

    Release note: None

    #### opt: add inbound FK checks for upsert

    This change adds inbound FK checks for upsert and switches execution over to the
    new style FK checks for upsert.

    Similar to UPDATE, the inbound FK checks run on the set difference between "old"
    values for the FK columns and "new" values.

    Release note (performance improvement): improved execution plans of foreign key
    checks for UPSERT and INSERT ON CONFLICT in some cases (in particular
    multi-region).

    Co-authored-by: Radu Berinde <radu@cockroachlabs.com>

commit 3daeced
Author: Radu Berinde <radu@cockroachlabs.com>
Date:   Tue Mar 3 16:23:50 2020 -0800

    opt: add inbound FK checks for upsert

    This change adds inbound FK checks for upsert and switches execution over to the
    new style FK checks for upsert.

    Similar to UPDATE, the inbound FK checks run on the set difference between "old"
    values for the FK columns and "new" values.

    Release note (performance improvement): improved execution plans of foreign key
    checks for UPSERT and INSERT ON CONFLICT in some cases (in particular
    multi-region).

commit b8ebc3a
Author: Radu Berinde <radu@cockroachlabs.com>
Date:   Tue Mar 3 16:23:50 2020 -0800

    opt: add outbound FK checks for upsert

    This is a re-implementation of Justin's change 3d1dd0f, except that we now
    perform the insertion check for all new rows instead of just the inserted rows.
    We do this by utilizing the same column that would be used for a RETURNING
    clause, which in some cases is of the form
    `CASE WHEN canary IS NULL col1 ELSE col2`.

    Release note: None

commit 75e6009
Author: Radu Berinde <radu@cockroachlabs.com>
Date:   Tue Mar 3 16:23:50 2020 -0800

    opt: refactor optbuilder FK check code

    The optbuilder FK check code has become unnecessarily complicated:
     - there are two ways of creating a WithScan (`makeFKInputScan` and
       `projectOrdinals`) which are similar but take different kinds of information
       as input.
     - there are various slices of column ordinals or column IDs that are threaded
       through long parts of code, making it hard to follow.

    This change cleans this up by creating a fkCheckHelper which contains the
    metadata related to FK table ordinals and is capable of creating a WithScan with
    either the "new" or the "fetched" values.

    The new code should generate the same effective plans; the differences are:
     - we now always create the WithScan before the other table Scan so some column IDs in the plan changed;
     - we no longer have an unnecessary projection for Update (it was used to
       renumber the columns coming out of WithScan).

    Release note: None

commit aa2295d
Author: Radu Berinde <radu@cockroachlabs.com>
Date:   Tue Mar 3 16:23:50 2020 -0800

    Revert "opt: add first half of upsert FK checks"

    This reverts commit 96447c8.

    We are reconsidering the direction of the upsert FK checks: instead of filtering
    inserted vs updated rows, we can use the return columns to get the "new" values.
    This change also made some things harder to understand: we used to build the FK
    check input using just the columns corresponding to the FK instead of all
    columns.

    Release note: None

commit 7b64b0b
Merge: ede97bc 5b1598d
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 23:33:09 2020 +0000

    Merge cockroachdb#45513

    45513: sql: disable primary key changes when a schema change is in progress r=lucy-zhang a=rohany

    Fixes cockroachdb#45362.

    Release note (sql change): This PR disables primary key changes
    when a concurrent schema change is executing on the same table,
    or if a schema change has been started on the same table in
    the current transaction.

    Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>

commit ede97bc
Merge: bbbb26b 7383e9d
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 22:44:11 2020 +0000

    Merge cockroachdb#45597

    45597: opt: extend WithUses and improve NeededMutationCols r=RaduBerinde a=RaduBerinde

    This change extends the WithUses rule prop to keep track of the columns actually
    used by `WithScan`s.

    This set is then used by NeededMutationCols to make sure we never prune mutation
    input columns that are used by FK checks. Currently this never happens, but
    because of fairly subtle reasons: FKs happen to require indexes on both sides,
    and those indexes force the mutation operator to require values for those
    columns. This won't be the case anymore with upsert FK checks, for which this
    fix is required.

    The new information will also be used (in a future change) to prune unused
    columns of `With` itself.

    Release note: None

    Co-authored-by: Radu Berinde <radu@cockroachlabs.com>

commit bbbb26b
Merge: e1e1f2c d80411f
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 21:58:08 2020 +0000

    Merge cockroachdb#45397

    45397: sql: disallow other schema changes during an in progress primary key change r=lucy-zhang a=rohany

    Fixes cockroachdb#45363.

    Release note (sql change): This commit disables the use of schema
    changes when a primary key change is in progress. This includes
    the following:
    * running a primary key change in a transaction, and then starting
      another schema change in the same transaction.
    * starting a primary key change on one connection, and then starting
      a schema change on the same table on another connection while
      the initial primary key change is currently executing.

    Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>

commit e1e1f2c
Merge: 064c4ea f38087b 66d502f 2b7aa46 35095d9
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 20:36:07 2020 +0000

    Merge cockroachdb#45472 cockroachdb#45607 cockroachdb#45609 cockroachdb#45613

    45472: storage/engine: Teeing engine fixes r=itsbilal a=itsbilal

    This change introduces some misc teeing engine fixes, such as proper
    cache initialization, copying of SSTs before ingestions (now that
    pebble also deletes SSTs after an Ingest), and better null msg
    handling in GetProto. The cache part fixes a segfault in
    GetStats.

    Fixes cockroachdb#42654 .

    Release note: None.

    45607: sql: fix usages of ActiveVersionOrEmpty in {create, alter}_table.go r=irfansharif a=rohany

    Fixes cockroachdb#45519.

    This PR removes an incorrect usage of ActiveVersionOrEmpty in
    alter_table.go, and updates a comment in create_table.go detailing its
    usage.

    Release note: None

    45609: sql: support star-expanding label-less tuples + numeric tuple indexing r=RaduBerinde a=knz

    Requested by @RaduBerinde  / @ajwerner .

    Previously, it was impossible to use `(t).*` to expand a single
    tuple-typed column into multiple projections if the tuple was not
    labeled to start with.

    This limitation was caused by another limitation, that it was not
    possible to refer to a single column in a tuple if the tuple was not
    already labeled.

    This patch addresses both limitations.

    Release note (sql change): CockroachDB now supports expanding all
    columns of a tuple using the `.*` notation, for example `SELECT (t).*
    FROM (SELECT (1,'b',2.3) AS t)`. This is a CockroachDB-specific extension.

    Release note (sql change): CockroachDB now supports accessing the Nth
    column in a column with tuple type using the syntax `(...).@N`, for
    example `SELECT (t).@2 FROM (SELECT (1,'b',2.3) AS t)`. This is a
    CockroachDB-specific extension.

    45613: pgwire: fix decimal decoding with trailing zeroes r=jordanlewis a=otan

    Resolves cockroachdb#25460

    In addition to the release note, I have also modified the docker runfile
    to support elixir tests, and also updated the elixir tests to be run.

    Release note (bug fix): In previous PRs, drivers that do not truncate
    trailing zeroes for decimals in the binary format end up having
    inaccuracies of up to 10^4 during the decode step. In this PR, we fix
    the error by truncating the trailing zeroes as appropriate. This fixes
    known incorrect decoding cases with Postgrex in Elixir.

    Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
    Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
    Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
    Co-authored-by: Oliver Tan <otan@cockroachlabs.com>

commit 35095d9
Author: Oliver Tan <otan@cockroachlabs.com>
Date:   Mon Mar 2 15:40:45 2020 -0800

    pgwire: fix decimal decoding with trailing zeroes

    In addition to the release note, I have also modified the docker runfile
    to support elixir tests, and also updated the elixir tests to be run.

    Release note (bug fix): In previous PRs, drivers that do not truncate
    trailing zeroes for decimals in the binary format end up having
    inaccuracies of up to 10^4 during the decode step. In this PR, we fix
    the error by truncating the trailing zeroes as appropriate. This fixes
    known incorrect decoding cases with Postgrex in Elixir.

commit 064c4ea
Merge: ff2b605 6653127
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 19:08:46 2020 +0000

    Merge cockroachdb#45630

    45630: storageccl: rework TestRandomKeyAndTimestampExport to be shorter r=pbardea a=ajwerner

    The test was taking a very long time due to the tiny page sizes. This commit
    changes the test to scale the total number of keys based on the page size.

    Before:

    ```
    --- PASS: TestRandomKeyAndTimestampExport (19.06s)
    ```

    After:
    ```
    --- PASS: TestRandomKeyAndTimestampExport (2.30s)
    ```

    Release note: None

    Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

commit f38087b
Author: Bilal Akhtar <bilal@cockroachlabs.com>
Date:   Wed Feb 26 15:45:56 2020 -0500

    storage/engine: Teeing engine fixes

    This change introduces some misc teeing engine fixes, such as proper
    cache initialization, copying of SSTs before ingestions (now that
    pebble also deletes SSTs after an Ingest), and better null msg
    handling in GetProto. The cache part fixes a segfault in
    GetStats.

    It also unifies file handling between in-memory and on-disk
    teeing engines, by ensuring we write to files in each engine's
    aux directory if we're writing to one engine's aux directory.

    Fixes cockroachdb#42654 .

    Release note: None.

commit 5b1598d
Author: Rohan Yadav <rohany@alumni.cmu.edu>
Date:   Thu Feb 27 12:42:24 2020 -0500

    sql: disable primary key changes when a schema change is in progress

    Fixes cockroachdb#45362.

    Release note (sql change): This PR disables primary key changes
    when a concurrent schema change is executing on the same table,
    or if a schema change has been started on the same table in
    the current transaction.

commit d80411f
Author: Rohan Yadav <rohany@alumni.cmu.edu>
Date:   Tue Feb 25 10:59:17 2020 -0500

    sql: disallow other schema changes an in progress primary key change

    Fixes cockroachdb#45363.

    Release note (sql change): This commit disables the use of schema
    changes when a primary key change is in progress. This includes
    the following:
    * running a primary key change in a transaction, and then starting
      another schema change in the same transaction.
    * starting a primary key change on one connection, and then starting
      a schema change on the same table on another connection while
      the initial primary key change is currently executing.

commit ff2b605
Merge: aeb41dc 9dcab01 cee9355
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 16:59:33 2020 +0000

    Merge cockroachdb#45347 cockroachdb#45502

    45347: sql: make secondary indexes not write empty k/v's + bugfixes for primary key changes r=jordanlewis a=rohany

    Fixes cockroachdb#45223.

    Depends on cockroachdb#45226 to land first.

    This PR fixes many bugs around secondary index encodings
    and CRUD operations k/v reads and writes.

    * Fixes a problem secondary indexes would write empty
      k/v's if it contained a family that had all null values.
    * Fix multiple bugs where updates to a table during an online
      primary key change could result an inconsistent
      new primary key.
    * Fix an assumption in the updater that assumed indexes
      always had the same number of k/v's. The logic has been
      updated to perform a sort of merge operation to decide
      what k/v's to insert/delete during the update operation.
    * Increased testing around secondary indexes k/vs and
      schema change operations.

    The release note is None because these are all bugs
    introduced in 20.1.

    Release note: None

    45502: sql: allow rename database for sequences without a db name reference r=rohany a=otan

    Resolves immediate concern from cockroachdb#45411
    Refs: cockroachdb#34416

    See release note for description. This PR should be included ahead of
    the more "general" fix of changing the DefaultExpr with the new database
    as it unblocks people using
    `experimental_serial_normalization=sql_sequence` from using the database
    rename feature.

    Release note (sql change): Previously, when we renamed a database, any
    table referencing a sequence would be blocked from being able to rename
    the table. This is to block cases where if the table's reference to the
    sequence contains the database name, and the database name changes, we
    have no way of overwriting the table's reference to the sequence in the
    new database. However, if no database name is included in the sequence
    reference, we should continue to allow the database to rename, as is
    implemented with this change.

    Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
    Co-authored-by: Oliver Tan <otan@cockroachlabs.com>

commit 6653127
Author: Andrew Werner <ajwerner@cockroachlabs.com>
Date:   Tue Mar 3 00:33:17 2020 -0500

    storageccl: rework TestRandomKeyAndTimestampExport to be shorter

    The test was taking a very long time due to the tiny page sizes. This commit
    changes the test to scale the total number of keys based on the page size.

    Before:

    ```
    --- PASS: TestRandomKeyAndTimestampExport (19.06s)
    ```

    After:
    ```
    --- PASS: TestRandomKeyAndTimestampExport (2.30s)
    ```

    Release note: None

commit 66d502f
Author: Rohan Yadav <rohany@alumni.cmu.edu>
Date:   Mon Mar 2 18:01:59 2020 -0500

    sql: fix usages of ActiveVersionOrEmpty in {create, alter}_table.go

    Fixes cockroachdb#45519.

    This PR removes an incorrect usage of ActiveVersionOrEmpty in
    alter_table.go, and updates a comment in create_table.go detailing its
    usage.

    Release note: None

commit aeb41dc
Merge: ada086e 3d7aeb3 0a1f251
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 16:10:07 2020 +0000

    Merge cockroachdb#45595 cockroachdb#45603

    45595: sql: make cleanup jobs spawned by alter primary key not cancelable r=spaskob a=rohany

    Fixes cockroachdb#45500.

    This PR makes the job spawned by ALTER PRIMARY KEY that cleans
    up indexes be uncancelable.

    This PR additionally fixes a related bug where ALTER PRIMARY KEY
    would spawn a job when it didn't actually enqueue any mutations
    on the table descriptor, causing future schema changes on the
    table to hang.

    Release note (sql change): This PR makes the cleanup job spawned
    by ALTER PRIMARY KEY in some cases uncancelable.

    45603: storage/txnwait: terminate push when pusher aborted at lower epoch r=nvanbenschoten a=nvanbenschoten

    Closes cockroachdb#40786.
    Closes cockroachdb#44336.

    This commit resolves a bug in distributed deadlock detection that would
    allow a deadlock between transactions to go undetected, stalling the
    workload indefinitely.

    The issue materialized as follows:
    1. two transactions would deadlock and each enter a txnwait queue
    2. they would poll their pushees record along with their own
    3. deadlock detection would eventually pick this up and abort one of the txns
       using the pusher's copy of the txn's proto
    4. however, the aborted txn has since restarted and bumped it epoch
    5. the aborted txn continued to query its record, but failed to ingest any
       updates from it because the record was at a lower epoch than its own
       copy of its txn proto. So it never noticed that it was ABORTED
    6. all other txns in the system including the original contending txn
       piled up behind the aborted txn in the contention queue, waiting for
       it to notice it was aborted and exit the queue
    7. deadlock!

    I'm optimistically closing the two `kv/contention/nodes=4` issues both
    because I hope this is the cause of their recent troubles and also because
    I've been spending a lot of time with the test recently in light of cockroachdb#45482
    and plan to stabilize it fully.

    I plan to backport this to release-19.2. This doesn't need to go all the
    way back to release-19.1 because this was introduces in aed892a.

    Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
    Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

commit ada086e
Merge: b0be21a 80894c3 4caee85 7e0ba7c
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 15:26:58 2020 +0000

    Merge cockroachdb#44130 cockroachdb#45589 cockroachdb#45604

    44130: sql: postgresql dollar-quoted string support r=knz a=damienhollis

    Added support for postgresql dollar-quoted strings in the scanner. A
    dollar-quoted string constant consists of a dollar sign ($), an
    optional "tag" of zero or more characters, another dollar sign, an
    arbitrary sequence of characters that makes up the string content, a
    dollar sign, the same tag that began this dollar quote, and a final
    dollar sign.

    The scanner uses the existing token type of SCONST for dollar-quoted
    strings. As a result, when the AST is formatted as a string, there is
    no knowledge that the original input was dollar-quoted and it is
    therefore formatted as either a plain string or an escaped
    string (depending on the content).

    Fixes cockroachdb#41777.

    Release Note (sql change): CockroachDB now supports string and byte
    array literals using the dollar-quoted notation, as documented here:
    https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-DOLLAR-QUOTING

    45589: sql: update error message for primary key change on an interleave parent r=otan a=rohany

    Fixes cockroachdb#45537.

    This PR updates the error message when trying to perform a primary
    key change on an interleaved parent to include the name of the
    table as well as the names of the interleaved children.

    Release note: None

    45604: opt: factor limit hints into scan and lookup join costs r=rytaft a=rytaft

    This PR is a continuation of cockroachdb#43415. The first commit is copied directly
    from that PR, and the second commit includes a minor fix as well as a
    number of test cases.

    Fixes cockroachdb#34811; the example query in this issue now chooses a lookup join
    as desired. The coster now takes limit hints into account when costing
    scans and lookup joins, and propagates limit hints through lookup joins.

    Release note (sql change): The optimizer now considers the likely number
    of rows an operator will need to provide, and might choose query plans
    based on this. In particular, the optimizer might prefer lookup joins
    over alternatives in some situations where all rows of the join will
    probably not be needed.

    Co-authored-by: damien.hollis <damien.hollis@unimarket.com>
    Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
    Co-authored-by: Céline O'Neil <celineloneil@gmail.com>
    Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>

commit 7e0ba7c
Author: Rebecca Taft <becca@cockroachlabs.com>
Date:   Mon Mar 2 15:35:36 2020 -0600

    opt: make lookup join limit hint consistent with coster, add tests

    This commit adds a new helper function called lookupJoinInputLimitHint, which
    is called by both the coster and the physical props code for limit hint
    calculation. This helper function ensures that both places take into account
    the batch size of 100, and require that the calculated limit hint is a
    multiple of this batch size.

    This commit also adds more tests related to costing of limit hints for
    scans and lookup joins.

    Release note: None

commit 9dcab01
Author: Rohan Yadav <rohany@alumni.cmu.edu>
Date:   Thu Feb 20 15:37:29 2020 -0500

    sql: make secondary indexes not write empty k/v's + bugfixes for primary
    key changes

    Fixes cockroachdb#45223.

    Depends on cockroachdb#45226 to land first.

    This PR fixes many bugs around secondary index encodings
    and CRUD operations k/v reads and writes.

    * Fixes a problem secondary indexes would write empty
      k/v's if it contained a family that had all null values.
    * Fix multiple bugs where updates to a table during an online
      primary key change could result an inconsistent
      new primary key.
    * Fix an assumption in the updater that assumed indexes
      always had the same number of k/v's. The logic has been
      updated to perform a sort of merge operation to decide
      what k/v's to insert/delete during the update operation.
    * Increased testing around secondary indexes k/vs and
      schema change operations.

    The release note is None because these are all bugs
    introduced in 20.1.

    Release note: None

commit 80894c3
Author: damien.hollis <damien.hollis@unimarket.com>
Date:   Sun Jan 19 21:16:13 2020 +1300

    sql: support postgresql dollar-quoted strings

    Added support for postgresql dollar-quoted strings in the scanner. A
    dollar-quoted string constant consists of a dollar sign ($), an
    optional "tag" of zero or more characters, another dollar sign, an
    arbitrary sequence of characters that makes up the string content, a
    dollar sign, the same tag that began this dollar quote, and a final
    dollar sign.

    The scanner uses the existing token type of SCONST for dollar-quoted
    strings. As a result, when the AST is formatted as a string, there is
    no knowledge that the original input was dollar-quoted and it is
    therefore formatted as either a plain string or an escaped
    string (depending on the content).

    Fixes cockroachdb#41777.

    Release Note (sql change): CockroachDB now supports string and byte
    array literals using the dollar-quoted notation, as documented here:
    https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-DOLLAR-QUOTING

commit 3d7aeb3
Author: Rohan Yadav <rohany@alumni.cmu.edu>
Date:   Mon Mar 2 13:46:35 2020 -0500

    sql: make cleanup jobs spawned by alter primary key not cancelable

    Fixes cockroachdb#45500.

    This PR makes the job spawned by ALTER PRIMARY KEY that cleans
    up indexes be uncancelable.

    This PR additionally fixes a related bug where ALTER PRIMARY KEY
    would spawn a job when it didn't actually enqueue any mutations
    on the table descriptor, causing future schema changes on the
    table to hang.

    Release note (sql change): This PR makes the cleanup job spawned
    by ALTER PRIMARY KEY in some cases uncancelable.

commit b0be21a
Merge: 8ffade8 bbac108
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 14:48:15 2020 +0000

    Merge cockroachdb#45621

    45621: sql: allow inverted indexes on mixed-case cols r=jordanlewis a=jordanlewis

    Closes cockroachdb#42944.

    Previously, a bug prevented creation of inverted indexes on columns with
    mixed-case identifiers. Now, the bug is fixed.

    Release note (bug fix): it is now possible to create inverted indexes on
    columns whose names are mixed-case.

    Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>

commit 2b7aa46
Author: Raphael 'kena' Poss <knz@thaumogen.net>
Date:   Tue Mar 3 00:38:55 2020 +0100

    sql: support numeric tuple indexing

    Previously, it was impossible to use `(t).*` to expand a single
    tuple-typed column into multiple projections if the tuple was not
    labeled to start with.

    This limitation was caused by another limitation, that it was not
    possible to refer to a single column in a tuple if the tuple was not
    already labeled.

    This patch addresses both limitations.

    Release note (sql change): CockroachDB now supports expanding all
    columns of a tuple using the `.*` notation, for example `SELECT (t).*
    FROM (SELECT (1,'b',2.3) AS t)`. This is a CockroachDB-specific extension.

    Release note (sql change): CockroachDB now supports accessing the Nth
    column in a column with tuple type using the syntax `(...).@N`, for
    example `SELECT (t).@2 FROM (SELECT (1,'b',2.3) AS t)`. This is a
    CockroachDB-specific extension.

commit 8ffade8
Merge: 120d53f 79ca338
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 14:08:12 2020 +0000

    Merge cockroachdb#45139

    45139: intervalccl: optimize OverlapCoveringMerge r=dt a=C0rWin

    Change OverlapCoveringMerge instead of iterating over coverings for each
    input range to flatten coverings, sort results and produce ranges with
    one pass. Reducing total complexity from O(nm) to O(n*log(n)).

    Signed-off-by: Artem Barger <bartem@il.ibm.com>

    Release note: none

    Co-authored-by: Artem Barger <bartem@il.ibm.com>

commit 120d53f
Merge: 0414b69 e2da8bd
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 11:44:30 2020 +0000

    Merge cockroachdb#45587

    45587: log: secondary logger fixes r=ajwerner a=knz

    Needed by / will be exercised by  cockroachdb#45193

    Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

commit 0414b69
Merge: 47259c2 3433fbd
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Tue Mar 3 09:35:26 2020 +0000

    Merge cockroachdb#45549

    45549: changefeedccl: pick up TargetBytes for backfill ScanRequest r=ajwerner a=ajwerner

    Now that cockroachdb#44925 has merged, we can use a byte limit rather than a row limit.

    Release note: None

    Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

commit 0a1f251
Author: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Date:   Mon Mar 2 16:26:24 2020 -0500

    storage/txnwait: terminate push when pusher aborted at lower epoch

    Closes cockroachdb#40786.
    Closes cockroachdb#44336.

    This commit resolves a bug in distributed deadlock detection that would
    allow a deadlock between transactions to go undetected, stalling the
    workload indefinitely.

    The issue materialized as follows:
    1. two transactions would deadlock and each enter a txnwait queue
    2. they would poll their pushees record along with their own
    3. deadlock detection would eventually pick this up and abort one of the txns
       using the pusher's copy of the txn's proto
    4. however, the aborted txn has since restarted and bumped it epoch
    5. the aborted txn continued to query its record, but failed to ingest any
       updates from it because the record was at a lower epoch than its own
       copy of its txn proto. So it never noticed that it was ABORTED
    6. all other txns in the system including the original contending txn
       piled up behind the aborted txn in the contention queue, waiting for
       it to notice it was aborted and exit the queue
    7. deadlock!

    I'm optimistically closing the two `kv/contention/nodes=4` issues both
    because I hope this is the cause of their recent troubles and also because
    I've been spending a lot of time with the test recently in light of cockroachdb#45482
    and plan to stabilize it fully.

    I plan to backport this to release-19.2. This doesn't need to go all the
    way back to release-19.1 because this was introduces in aed892a.

commit bbac108
Author: Jordan Lewis <jordanthelewis@gmail.com>
Date:   Mon Mar 2 23:16:33 2020 -0500

    sql: allow inverted indexes on mixed-case cols

    Previously, a bug prevented creation of inverted indexes on columns with
    mixed-case identifiers. Now, the bug is fixed.

    Release note (bug fix): it is now possible to create inverted indexes on
    columns whose names are mixed-case.

commit cee9355
Author: Oliver Tan <otan@cockroachlabs.com>
Date:   Thu Feb 27 10:11:31 2020 -0800

    sql: allow rename database for sequences without a db name reference

    See release note for description. This PR should be included ahead of
    the more "general" fix of changing the DefaultExpr with the new database
    as it unblocks people using
    `experimental_serial_normalization=sql_sequence` from using the database
    rename feature.

    Release note (sql change): Previously, when we renamed a database, any
    table referencing a sequence would be blocked from being able to rename
    the table. This is to block cases where if the table's reference to the
    sequence contains the database name, and the database name changes, we
    have no way of overwriting the table's reference to the sequence in the
    new database. However, if no database name is included in the sequence
    reference, we should continue to allow the database to rename, as is
    implemented with this change.

commit 7383e9d
Author: Radu Berinde <radu@cockroachlabs.com>
Date:   Mon Mar 2 11:15:12 2020 -0800

    opt: extend WithUses and improve NeededMutationCols

    This change extends the WithUses rule prop to keep track of the columns actually
    used by `WithScan`s.

    This set is then used by NeededMutationCols to make sure we never prune mutation
    input columns that are used by FK checks. Currently this never happens, but
    because of fairly subtle reasons: FKs happen to require indexes on both sides,
    and those indexes force the mutation operator to require values for those
    columns. This won't be the case anymore with upsert FK checks, for which this
    fix is required.

    The new information will also be used (in a future change) to prune unused
    columns of `With` itself.

    Release note: None

commit e2da8bd
Author: Raphael 'kena' Poss <knz@thaumogen.net>
Date:   Mon Mar 2 17:04:52 2020 +0100

    log: ensure that the test log scopes works with secondary loggers

    This patch ensures that the file redirection performed by log.Scope /
    log.ScopeWithoutShowLogs is effective with secondary loggers.

    Release note: None

commit 3433fbd
Author: Andrew Werner <ajwerner@cockroachlabs.com>
Date:   Sat Feb 29 13:39:17 2020 -0500

    changefeedccl: pick up TargetBytes for backfill ScanRequest

    Now that cockroachdb#44925 has merged, we can use a byte limit rather than a row limit.

    Release note: None

commit 589cfa1
Author: Raphael 'kena' Poss <knz@thaumogen.net>
Date:   Mon Mar 2 17:03:18 2020 +0100

    log: ensure secondary loggers get cleaned up

    Previously, if two consecutive tests in a package were using secondary
    loggers, the second test would see the secondary loggers of the first
    test in its logger registry.

    This was problematic because it really made the log files/messages of
    the first test available to the logging file retrieval API.

    This patch cleans this up by de-registering secondary loggers when
    their context is cancelled, which maps to stoppers going down.

    Release note: None

commit 4caee85
Author: Rohan Yadav <rohany@alumni.cmu.edu>
Date:   Mon Mar 2 11:46:04 2020 -0500

    sql: update error message for primary key change on an interleave parent

    Fixes cockroachdb#45537.

    This PR updates the error message when trying to perform a primary
    key change on an interleaved parent to include the name of the
    table as well as the names of the interleaved children.

    Release note: None

commit 9ac5dbc
Author: Céline O'Neil <celineloneil@gmail.com>
Date:   Thu Dec 19 14:02:53 2019 -0500

    opt: factor limit hints into scan and lookup join costs

    Fixes cockroachdb#34811; the example query in this issue now chooses a lookup join
    as desired. The coster now takes limit hints into account when costing
    scans and lookup joins, and propagates limit hints through lookup joins.

    Release note (sql change): The optimizer now considers the likely number
    of rows an operator will need to provide, and might choose query plans
    based on this. In particular, the optimizer might prefer lookup joins
    over alternatives in some situations where all rows of the join will
    probably not be needed.

commit 62c8667
Author: Raphael 'kena' Poss <knz@thaumogen.net>
Date:   Mon Mar 2 14:35:30 2020 +0100

    cli: reveal SQL error details/hints in CLI commands

    Previously, when a CLI command that uses SQL under the hood
    encountered an error, only the main error message was displayed.

    This patch changes it to reveal the other additional user-visible
    fields.

    Release note (cli change): Certain CLI command now provide more
    details and/or a hint when they encounter an error.

commit f978076
Author: Raphael 'kena' Poss <knz@thaumogen.net>
Date:   Wed Dec 11 16:59:31 2019 +0100

    *: allow periods (.) in usernames

    Requested by a user:

    > Currently, there is a restriction for the database username which
    > will limit the certificate-based authentication. It's very common to
    > include .local (e.g.: internal-service2.local) in the CN (Common Name)
    > of a certificate.  The AWS Certificate Manager (ACM) won't even issue
    > a certificate if the "dot" (.) is not present.

    Release note (sql change): Usernames can now contain periods, for
    compatibility with certificate managers that require domain names to
    be used as usernames.

    Release note (security update): The validation rule for principal
    names was extended to support periods and thus allow domainname-like
    principals. For reference, the validation rule is currently the regular
    expression `^[\p{Ll}0-9_][---\p{Ll}0-9_.]+$` and a size limit of 63
    UTF-8 bytes (larger usernames are rejected with an error); for
    comparison, PostgreSQL allows many more characters and truncates at 63
    chacters silently.

commit 79ca338
Author: Artem Barger <bartem@il.ibm.com>
Date:   Sun Feb 16 17:28:58 2020 +0200

    intervalccl: optimize OverlapCoveringMerge

    Change OverlapCoveringMerge instead of iterating over coverings for each
    input range to flatten coverings, sort results and produce ranges with
    one pass. Reducing total complexity from O(nm) to O(n*log(n)).

    Signed-off-by: Artem Barger <bartem@il.ibm.com>

    Release note: none

Release note (<category, see below>): <what> <show> <why>
@nvanbenschoten nvanbenschoten mentioned this issue Jun 23, 2020
22 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants