roachtest: clearrange/zfs/checks=true failed #68303

cockroach-teamcity · 2021-07-31T16:26:58Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 701b177d8f4b81d8654dfb4090a2cd3cf82e63a7:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: roachtest README

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2021-08-04T17:06:05Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ eef03a46f2e43ff70485dadf7d9ad445db05cab4:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-05T16:30:36Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 6b8d59327add74cf1342345fb3eaffc3a3e765d2:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-06T16:56:58Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 50ef2fc205baa65c5a740c2d614fe1de279367e9:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-07T16:45:57Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ cab185ff71f0924953d987fe6ffd14efdd32a3a0:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-10T16:26:14Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 847514dab6354d4cc4ccf7b2857487b32119fb37:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

bananabrick · 2021-08-10T17:36:07Z

These are failing during the "import" workload sporadically. Looking into it.

cockroach-teamcity · 2021-08-11T16:33:39Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 90809c048d05f923a67ce9b89597b2779fc73e32:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-14T17:07:18Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 0880e83e30ee5eb9aab7bb2297324e098d028225:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-20T08:22:42Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 7897f24246bef3cb94f9f4bfaed474ecaa9fdee6:

		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 8: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 9: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I210820 08:22:08.352991 1 (gostd) cluster_synced.go:1677  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *cluster.WithCommandDetails (2) *exec.ExitError

Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-21T08:06:39Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 11e0a4da82124e70e772a009011ca7a4007bff85:

		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 8: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 9: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I210821 08:06:06.679612 1 (gostd) cluster_synced.go:1677  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *cluster.WithCommandDetails (2) *exec.ExitError

Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-22T08:36:32Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ d18da6c092bf1522e7a6478fe3973817e318c247:

		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 8: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 9: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I210822 08:36:00.123979 1 (gostd) cluster_synced.go:1677  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *cluster.WithCommandDetails (2) *exec.ExitError

Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-23T08:17:47Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 61bd543ba7288c8f0eed6cddded7b219c9d1fcd4:

		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 8: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 9: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I210823 08:17:14.885081 1 (gostd) cluster_synced.go:1677  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *cluster.WithCommandDetails (2) *exec.ExitError

Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-27T09:03:04Z

roachtest.clearrange/checks=true failed with artifacts on master @ 8cae60f603ccc4d83137167b3b31cab09be9d41a:

		  |  1358.0s        0         3557.8         5106.1      5.0     32.5     62.9    159.4 write
		  |  1359.0s        0         3963.4         5105.3      5.0     24.1     79.7    121.6 write
		  |  1360.0s        0         4129.4         5104.6      5.0     24.1     54.5     92.3 write
		  | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  |  1361.0s        0          953.5         5101.5      5.0     26.2     50.3     83.9 write
		  |  1362.0s        0            0.0         5097.8      0.0      0.0      0.0      0.0 write
		  |  1363.0s        0            0.0         5094.0      0.0      0.0      0.0      0.0 write
		  |  1364.0s        0            0.0         5090.3      0.0      0.0      0.0      0.0 write
		  |  1365.0s        0            0.0         5086.6      0.0      0.0      0.0      0.0 write
		  |  1366.0s        0            0.0         5082.8      0.0      0.0      0.0      0.0 write
		  |  1367.0s        0            0.0         5079.1      0.0      0.0      0.0      0.0 write
		  |  1368.0s        0            0.0         5075.4      0.0      0.0      0.0      0.0 write
		  |  1369.0s        0            0.0         5071.7      0.0      0.0      0.0      0.0 write
		  |  1370.0s        0            0.0         5068.0      0.0      0.0      0.0      0.0 write
		  | Error: unexpected EOF
		  | COMMAND_PROBLEM: exit status 1
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,clearrange.go:207,clearrange.go:38,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:38
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: clearrange/checks=true failed #65092 roachtest: clearrange/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-20.1 release-blocker]

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-27T09:41:10Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 8cae60f603ccc4d83137167b3b31cab09be9d41a:

		Wraps: (2) output in run_090816.531265009_n1_cockroach_workload_fixtures_import_bank
		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-3366874-1630046181-41-n10cpu16:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I210827 09:08:17.597020 1 ccl/workloadccl/fixture.go:345  [-] 1  starting import of 1 tables
		  | Error: importing fixture: importing table bank: dial tcp 127.0.0.1:26257: connect: connection refused
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ``````
		  |   | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3366874-1630046181-41-n10cpu16 --oneshot --ignore-empty-nodes: exit status 1 1: dead (exit status 137)
		10: 13246
		4: 13917
		5: 14053
		2: 13877
		7: 13704
		8: 13512
		9: 13895
		3: 13794
		6: 13959
		Error: UNCLASSIFIED_PROBLEM: 1: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 1: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-28T11:15:55Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 44ea1fa0eba8fc78544700ef4afded62ab98a021:

		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3373578-1630131756-44-n10cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: dead (exit status 134)
		6: 1381369
		10: 1433487
		1: 1027658
		8: 942369
		7: 1337138
		2: 1209474
		3: 1737446
		4: 1008459
		9: 1325433
		Error: UNCLASSIFIED_PROBLEM: 5: dead (exit status 134)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 5: dead (exit status 134)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-30T10:53:42Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 0b57dc40deda1206d9a1c215ffdb219bbf182a39:

		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3380119-1630304219-45-n10cpu16 --oneshot --ignore-empty-nodes: exit status 1 1: 1458935
		5: 1129317
		3: 1155285
		7: 1713730
		8: 1288440
		6: 1139808
		2: dead (exit status 134)
		10: 1327500
		9: 1181349
		4: 958629
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 134)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 2: dead (exit status 134)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-08-31T09:14:12Z

roachtest.clearrange/checks=true failed with artifacts on master @ c1ef81f5f435b3cc5bdf8b218532e0779f03a6bf:

		  |  1595.0s        0         2227.4         3986.7      6.8     17.8    159.4    209.7 write
		  |  1596.0s        0         3167.1         3986.2      6.8     19.9    117.4    570.4 write
		  |  1597.0s        0         3393.2         3985.8      6.8     26.2     56.6    142.6 write
		  |  1598.0s        0         1478.8         3984.2      6.0     39.8    159.4    201.3 write
		  |  1599.0s        0            0.0         3981.8      0.0      0.0      0.0      0.0 write
		  |  1600.0s        0            0.0         3979.3      0.0      0.0      0.0      0.0 write
		  | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  |  1601.0s        0            0.0         3976.8      0.0      0.0      0.0      0.0 write
		  |  1602.0s        0            0.0         3974.3      0.0      0.0      0.0      0.0 write
		  |  1603.0s        0            0.0         3971.8      0.0      0.0      0.0      0.0 write
		  |  1604.0s        0            0.0         3969.3      0.0      0.0      0.0      0.0 write
		  |  1605.0s        0            0.0         3966.9      0.0      0.0      0.0      0.0 write
		  |  1606.0s        0            0.0         3964.4      0.0      0.0      0.0      0.0 write
		  |  1607.0s        0            0.0         3961.9      0.0      0.0      0.0      0.0 write
		  | Error: ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.151:49272->10.142.0.148:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003)
		  | COMMAND_PROBLEM: exit status 1
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,clearrange.go:207,clearrange.go:38,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:38
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: clearrange/checks=true failed #65092 roachtest: clearrange/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-20.1 release-blocker]

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-09-01T16:57:34Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 15b773c71f92d643795e34c922717fde0447f9cd:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: roachtest README

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-09-03T16:15:23Z

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 42e5f9492d0d8d93638241303bca984fe78baae3:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: roachtest README

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

nvanbenschoten · 2022-01-27T21:10:55Z

Nice job extracting all of that Nick! That goroutine is interesting both because it is not stuck ([select]) and because there are over 60 of these computeChecksumPostApply functions firing at once on a single node. That indicates that these calls to computeChecksumPostApply are not stuck, but are very slow.

Could this slowness be explained by unexpectedly high consistency checker concurrency? All consistency checks on a single node will share the same consistencyLimiter rate limiter, which defaults to a rate of 8MB/s. Split across 60 ranges, that's 140KB/s per range. So the time to scan a single range will be 512MB / 140KB/s = 3840s = 64m.

@tbg would any of the replication circuit breaker work have led to the consistency checker queue detaching its context from an ongoing consistency check and moving on without the consistency check being canceled? If so, could this explain why we have more consistency checks running concurrently than the individual consistencyQueues should allow? And then the shared rate limiter would explain why these checks are getting slower and slower as more consistency checks leak.

nicktrav · 2022-01-27T21:27:26Z

Should have also mentioned that the stacks from above were taken from latest master (branched from fa93c68).

Here's a look at what we're calling the "bad" sha (i.e. 6664d0c). Same problem, just much more pronounced:

LSM state (on worst node):

Store 6:
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1   6.5 M       -   220 M       -       -       -       -   221 M       -       -       -     1.0
      0         0     0 B    0.00   215 M   685 M     137     0 B       0    52 M      72     0 B       0     0.2
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         7    25 M    0.57    39 M     0 B       0     0 B       0    96 M      28    97 M       1     2.4
      3        21   103 M    0.33   252 M   1.2 G      96    31 M      12   309 M     113   329 M       1     1.2
      4       355   4.3 G    1.00   1.2 G    21 G   1.7 K   638 M     100   1.6 G     318   1.6 G       1     1.4
      5      2171    29 G    1.00   7.2 G   108 G   7.1 K    11 G   1.0 K    11 G   1.0 K    11 G       1     1.5
      6      2803    98 G       -    97 G   1.6 G     125    36 M       8   169 G   5.9 K   169 G       1     1.7
  total      5357   131 G       -   132 G   132 G   9.1 K    12 G   1.1 K   314 G   7.4 K   182 G       5     2.4
  flush        36
compact      4326    66 G   4.6 M       1          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype      3159       5      29    1129       4  (default, delete, elision, move, read)
 memtbl         1    64 M
zmemtbl        13   280 M
   ztbl      6294   105 G
 bcache     220 K   3.2 G   20.3%  (score == hit-rate)
 tcache      10 K   6.5 M   97.8%  (score == hit-rate)
 titers      2643
 filter         -       -   98.2%  (score == utility)

530 goroutines computing the consistency checks:

0 | quotapool | quotapool.go:281 | (*AbstractPool).Acquire(#1309, {#4, *}, {#134, *})
1 | quotapool | int_rate.go:59 | (*RateLimiter).WaitN(#307, {#4, *}, *)
2 | kvserver | replica_consistency.go:581 | (*Replica).sha512.func1({{*, *, *}, {*, 0, 0}}, {*, *, *})
3 | storage | mvcc.go:3902 | ComputeStatsForRange({#147, *}, {*, *, *}, {*, *, *}, 0, {*, ...})
4 | kvserver | replica_consistency.go:636 | (*Replica).sha512(*, {#4, *}, {*, {*, *, 8}, {*, *, 8}, ...}, …)
5 | kvserver | replica_proposal.go:247 | (*Replica).computeChecksumPostApply.func1.1({#156, *}, {{*, *, *, *, *, *, *, *, ...}, ...}, …)
6 | kvserver | replica_proposal.go:253 | (*Replica).computeChecksumPostApply.func1({#4, *})
7 | stop | stopper.go:488 | (*Stopper).RunAsyncTaskEx.func2()

nicktrav · 2022-01-27T23:34:22Z

Going to keep digging on master, as I'm fairly certain that 71f0b34 alleviates much of the problem we were seeing. Though, as PR #75448 mentions, there's likely an alternative failure mode.

On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that of the proposal. As of #71806, this context is canceled right after the corresponding proposal is signaled (and the client goroutine returns from `sendWithRangeID`). This effectively prevents most consistency checks from succeeding (they previously were not affected by higher-level cancellation because the consistency check is triggered from a local queue that talks directly to the replica, i.e. had something like a minutes-long timeout). This caused disastrous behavior in the `clearrange` suite of roachtests. That test imports a large table. After the import, most ranges have estimates (due to the ctx cancellation preventing the consistency checks, which as a byproduct trigger stats adjustments) and their stats claim that they are very small. Before recent PR #74674, `ClearRange` on such ranges would use individual point deletions instead of the much more efficient pebble range deletions, effectively writing a lot of data and running the nodes out of disk. Failures of `clearrange` with #74674 were also observed, but they did not involve out-of-disk situations, so are possibly an alternative failure mode (that may still be related to the newly introduced presence of context cancellation). Touches #68303. Release note: None

nicktrav · 2022-01-28T06:14:55Z

Last update for the evening. Spent the remainder of today looking less at the ztbls leak (from what I'm seeing, after 71f0b34 we tend to see brief periods of elevation, but never anything near as bad as before that commit), and more on the replica imbalances problem that @tbg mentioned which is preventing the test from even getting to the actual "clearrange" step.

On the master, even taking ctx cancellation completely out of the picture, we're running out of disk as well, but it looks to be due to a replica imbalance.

Sampling some commits, I'm noticing the following "good" vs. "bad" behavior:

Good (fair allocation of replicas across all nodes):

Bad (some nodes run out of disk and stall the import):

I started a bisect, but it was taking some time. I'll pick this up again tomorrow.

tbg · 2022-01-28T09:11:31Z

@tbg would any of the replication circuit breaker work have led to the consistency checker queue detaching its context from an ongoing consistency check and moving on without the consistency check being canceled? If so, could this explain why we have more consistency checks running concurrently than the individual consistencyQueues should allow? And then the shared rate limiter would explain why these checks are getting slower and slower as more consistency checks leak.

I think I see what the problem is. I had actually thought about it before, but erroneously convinced myself that it wasn't an issue. Here's what it looks like on the leaseholder on the bad sha (i.e. ctx cancels):

leaseholder proposes ComputeChecksum
applies cmd, spawns async computation
ctx cancels
async computation stops immediately
leaseholder's long-poll to wait for the result errors out
leaseholder handles next range. goto 1

so no concurrency. But step 2 also happens on each follower, and there it will not have a cancelable context associated to it. So there:

applies cmd, spawns async
runs for a long time
but in the meantime the leaseholder is already doing ten more ranges that also all went through steps 1+2
have 60+ consistency checks running, oops

So basically the problem is that if a consistency check fails fast on the leader, this doesn't cancel the in-flight computation on the follower. Since each node is a follower for lots of ranges, we had tons of consistency checks running on each node.

What's curious is that when I ran this experiment I should've seen lots of snapshots open but I didn't, but maybe my instrumentation was wrong or the test never got to the point where it exhibited this problem (the graceful shutdowns I introduced after the import hung, I think).

With the cancellation fix, we're close to the previous behavior. The only difference is that previously, the computation on the leaseholder was canceled when the consistency checker queue gave up. But like before this wouldn't affect the followers if they still had the computation ongoing.

I think this might put a pin in the high ztbl count, right? Thanks for all of the work getting us here @nicktrav!

tbg · 2022-01-28T09:17:26Z

How are you bisecting, btw? Are you going through all 319 commits cd1093d...8eaf8d2? It sounds as though the strategy for each bisection attempt would be to cherry-pick 71f0b34 on top, but are we even sure this is "good" for any of the commits in that range?

nvanbenschoten · 2022-01-28T10:16:29Z

So basically the problem is that if a consistency check fails fast on the leader, this doesn't cancel the in-flight computation on the follower. Since each node is a follower for lots of ranges, we had tons of consistency checks running on each node.

This makes a lot of sense. I'll still suggest that we should think carefully about whether the client ctx cancellation is the root of the problem, or whether it's actually d064059. The ability for a client ctx cancellation to propagate to Raft log application on the proposer replica seems like a serious problem to me. It breaks determinism, the cornerstone of the whole "replicated state machine" idea. I'm actually surprised this hasn't caused worse issues, like a short-circuited split on the proposer. We must just not currently check for context cancellation in many places below Raft.

erikgrinaker · 2022-01-28T10:24:47Z

I fully agree with Nathan here. That commit was motivated by propagating tracing information through command application, but it should not propagate cancellation signals.

cockroach-teamcity · 2022-01-28T10:37:57Z

roachtest.clearrange/checks=true failed with artifacts on master @ 71becf337d9d2731298dc092f3ce9cf0f0eedb2c:

		  | I220128 10:37:27.336682 337 workload/pgx_helpers.go:79  [-] 23  pgx logger [error]: Exec logParams=map[args:[-6450913955317917568 f1] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340792 326 workload/pgx_helpers.go:79  [-] 24  pgx logger [error]: Exec logParams=map[args:[8829663467242086327 62] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.336700 343 workload/pgx_helpers.go:79  [-] 25  pgx logger [error]: Exec logParams=map[args:[-3644533257171351169 0b] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.336710 323 workload/pgx_helpers.go:79  [-] 26  pgx logger [error]: Exec logParams=map[args:[3192999095032280912 da] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.336696 346 workload/pgx_helpers.go:79  [-] 27  pgx logger [error]: Exec logParams=map[args:[6493141783003117667 97] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340831 327 workload/pgx_helpers.go:79  [-] 28  pgx logger [error]: Exec logParams=map[args:[1555708056282946553 c5] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340861 65 workload/pgx_helpers.go:79  [-] 29  pgx logger [error]: Exec logParams=map[args:[1826535142466176772 5b] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340876 64 workload/pgx_helpers.go:79  [-] 30  pgx logger [error]: Exec logParams=map[args:[1318876305802062279 3a] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340893 344 workload/pgx_helpers.go:79  [-] 31  pgx logger [error]: Exec logParams=map[args:[1728666596595591428 ca] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340911 59 workload/pgx_helpers.go:79  [-] 32  pgx logger [error]: Exec logParams=map[args:[-6613865651839355368 ca] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340927 61 workload/pgx_helpers.go:79  [-] 33  pgx logger [error]: Exec logParams=map[args:[-3523718973629480045 3a] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340941 62 workload/pgx_helpers.go:79  [-] 34  pgx logger [error]: Exec logParams=map[args:[-8232659246879096639 05] err:unexpected EOF sql:kv-2]
		  | Error: unexpected EOF
		  | COMMAND_PROBLEM: exit status 1
		  |   10: 
		  | UNCLASSIFIED_PROBLEM: context canceled
		Wraps: (4) secondary error attachment
		  | COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 9. Command with error:
		  |   | ``````
		  |   | ./cockroach workload run kv --concurrency=32 --duration=1h
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:127,clearrange.go:207,clearrange.go:39,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 9: dead (exit status 10)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 9: dead (exit status 10)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: clearrange/zfs/checks=true failed #70306 roachtest: clearrange/zfs/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-21.2]

_{This test on roachdash | Improve this report!}

andreimatei · 2022-01-28T13:33:10Z

We must just not currently check for context cancellation in many places below Raft.

Is there a good reason why we check for cancellation in any places below Raft?

tbg · 2022-01-28T13:42:31Z

Is there a good reason why we check for cancellation in any places below Raft?

There may not be, but it seems brittle pass a cancelable context into a subsystem that must not check cancellation. It's both more robust and also, in my view, more appropriate to execute state machine transitions under a context that does not inherit the wholly unrelated client cancellation.

I think we should massage

cockroach/pkg/kv/kvserver/replica_application_decoder.go

Lines 139 to 171 in 852a80c

    
           // createTracingSpans creates and assigns a new tracing span for each decoded 
        
           // command. If a command was proposed locally, it will be given a tracing span 
        
           // that follows from its proposal's span. 
        
           func (d *replicaDecoder) createTracingSpans(ctx context.Context) { 
        
           	const opName = "raft application" 
        
           	var it replicatedCmdBufSlice 
        
           	for it.init(&d.cmdBuf); it.Valid(); it.Next() { 
        
           		cmd := it.cur() 
        
           		if cmd.IsLocal() { 
        
           			cmd.ctx, cmd.sp = tracing.ChildSpan(cmd.proposal.ctx, opName) 
        
           		} else if cmd.raftCmd.TraceData != nil { 
        
           			// The proposal isn't local, and trace data is available. Extract 
        
           			// the remote span and start a server-side span that follows from it. 
        
           			spanMeta, err := d.r.AmbientContext.Tracer.ExtractMetaFrom(tracing.MapCarrier{ 
        
           				Map: cmd.raftCmd.TraceData, 
        
           			}) 
        
           			if err != nil { 
        
           				log.Errorf(ctx, "unable to extract trace data from raft command: %s", err) 
        
           			} else { 
        
           				cmd.ctx, cmd.sp = d.r.AmbientContext.Tracer.StartSpanCtx( 
        
           					ctx, 
        
           					opName, 
        
           					// NB: Nobody is collecting the recording of this span; we have no 
        
           					// mechanism for it. 
        
           					tracing.WithRemoteParent(spanMeta), 
        
           					tracing.WithFollowsFrom(), 
        
           				) 
        
           			} 
        
           		} else { 
        
           			cmd.ctx, cmd.sp = tracing.ChildSpan(ctx, opName) 
        
           		} 
        
           	} 
        
           }

such that it avoids deriving from cmd.proposal.ctx (i.e. it can create a derived span, but not become a child of the proposer ctx).

tbg · 2022-01-28T13:52:30Z

Filed #75656

andreimatei · 2022-01-28T14:02:10Z

There may not be, but it seems brittle pass a cancelable context into a subsystem that must not check cancellation

I kinda see it the other way around. The subsystem should be robust against any context passed into it. Depending on where you draw the boundary of the subsystem on question, you can say that raft application can be made robust by switching to a non-cancelable context itself. But still, if there's code that only ever runs below Raft, I think we should take out all the cancellation checks (at the very least, for clarity).

nicktrav · 2022-01-28T14:22:41Z

How are you bisecting, btw? ... It sounds as though the strategy for each bisection attempt would be to cherry-pick 71f0b34 on top

Yeah, taking this approach. There are only 8-ish steps in a full bisect, but it's a little bit of extra work to cherry-pick, etc.. So a little slower going.

are we even sure this is "good" for any of the commits in that range?

I don't think we are based on what came in while I was offline. That said, if I treat a "good" signal for this bisect as whether the replicas are balanced I seem to be zeroing in.

nvanbenschoten · 2022-01-28T15:57:06Z

Depending on where you draw the boundary of the subsystem on question, you can say that raft application can be made robust by switching to a non-cancelable context itself.

Right, I think this is what we're saying, and what is proposed in #75656.

But still, if there's code that only ever runs below Raft, I think we should take out all the cancellation checks (at the very least, for clarity).

Trying to make this guarantee is the part that seems brittle. Even if we carefully audit and ensure that we don't perform context cancellation checks directly in Raft code, it's hard to guarantee that no lower-level logic or library that Raft code calls into will perform such checks. For instance, I broke this guarantee in #73279 while touching distant code, which Erik fixed in #73484. There are also proposals like golang/go#20280 to add context cancellation awareness to the filesystem operations provided by the standard library. If we don't want a subsystem to respect context cancellation, it's best not to give it a cancellable context.

nvanbenschoten · 2022-01-28T16:00:16Z

I think we should massage .. such that it avoids deriving from cmd.proposal.ctx (i.e. it can create a derived span, but not become a child of the proposer ctx).

Maybe we could even remove cmd.proposal.ctx entirely. We already extract the tracing span (cmd.proposal.sp) from the context.

tbg · 2022-01-28T16:19:27Z

Let's continue discussing this on #75656, this comment thread is already pretty unwiedly.

nicktrav · 2022-01-28T19:12:16Z

I had some luck with the bisect on the replica imbalance issue.

I've narrowed it down to e12c9e6. On master with this commit included I see the following behavior on import:

I then ran a branch with just this commit excluded and the replicas are far more balanced and the import is able to succeed:

I don't have enough context to say whether, outside the context of just the clearrange/* tests, that commit would cause issues elsewhere. It could just be a matter of giving these test workers more headroom to allow them to complete the import, and then potentially rebalance to a more even state? cc: @dt - happy to spin up a new issue for this to avoid piling onto this ticket.

Once the import succeeds, we're into the (well documented) realm of #75656 - ztbls isn't terrible, but we have a lot of goroutines (~100 on each node) running the consistency checks, which is slowing down test overall, and preventing the disk space from being reclaimed (not an issue for the remainder of the test, as we're just deleting).

In terms of debugging this specific test failure, I think we've found the two separate issues we theorized.

tbg · 2022-01-28T19:19:35Z

Great work, and much appreciate the consistent extra miles you're going.

I think we should close this issue, and file a separate issue about what you have found, and then link it to the new clear range failure issue once the next nightly creates it.

dt · 2022-01-28T19:21:50Z

Well that's pretty interesting / mildly surprising since this is an ordered ingestion import, where we didn't expect this bulk-sent split size to do much other an aggravate the merge queue an hour later.

nicktrav · 2022-01-28T19:22:44Z

I think we should close this issue, and file a separate issue about what you have found, and then link it to the new clear range failure issue once the next nightly creates it.

Ack. Will do. @dt - I'll move discussion over there 👍

If I close this, I think it will just re-open when it fails again (at least until the replica imbalance issue is addressed). I'll just make it known to the rest of the Storage folks that we can probably safely leave this one alone.

Thanks all.

blathers-crl bot added the T-storage Storage Team label Jul 31, 2021

bananabrick self-assigned this Aug 2, 2021

bananabrick removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 4, 2021

bananabrick mentioned this issue Aug 11, 2021

Investigate zfs roachtest failures #68716

Closed

cockroach-teamcity mentioned this issue Aug 21, 2021

roachtest: clearrange/checks=true failed #69216

Closed

tbg mentioned this issue Jan 28, 2022

kvserver: use non-cancelable contexts in raft application #75656

Closed

nicktrav mentioned this issue Jan 28, 2022

bulk: increased split-after size can result in replica imbalance #75664

Closed

nicktrav closed this as completed Jan 28, 2022

nicktrav mentioned this issue Feb 11, 2022

roachtest: clearrange/checks=true failed #75701

Closed

nicktrav mentioned this issue Feb 28, 2022

roachtest: clearrange/checks=true failed #77080

Closed

nicktrav mentioned this issue Mar 30, 2022

db: add custom profile for open iterators cockroachdb/pebble#1597

Closed

nicktrav mentioned this issue May 18, 2022

roachtest: clearrange/checks=true failed #81429

Closed

roachtest: clearrange/zfs/checks=true failed #68303

roachtest: clearrange/zfs/checks=true failed #68303

Comments

cockroach-teamcity commented Jul 31, 2021

cockroach-teamcity commented Aug 4, 2021

cockroach-teamcity commented Aug 5, 2021

cockroach-teamcity commented Aug 6, 2021

cockroach-teamcity commented Aug 7, 2021

cockroach-teamcity commented Aug 10, 2021

bananabrick commented Aug 10, 2021

cockroach-teamcity commented Aug 11, 2021

cockroach-teamcity commented Aug 14, 2021

cockroach-teamcity commented Aug 20, 2021

cockroach-teamcity commented Aug 21, 2021

cockroach-teamcity commented Aug 22, 2021

cockroach-teamcity commented Aug 23, 2021

cockroach-teamcity commented Aug 27, 2021

cockroach-teamcity commented Aug 27, 2021

cockroach-teamcity commented Aug 28, 2021

cockroach-teamcity commented Aug 30, 2021

cockroach-teamcity commented Aug 31, 2021

cockroach-teamcity commented Sep 1, 2021

cockroach-teamcity commented Sep 3, 2021

nvanbenschoten commented Jan 27, 2022

nicktrav commented Jan 27, 2022

nicktrav commented Jan 27, 2022 • edited Loading

nicktrav commented Jan 28, 2022

tbg commented Jan 28, 2022

tbg commented Jan 28, 2022

nvanbenschoten commented Jan 28, 2022

erikgrinaker commented Jan 28, 2022

cockroach-teamcity commented Jan 28, 2022

andreimatei commented Jan 28, 2022

tbg commented Jan 28, 2022

tbg commented Jan 28, 2022

andreimatei commented Jan 28, 2022

nicktrav commented Jan 28, 2022

nvanbenschoten commented Jan 28, 2022

nvanbenschoten commented Jan 28, 2022

tbg commented Jan 28, 2022

nicktrav commented Jan 28, 2022

tbg commented Jan 28, 2022 • edited Loading

dt commented Jan 28, 2022

nicktrav commented Jan 28, 2022

nicktrav commented Jan 27, 2022 •

edited

Loading

tbg commented Jan 28, 2022 •

edited

Loading