Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: jepsen-batch3/multi-register/strobe-skews failed [likely #36431] #37394

Closed
cockroach-teamcity opened this issue May 8, 2019 · 26 comments · Fixed by #40600
Closed
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/8abb47a1c9795c1463183bc44e776b054bece682

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1279683&tab=buildLog

The test failed on branch=master, cloud=gce:
	jepsen.go:249,jepsen.go:312,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1279683-jepsen-batch3:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.142.0.88 -n 10.142.0.57 -n 10.142.0.66 -n 10.142.0.64 -n 10.142.0.63 \
		   --test multi-register --nemesis strobe-skews \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Connection to 34.73.29.82 closed by remote host.
		Error:  exit status 255
		: exit status 1

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone May 8, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels May 8, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/048bdc163fcb470d4e749fcad482cf2671c29fb1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1281656&tab=buildLog

The test failed on branch=master, cloud=gce:
	jepsen.go:249,jepsen.go:312,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1281656-jepsen-batch3:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.142.0.141 -n 10.142.0.135 -n 10.142.0.136 -n 10.142.0.86 -n 10.142.0.134 \
		   --test multi-register --nemesis strobe-skews \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  exit status 1
		: exit status 1

@andreimatei
Copy link
Contributor

The 2nd failure here is a real table-flipping "Analysis invalid".
failure logs:
https://drive.google.com/open?id=1-apgEwOrCJryn1ofRdkD1YPo2u_-hcOy

@nvanbenschoten this test is your baby, right?

@andreimatei andreimatei added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label May 9, 2019
@nvanbenschoten
Copy link
Member

@andreimatei and I took a look at this together and haven't been able to find any indication that this is not #36431. For context, the multi-register test was introduced specifically to catch that speculated issue. Here's what we see:

We see in results.edn that timeline 83 detected a linearizability violation. We see in independent/83/results.edn that this timeline failed with the following invalid sequence of operations:

:final-paths
  ([{:op
     {:process 23,
      :type :ok,
      :f :txn,
      :value [[:write 2 1] [:write 4 6]],
      :index 4876,
      :time 27710611954},
     :model {1 9, 3 3, 0 7, 2 1, 4 6}}
    {:op
     {:process 22,
      :type :ok,
      :f :txn,
      :value [[:read 4 9]],
      :index 4885,
      :time 27741394387},
     :model {:msg "6≠9"}}]
   [{:op
     {:process 23,
      :type :ok,
      :f :txn,
      :value [[:write 2 1] [:write 4 6]],
      :index 4876,
      :time 27710611954},
     :model {1 9, 3 9, 0 9, 2 1, 4 6}}
    {:op
     {:process 22,
      :type :ok,
      :f :txn,
      :value [[:read 4 9]],
      :index 4885,
      :time 27741394387},
     :model {:msg "6≠9"}}]),

This shows that register 4 was initially given a value of 9 by Txn A. Txn B then writes to the register with a value of 6. Finally, Txn C reads from the register and gets a value of 9. None of these txns are concurrent so this is a classic case of a "stale read". We can see this more easily in independent/83/timeline.html:

violate

The biggest piece of evidence that this is in-fact #36431 is that the write to register 4 was the second write in Txn B. This means that if Txn B refreshed before committing then it may never have gone back to the range with register 4 to bump its clock because its intent on register 4 is not local to the transaction's record.

The other piece of evidence that points to this being #36431 is that the test is running the strobe-skews nemesis. This is the nemesis we hoped would trigger this issue. It could explain both how process 21's read of register 2 resulted in Txn B refreshing and how the gateway that executed Txn C had a lower timestamp than the commit timestamp of Txn B.

Because of all of that, I'm going to downgrade the severity of this to an S-2. The next steps will be to fix #36431 now that we have a probable reproduction. We can then try to reproduce. I'm also going to add a scattering step to the test to increase the change that the ranges fall on different nodes. @andreimatei also mentioned formally verifying that the issue exists and verifying that the proposed fix will resolve it.

@nvanbenschoten nvanbenschoten added S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors and removed S-1 High impact: many users impacted, serious risk of high unavailability or data loss S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. labels May 9, 2019
@nvanbenschoten
Copy link
Member

cockroachdb/jepsen#21 should help us hit this a little more frequently.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dcd4cc5e37ebbcebbbf0f01670811b176a58bf89

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1288281&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	jepsen.go:252,jepsen.go:322,test.go:1251: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/699f675c73f8420802f92e46f65e6dce52abc12f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1306268&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	jepsen.go:262,jepsen.go:324,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1306268-jepsen-batch3:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.142.0.125 -n 10.142.0.103 -n 10.142.0.101 -n 10.142.0.122 -n 10.142.0.148 \
		   --test multi-register --nemesis strobe-skews \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.229.118.16_2019-05-24T10:43:33Z: exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f49f211f8fb2c2aa51182054192ebfcb9c0355f0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1315180&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	jepsen.go:254,jepsen.go:324,test.go:1251: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5a88de2233e1405c0553f2d5380fd24218fac3d2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1324169&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	jepsen.go:253,jepsen.go:323,test.go:1248: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/b83798cadfee6447d565688b58657843741f8a45

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1328389&tab=buildLog

The test failed on branch=master, cloud=gce:
	jepsen.go:253,jepsen.go:323,test.go:1248: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/33f9a3c2c5a0d2b756a8827daa74285b1aece401

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1330334&tab=buildLog

The test failed on branch=master, cloud=gce:
	jepsen.go:253,jepsen.go:323,test.go:1248: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e77445ae3dba64442ed36ca2d7abbbd983e17677

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1332337&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	jepsen.go:253,jepsen.go:323,test.go:1248: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/254f7a8aab8d2f32315c879e0cc946429883e9e6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1337166&tab=buildLog

The test failed on branch=master, cloud=gce:
	jepsen.go:253,jepsen.go:323,test.go:1248: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/db6d4425d65bdb027624972ccb19d7aad0bc57cc

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1339372&tab=buildLog

The test failed on branch=master, cloud=gce:
	jepsen.go:253,jepsen.go:323,test.go:1248: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/35b45366a97b6209edb8ffc343a930c242ccf87e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1349769&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	jepsen.go:253,jepsen.go:323,test.go:1251: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f6d75a54f2cdf21723b924e143963f75816272df

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1349751&tab=buildLog

The test failed on branch=master, cloud=gce:
	jepsen.go:253,jepsen.go:323,test.go:1251: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5f358ed804af05f8c4b404efc4d8a282d8e0916c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1360435&tab=buildLog

The test failed on branch=master, cloud=gce:
	jepsen.go:259,jepsen.go:321,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1360435-jepsen-batch3:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.142.0.53 -n 10.142.0.52 -n 10.142.0.41 -n 10.142.0.28 -n 10.142.0.40 \
		   --test multi-register --nemesis strobe-skews \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.227.31.90_2019-06-26T11:03:34Z: exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f1c9693da739fa5fc2c94d4d978fadd6710d17da

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1371441&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190703-1371441/jepsen/multi-register/strobe-skews/run_1
	jepsen.go:256,jepsen.go:316,test_runner.go:680: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4878cb3e960dc26f0946e527e1836f27b3304c97

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1373334&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190704-1373334/jepsen/register/strobe-skews/run_1
	cluster.go:1724,jepsen.go:98,jepsen.go:138,jepsen.go:316,test_runner.go:681: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1562220659-29-n6cpu4:1-6 -- tar --transform s,^,cockroach/, -c -z -f cockroach.tgz cockroach returned:
		stderr:
		
		stdout:
		teamcity-1562220659-29-n6cpu4: tar --transform s,^,cockroa.........
		   1: 
		   2: tar: cockroach: Cannot stat: No such file or directory
		tar: Exiting with failure status due to previous errors
		ssh verbose log retained in /root/.roachprod/debug/ssh_35.227.82.203_2019-07-04T14:07:42Z: exit status 2
		   3: 
		   4: 
		   5: tar: cockroach: Cannot stat: No such file or directory
		tar: Exiting with failure status due to previous errors
		ssh verbose log retained in /root/.roachprod/debug/ssh_34.73.100.118_2019-07-04T14:07:42Z: exit status 2
		   6: tar: cockroach: Cannot stat: No such file or directory
		tar: Exiting with failure status due to previous errors
		ssh verbose log retained in /root/.roachprod/debug/ssh_35.196.208.190_2019-07-04T14:07:42Z: exit status 2
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.227.82.203_2019-07-04T14:07:42Z: exit status 2
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9322e07476de447799c5d3011eb2874930ee2993

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1375546&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190706-1375546/jepsen/multi-register/strobe-skews/run_1
	jepsen.go:256,jepsen.go:316,test_runner.go:681: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/8c6fdc64908a13291e4ddc5d233bbbaa379e71a2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/multi-register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1378458&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190709-1378458/jepsen/multi-register/strobe-skews/run_1
	jepsen.go:256,jepsen.go:316,test_runner.go:670: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190722-1399000/jepsen/register/strobe-skews/run_1
	jepsen.go:256,jepsen.go:325,test_runner.go:691: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/008680701c3aaf3f5f621dca781819096178cff4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1411175&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190730-1411175/jepsen/register/strobe-skews/run_1
	jepsen.go:256,jepsen.go:325,test_runner.go:691: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/da56c792e968574b8f1d9ef3fdb45d56a530221a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/strobe-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1415578&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190801-1415578/jepsen/register/strobe-skews/run_1
	jepsen.go:264,jepsen.go:325,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564640260-63-n6cpu4:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.128.0.98 -n 10.128.0.99 -n 10.128.0.93 -n 10.128.0.89 -n 10.128.0.97 \
		   --test register --nemesis strobe-skews \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  exit status 255
		: exit status 1

@nvanbenschoten
Copy link
Member

The previous failure was a duplicate of #39218, which is fixed by cockroachdb/jepsen#23.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 9, 2019
…tervals

This commit fixes the most common failure case in all of the following Jepsen
failures. I'm not closing them here though, because they also should be failing
due to cockroachdb#36431.
Would fix cockroachdb#37394.
Would fix cockroachdb#37545.
Would fix cockroachdb#37930.
Would fix cockroachdb#37932.
Would fix cockroachdb#37956.
Would fix cockroachdb#38126.
Would fix cockroachdb#38763.

Before this fix, we were not considering intents in a scan's uncertainty
interval to be uncertain. This had the potential to cause stale reads because
an unresolved intent doesn't indicate that its transaction hasn’t been committed
and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register`
tests to fail, which I had previously incorrectly attributed entirely to cockroachdb#36431.

This commit fixes this by returning `WriteIntentError`s for intents when they
are above the read timestamp of a scan but below the max timestamp of a scan.
This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s
in this situation. Both would eventually have the same effect, but it seems
preferable to kick off concurrency control immediately in this situation and
only fall back to uncertainty handling for committed values. If the intent
ends up being aborted, this could allow the read to avoid moving its timestamp.

This commit will need to be backported all the way back to v2.0.

Release note (bug fix): Consider intents in a read's uncertainty
interval to be uncertain just as if they were committed values. This
removes the potential for stale reads when a causally dependent
transaction runs into the not-yet resolved intents from a causal
ancestor.
craig bot pushed a commit that referenced this issue Sep 9, 2019
40600: storage/engine: return WriteIntentError for intents in uncertainty intervals r=petermattis a=nvanbenschoten

This commit fixes the most common failure case in all of the following Jepsen failures. I'm not closing them here though, because they also should be failing due to #36431.
Would fix #37394.
Would fix #37545.
Would fix #37930.
Would fix #37932.
Would fix #37956.
Would fix #38126.
Would fix #38763.

Before this fix, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register` tests to fail, which I had previously incorrectly attributed entirely to #36431.

This commit fixes this by returning `WriteIntentError`s for intents when they are above the read timestamp of a scan but below the max timestamp of a scan. This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s in this situation. Both would eventually have the same effect, but it seems preferable to kick off concurrency control immediately in this situation and only fall back to uncertainty handling for committed values. If the intent ends up being aborted, this could allow the read to avoid moving its timestamp.

This commit will need to be backported all the way back to v2.0.

Release note (bug fix): Consider intents in a read's uncertainty interval to be uncertain just as if they were committed values. This removes the potential for stale reads when a causally dependent transaction runs into the not-yet resolved intents from a causal ancestor.

40603: make: pass TESTFLAGS to roachprod-stress, not GOFLAGS r=petermattis a=nvanbenschoten

Passing the testflags through the GOFLAGS env var was causing the following
error:
```
stringer -output=pkg/sql/opt/rule_name_string.go -type=RuleName pkg/sql/opt/rule_name.go pkg/sql/opt/rule_name.og.go
stringer: go [list -f {{context.GOARCH}} {{context.Compiler}} -tags= -- unsafe]: exit status 1: go: parsing $GOFLAGS: non-flag "storage.test"
Makefile:1496: recipe for target 'pkg/sql/opt/rule_name_string.go' failed
make: *** [pkg/sql/opt/rule_name_string.go] Error 1
make: *** Waiting for unfinished jobs....
```

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in d20419d Sep 9, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 9, 2019
…tervals

This commit fixes the most common failure case in all of the following Jepsen
failures. I'm not closing them here though, because they also should be failing
due to cockroachdb#36431.
Would fix cockroachdb#37394.
Would fix cockroachdb#37545.
Would fix cockroachdb#37930.
Would fix cockroachdb#37932.
Would fix cockroachdb#37956.
Would fix cockroachdb#38126.
Would fix cockroachdb#38763.

Before this fix, we were not considering intents in a scan's uncertainty
interval to be uncertain. This had the potential to cause stale reads because
an unresolved intent doesn't indicate that its transaction hasn’t been committed
and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register`
tests to fail, which I had previously incorrectly attributed entirely to cockroachdb#36431.

This commit fixes this by returning `WriteIntentError`s for intents when they
are above the read timestamp of a scan but below the max timestamp of a scan.
This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s
in this situation. Both would eventually have the same effect, but it seems
preferable to kick off concurrency control immediately in this situation and
only fall back to uncertainty handling for committed values. If the intent
ends up being aborted, this could allow the read to avoid moving its timestamp.

This commit will need to be backported all the way back to v2.0.

Release note (bug fix): Consider intents in a read's uncertainty
interval to be uncertain just as if they were committed values. This
removes the potential for stale reads when a causally dependent
transaction runs into the not-yet resolved intents from a causal
ancestor.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 10, 2019
…tervals

This commit fixes the most common failure case in all of the following Jepsen
failures. I'm not closing them here though, because they also should be failing
due to cockroachdb#36431.
Would fix cockroachdb#37394.
Would fix cockroachdb#37545.
Would fix cockroachdb#37930.
Would fix cockroachdb#37932.
Would fix cockroachdb#37956.
Would fix cockroachdb#38126.
Would fix cockroachdb#38763.

Before this fix, we were not considering intents in a scan's uncertainty
interval to be uncertain. This had the potential to cause stale reads because
an unresolved intent doesn't indicate that its transaction hasn’t been committed
and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register`
tests to fail, which I had previously incorrectly attributed entirely to cockroachdb#36431.

This commit fixes this by returning `WriteIntentError`s for intents when they
are above the read timestamp of a scan but below the max timestamp of a scan.
This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s
in this situation. Both would eventually have the same effect, but it seems
preferable to kick off concurrency control immediately in this situation and
only fall back to uncertainty handling for committed values. If the intent
ends up being aborted, this could allow the read to avoid moving its timestamp.

This commit will need to be backported all the way back to v2.0.

Release note (bug fix): Consider intents in a read's uncertainty
interval to be uncertain just as if they were committed values. This
removes the potential for stale reads when a causally dependent
transaction runs into the not-yet resolved intents from a causal
ancestor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors
Projects
None yet
4 participants