Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: jepsen-batch3/multi-register/split failed [hopefully #36431] #37930

Closed
cockroach-teamcity opened this issue May 30, 2019 · 12 comments · Fixed by #40600
Closed

roachtest: jepsen-batch3/multi-register/split failed [hopefully #36431] #37930

cockroach-teamcity opened this issue May 30, 2019 · 12 comments · Fixed by #40600
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/8f42e0d9948256af8b3e1994d514314ba1718c48

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1315162&tab=buildLog

The test failed on branch=master, cloud=gce:
	jepsen.go:262,jepsen.go:324,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1315162-jepsen-batch3:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.142.0.26 -n 10.142.0.39 -n 10.142.0.55 -n 10.142.0.54 -n 10.142.0.24 \
		   --test multi-register --nemesis split \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.196.20.37_2019-05-30T10:48:33Z: exit status 1
		: exit status 1

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone May 30, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels May 30, 2019
@tbg
Copy link
Member

tbg commented Jun 4, 2019

    :valid? true}},
  :failures [2287]},
 :valid? false}


Analysis invalid!

@tbg tbg assigned nvanbenschoten and unassigned petermattis Jun 4, 2019
@tbg tbg changed the title roachtest: jepsen-batch3/multi-register/split failed roachtest: jepsen-batch3/multi-register/split failed [hopefully #36431] Jun 4, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3b63648f905715e6c0b055fe9acac9c5b8206196

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1333520&tab=buildLog

The test failed on branch=master, cloud=gce:
	jepsen.go:253,jepsen.go:323,test.go:1248: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f6d75a54f2cdf21723b924e143963f75816272df

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/multi-register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1349751&tab=buildLog

The test failed on branch=master, cloud=gce:
	jepsen.go:253,jepsen.go:323,test.go:1251: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4878cb3e960dc26f0946e527e1836f27b3304c97

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/multi-register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1373334&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190704-1373334/jepsen/multi-register/split/run_1
	cluster.go:1724,jepsen.go:98,jepsen.go:138,jepsen.go:316,test_runner.go:681: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1562220659-29-n6cpu4:1-6 -- tar --transform s,^,cockroach/, -c -z -f cockroach.tgz cockroach returned:
		stderr:
		
		stdout:
		teamcity-1562220659-29-n6cpu4: tar --transform s,^,cockroa.........
		   1: 
		   2: tar: cockroach: Cannot stat: No such file or directory
		tar: Exiting with failure status due to previous errors
		ssh verbose log retained in /root/.roachprod/debug/ssh_35.227.82.203_2019-07-04T14:02:06Z: exit status 2
		   3: 
		   4: 
		   5: tar: cockroach: Cannot stat: No such file or directory
		tar: Exiting with failure status due to previous errors
		ssh verbose log retained in /root/.roachprod/debug/ssh_34.73.100.118_2019-07-04T14:02:06Z: exit status 2
		   6: tar: cockroach: Cannot stat: No such file or directory
		tar: Exiting with failure status due to previous errors
		ssh verbose log retained in /root/.roachprod/debug/ssh_35.196.208.190_2019-07-04T14:02:06Z: exit status 2
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.227.82.203_2019-07-04T14:02:06Z: exit status 2
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4878cb3e960dc26f0946e527e1836f27b3304c97

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1373334&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190704-1373334/jepsen/register/split/run_1
	cluster.go:1724,jepsen.go:98,jepsen.go:138,jepsen.go:316,test_runner.go:681: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1562220659-29-n6cpu4:1-6 -- tar --transform s,^,cockroach/, -c -z -f cockroach.tgz cockroach returned:
		stderr:
		
		stdout:
		teamcity-1562220659-29-n6cpu4: tar --transform s,^,cockroa.........
		   1: 
		   2: tar: cockroach: Cannot stat: No such file or directory
		tar: Exiting with failure status due to previous errors
		ssh verbose log retained in /root/.roachprod/debug/ssh_35.227.82.203_2019-07-04T14:05:46Z: exit status 2
		   3: 
		   4: 
		   5: tar: cockroach: Cannot stat: No such file or directory
		tar: Exiting with failure status due to previous errors
		ssh verbose log retained in /root/.roachprod/debug/ssh_34.73.100.118_2019-07-04T14:05:46Z: exit status 2
		   6: tar: cockroach: Cannot stat: No such file or directory
		tar: Exiting with failure status due to previous errors
		ssh verbose log retained in /root/.roachprod/debug/ssh_35.196.208.190_2019-07-04T14:05:46Z: exit status 2
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.227.82.203_2019-07-04T14:05:46Z: exit status 2
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9322e07476de447799c5d3011eb2874930ee2993

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/multi-register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1375546&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190706-1375546/jepsen/multi-register/split/run_1
	jepsen.go:264,jepsen.go:316,test_runner.go:681: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1562393890-27-n6cpu4:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.142.0.66 -n 10.142.0.42 -n 10.142.0.30 -n 10.142.0.24 -n 10.142.0.101 \
		   --test multi-register --nemesis split \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.196.250.21_2019-07-06T14:56:36Z: exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1ca35fc4a0e2665e7f6efd945e65a0db97984fa7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1396096&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190719-1396096/jepsen/register/split/run_1
	jepsen.go:256,jepsen.go:325,test_runner.go:691: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/da56c792e968574b8f1d9ef3fdb45d56a530221a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1415578&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190801-1415578/jepsen/register/split/run_1
	jepsen.go:264,jepsen.go:325,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564640260-60-n6cpu4:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.128.0.72 -n 10.128.0.78 -n 10.128.0.75 -n 10.128.0.69 -n 10.128.0.70 \
		   --test register --nemesis split \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  exit status 255
		: exit status 1

@nvanbenschoten
Copy link
Member

The previous failure was a duplicate of #39218, which is fixed by cockroachdb/jepsen#23.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5bd37e8eb58ca66b9293c234bc572411057fec3a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1417287&tab=buildLog

The test failed on branch=provisional_201908012151_v19.2.0-alpha.20190729, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190802-1417287/jepsen/register/split/run_1
	jepsen.go:256,jepsen.go:325,test_runner.go:691: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3b9a95bd7eb2cfa6d544fe7217852a85ec3b76f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1422703&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190805-1422703/jepsen/register/split/run_1
	jepsen.go:256,jepsen.go:325,test_runner.go:691: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9a982e902638e116ed6a76f4fa635a0a1445d88a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1447054&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190821-1447054/jepsen/register/split/run_1
	jepsen.go:256,jepsen.go:325,test_runner.go:673: exit status 1

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 9, 2019
…tervals

This commit fixes the most common failure case in all of the following Jepsen
failures. I'm not closing them here though, because they also should be failing
due to cockroachdb#36431.
Would fix cockroachdb#37394.
Would fix cockroachdb#37545.
Would fix cockroachdb#37930.
Would fix cockroachdb#37932.
Would fix cockroachdb#37956.
Would fix cockroachdb#38126.
Would fix cockroachdb#38763.

Before this fix, we were not considering intents in a scan's uncertainty
interval to be uncertain. This had the potential to cause stale reads because
an unresolved intent doesn't indicate that its transaction hasn’t been committed
and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register`
tests to fail, which I had previously incorrectly attributed entirely to cockroachdb#36431.

This commit fixes this by returning `WriteIntentError`s for intents when they
are above the read timestamp of a scan but below the max timestamp of a scan.
This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s
in this situation. Both would eventually have the same effect, but it seems
preferable to kick off concurrency control immediately in this situation and
only fall back to uncertainty handling for committed values. If the intent
ends up being aborted, this could allow the read to avoid moving its timestamp.

This commit will need to be backported all the way back to v2.0.

Release note (bug fix): Consider intents in a read's uncertainty
interval to be uncertain just as if they were committed values. This
removes the potential for stale reads when a causally dependent
transaction runs into the not-yet resolved intents from a causal
ancestor.
craig bot pushed a commit that referenced this issue Sep 9, 2019
40600: storage/engine: return WriteIntentError for intents in uncertainty intervals r=petermattis a=nvanbenschoten

This commit fixes the most common failure case in all of the following Jepsen failures. I'm not closing them here though, because they also should be failing due to #36431.
Would fix #37394.
Would fix #37545.
Would fix #37930.
Would fix #37932.
Would fix #37956.
Would fix #38126.
Would fix #38763.

Before this fix, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register` tests to fail, which I had previously incorrectly attributed entirely to #36431.

This commit fixes this by returning `WriteIntentError`s for intents when they are above the read timestamp of a scan but below the max timestamp of a scan. This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s in this situation. Both would eventually have the same effect, but it seems preferable to kick off concurrency control immediately in this situation and only fall back to uncertainty handling for committed values. If the intent ends up being aborted, this could allow the read to avoid moving its timestamp.

This commit will need to be backported all the way back to v2.0.

Release note (bug fix): Consider intents in a read's uncertainty interval to be uncertain just as if they were committed values. This removes the potential for stale reads when a causally dependent transaction runs into the not-yet resolved intents from a causal ancestor.

40603: make: pass TESTFLAGS to roachprod-stress, not GOFLAGS r=petermattis a=nvanbenschoten

Passing the testflags through the GOFLAGS env var was causing the following
error:
```
stringer -output=pkg/sql/opt/rule_name_string.go -type=RuleName pkg/sql/opt/rule_name.go pkg/sql/opt/rule_name.og.go
stringer: go [list -f {{context.GOARCH}} {{context.Compiler}} -tags= -- unsafe]: exit status 1: go: parsing $GOFLAGS: non-flag "storage.test"
Makefile:1496: recipe for target 'pkg/sql/opt/rule_name_string.go' failed
make: *** [pkg/sql/opt/rule_name_string.go] Error 1
make: *** Waiting for unfinished jobs....
```

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in d20419d Sep 9, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 9, 2019
…tervals

This commit fixes the most common failure case in all of the following Jepsen
failures. I'm not closing them here though, because they also should be failing
due to cockroachdb#36431.
Would fix cockroachdb#37394.
Would fix cockroachdb#37545.
Would fix cockroachdb#37930.
Would fix cockroachdb#37932.
Would fix cockroachdb#37956.
Would fix cockroachdb#38126.
Would fix cockroachdb#38763.

Before this fix, we were not considering intents in a scan's uncertainty
interval to be uncertain. This had the potential to cause stale reads because
an unresolved intent doesn't indicate that its transaction hasn’t been committed
and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register`
tests to fail, which I had previously incorrectly attributed entirely to cockroachdb#36431.

This commit fixes this by returning `WriteIntentError`s for intents when they
are above the read timestamp of a scan but below the max timestamp of a scan.
This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s
in this situation. Both would eventually have the same effect, but it seems
preferable to kick off concurrency control immediately in this situation and
only fall back to uncertainty handling for committed values. If the intent
ends up being aborted, this could allow the read to avoid moving its timestamp.

This commit will need to be backported all the way back to v2.0.

Release note (bug fix): Consider intents in a read's uncertainty
interval to be uncertain just as if they were committed values. This
removes the potential for stale reads when a causally dependent
transaction runs into the not-yet resolved intents from a causal
ancestor.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 10, 2019
…tervals

This commit fixes the most common failure case in all of the following Jepsen
failures. I'm not closing them here though, because they also should be failing
due to cockroachdb#36431.
Would fix cockroachdb#37394.
Would fix cockroachdb#37545.
Would fix cockroachdb#37930.
Would fix cockroachdb#37932.
Would fix cockroachdb#37956.
Would fix cockroachdb#38126.
Would fix cockroachdb#38763.

Before this fix, we were not considering intents in a scan's uncertainty
interval to be uncertain. This had the potential to cause stale reads because
an unresolved intent doesn't indicate that its transaction hasn’t been committed
and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register`
tests to fail, which I had previously incorrectly attributed entirely to cockroachdb#36431.

This commit fixes this by returning `WriteIntentError`s for intents when they
are above the read timestamp of a scan but below the max timestamp of a scan.
This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s
in this situation. Both would eventually have the same effect, but it seems
preferable to kick off concurrency control immediately in this situation and
only fall back to uncertainty handling for committed values. If the intent
ends up being aborted, this could allow the read to avoid moving its timestamp.

This commit will need to be backported all the way back to v2.0.

Release note (bug fix): Consider intents in a read's uncertainty
interval to be uncertain just as if they were committed values. This
removes the potential for stale reads when a causally dependent
transaction runs into the not-yet resolved intents from a causal
ancestor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
4 participants