roachtest: jepsen/multi-register/majority-ring-start-kill-2 failed [hopefully #36431] #38763

cockroach-teamcity · 2019-07-09T14:12:02Z

SHA: https://github.com/cockroachdb/cockroach/commits/8c6fdc64908a13291e4ddc5d233bbbaa379e71a2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/multi-register/majority-ring-start-kill-2 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1378458&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190709-1378458/jepsen/multi-register/majority-ring-start-kill-2/run_1
	jepsen.go:256,jepsen.go:316,test_runner.go:670: exit status 1

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2019-07-11T13:52:09Z

SHA: https://github.com/cockroachdb/cockroach/commits/07607a73daedbf57f47e42e2e3d6cfd529ab65a5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/majority-ring-start-kill-2 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1382906&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190711-1382906/jepsen/register/majority-ring-start-kill-2/run_1
	jepsen.go:256,jepsen.go:325,test_runner.go:678: exit status 1

cockroach-teamcity · 2019-08-01T17:30:05Z

SHA: https://github.com/cockroachdb/cockroach/commits/da56c792e968574b8f1d9ef3fdb45d56a530221a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen/register/majority-ring-start-kill-2 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1415578&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190801-1415578/jepsen/register/majority-ring-start-kill-2/run_1
	jepsen.go:264,jepsen.go:325,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564640260-58-n6cpu4:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.128.0.42 -n 10.128.0.40 -n 10.128.0.43 -n 10.128.0.47 -n 10.128.0.49 \
		   --test register --nemesis majority-ring --nemesis2 start-kill-2 \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  exit status 255
		: exit status 1

nvanbenschoten · 2019-08-01T21:31:26Z

The previous failure was a duplicate of #39218, which is fixed by cockroachdb/jepsen#23.

…tervals This commit fixes the most common failure case in all of the following Jepsen failures. I'm not closing them here though, because they also should be failing due to cockroachdb#36431. Would fix cockroachdb#37394. Would fix cockroachdb#37545. Would fix cockroachdb#37930. Would fix cockroachdb#37932. Would fix cockroachdb#37956. Would fix cockroachdb#38126. Would fix cockroachdb#38763. Before this fix, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register` tests to fail, which I had previously incorrectly attributed entirely to cockroachdb#36431. This commit fixes this by returning `WriteIntentError`s for intents when they are above the read timestamp of a scan but below the max timestamp of a scan. This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s in this situation. Both would eventually have the same effect, but it seems preferable to kick off concurrency control immediately in this situation and only fall back to uncertainty handling for committed values. If the intent ends up being aborted, this could allow the read to avoid moving its timestamp. This commit will need to be backported all the way back to v2.0. Release note (bug fix): Consider intents in a read's uncertainty interval to be uncertain just as if they were committed values. This removes the potential for stale reads when a causally dependent transaction runs into the not-yet resolved intents from a causal ancestor.

40600: storage/engine: return WriteIntentError for intents in uncertainty intervals r=petermattis a=nvanbenschoten This commit fixes the most common failure case in all of the following Jepsen failures. I'm not closing them here though, because they also should be failing due to #36431. Would fix #37394. Would fix #37545. Would fix #37930. Would fix #37932. Would fix #37956. Would fix #38126. Would fix #38763. Before this fix, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register` tests to fail, which I had previously incorrectly attributed entirely to #36431. This commit fixes this by returning `WriteIntentError`s for intents when they are above the read timestamp of a scan but below the max timestamp of a scan. This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s in this situation. Both would eventually have the same effect, but it seems preferable to kick off concurrency control immediately in this situation and only fall back to uncertainty handling for committed values. If the intent ends up being aborted, this could allow the read to avoid moving its timestamp. This commit will need to be backported all the way back to v2.0. Release note (bug fix): Consider intents in a read's uncertainty interval to be uncertain just as if they were committed values. This removes the potential for stale reads when a causally dependent transaction runs into the not-yet resolved intents from a causal ancestor. 40603: make: pass TESTFLAGS to roachprod-stress, not GOFLAGS r=petermattis a=nvanbenschoten Passing the testflags through the GOFLAGS env var was causing the following error: ``` stringer -output=pkg/sql/opt/rule_name_string.go -type=RuleName pkg/sql/opt/rule_name.go pkg/sql/opt/rule_name.og.go stringer: go [list -f {{context.GOARCH}} {{context.Compiler}} -tags= -- unsafe]: exit status 1: go: parsing $GOFLAGS: non-flag "storage.test" Makefile:1496: recipe for target 'pkg/sql/opt/rule_name_string.go' failed make: *** [pkg/sql/opt/rule_name_string.go] Error 1 make: *** Waiting for unfinished jobs.... ``` Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

…tervals This commit fixes the most common failure case in all of the following Jepsen failures. I'm not closing them here though, because they also should be failing due to cockroachdb#36431. Would fix cockroachdb#37394. Would fix cockroachdb#37545. Would fix cockroachdb#37930. Would fix cockroachdb#37932. Would fix cockroachdb#37956. Would fix cockroachdb#38126. Would fix cockroachdb#38763. Before this fix, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register` tests to fail, which I had previously incorrectly attributed entirely to cockroachdb#36431. This commit fixes this by returning `WriteIntentError`s for intents when they are above the read timestamp of a scan but below the max timestamp of a scan. This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s in this situation. Both would eventually have the same effect, but it seems preferable to kick off concurrency control immediately in this situation and only fall back to uncertainty handling for committed values. If the intent ends up being aborted, this could allow the read to avoid moving its timestamp. This commit will need to be backported all the way back to v2.0. Release note (bug fix): Consider intents in a read's uncertainty interval to be uncertain just as if they were committed values. This removes the potential for stale reads when a causally dependent transaction runs into the not-yet resolved intents from a causal ancestor.

cockroach-teamcity added this to the 19.2 milestone Jul 9, 2019

cockroach-teamcity assigned andreimatei Jul 9, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 9, 2019

nvanbenschoten assigned nvanbenschoten and unassigned andreimatei Jul 9, 2019

nvanbenschoten changed the title ~~roachtest: jepsen/multi-register/majority-ring-start-kill-2 failed~~ roachtest: jepsen/multi-register/majority-ring-start-kill-2 failed [hopefully #36431] Jul 9, 2019

nvanbenschoten mentioned this issue Sep 9, 2019

storage/engine: return WriteIntentError for intents in uncertainty intervals #40600

Merged

craig bot closed this as completed in d20419d Sep 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: jepsen/multi-register/majority-ring-start-kill-2 failed [hopefully #36431] #38763

roachtest: jepsen/multi-register/majority-ring-start-kill-2 failed [hopefully #36431] #38763

cockroach-teamcity commented Jul 9, 2019

cockroach-teamcity commented Jul 11, 2019

cockroach-teamcity commented Aug 1, 2019

nvanbenschoten commented Aug 1, 2019

roachtest: jepsen/multi-register/majority-ring-start-kill-2 failed [hopefully #36431] #38763

roachtest: jepsen/multi-register/majority-ring-start-kill-2 failed [hopefully #36431] #38763

Comments

cockroach-teamcity commented Jul 9, 2019

cockroach-teamcity commented Jul 11, 2019

cockroach-teamcity commented Aug 1, 2019

nvanbenschoten commented Aug 1, 2019