Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/w=100/nodes=3/chaos=true failed #34293

Closed
cockroach-teamcity opened this issue Jan 28, 2019 · 3 comments · Fixed by #34548
Closed

roachtest: tpcc/w=100/nodes=3/chaos=true failed #34293

cockroach-teamcity opened this issue Jan 28, 2019 · 3 comments · Fixed by #34548
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/9f084ad576e85756c5c5a7e41335d9aa2d3eee30

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/w=100/nodes=3/chaos=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1112101&tab=buildLog

The test failed on master:
	test.go:743,cluster.go:1195,tpcc.go:130,tpcc.go:182: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1112101-tpcc-w-100-nodes-3-chaos-true:4 -- ./workload check tpcc --warehouses=100 {pgurl:1} returned:
		stderr:
		
		stdout:
		I190128 15:31:29.230660 1 workload/tpcc/tpcc.go:288  check 3.3.2.1 took 29.119545ms
		I190128 15:31:32.440868 1 workload/tpcc/tpcc.go:288  check 3.3.2.2 took 3.21016191s
		Error: check failed: 3.3.2.2: inequality at idx 435: order: 3010.000000, newOrder: 3010.000000, district-1: 3009.000000
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Jan 28, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 28, 2019
@tbg tbg assigned nvanbenschoten and unassigned tbg and nvanbenschoten Jan 28, 2019
@nvanbenschoten
Copy link
Member

Duplicate of #34025. Keeping open for now since it has a different symptom.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/395d842feb97c5bd8cad2b32b71a5156c03061eb

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/w=100/nodes=3/chaos=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1115923&tab=buildLog

The test failed on master:
	test.go:743,cluster.go:1226,tpcc.go:118,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1115923-tpcc-w-100-nodes-3-chaos-true:4 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=30m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		    0.0      0.0 orderStatus
		   4m36s  4638542            0.0           41.6      0.0      0.0      0.0      0.0 payment
		   4m36s  4638542            0.0            4.6      0.0      0.0      0.0      0.0 stockLevel
		E190130 15:28:41.709374 1 workload/cli/run.go:402  error in newOrder: dial tcp 10.128.0.38:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   4m37s  4656783            0.0            4.5      0.0      0.0      0.0      0.0 delivery
		   4m37s  4656783            0.0           43.6      0.0      0.0      0.0      0.0 newOrder
		   4m37s  4656783            0.0            4.6      0.0      0.0      0.0      0.0 orderStatus
		   4m37s  4656783            0.0           41.5      0.0      0.0      0.0      0.0 payment
		   4m37s  4656783            0.0            4.5      0.0      0.0      0.0      0.0 stockLevel
		E190130 15:28:42.709447 1 workload/cli/run.go:402  error in orderStatus: dial tcp 10.128.0.38:26257: connect: connection refused
		: signal: killed
	test.go:743,cluster.go:1585,tpcc.go:128,tpcc.go:182: unexpected node event: 1: dead

@nvanbenschoten
Copy link
Member

Previous failure was

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x190242d]

goroutine 719576 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc0001c5830, 0x3898dc0, 0xc00f811050)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:183 +0x11f
panic(0x2d24dc0, 0x546ff60)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/cockroachdb/cockroach/pkg/storage.(*truncateDecision).raftSnapshotsForIndex(0xc0060625f0, 0x0, 0xc0054edb00)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_log_queue.go:226 +0x3d

Fixed by #34399.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 5, 2019
Fixes cockroachdb#34025.
Fixes cockroachdb#33624.
Fixes cockroachdb#33335.
Fixes cockroachdb#33151.
Fixes cockroachdb#33149.
Fixes cockroachdb#34159.
Fixes cockroachdb#34293.
Fixes cockroachdb#32813.
Fixes cockroachdb#30886.
Fixes cockroachdb#34228.
Fixes cockroachdb#34321.

It is rare but possible for a replica to become a leaseholder but not
learn about this until it applies a snapshot. Immediately upon the
snapshot application's `ReplicaState` update, the replica will begin
operating as a standard leaseholder.

Before this change, leases acquired in this way would not trigger
in-memory side-effects to be performed. This could result in a regression
in the new leaseholder's timestamp cache compared to the previous
leaseholder, allowing write-skew like we saw in cockroachdb#34025. This could
presumably result in other anomalies as well, because all of the
steps in `leasePostApply` were skipped.

This PR fixes this bug by detecting lease updates when applying
snapshots and making sure to react correctly to them. It also likely
fixes the referenced issue. The new test demonstrated that without
this fix, the serializable violation speculated about in the issue
was possible.

Release note (bug fix): Fix bug where lease transfers passed through
Snapshots could forget to update in-memory state on the new leaseholder,
allowing write-skew between read-modify-write operations.
craig bot pushed a commit that referenced this issue Feb 5, 2019
34548: storage: apply lease change side-effects on snapshot recipients r=nvanbenschoten a=nvanbenschoten

Fixes #34025.
Fixes #33624.
Fixes #33335.
Fixes #33151.
Fixes #33149.
Fixes #34159.
Fixes #34293.
Fixes #32813.
Fixes #30886.
Fixes #34228.
Fixes #34321.

It is rare but possible for a replica to become a leaseholder but not learn about this until it applies a snapshot. Immediately upon the snapshot application's `ReplicaState` update, the replica will begin operating as a standard leaseholder.

Before this change, leases acquired in this way would not trigger in-memory side-effects to be performed. This could result in a regression in the new leaseholder's timestamp cache compared to the previous leaseholder's cache, allowing write-skew like we saw in #34025. This could presumably result in other anomalies as well, because all of the steps in `leasePostApply` were skipped (as theorized by #34025 (comment)).

This PR fixes this bug by detecting lease updates when applying snapshots and making sure to react correctly to them. It also likely fixes the referenced issue. The new test demonstrates that without this fix, the serializable violation speculated about in the issue was possible.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in #34548 Feb 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants