Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/tpcc-1000 failed #32813

Closed
cockroach-teamcity opened this issue Dec 4, 2018 · 15 comments · Fixed by #34548
Closed

roachtest: cdc/tpcc-1000 failed #32813

cockroach-teamcity opened this issue Dec 4, 2018 · 15 comments · Fixed by #34548
Assignees
Labels
A-cdc Change Data Capture C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/e6cb0c5c329617b560eee37527248171b5e06382

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1038478&tab=buildLog

The test failed on master:
	test.go:630,cluster.go:1141,cdc.go:531,cdc.go:103,cluster.go:1467,errgroup.go:58: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1038478-cdc-tpcc-1000:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		: signal: killed
	test.go:630,cluster.go:1488,cdc.go:179,cdc.go:319: pq: AS OF SYSTEM TIME: cannot specify timestamp in the future

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Dec 4, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Dec 4, 2018
@tbg tbg added the A-cdc Change Data Capture label Dec 4, 2018
@danhhz
Copy link
Contributor

danhhz commented Dec 4, 2018

Huh, it passes timeutil.Now().UnixNano(). I've never seen this before, so maybe it's something very rare, but there's no reason we couldn't subtract a second from that time as insurance.

danhhz added a commit to danhhz/cockroach that referenced this issue Dec 10, 2018
We got a `AS OF SYSTEM TIME: cannot specify timestamp in the future`
error. I can't imagine how this would have happened besides the clocks
being out of sync, so just subtract a second.

Closes cockroachdb#32813

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ea4c00f5d7c33ece947a080cf56e63a33826565b

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1064508&tab=buildLog

The test failed on master:
	test.go:703,cluster.go:1137,cdc.go:451,cdc.go:83,cdc.go:320: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1064508-cdc-tpcc-1000:4 -- yes | sudo apt-get -q install default-jre returned:
		stderr:
		
		stdout:
		 [69.3 kB]
		Get:80 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial/main amd64 default-jre amd64 2:1.8-56ubuntu2 [980 B]
		Get:81 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial/main amd64 fonts-dejavu-extra all 2.35-1 [1,749 kB]
		Get:82 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-updates/main amd64 hicolor-icon-theme all 0.15-0ubuntu1.1 [7,698 B]
		Get:83 http://us-central1.gce.archive.ubuntu.com/ubuntu xenial-updates/main amd64 libgtk2.0-bin amd64 2.24.30-1ubuntu1.16.04.2 [9,834 B]
		Fetched 43.7 MB in 1min 44s (417 kB/s)
		E: Failed to fetch http://us-central1.gce.archive.ubuntu.com/ubuntu/pool/main/l/llvm-toolchain-6.0/libllvm6.0_6.0-1ubuntu2~16.04.1_amd64.deb  504  Gateway Time-out [IP: 35.225.153.130 80]
		
		E: Failed to fetch http://us-central1.gce.archive.ubuntu.com/ubuntu/pool/main/p/pango1.0/libpangocairo-1.0-0_1.38.1-1_amd64.deb  504  Gateway Time-out [IP: 35.225.153.130 80]
		
		E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
		Error:  exit status 100
		: exit status 1

@petermattis
Copy link
Collaborator

Latest failure is an apt-get issue.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/98ef7abf32784b8e837d18d10173ef083010ad45

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1067238&tab=buildLog

The test failed on master:
	test.go:703,cluster.go:1137,cdc.go:552,cdc.go:107,cluster.go:1463,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1067238-cdc-tpcc-1000:4 -- ./workload run tpcc --warehouses=1000 --duration=120m  {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		: signal: killed
	test.go:703,cluster.go:1484,cdc.go:183,cdc.go:323: pq: AS OF SYSTEM TIME: timestamp before 1970-01-01T00:00:00Z is invalid

@danhhz
Copy link
Contributor

danhhz commented Dec 26, 2018

Note that this AS OF SYSTEM TIME error is coming out of workload run tpcc not any of the changefeed stuff. Perhaps this is the same underlying cause as #33317. I don't immediately see any use of AS OF SYSTEM TIME in the tpcc workload or in workload itself.

danhhz added a commit to danhhz/cockroach that referenced this issue Dec 26, 2018
The timestamp for the initial scan was previously generated on the test
runner, which is usually someone's laptop or a teamcity machine. Avoid
this by using the time interval AS OF SYSTEM TIME notation.

This is a better fix for cockroachdb#32813

Release note: None
craig bot pushed a commit that referenced this issue Dec 26, 2018
33363: roachtest: deflake cdc/initial-scan r=mrtracy a=danhhz

The timestamp for the initial scan was previously generated on the test
runner, which is usually someone's laptop or a teamcity machine. Avoid
this by using the time interval AS OF SYSTEM TIME notation.

This is a better fix for #32813

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5cb2c3803d0ea3342415ab1a72ed86d356510e0b

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1086777&tab=buildLog

The test failed on master:
	test.go:696,cluster.go:1164,cdc.go:540,cdc.go:100,cdc.go:323: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1086777-cdc-tpcc-1000:4 -- ./workload fixtures load tpcc --warehouses=1000 --checks=false {pgurl:2} returned:
		stderr:
		Error: failed to create google cloud client (You may need to setup the GCS application default credentials: 'gcloud auth application-default login --project=cockroach-shared'): dialing: google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
		Error:  exit status 1
		
		stdout:
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d0e2f78183d2ce0a6b803127ee80143571c9cd4f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1086794&tab=buildLog

The test failed on release-2.1:
	test.go:696,cluster.go:1164,cdc.go:540,cdc.go:100,cdc.go:323: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1086794-cdc-tpcc-1000:4 -- ./workload fixtures load tpcc --warehouses=1000 --checks=false {pgurl:3} returned:
		stderr:
		Error: failed to create google cloud client (You may need to setup the GCS application default credentials: 'gcloud auth application-default login --project=cockroach-shared'): dialing: google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
		Error:  exit status 1
		
		stdout:
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/fe6fbbb99f51f414804daaeb704635ee0ff17b28

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1091924&tab=buildLog

The test failed on master:
	test.go:696,cluster.go:1164,cdc.go:552,cdc.go:107,cluster.go:1490,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1091924-cdc-tpcc-1000:4 -- ./workload run tpcc --warehouses=1000 --duration=120m  {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		4831.8 delivery
		     29s        0          396.6          151.4   5368.7  10737.4  11274.3  11274.3 newOrder
		     29s        0           24.0           24.1     19.9    536.9    671.1    671.1 orderStatus
		     29s        0          513.5          235.4   4563.4  10737.4  10737.4  10737.4 payment
		     29s        0           31.0           26.0    260.0   1275.1   7784.6   7784.6 stockLevel
		     30s        0           11.0           25.6    226.5    637.5    805.3    805.3 delivery
		     30s        0          664.2          168.6   5637.1  12348.0  12348.0  12348.0 newOrder
		     30s        0           24.9           24.1     14.7     56.6    184.5    184.5 orderStatus
		     30s        0          555.6          246.2   2415.9   9663.7  10737.4  11811.2 payment
		     30s        0           23.9           25.9    192.9   6979.3   8053.1   8053.1 stockLevel
		Error: error in newOrder: ERROR: duplicate key value (o_w_id,o_d_id,o_id)=(791,3,3001) violates unique constraint "primary" (SQLSTATE 23505)
		Error:  exit status 1
		: exit status 1
	test.go:696,cluster.go:1511,cdc.go:183,cdc.go:323: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/772c8001911d14119cd78078c0c1acbc15c59142

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1107318&tab=buildLog

The test failed on release-2.1:
	test.go:743,test.go:755: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1107318-cdc-tpcc-1000 -n 4 --gce-machine-type=n1-highcpu-16 --gce-zones=us-central1-b,us-west1-b,europe-west2-b --local-ssd-no-ext4-barrier returned:
		stderr:
		
		stdout:
		0.128.0.66	146.148.92.90
		  teamcity-1107318-cdc-tpcc-1000-0004	teamcity-1107318-cdc-tpcc-1000-0004.us-central1-b.cockroach-ephemeral	10.128.0.60	130.211.205.172
		Syncing...
		teamcity-1107318-cdc-tpcc-1000: waiting for nodes to start.................................................................
		generating ssh key.
		distributing ssh key...................................................................................................................................
		2: exit status 255
		~ tar xf -
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).SetupSSH.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:525
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1301
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333: 
		I190124 06:42:18.731941 1 cluster_synced.go:1383  command failed
		: exit status 1

@danhhz
Copy link
Contributor

danhhz commented Jan 24, 2019

this last one looks like it failed during test setup

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/8cbeb534432b81c57564956ed7d645b854b426be

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1111300&tab=buildLog

The test failed on master:
	test.go:743,cluster.go:1195,cdc.go:582,cdc.go:116,cluster.go:1533,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1111300-cdc-tpcc-1000:4 -- ./workload run tpcc --warehouses=1000 --duration=120m  {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		 771.8 delivery
		     29s        0          496.1          155.4   5368.7  10737.4  11274.3  11274.3 newOrder
		     29s        0           23.0           25.1     27.3    260.0    260.0    260.0 orderStatus
		     29s        0          302.5          233.3   2684.4   9126.8  10200.5  11274.3 payment
		     29s        0           31.9           24.2    939.5   3087.0   5100.3   5100.3 stockLevel
		     30s        0           13.0           24.3    226.5    352.3    402.7    402.7 delivery
		     30s        0          720.1          174.3   2952.8  11811.2  12348.0  12348.0 newOrder
		     30s        0           19.9           25.0     16.8     62.9     67.1     67.1 orderStatus
		     30s        0          379.0          238.2   1476.4   9126.8  11274.3  11811.2 payment
		     30s        0           21.9           24.1    192.9   3221.2   5100.3   5100.3 stockLevel
		Error: error in newOrder: ERROR: duplicate key value (o_w_id,o_d_id,o_id)=(777,1,3001) violates unique constraint "primary" (SQLSTATE 23505)
		Error:  exit status 1
		: exit status 1
	test.go:743,cluster.go:1554,cdc.go:193,cdc.go:336: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e10fb557b11b5ff1b8609aa963da23c37a1143c8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1113854&tab=buildLog

The test failed on master:
	test.go:743,cluster.go:1226,cdc.go:582,cdc.go:116,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1113854-cdc-tpcc-1000:4 -- ./workload run tpcc --warehouses=1000 --duration=120m  {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		7516.2 delivery
		     27s        0          523.2          150.1   8053.1   9126.8   9126.8   9126.8 newOrder
		     27s        0           13.0           25.0    469.8   1744.8   2415.9   2415.9 orderStatus
		     27s        0          473.1          240.0   4563.4   7516.2   8053.1   9126.8 payment
		     27s        0           36.1           26.7    671.1   6710.9   7784.6   7784.6 stockLevel
		     28s        0           17.0           27.1    151.0    385.9    402.7    402.7 delivery
		     28s        0          625.8          167.1   4026.5   9663.7  10200.5  10200.5 newOrder
		     28s        0           16.0           24.7     14.7     62.9     79.7     79.7 orderStatus
		     28s        0          390.3          245.4   1879.0   7516.2   9126.8   9663.7 payment
		     28s        0           22.0           26.6    104.9   5100.3   6442.5   6442.5 stockLevel
		Error: error in newOrder: ERROR: duplicate key value (o_w_id,o_d_id,o_id)=(979,7,3002) violates unique constraint "primary" (SQLSTATE 23505)
		Error:  exit status 1
		: exit status 1
	test.go:743,cluster.go:1585,cdc.go:193,cdc.go:336: Goexit() was called

@nvanbenschoten
Copy link
Member

Interesting that all of the duplicate key value failures happen right around 30s.

@danhhz
Copy link
Contributor

danhhz commented Jan 30, 2019

@nvanbenschoten Huh, that is interesting. I wonder if there's any interesting 30s constants. Closed timestamps? I don't see anything in the roachtest

Note that these failures were run with rangefeed off, the test name was changed when I did that flip

@nvanbenschoten
Copy link
Member

I wonder if there's any interesting 30s constants. Closed timestamps? I don't see anything in the roachtest

That's an interesting idea. I don't immediately see how that could be related, but I'll keep it in mind.

Another very strange thing I'm seeing: these duplicate key value errors are almost always on order id 3001 (a few are nearby). The initial TPC-C dataset loads 3000 orders in each district, so the issue is almost always on the very first order touched. Suspicious. I'm looking forward to getting to the bottom of this.

Another interesting note is that the duplicate key value failure appears to be more recent (about a week) than the failure we're seeing in #34025. Perhaps they're not exactly the same issue.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 5, 2019
Fixes cockroachdb#34025.
Fixes cockroachdb#33624.
Fixes cockroachdb#33335.
Fixes cockroachdb#33151.
Fixes cockroachdb#33149.
Fixes cockroachdb#34159.
Fixes cockroachdb#34293.
Fixes cockroachdb#32813.
Fixes cockroachdb#30886.
Fixes cockroachdb#34228.
Fixes cockroachdb#34321.

It is rare but possible for a replica to become a leaseholder but not
learn about this until it applies a snapshot. Immediately upon the
snapshot application's `ReplicaState` update, the replica will begin
operating as a standard leaseholder.

Before this change, leases acquired in this way would not trigger
in-memory side-effects to be performed. This could result in a regression
in the new leaseholder's timestamp cache compared to the previous
leaseholder, allowing write-skew like we saw in cockroachdb#34025. This could
presumably result in other anomalies as well, because all of the
steps in `leasePostApply` were skipped.

This PR fixes this bug by detecting lease updates when applying
snapshots and making sure to react correctly to them. It also likely
fixes the referenced issue. The new test demonstrated that without
this fix, the serializable violation speculated about in the issue
was possible.

Release note (bug fix): Fix bug where lease transfers passed through
Snapshots could forget to update in-memory state on the new leaseholder,
allowing write-skew between read-modify-write operations.
craig bot pushed a commit that referenced this issue Feb 5, 2019
34548: storage: apply lease change side-effects on snapshot recipients r=nvanbenschoten a=nvanbenschoten

Fixes #34025.
Fixes #33624.
Fixes #33335.
Fixes #33151.
Fixes #33149.
Fixes #34159.
Fixes #34293.
Fixes #32813.
Fixes #30886.
Fixes #34228.
Fixes #34321.

It is rare but possible for a replica to become a leaseholder but not learn about this until it applies a snapshot. Immediately upon the snapshot application's `ReplicaState` update, the replica will begin operating as a standard leaseholder.

Before this change, leases acquired in this way would not trigger in-memory side-effects to be performed. This could result in a regression in the new leaseholder's timestamp cache compared to the previous leaseholder's cache, allowing write-skew like we saw in #34025. This could presumably result in other anomalies as well, because all of the steps in `leasePostApply` were skipped (as theorized by #34025 (comment)).

This PR fixes this bug by detecting lease updates when applying snapshots and making sure to react correctly to them. It also likely fixes the referenced issue. The new test demonstrates that without this fix, the serializable violation speculated about in the issue was possible.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in #34548 Feb 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants