Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jepsen: bank/majority-ring failed with wrong-total #34321

Closed
cockroach-teamcity opened this issue Jan 29, 2019 · 2 comments · Fixed by #34548
Closed

jepsen: bank/majority-ring failed with wrong-total #34321

cockroach-teamcity opened this issue Jan 29, 2019 · 2 comments · Fixed by #34548
Assignees
Labels
O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/e10fb557b11b5ff1b8609aa963da23c37a1143c8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch1/bank/majority-ring PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1113854&tab=buildLog

The test failed on master:
	test.go:743,jepsen.go:247,jepsen.go:308: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1113854-jepsen-batch1:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.128.0.70 -n 10.128.0.92 -n 10.128.0.78 -n 10.128.0.29 -n 10.128.0.88 \
		   --test bank --nemesis majority-ring \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Jan 29, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 29, 2019
@tbg tbg assigned bdarnell and unassigned petermattis Jan 29, 2019
@tbg tbg removed the C-test-failure Broken test (automatically or manually discovered). label Jan 29, 2019
@bdarnell bdarnell changed the title roachtest: jepsen-batch1/bank/majority-ring failed jepsen: bank/majority-ring failed with wrong-total Jan 29, 2019
@bdarnell
Copy link
Contributor

This is a real failure:

INFO [2019-01-29 08:42:48,388] jepsen results - jepsen.store Wrote /mnt/data1/jepsen/cockroachdb/store/cockroachdb-bank:majring/20190129T083549.000Z/results.edn
INFO [2019-01-29 08:42:49,270] main - jepsen.core {:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :timeline {:valid? true},
 :details
 {:valid? false,
  :bad-reads
  [{:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 452,
     :time 233308728429,
     :value [0 11 20 9 13],
     :index 12746}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 387,
     :time 233308992089,
     :value [0 11 20 9 13],
     :index 12748}}
...
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 431,
     :time 296967625647,
     :value [15 0 18 3 17],
     :index 14615}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 541,
     :time 296967653011,
     :value [15 0 18 3 17],
     :index 14616}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 500,
     :time 296967702596,
     :value [15 0 18 3 17],
     :index 14617}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 529,
     :time 296967709008,
     :value [25 7 6 3 12],
     :index 14618}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 670,
     :time 296967733680,
     :value [24 2 17 3 7],
     :index 14620}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 462,
     :time 296986725618,
     :value [25 2 9 3 14],
     :index 14632}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 542,
     :time 296986915503,
     :value [4 9 25 3 12],
     :index 14634}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 510,
     :time 296986999786,
     :value [2 7 27 3 14],
     :index 14635}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 547,
     :time 339949928998,
     :value [0 7 7 14 25],
     :index 14695}}]},
 :valid? false}


Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

@bdarnell
Copy link
Contributor

Possibly has the same cause as the TPC-C errors @nvanbenschoten is looking at (#34228?)

This is a bank test with 5 accounts (and network partitions). At some point just prior to operation 12746 one of the transfers created $3 out of thin air, and all subsequent reads report this inconsistency. The nearest transfers of $3 were at indexes 12723 and 12725, just before a nemesis activation. The client ultimately got a timeout error on these connections, but not until 12898 (after the inconsistent reads, and long after the network partition healed at 12733)

{:type :invoke, :f :transfer, :value {:from 1, :to 0, :amount 1}, :process 426, :time 226181944237, :index 12721}
{:type :ok, :f :transfer, :value {:from 2, :to 3, :amount 4}, :process 298, :time 226194131568, :index 12722}
{:type :invoke, :f :transfer, :value {:from 4, :to 0, :amount 3}, :process 298, :time 226194266856, :index 12723}
{:type :ok, :f :read, :process 395, :time 226284989087, :value [4 7 17 9 13], :index 12724}
{:type :invoke, :f :transfer, :value {:from 1, :to 4, :amount 3}, :process 395, :time 226285168556, :index 12725}
{:type :info, :f ["majring" :start], :process :nemesis, :time 226399731031, :index 12726}
{:type :info, :f ["majring" :start], :process :nemesis, :time 226501818010, :value [:isolated {"10.128.0.29" #{"10.128.0.70" "10.128.0.78"}, "10.128.0.92" #{"10.128.0.88" "10.128.0.78"}, "10.128.0.70" #{"10.128.0.29" "10.128.0.88"}, "10.128.0.78" #{"10.128.0.29" "10.128.0.92"}, "10.128.0.88" #{"10.128.0.92" "10.128.0.70"}}], :index 12727}
{:type :info, :f :transfer, :value {:from 4, :to 0, :amount 2}, :process 379, :time 228551244509, :error :timeout, :index 12728}
{:type :info, :f :transfer, :value {:from 0, :to 2, :amount 4}, :process 279, :time 228551450787, :error :timeout, :index 12729}
{:type :invoke, :f :transfer, :value {:from 3, :to 1, :amount 2}, :process 409, :time 228553132511, :index 12730}
{:type :invoke, :f :transfer, :value {:from 3, :to 0, :amount 4}, :process 309, :time 228553133655, :index 12731}
{:type :info, :f ["majring" :stop], :process :nemesis, :time 231502289665, :index 12732}
{:type :info, :f ["majring" :stop], :process :nemesis, :time 231705236185, :value :network-healed, :index 12733}
{:type :info, :f :read, :process 342, :time 232676248887, :error [:psql-exception "ERROR: communication error: rpc error: code = Canceled desc = context canceled"], :index 12734}
{:type :info, :f :read, :process 357, :time 232676303850, :error [:psql-exception "ERROR: communication error: rpc error: code = Canceled desc = context canceled"], :index 12735}
{:type :info, :f :read, :process 422, :time 232677871811, :error [:psql-exception "ERROR: communication error: rpc error: code = Canceled desc = context canceled"], :index 12736}
{:type :invoke, :f :read, :process 387, :time 232677734349, :index 12737}
{:type :invoke, :f :transfer, :value {:from 1, :to 0, :amount 2}, :process 372, :time 232677783855, :index 12738}
{:type :invoke, :f :read, :process 452, :time 232679451919, :index 12739}
{:type :info, :f :read, :process 321, :time 233248572734, :error [:psql-exception "ERROR: communication error: rpc error: code = Canceled desc = context canceled"], :index 12740}
{:type :invoke, :f :transfer, :value {:from 3, :to 4, :amount 4}, :process 351, :time 233250075018, :index 12741}
{:type :ok, :f :transfer, :value {:from 3, :to 4, :amount 4}, :process 351, :time 233267410923, :index 12742}
{:type :invoke, :f :transfer, :value {:from 4, :to 0, :amount 5}, :process 351, :time 233267588090, :index 12743}
{:type :ok, :f :transfer, :value {:from 1, :to 0, :amount 4}, :process 435, :time 233301067125, :index 12744}
{:type :invoke, :f :read, :process 435, :time 233301231876, :index 12745}
{:type :ok, :f :read, :process 452, :time 233308728429, :value [0 11 20 9 13], :index 12746}

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 5, 2019
Fixes cockroachdb#34025.
Fixes cockroachdb#33624.
Fixes cockroachdb#33335.
Fixes cockroachdb#33151.
Fixes cockroachdb#33149.
Fixes cockroachdb#34159.
Fixes cockroachdb#34293.
Fixes cockroachdb#32813.
Fixes cockroachdb#30886.
Fixes cockroachdb#34228.
Fixes cockroachdb#34321.

It is rare but possible for a replica to become a leaseholder but not
learn about this until it applies a snapshot. Immediately upon the
snapshot application's `ReplicaState` update, the replica will begin
operating as a standard leaseholder.

Before this change, leases acquired in this way would not trigger
in-memory side-effects to be performed. This could result in a regression
in the new leaseholder's timestamp cache compared to the previous
leaseholder, allowing write-skew like we saw in cockroachdb#34025. This could
presumably result in other anomalies as well, because all of the
steps in `leasePostApply` were skipped.

This PR fixes this bug by detecting lease updates when applying
snapshots and making sure to react correctly to them. It also likely
fixes the referenced issue. The new test demonstrated that without
this fix, the serializable violation speculated about in the issue
was possible.

Release note (bug fix): Fix bug where lease transfers passed through
Snapshots could forget to update in-memory state on the new leaseholder,
allowing write-skew between read-modify-write operations.
craig bot pushed a commit that referenced this issue Feb 5, 2019
34548: storage: apply lease change side-effects on snapshot recipients r=nvanbenschoten a=nvanbenschoten

Fixes #34025.
Fixes #33624.
Fixes #33335.
Fixes #33151.
Fixes #33149.
Fixes #34159.
Fixes #34293.
Fixes #32813.
Fixes #30886.
Fixes #34228.
Fixes #34321.

It is rare but possible for a replica to become a leaseholder but not learn about this until it applies a snapshot. Immediately upon the snapshot application's `ReplicaState` update, the replica will begin operating as a standard leaseholder.

Before this change, leases acquired in this way would not trigger in-memory side-effects to be performed. This could result in a regression in the new leaseholder's timestamp cache compared to the previous leaseholder's cache, allowing write-skew like we saw in #34025. This could presumably result in other anomalies as well, because all of the steps in `leasePostApply` were skipped (as theorized by #34025 (comment)).

This PR fixes this bug by detecting lease updates when applying snapshots and making sure to react correctly to them. It also likely fixes the referenced issue. The new test demonstrates that without this fix, the serializable violation speculated about in the issue was possible.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in #34548 Feb 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants