Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: consistency check failed on cyan #29252

Closed
tbg opened this issue Aug 29, 2018 · 37 comments · Fixed by #29677
Closed

storage: consistency check failed on cyan #29252

tbg opened this issue Aug 29, 2018 · 37 comments · Fixed by #29677
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading.
Milestone

Comments

@tbg
Copy link
Member

tbg commented Aug 29, 2018

https://sentry.io/cockroach-labs/cockroachdb/issues/666877375/

*log.safeError: replica_consistency.go:142: consistency check failed with %d inconsistent replicas | int
  File "github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go", line 151, in CheckConsistency
  File "github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go", line 114, in process
  File "github.com/cockroachdb/cockroach/pkg/storage/queue.go", line 744, in processReplica
  File "github.com/cockroachdb/cockroach/pkg/storage/queue.go", line 627, in 2
  File "github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go", line 324, in func1

replica_consistency.go:142: consistency check failed with %d inconsistent replicas | int
@tbg tbg changed the title *log.safeError: replica_consistency.go:142: consistency check failed with %d inconsistent replicas | int storage: consistency check failed on cyan Aug 29, 2018
@tbg
Copy link
Member Author

tbg commented Aug 29, 2018

1/cockroach.cockroach-cyan-01.root.2018-08-27T18_02_25Z.060057.log

E180828 09:35:08.646984 42109141 storage/replica_consistency.go:158  [consistencyChecker,n1,s1,r32/8:/Table/5{3-4}] consistency check failed with 1 inconsistent replicas; fetching details
I180828 09:35:09.431344 1777 server/status/runtime.go:433  [n1] runtime stats: 5.2 GiB RSS, 538 goroutines, 375 MiB/247 MiB/778 MiB GO alloc/idle/total, 3.5 GiB/4.5 GiB CGO alloc/total, 14926.76cgo/sec, 1.65/0.36 %(u/s)time, 0.00 %gc (3x)
E180828 09:35:10.657683 42109141 storage/replica_consistency.go:104  [n1,s1,r32/8:/Table/5{3-4}] replica (n5,s5):7 is inconsistent: expected checksum b8f12a1959232ebe9632ecbd87c8bad38b1b52307803eeb2840b713dda044b45d8997df9ea2dffb5146358649ad121aa5adafc3af343829077b9b874d0c87c1b, got fe9c23f49f08099229f1acf8f5c01cd69accaeedc75c953cf45d1eabba422888626c39fac4b241002a4af2e6bf468c43ff30e7e0b4dd71a836eefbb8e23cacc4
--- leaseholder
+++ follower
-0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9"
-    ts:<zero>
-    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x002\x85\x01i\xac\xeaK\x03\n:\n\x10s?\xb0\x94\xe9\xaaMg\x97LqHk\x98#\xb9\x1a\x14\x01k\x12\xbd\x89\x80\x90מR\xe7\x9b\x01D\x00\x01rdsc*\n\b\xd4\xca\xff\xc0\x95ˮ\xa7\x150\x92\x9538\x01\x12\x05merge*\n\b\xc1ϓ\xc1\x95ˮ\xa7\x152\n\b\xd4\xca\xff\xc0\x95ˮ\xa7\x15:\n\b\xd4\xca\xff\xc0\x95ˮ\xa7\x15B\x0e\b\x06\x12\n\b\xd4\xca\xff\xc0\x95ˮ\xa7\x15H\x01r\x00z\x00\x80\x01\x01"
-    raw mvcc_key/value: 016b12bd898090d79e52e79b0144000174786e2d733fb094e9aa4d67974c71486b9823b900 12040800100018002000280032850169acea4b030a3a0a10733fb094e9aa4d67974c71486b9823b91a14016b12bd898090d79e52e79b01440001726473632a0a08d4caffc095cbaea71530929533380112056d657267652a0a08c1cf93c195cbaea715320a08d4caffc095cbaea7153a0a08d4caffc095cbaea715420e0806120a08d4caffc095cbaea715480172007a00800101
E180828 09:35:10.817920 42109141 util/log/crash_reporting.go:477  [n1,s1,r32/8:/Table/5{3-4}] Reported as error 7a67fc8e1f9d436cab357453cd33ed2d

@tbg
Copy link
Member Author

tbg commented Aug 29, 2018

$ ./cockroach debug decode-value 016b12bd898090d79e52e79b0144000174786e2d733fb094e9aa4d67974c71486b9823b900 12040800100018002000280032850169acea4b030a3a0a10733fb094e9aa4d67974c71486b9823b91a14016b12bd898090d79e52e79b01440001726473632a0a08d4caffc095cbaea71530929533380112056d657267652a0a08c1cf93c195cbaea715320a08d4caffc095cbaea7153a0a08d4caffc095cbaea715420e0806120a08d4caffc095cbaea715480172007a00800101
0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9": "merge" id=733fb094 key=/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor rw=true pri=0.03903592 iso=SERIALIZABLE stat=PENDING epo=0 ts=1535369415.873258836,0 orig=1535369415.873258836,0 max=1535369415.873258836,0 wto=false rop=false seq=1

so it looks like this time one of the followers has a pending merge transaction, whereas the other follower and the leader don't have it.

@tbg
Copy link
Member Author

tbg commented Aug 29, 2018

wait a minute, it's the other way around. Presumably leader and one follower have this pending txn, but not another follower?

@tbg
Copy link
Member Author

tbg commented Aug 29, 2018

There's a second, identical, consistency failure in the logs later (which makes sense - once inconsistent, twice inconsistent).

The transaction timestamp boils down to 180827 at 11:30:15, at which point we find this (note that the start key of r5911 is the anchor of the inconsistent txn record):

180827 11:30:14.986610 5269578 storage/replica_command.go:430  [merge,n6,s6,r5911/3:/Table/53/1/-{800975…-798923…}] initiating a merge of r3412:/Table/53/1/-798{9234154929088300-4682932150010947} [(n6,s6):5, (n5,s5):2, (n3,s3):4, next=6, gen=1] into this range

and that seems to go through, for we see all replicas disappear:

80827 11:30:15.022004 20152 storage/store.go:3748  [n1,s1,r3412/1:/Table/53/1/-798{9234…-4682…}] added to replica GC queue (peer suggestion)
80827 11:30:15.022514 20152 storage/store.go:3748  [n1,s1,r3412/1:/Table/53/1/-798{9234…-4682…}] added to replica GC queue (peer suggestion)
80827 11:30:15.098173 3274529 storage/store.go:2656  [replicaGC,n1,s1,r3412/1:/Table/53/1/-798{9234…-4682…}] removing replica r3412/1
80827 11:30:15.134214 3274529 storage/replica.go:880  [replicaGC,n1,s1,r3412/1:/Table/53/1/-798{9234…-4682…}] removed 11 (0+11) keys in 36ms [clear=1ms commit=35ms]
80827 11:30:16.493037 556 storage/store.go:2656  [n6,s6,r5911/3:/Table/53/1/-{800975…-798923…}] removing replica r3412/5
80827 11:30:16.496645 2253 storage/store.go:2656  [n5,s5,r5911/2:/Table/53/1/-{800975…-798923…}] removing replica r3412/2
80827 11:30:16.503067 338 storage/store.go:2656  [n3,s3,r5911/4:/Table/53/1/-{800975…-798923…}] removing replica r3412/4

shortly thereafter we see r5911 itself get eaten:

180827 11:31:49.022537 5787986 storage/replica_command.go:430  [merge,n4,s4,r982/4:/Table/53/1/-80{48444…-09759…}] initiating a merge of r5911:/Table/53/1/-{8009759333344935612-7952945758738317278} [(n6,s6):3, (n5,s5):2, (n4,s4):5, next=6, gen=5] into this range
180827 11:31:50.518419 2168 storage/store.go:2656  [n4,s4,r982/4:/Table/53/1/-80{48444…-09759…}] removing replica r5911/5
180827 11:31:50.520475 559 storage/store.go:2656  [n6,s6,r982/5:/Table/53/1/-80{48444…-09759…}] removing replica r5911/3
180827 11:31:50.524089 2264 storage/store.go:2656  [n5,s5,r982/2:/Table/53/1/-80{48444…-09759…}] removing replica r5911/2

then r982 gets eaten (note r7900s extended bounds in the last line)...

180827 11:43:11.490413 2518453 storage/replica_command.go:430  [merge,n3,s3,r7900/2:/Table/53/1/-80{77894…-48444…}] initiating a merge of r982:/Table/53/1/-{8048444779259766956-7909958576638586199} [(n1,s1):9, (n5,s5):2, (n3,s3):6, next=10, gen=8] into this range
180827 11:43:11.629098 284 storage/store.go:2656  [n3,s3,r7900/2:/Table/53/1/-80{77894…-48444…}] removing replica r982/6
180827 11:43:11.632123 2262 storage/store.go:2656  [n5,s5,r7900/4:/Table/53/1/-80{77894…-48444…}] removing replica r982/2
180827 11:43:11.681154 988 storage/store.go:2656  [n1,s1,r7900/5:/Table/53/1/-80{77894…-48444…}] removing replica r982/9
180827 11:48:14.624472 2654023 storage/replica_command.go:430  [merge,n3,s3,r7900/2:/Table/53/1/-{807789…-790995…}] initiating a merge of r7349

we can chase this down all the way, but I think it's fair to say that the damage has already been done at this point (in fact, the damage likely gets done pretty much at 180827 11:30:16).

@tbg
Copy link
Member Author

tbg commented Aug 29, 2018

With some newly developed options to debug keys:

$ crl-ssh cyan all './cockroach-tobias debug keys --values --enterprise-encryption=path=/mnt/data1,key=keys/active.key,old-key=keys/old.key --from hex:016b12bd898090d79e52e79b0144000174786e2d733fb094e9aa4d67974c71486b9823b900 /mnt/data1 | head -n 1
'
1: cockroach@cockroach-cyan-0001.crdb.io
0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9": "merge" id=733fb094 key=/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor rw=true pri=0.03903592 iso=SERIALIZABLE stat=PENDING epo=0 ts=1535369415.873258836,0 orig=1535369415.873258836,0 max=1535369415.873258836,0 wto=false rop=false seq=1

2: cockroach@cockroach-cyan-0002.crdb.io
0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9": "merge" id=733fb094 key=/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor rw=true pri=0.03903592 iso=SERIALIZABLE stat=PENDING epo=0 ts=1535369415.873258836,0 orig=1535369415.873258836,0 max=1535369415.873258836,0 wto=false rop=false seq=1

3: cockroach@cockroach-cyan-0003.crdb.io
0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9": "merge" id=733fb094 key=/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor rw=true pri=0.03903592 iso=SERIALIZABLE stat=PENDING epo=0 ts=1535369415.873258836,0 orig=1535369415.873258836,0 max=1535369415.873258836,0 wto=false rop=false seq=1

I verified that node i has nodeID i, so our mystery transaction appears to around on nodeIDs 1-3.

So my confusion above is confirmed: r32 is on 1,3,5 at the time of the crash, and it seems that 1 and 3 actually have this txn, it's just that it's "missing" on 5.

Obviously it's weird and a bit concerning to have that transaction open in the first place, after all, it's been more than a day and the corresponding merge doesn't seem to be active any more?

Time to look for a merge that would've been anchored on a replica on nodes 1-3. This is confusing, because as far as I can tell none of the merges I looked at so far live on these nodes. Here are a few that do, but none of them really match up with the timestamp (but there could've been retries; they're not logged). I don't really think that there's a big point in looking into these transactions. The merge above has the matching start key and time, so it's more than likely that it wrote the txn record, but not on n1-3.

$ rg -a --no-line-number --no-filename 'initiating a merge[^\d]' | cut -c '2-' | sort | grep n1 | grep n2 | grep n3 | grep -E 180827\ 11:30\|11:29
180827 11:29:36.496676 2259765 storage/replica_command.go:430  [merge,n3,s3,r2826/3:/Table/53/1/1{198203…-200426…}] initiating a merge of r10017:/Table/53/1/120{0426956837177347-2644061213040745} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=0] into this range
180827 11:29:37.717360 3255526 storage/replica_command.go:430  [merge,n1,s1,r5308/5:/Table/53/1/177{03845…-69966…}] initiating a merge of r10242:/Table/53/1/17{76996673402615308-96867429618878746} [(n3,s3):4, (n2,s2):2, (n1,s1):3, next=5, gen=3] into this range
180827 11:30:36.689365 3283728 storage/replica_command.go:430  [merge,n1,s1,r9113/1:/Table/53/1/5{577175…-615383…}] initiating a merge of r1333:/Table/53/1/56{15383261386172522-28833217815307096} [(n1,s1):1, (n3,s3):5, (n2,s2):4, next=6, gen=3] into this range
180827 11:30:46.770576 3288949 storage/replica_command.go:430  [merge,n1,s1,r10076/4:/Table/53/1/{198046…-200040…}] initiating a merge of r5286:/Table/53/1/20{00405052489571948-65584417125190863} [(n1,s1):6, (n2,s2):2, (n3,s3):5, next=7, gen=3] into this range
180827 11:30:48.147848 1867893 storage/replica_command.go:430  [merge,n5,s5,r3561/2:/Table/53/1/-67{52795…-18942…}] initiating a merge of r6439:/Table/53/1/-671{8942456141413932-6677564676282143} [(n2,s2):6, (n5,s5):2, (n3,s3):3, (n1,s1):5, next=7, gen=0] into this range

@tbg
Copy link
Member Author

tbg commented Aug 29, 2018

I'm seeing this

80827 11:30:16.494360 5272116 storage/replica_command.go:812  [n6,s6,r5911/3:/Table/53/1/-{800975…-798468…}] change replicas (ADD_REPLICA (n4,s4):5): read exis
ting descriptor r5911:/Table/53/1/-{8009759333344935612-7984682932150010947} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=4]
80827 11:30:16.496143 5700331 storage/queue.go:788  [merge,n4,s4,r982/4:/Table/53/1/-80{48444…-09759…}] while adding target n4,s4: change replicas of r5911 fai
led: descriptor changed: [expected] r5911:/Table/53/1/-{8009759333344935612-7989234154929088300} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=3] != [actual] r
5911:/Table/53/1/-{8009759333344935612-7984682932150010947} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=4]: unexpected value: raw_bytes:"\317A\241B\003\010\2
27.\022\013\275\211\200\220\327\236R\347\233\001D\032\013\275\211\200\2210\265-\211\022\373\275\"\006\010\006\020\006\030\003\"\006\010\005\020\005\030\002\"\0
06\010\003\020\003\030\004(\0050\004" timestamp:<wall_time:1535369415873258836 >

shortly before 11:30:16 which tells me that mostly likely during the problematic merge transaction at 11:30:14 r5911 had descriptor [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=X].

But isn't something backwards here:

[expected] r5911:/Table/53/1/-{8009759333344935612-7989234154929088300} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=3] != [actual] r5911:/Table/53/1/-{8009759333344935612-7984682932150010947} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=4]:

Check out the end keys, the expected one is larger than the actual one, but the expected one has a smaller generation than the actual one:

7989234154929088300 # exp
7984682932150010947 # act

I'm not finding anything about any splits in these logs.

@tbg
Copy link
Member Author

tbg commented Aug 29, 2018

Some of my analysis above might be muddled. I'm just realizing that on 180827, we didn't have #29079 yet (the SHA was e7d570e), though we did have #29067.

Either way, the problem here can't be pinned on the consistency checker. We have these keys in three places, but these three places aren't the three places where it should (or should not) be.

@tbg
Copy link
Member Author

tbg commented Aug 29, 2018

@benesch I've got to run out for a while (and I've also stared at this for too long). Time for you to hopefully figure it out :-)

@tbg tbg added the S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. label Aug 29, 2018
@tbg tbg added this to the 2.1 milestone Aug 30, 2018
@tbg tbg added the A-kv-replication Relating to Raft, consensus, and coordination. label Aug 30, 2018
tbg added a commit to tbg/cockroach that referenced this issue Aug 30, 2018
This was used in cockroachdb#29252 and I imagine I'll want to use it again whenever
we see the consistency checker fail in the future.

Release note: None
craig bot pushed a commit that referenced this issue Aug 30, 2018
29328: cli: add hex option to debug keys r=a-robinson a=tschottdorf

This was used in #29252 and I imagine I'll want to use it again whenever
we see the consistency checker fail in the future.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Aug 30, 2018
This was used in cockroachdb#29252 and I imagine I'll want to use it again whenever
we see the consistency checker fail in the future.

Release note: None
craig bot pushed a commit that referenced this issue Aug 31, 2018
29390: backport-2.1: storage, roachtest, cli: assorted backports r=petermattis a=tschottdorf

storage: remove test-only data race

   The test was stopping the engine before the stopper, so the compactor was
   sometimes able to use the engine while it was being closed.

   Fixes #29302.

   roachtest: improve TestMonitor

   Add two more test cases about the exit status of the `monitor` invocation
   itself.

   roachtest: fix flake in TestMonitor

   Accidentally changed this in #29178.

   cli: add hex option to debug keys

   This was used in #29252 and I imagine I'll want to use it again whenever we
   see the consistency checker fail in the future.

   storage: skip TestClosedTimestampCanServe

   It won't be the first one I'm looking into.


Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 4, 2018
@tbg tbg self-assigned this Sep 4, 2018
@tbg
Copy link
Member Author

tbg commented Sep 5, 2018

$ ./cockroach debug range-data --enterprise-encryption=path=/mnt/data1,key=keys/active.key,old-key=keys/old.key --replicated /mnt/data1/ 32 > r32dump.delme
Error: value type is not BYTES: UNKNOWN
Failed running "debug range-data"

Hmm, I assume this is some bug in our mostly untested debug tools. Looking.

@tbg
Copy link
Member Author

tbg commented Sep 5, 2018

Oh, great. Something is broken unmarshaling keys of the following form. I doubt those are actual range descriptors, they probably just look enough like them.

$ ./cockroach-tobias debug range-data --enterprise-encryption=path=/mnt/data1,key=keys/active.key,old-key=keys/old.key --replicated /mnt/data1/ 32
Error: unmarshaling range descriptor at key {Key:/Local/Range/System/tsd/cr.node.sql.mem.sql.max-p75/4/10s/2018-06-18T21:00:00Z/RangeDescriptor/1535369258.738346692,0 Value:[]}: value type is not BYTES: UNKNOWN
Failed running "debug range-data"

@tbg
Copy link
Member Author

tbg commented Sep 5, 2018

Working around this error, we can see the history of range descriptors on r32 on node 1's dump.

First of all, here's the offending txn again:

/Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9"
0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9": "merge" id=733fb094 key=/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor rw=true pri=0.03903592 iso=SERIALIZABLE stat=PENDING epo=0 ts=1535369415.873258836,0 orig=1535369415.873258836,0 max=1535369415.873258836,0 wto=false rop=false seq=1

The following range descriptors are newer than that transaction:

/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535369509.778111977,0
1535369509.778111977,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: ""

/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535369508.540458480,0
1535369508.540458480,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-7952945758738317278)
        Raw:r5911:/Table/53/1/-{8009759333344935612-7952945758738317278} [(n6,s6):3, (n5,s5):2, (n4,s4):5, next=6, gen=5]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535369507.891083783,0
1535369507.891083783,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-7952945758738317278)
        Raw:r5911:/Table/53/1/-{8009759333344935612-7952945758738317278} [(n6,s6):3, (n5,s5):2, (n3,s3):4, (n4,s4):5, next=6, gen=5]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535369467.773388533,0
1535369467.773388533,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-7952945758738317278)
        Raw:r5911:/Table/53/1/-{8009759333344935612-7952945758738317278} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=5]

The above basically says that r32 moved from (n6,n5,n3) to (n6,n5,n4).

Now the interesting range descriptor which must have been written by the offending transactions (since it's a merge and the timestamps match):

/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535369415.873258836,0
1535369415.873258836,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-7984682932150010947)
        Raw:r5911:/Table/53/1/-{8009759333344935612-7984682932150010947} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=4]

So while located on (n6,n5,n3), the range subsumed a neighbor. The fact that the pending txn record and this committed value exist is already - to me - impossible to explain.

The following descriptors are older. Looking at them in chronological order (i.e. reverse from what's here) the range first lived on (n2,n5,n6), moved to (n6,n5,n3), and then subsumed its right neighbor three times.

/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535368210.638844179,0
1535368210.638844179,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-7989234154929088300)
        Raw:r5911:/Table/53/1/-{8009759333344935612-7989234154929088300} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=3]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535356708.050223658,0
1535356708.050223658,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-8002918179127402730)
        Raw:r5911:/Table/53/1/-800{9759333344935612-2918179127402730} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=2]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535356402.383522155,0
1535356402.383522155,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-8005204013617227225)
        Raw:r5911:/Table/53/1/-800{9759333344935612-5204013617227225} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=1]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1534252593.780488949,0
1534252593.780488949,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-8007483484511996492)
        Raw:r5911:/Table/53/1/-800{9759333344935612-7483484511996492} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=0]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1534252590.540289140,0
1534252590.540289140,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-8007483484511996492)
        Raw:r5911:/Table/53/1/-800{9759333344935612-7483484511996492} [(n6,s6):3, (n5,s5):2, next=4, gen=0]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1530909478.560743969,0
1530909478.560743969,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-8007483484511996492)
        Raw:r5911:/Table/53/1/-800{9759333344935612-7483484511996492} [(n2,s2):1, (n5,s5):2, (n6,s6):3, next=4, gen=0]


@tbg
Copy link
Member Author

tbg commented Sep 5, 2018

Oh, great. Something is broken unmarshaling keys of the following form

I bet these are just the range descriptors for ranges that have been merged away.

@tbg
Copy link
Member Author

tbg commented Sep 6, 2018

I don't see where the operation would wait for the RHS to have caught up. If the n9 replica of r10033 hasn't caught up to the merge yet, what in SubsumeRequest would do it? That request seems to only flush out any pending writes on the leaseholder, but it doesn't wait for the followers to be up to date (I thought it did - am I missing this)?

@benesch
Copy link
Contributor

benesch commented Sep 6, 2018

The SubsumeRequest doesn't wait; that's done later in the merge transaction:

err := waitForApplication(ctx, r.store.cfg.NodeDialer, rightDesc, rhsSnapshotRes.LeaseAppliedIndex)
if err != nil {
return errors.Wrap(err, "waiting for all right-hand replicas to catch up")
}

@tbg
Copy link
Member Author

tbg commented Sep 6, 2018

Good point. I looked at the server-side impl of that request and one thing that stood out is that it's not synchronizing with the raft mu (just the replica mutex). Looking at how this is updated, that shouldn't make a difference, though - we write the data first, and then bump the lease applied index (both on regular application and snapshots).

I also looked into interleavings in which somehow the second merge could use a stale descriptor that simply wouldn't be checking n9, but found no way that this could happen. The second merge places intents on both range descriptors before doing anything critical, as it should. I still think this is somehow where the bug is, though perhaps I'm wrong.

@benesch
Copy link
Contributor

benesch commented Sep 6, 2018

Extracted and interleaved logs from all nodes: https://gist.github.com/benesch/bbdb986899ca5e972ce5576149b88fbe

@tbg
Copy link
Member Author

tbg commented Sep 6, 2018

The timing of these snapshots is instructive.

$ grep -E 'r9836|r10033' logs/cockroach.log | less

I180906 16:06:26.250154 162811 storage/replica_raftstorage.go:784  [n9,s9,r10033/?:{-}] applying preemptive snapshot at index 79 (id=63e9e1ae, encoded size=20413, 1 rocksdb batches, 56 log entries)
I180906 16:06:26.256535 162811 storage/replica_raftstorage.go:790  [n9,s9,r10033/?:/Table/53/1/316{27334-60422}] applied preemptive snapshot in 6ms [clear=0ms batch=0ms entries=0ms commit=5ms]
I180906 16:06:58.705516 163509 storage/replica_raftstorage.go:784  [n9,s9,r10033/?:/Table/53/1/316{27334-60422}] applying preemptive snapshot at index 84 (id=82de396e, encoded size=36673, 1 rocksdb batches, 61 log entries)
I180906 16:06:58.715247 163509 storage/replica_raftstorage.go:790  [n9,s9,r10033/?:/Table/53/1/31{660422-753605}] applied preemptive snapshot in 10ms [clear=1ms batch=0ms entries=0ms commit=8ms]
I180906 16:06:58.917969 223 storage/store.go:2717  [n9,s9,r9836/2:/Table/53/1/31{599388-627334}] removing replica r10033/5
E180906 16:06:58.928164 163478 storage/stores_server.go:68  consistency check failed on range r9836: expected checksum 14158912b68d5426c90eb8f9791794efdfb153d79a0fccb96f6981ad0bee30db87abe8cf45dcb2c41fa1ee38163cfde32066eb986d1e812354947dfc699deaf7, got b2de02b49291ab8fc38f572f8786ab5cc9a1d71ca0f9787faa14c23763d8ed9a0d9fb730b7896dd97d5dfffc3627642102137471bd4daa067ff592290920a3c4

We see a first preemptive snapshot for r10033 right when r10033 is merging its neighbor, which is probably a first attempt at RelocateRange, and then a second one right before the crash (which is also pretty precisely when the second merge happened).

@benesch
Copy link
Contributor

benesch commented Sep 6, 2018

Well well well. n9 is getting a preemptive snapshot for the right start key... but it applies the snapshot with a totally wrong start key:

I180906 16:06:58.705516 163509 storage/replica_raftstorage.go:784  [n9,s9,r10033/?:/Table/53/1/316{27334-60422}] applying preemptive snapshot at index 84 (id=82de396e, encoded size=36673, 1 rocksdb batches, 61 log entries)
I180906 16:06:58.715247 163509 storage/replica_raftstorage.go:790  [n9,s9,r10033/?:/Table/53/1/31{660422-753605}] applied preemptive snapshot in 10ms [clear=1ms batch=0ms entries=0ms commit=8ms]

@benesch
Copy link
Contributor

benesch commented Sep 6, 2018

Specifically it is somehow using its old end key (from the first snapshot) as the start key. o_O

@tbg
Copy link
Member Author

tbg commented Sep 6, 2018

31660422 is the end key of r10033 before it ate its right neighbor, 31753605 is after.
31627334 is its real start key.

So somehow r10033 transformed itself into the right neighbor it subsumed via a snapshot.

@tbg
Copy link
Member Author

tbg commented Sep 6, 2018

I'm honestly a little surprised that a consistency check is where this fails. Shouldn't some assertion fire?

@benesch
Copy link
Contributor

benesch commented Sep 6, 2018

Replica.setDescWithoutProcessUpdate doesn't look at the descriptor you hand it at all. It just blindly accepts it.

@benesch
Copy link
Contributor

benesch commented Sep 6, 2018

Store.AssertInvariants() would catch it, but that's only called in tests.

@tbg
Copy link
Member Author

tbg commented Sep 6, 2018

So is it likely that this snapshot

[n3,s3,r10033/1:/Table/53/1/31{627334-753605}] streamed snapshot to (n9,s9):?: kv pairs: 275, log entries: 61, rate-limit: 2.0 MiB/sec, 2ms

which seems to reflect the correct post-first-merge r10033, somehow addressed only the subsumed portion? Or is the error on the receiving end?

tbg added a commit to tbg/cockroach that referenced this issue Sep 6, 2018
@benesch
Copy link
Contributor

benesch commented Sep 6, 2018

Ok, my mind is blown:

I180906 19:15:08.684945 407329 storage/store_snapshot.go:615  [n10,s10,r21820/3:/Table/53/1/620{38265-46570}] sending preemptive snapshot 69a0384f at applied index 40 with desc r21820:/Table/53/1/620{38265-46570} [(n5,s5):1, (n9,s9):4, (n10,s10):3, next=5, gen=1]
I180906 19:15:08.788794 407329 storage/store_snapshot.go:657  [n10,s10,r21820/3:/Table/53/1/620{38265-46570}] streamed snapshot to (n3,s3):?: kv pairs: 28, log entries: 25, rate-limit: 2.0 MiB/sec, 104ms
I180906 19:15:08.825122 196788 storage/store.go:3555  [n3,s3,r21820/2:/Table/53/1/620{38265-40026}] processing snapshot request (preemptive? true) with desc r21820:/Table/53/1/6204{0026-6570} [(n5,s5):1, (n9,s9):4, (n10,s10):3, next=5, gen=1]
I180906 19:15:08.825281 196788 storage/replica_raftstorage.go:787  [n3,s3,r21820/2:/Table/53/1/620{38265-40026}] applying preemptive snapshot at index 40 (id=69a0384f, encoded size=13907829, 1 rocksdb batches, 25 log entries)
I180906 19:15:08.985761 196788 storage/replica_raftstorage.go:793  [n3,s3,r21820/2:/Table/53/1/6204{0026-6570}] applied preemptive snapshot in 160ms [clear=0ms batch=0ms entries=152ms commit=8ms]

The outgoing snapshot looks correct and yet the incoming snapshot has the busted range descriptor! o_O

@tbg
Copy link
Member Author

tbg commented Sep 6, 2018

Are you trusting this code?

https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/replica_raftstorage.go#L500-L535

In particular, there's snapData.RangeDescriptor vs state.mu.Desc. I wonder if those diverge.

@benesch
Copy link
Contributor

benesch commented Sep 6, 2018

HA! @nvanbenschoten managed to spot the bug in like 30s of looking at this code. It's here:

desc.StartKey = existingDesc.EndKey

We're just mutating the header range descriptor directly instead of making a copy of it. 🤦‍♂️

@tbg
Copy link
Member Author

tbg commented Sep 6, 2018

Hmm, rsl.Load just clones the desc you pass in. So if they diverged, it'd be in a really weird way.

@tbg
Copy link
Member Author

tbg commented Sep 6, 2018

We're just mutating the header range descriptor directly instead of making a copy of it. 🤦‍♂️

❤️ ❤️ ❤️

Glad this got done so quickly.

benesch added a commit to benesch/cockroach that referenced this issue Sep 6, 2018
…hots

Merges can cause preemptive snapshots that widen existing replicas. For
example, consider the following sequence of events:

    1. A replica of range A is removed from store S, but is not garbage
       collected.
    2. Range A subsumes its right neighbor B.
    3. Range A is re-added to store S.

In step 3, S will receive a preemptive snapshot for A that requires
widening its existing replica, thanks to the intervening merge.

Problematically, the code to check whether this widening was possible,
in Store.canApplySnapshotLocked, was incorrectly mutating the range
descriptor in the snapshot header! Applying the snapshot would then fail
to clear all of the data from the old incarnation of the replica, since
the bounds on the range deletion tombstone were wrong. This often
resulted in replica inconsistency. Plus, the in-memory copy of the range
descriptor would be incorrect until the next descriptor update--though
this usually happened quickly, as the replica would apply the change
replicas command, which updates the descriptor, soon after applying the
preemptive snapshot.

To fix the problem, teach Store.canApplySnapshotLocked to make a copy of
the range descriptor before it mutates it.

To prevent regressions, add an assertion that a range's start key is
never changed to the descriptor update path. With this assertion in
place, but without the fix itself,
TestStoreRangeMergeReadoptedLHSFollower reliably fails.

Fixes cockroachdb#29252.

Release note: None
craig bot pushed a commit that referenced this issue Sep 7, 2018
29646: roachtest: better merge testing in clearrange r=benesch a=tschottdorf

Unfortunately, the method to determine the range count is quite slow since
crdb_internal.ranges internally sends an RPC for each range to determine
the leaseholder.

Anecdotally, I've seen ~25% of the merges completed after less than 15
minutes. I know that it's slowing down over time, but @benesch will fix
that.

Also throws in aggressive consistency checks so that when something goes
out of sync, we find out right there.

Release note: None

29677: storage: preserve consistency when applying widening preemptive snapshots r=benesch a=benesch

Merges can cause preemptive snapshots that widen existing replicas. For
example, consider the following sequence of events:

1. A replica of range A is removed from store S, but is not garbage
   collected.
2. Range A subsumes its right neighbor B.
3. Range A is re-added to store S.

In step 3, S will receive a preemptive snapshot for A that requires
widening its existing replica, thanks to the intervening merge.

Problematically, the code to check whether this widening was possible,
in Store.canApplySnapshotLocked, was incorrectly mutating the range
descriptor in the snapshot header! Applying the snapshot would then fail
to clear all of the data from the old incarnation of the replica, since
the bounds on the range deletion tombstone were wrong. This often
resulted in replica inconsistency. Plus, the in-memory copy of the range
descriptor would be incorrect until the next descriptor update--though
this usually happened quickly, as the replica would apply the change
replicas command, which updates the descriptor, soon after applying the
preemptive snapshot.

To fix the problem, teach Store.canApplySnapshotLocked to make a copy of
the range descriptor before it mutates it.

To prevent regressions, add an assertion that a range's start key is
never changed to the descriptor update path. With this assertion in
place, but without the fix itself,
TestStoreRangeMergeReadoptedLHSFollower reliably fails.

Fixes #29252.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@craig craig bot closed this as completed in #29677 Sep 7, 2018
benesch added a commit to benesch/cockroach that referenced this issue Sep 7, 2018
…hots

Merges can cause preemptive snapshots that widen existing replicas. For
example, consider the following sequence of events:

    1. A replica of range A is removed from store S, but is not garbage
       collected.
    2. Range A subsumes its right neighbor B.
    3. Range A is re-added to store S.

In step 3, S will receive a preemptive snapshot for A that requires
widening its existing replica, thanks to the intervening merge.

Problematically, the code to check whether this widening was possible,
in Store.canApplySnapshotLocked, was incorrectly mutating the range
descriptor in the snapshot header! Applying the snapshot would then fail
to clear all of the data from the old incarnation of the replica, since
the bounds on the range deletion tombstone were wrong. This often
resulted in replica inconsistency. Plus, the in-memory copy of the range
descriptor would be incorrect until the next descriptor update--though
this usually happened quickly, as the replica would apply the change
replicas command, which updates the descriptor, soon after applying the
preemptive snapshot.

To fix the problem, teach Store.canApplySnapshotLocked to make a copy of
the range descriptor before it mutates it.

To prevent regressions, add an assertion that a range's start key is
never changed to the descriptor update path. With this assertion in
place, but without the fix itself,
TestStoreRangeMergeReadoptedLHSFollower reliably fails.

Fixes cockroachdb#29252.

Release note: None
andy-kimball pushed a commit to andy-kimball/cockroach that referenced this issue Sep 7, 2018
We saw a consistency failure in cockroachdb#29252 that would've been much more
useful had it occurred close to the time around which the inconsistency
must have been introduced. Instead of leaving it to chance, add a switch
that runs aggressive checks in (roach) tests that want them such as the
clearrange test.

Release note: None
andy-kimball pushed a commit to andy-kimball/cockroach that referenced this issue Sep 7, 2018
…hots

Merges can cause preemptive snapshots that widen existing replicas. For
example, consider the following sequence of events:

    1. A replica of range A is removed from store S, but is not garbage
       collected.
    2. Range A subsumes its right neighbor B.
    3. Range A is re-added to store S.

In step 3, S will receive a preemptive snapshot for A that requires
widening its existing replica, thanks to the intervening merge.

Problematically, the code to check whether this widening was possible,
in Store.canApplySnapshotLocked, was incorrectly mutating the range
descriptor in the snapshot header! Applying the snapshot would then fail
to clear all of the data from the old incarnation of the replica, since
the bounds on the range deletion tombstone were wrong. This often
resulted in replica inconsistency. Plus, the in-memory copy of the range
descriptor would be incorrect until the next descriptor update--though
this usually happened quickly, as the replica would apply the change
replicas command, which updates the descriptor, soon after applying the
preemptive snapshot.

To fix the problem, teach Store.canApplySnapshotLocked to make a copy of
the range descriptor before it mutates it.

To prevent regressions, add an assertion that a range's start key is
never changed to the descriptor update path. With this assertion in
place, but without the fix itself,
TestStoreRangeMergeReadoptedLHSFollower reliably fails.

Fixes cockroachdb#29252.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Sep 21, 2018
We saw a consistency failure in cockroachdb#29252 that would've been much more
useful had it occurred close to the time around which the inconsistency
must have been introduced. Instead of leaving it to chance, add a switch
that runs aggressive checks in (roach) tests that want them such as the
clearrange test.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants