-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: hotspotsplits/nodes=4 failed [inconsistency] #61990
Comments
cc @aayushshah15 @nvanbenschoten for triage |
(roachtest).hotspotsplits/nodes=4 failed on master@8b137b4f068a0d590a3e86ae91fd60eb84f2750a:
More
Artifacts: /hotspotsplits/nodes=4 See this test on roachdash |
In #61990 we had this method catch a stats divergence on teardown in an otherwise successful test. The call to `t.Fatal` in that method unfortunately prevented the logs from being collected, which is not helpful. Release note: None
Kicking off a repro stress attempt with |
20 runs passed (in <900s each) |
out of 200 an additional runs, I got one reproduction (i.e. 199 PASS). That's a 0.45% repro rate. Ouch!
|
Do you have the logs/debug zips collected somewhere? I was just going to say that I'd be really happy if we found a range tombstone bug. Out of obvious ideas for #62267. |
I have something even better, namely the cluster! Two of the nodes failed on the real consistency checker and got checkpointed and killed. Go check them out (I'm heading to bed, haven't looked yet):
Unfortunately the checkpoints are ~20GiB each. |
Kicking off 800 runs overnight on release-21.1 to see if I can get a repro there. |
@sumeerbhola, did we break
|
Ditto for range-keys, but that doesn't seem pebble related. #51151, works on a 20.1 binary (broken in 20.2 as well).
|
This must have been overlooked in all the changes to users of MVCCIterator such that they don't span from local to global keys. Where is the code that is constructing this MVCCIterator? The fix will typically be one of (a) set more accurate bounds if the iterator is only really trying to read one of the local or global keys, (b) use 2 iterators if both need to be read, (c) if one is trying to also read non-MVCC keys like lock table keys, use EngineIterator instead of MVCCIterator. |
The local range ID keys look fine, except the node in the majority (n2) is ahead by a commit. Is that fine? 3c3
< 0,0 /Local/RangeID/59/r/RangeAppliedState (0x0169c3727261736b00): raft_applied_index:2866 lease_applied_index:2704 range_stats:<last_update_nanos:1617046662819189590 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 sys_bytes:436 sys_count:5 >
---
> 0,0 /Local/RangeID/59/r/RangeAppliedState (0x0169c3727261736b00): raft_applied_index:2865 lease_applied_index:2703 range_stats:<last_update_nanos:1617046662819189590 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 sys_bytes:436 sys_count:5 >
6c6
< 0,0 /Local/RangeID/59/u/RaftHardState (0x0169c3757266746800): term:28 vote:2 commit:2866
---
> 0,0 /Local/RangeID/59/u/RaftHardState (0x0169c3757266746800): term:28 vote:2 commit:2865 @sumeerbhola, I think here: Lines 223 to 236 in d29a7be
cockroach start-single-node --insecure --logtostderr=INFO ) and use cockroach debug keys in against the store directory.
|
|
cockroach.tobias-1617031837-01-n4cpu4-0002.ubuntu.2021-03-29T19_27_11Z.019618.log (n2,s2):2: checksum 335d894333f35cb2cbcf2844defb44a77b8f42b6f286dff07970e79b43e7e2bef3c46de8b4ec1fb06d28629d87e63a92bfb0578e84be5ef2ba9441a0a04fcc9f
stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0
stats.Sub(recomputation): last_update_nanos:1617046662819189590
(n3,s3):3: checksum 335d894333f35cb2cbcf2844defb44a77b8f42b6f286dff07970e79b43e7e2bef3c46de8b4ec1fb06d28629d87e63a92bfb0578e84be5ef2ba9441a0a04fcc9f
stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0
stats.Sub(recomputation): last_update_nanos:1617046662819189590
(n1,s1):1: checksum 4174178f69296d98d5d821d3998f0e58ba23e4665bda7e8464a6ef20098623918f3eb27f2293812128fbf4a96cea060fd3c9c8fbecdfec001ee79b9a91d8f9e1 [minority]
stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0
stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:-524356 live_count:-2 key_bytes:-50 key_count:-2 val_bytes:-524306 val_count:-2
consistency check failed; fetching details and shutting down minority (n1,s1):1
(n2,s2):2: checksum 2a0745549ed2f4a9b4f7d26f857f85447b3701975a51e74b15a51c1c3f5cb9dc6d8394fc7023ee9a4544e42d92136de092469488dd04c9a883649ef1336fbcc3
stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0
stats.Sub(recomputation): last_update_nanos:1617046662819189590
(n3,s3):3: checksum 2a0745549ed2f4a9b4f7d26f857f85447b3701975a51e74b15a51c1c3f5cb9dc6d8394fc7023ee9a4544e42d92136de092469488dd04c9a883649ef1336fbcc3
stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0
stats.Sub(recomputation): last_update_nanos:1617046662819189590
+====== diff(2a0745549ed2f4a9b4f7d26f857f85447b3701975a51e74b15a51c1c3f5cb9dc6d8394fc7023ee9a4544e42d92136de092469488dd04c9a883649ef1336fbcc3, [minority]) ======
--- leaseholder
+++ follower
-0,0 �/Local/RangeID/59/r/RangeAppliedState�
- ts:1970-01-01 00:00:00 +0000 UTC
- value:�raft_applied_index:2866 lease_applied_index:2704 range_stats:<last_update_nanos:1617046662819189590 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 sys_bytes:436 sys_count:5 > raft_closed_timestamp:<wall_time:1617047719942013857 >�
- raw mvcc_key/value: �0169c3727261736b00� �1204080010001800200028003240785941710308b2161090151a2709569391525fe770162094aca28d0128ea0830dadc0138ea0840bacfa08d0148ea0860b4036805220a08a187ff9fd78abab816�
+0,0 �/Local/RangeID/59/r/RangeAppliedState�
+ ts:1970-01-01 00:00:00 +0000 UTC
+ value:�raft_applied_index:2866 lease_applied_index:2704 range_stats:<last_update_nanos:1617046662819189590 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 sys_bytes:436 sys_count:5 > raft_closed_timestamp:<wall_time:1617047719942013857 >�
+ raw mvcc_key/value: �0169c3727261736b00� �1204080010001800200028003240ea302a2f0308b2161090151a2709569391525fe77016208cabe28c0128e60830f6db0138e6084096cfe08c0148e60860b4036805220a08a187ff9fd78abab816�
-1617046279.883407392,0 �/Table/53/1/5898728633597431726/0�
- ts:2021-03-29 19:31:19.883407392 +0000 UTC
- value:�"\x01\xaa\x98\x1f\n&\x90\x80\x00\u007fw\xf24\xdc\xd0m\x97\xd5~X\xe5\x98\x1f!\x9f\xf6\xb0\x89\xd3\r\xe5.\x03\x8f\xa2\x1d8q\x1ch=\x9e\xcb\x04\xc6\xe34ce\x85cV\x83\xfa\xf4\x12\xbb\xb4\x95\x1e\x97k}\x04\xe0\x9a\b\xb0\rO\xb3\xf3\xec+\xdf\n|gr-\xe2b\xa8\x1a\u007f#mx\xbf\xb5\xc7m\xf8K8\xcd\xf7\x11\xdb&\xa2ww\x98\xc7K\xa5O\x17\x86$s\x0e\xd4\xdf9\x17\x87I\x9c\xb3:\xad~\xec\xddg\x8cF\x85-\x83\xe7\xb6\xc1\x85u\xde\x1d3.f Ү\xf7\xa24\x12\xf38\xce\x1d\xb5\xe4$\x9f\xf8\v\x141\x94\xb0\x88\x1b\u007fg\xde\xe7�\x00\xb2\xfek\xc1\xfb\xb0\x83\x97\x95\x19\x88\xcf@HlO\x1b\xd2\xd6\xe1\xf0B\x87w\xc0\xe7w`V\x98\xda0&\xa9\xf6\xf6\xc0-\xff\xf2\xd9*\x9f\xcd\ty\x9e\xc9\x06l\xb7{\t\x19J\x85I\x85R\xb5,2\xa7\x93\xe7\x9eT\x8b\xf4r\xd3m\xa6s\xbf\x1a\x1aU\xbe\xcd\xe0)�mJ[[\x90\x88\xff<\xbf\x0fj\x14\x93\x89\xb1-\x92\xe6jb\xe3\xf2\"\xa0\xb5\xf7\x15\xdf\xf4>\x15t.\x18\xa9s\x90\x8d?\x8f\xc6\xect\x8b�\x9c\xeecN}\x9e\xb2ɺ\x8f\x8e\xed\xf5\xbfZ\b]\x18d\x04`^\xe4\x1dZ7�@\xd6o\r4\x8c\"\xab\xac\xa3�\xd3V\xcc$\x97\xa1\x8bwS\xf4\xc9\xec
- raw mvcc_key/value: �bd89fd51dc7e7080000fae88001670e70629d1402009� �01aa981f0a269080007f77f234dcd06d97d57e58e5981f219ff6b089d30de52e038fa21d38711c683d9ecb04c6e334636585635683faf412bbb4951e976b7d04e09a08b00d4fb3f3ec2bdf0a7c67722de262a81a7f236d78bfb5c76df84b38cdf711db26a2777798c74ba54f178624730ed4df391787499cb33aad7eecdd678c46852d83e7b6c18575de1d332e6620d2aef7a23412f338ce1db5e4249ff80b143194b0881b7f67dee7ca8300b2fe6bc1fbb08397951988cf40486c4f1bd2d6e1f0428777c0e777605698da3026a9f6f6c02dfff2d92a9fcd09799ec9066cb77b09194a85498552b52c32a793e79e548bf472d36da673bf1a1a55becde029e394a66d4a5b5b9088ff3cbf0f6a149389b12d92e66a62e3f222a0b5f715dff43e15742e18a973908d3f8fc6ec748bdf929cee634e7d9eb2c9ba8f8eedf5bf5a085d186404605ee41d5a37dc9c40d66f0d348c22abaca3c985d356cc2497a18b7753f4c9ec2b4404bf05c3882ce0a4046d456061160e679eccf09ae40b00d1fd316eff9c537c05a88c637f4f7331658429146b7b5fe22e339f626a203b2bac99fe60b62d4297f6369e4e9838cdd2de45afcbf3ab6431cec7bc70987108897a22487153c9f4a44db16cca7031e17a58daa099c8f83cdf358de6ba1
-1617046280.292398950,0 �/Table/53/1/5938528238689125289/0�
- ts:2021-03-29 19:31:20.29239895 +0000 UTC
- value:�"\xa0\xa3A1\n&\x90\x80\x00\x9c\x10\xc4\x04\x801\xb8&\x8f\xcaf\"\xe0.\u008b<\"q~]\xcc\xe3(K\x19\xf9\x81OC5\xfa\x89%,\xac\x10\x99\\b\xf0�\xd9\xf9Ծ\x83�\xbfw\x1a\x9ez\xae5\n]m\u007f\xf9{\xe2\x05\x03\x1c\xb68\x0e\x91\v\xfev\xb9~\xa5\xa8\xc4\x1b\n\xe3\xc7m\xd78\xd9x_\xe6]\x9aCqsj\xc7Q\xa4ը\xc1\x80\xf5&H\xb6l\xc2G\xc0\xb1\xcf\x1eB\v\x8d\xc0Ty\xaaL\x11\x90rL\xbe\x19\xfa\x17\xd7b\x8c]E\xaa\x04\xd5\x02\xdf#$\x9b<\x1d\xda]\xfcNB\xb0O\xbf�\x86\xb1UkWo^\xe5\xf9%\xeaGAdz\xad\xd1\xc5\x10\x93\xaeE\xfb\xab\xac\xc1\x13\xff\xba\xb7\xe4_\xbd\xfc4FF\xe9Hd\xb6\xca�\xec]p\xe1Hl\xad5\xc7\u007f\xf1\x8d\x1a\xbe\xe7z\rkJ\x00\x94-=p@\xd8UecXDl\xa5c\xac\xb0\xa7\xb8Iĺ\xa9\xa8\u007f,\xceh\xbf\vl�N \xf5z!\x1f\xd9#C\xa9\xf0!o\x12�SVLi\x8em\xa5\xab-Y\xccT@#\xc0\x8e\xf8a\x0e&q\x8d\x99\x94Z$,(\xc1*\x84m\xeaN\x90>\x94s9\x1b\xec\x14\xde\xca{\xf6\x94kH\xa4\x1d\x19T\x0f\xfa\x9c}\x16[\x03\x9e@\xdaM=P\xd9;:\xdb_*E0L\xa3\a]\x92\x9f\xb6\x8e}ӯP\fp<VT\xb3\\\xf9W\x80j\xd9{\xb2�:ͬ\x03\x13\xebM\xf7\xea\xba,\x8d\xa9\x16\xdb\xfd!\
- raw mvcc_key/value: �bd89fd5269e3f82bb103a988001670e7064231f76609� �a0a341310a269080009c10c4048031b8268fca6622e02ec28b3c22717e5dcce3284b19f9814f4335fa89252cac10995c62f0dd9bd9f9d4be83e98ebebf771a9e7aae350a5d6d7ff97be205031cb6380e910bfe76b97ea5a8c41b0ae3c76dd738d9785fe65d9a4371736ac751a4d5a8c180f52648b66cc247c0b1cf1e420b8dc05479aa4c1190724cbe19fa17d7628c5d45aa04d502df23249b3c1dda5dfc4e42b04fbfefa79086b1556b576f5ee5f925ea4741c7b3add1c51093ae45fbabacc113ffbab7e45fbdfc344646e94864b6cada9fec5d70e1486cad35c77ff18d1abee77a0d6b4a00942d3d7040d855656358446ca563acb0a7b849c4baa9a87f2cce68bf0b6ccf844e20f57a211fd92343a9f0216f12de8b53564c698e6da5ab2d59cc544023c08ef8610e26718d99945a242c28c12a846dea4e903e9473391bec14deca7bf6946b48a41d19540ffa9c7d165b039e40da4d3d50d93b3adb5f2a45304ca3075d929fb68e7dd3af500c703c5654b35cf957806ad97bb2d9883acdac0313eb4df7eaba2c8da916dbfd211b3c436e3de3f1ea5b3f88950fd9e0ff394538dc16ff9d247dbe34136111f23161b40152a776d1efa4418782d9a2916b0d4021565e146a856f5712fbd7cef4a74d8990a7c961ecea6b5acf5
- stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0
- stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:-524356 live_count:-2 key_bytes:-50 key_count:-2 val_bytes:-524306 val_count:-2 Looks like we've dropped two keys. |
on n1 (minority) and n2 (majority):
These two keys are not adjacent. The first key is on line 967 of The workload here is At first glance a plausible explanation for the bug could be that we're simply dropping a whole pebble batch on the floor, but that isn't it. We are definitely losing the kv write in tandem with the stats write, but we seem to be updating the raft applied index just fine, or we would see fatal errors - as we do assert that it goes up in increments of exactly one. A more likely explanation is that we are erroneously applying a command with a forced error when we shouldn't (only on the minority replica), but I don't see any particular reason why this would happen in this test and not others (and a cursory review of the code shows no source of nondeterminism). The timestamp of the first lost key translates to Monday, March 29, 2021 7:31:19 PM. Looking through the combined logs at that time via
there are a few things going on. r59 is created via a split ~2 minutes before the lost writes (so that split is likely not to blame). Around 70s before the problematic write there's a slow heartbeat. 60s before the write the raft log queue truncates (due to size), cutting off n3 from the raft log. A snapshot is sent and applied by n3 at 19:31:10, i.e. ~8s before the write. 2s after the write, the raft log queue truncates again, causing another snapshot (but none is sent for a while, it only gets applied at 19:34:11, minutes after the write. We're also seeing a split in the meantime, at 19:32:10, ~50s after the write. Finally, starting at 19:34:27 (minutes after the write) we see a flurry of failed reproposals due to closed timestamps on n2, and also a bunch of "have been waiting" messages from DistSender trying to talk to 459. That's pretty much it - at 19:37:49 the inconsistency is detected. Nothing really stands out here. Since we have a repro cycle - as bad as it may be - I am going to instrument the binary to print at each raft log position the command and whether we applied it or not. That should provide valuable insight. |
#62791 now running with up to 800 iterations. 💸 💸 💸 It's also looking increasingly likely that release-21.1 isn't going to reproduce - I'm ~700 out of 800 runs in and the only failure that has cropped up was a clock offset fatal. |
Doing some light math here on the repro changes... I got 1 in 200 on master, and the failures are Binomially distributed, so let's say the Then with 0/800 basically rules out that it'll ever show up (i.e. under the same distribution, I would've been ~guaranteed to see a failure) That tool also has facilities for giving you 95% confidence interval estimates for |
run 114, tobias-1617787468-46-n4cpu4:2
|
run 126, tobias-1617787468-24-n4cpu4:2
I'll leave this running but you get the idea. We catch the raft entry cache assertion quite a bit, and the term regression sometimes. |
These repros all have if truncate {
truncIdx := ents[0].Index
if add { // not taken
// Some entries were already overwritten.
truncIdx = ents[len(ents)-1].Index + 1
}
bytesRemoved, entriesRemoved = p.truncateFrom(truncIdx)
} I have the Cache here - should be doable to figure out why this didn't do what was advertised. One last failure that occurred before I aborted the repro attempts: run 144,
|
I think I have something. The problem is indeed in the code from the last comment. When For example, in the dump I'm looking at right now (run 71), we were inserting entries 1154...1253 when the assertion hit. The ringbuf (post assertion, but I think it wasn't mutated at all) has entries 1213..1268. Note 1154 < 1213. Ok, time to fix that and to see if that indeed does explain everything... but I feel it must. Will have some hopefully not preliminary victory lunch before. |
Spinning up the repro machine again with provisional fix 97bf81b |
Likely due to golang/gogo proto incompatibilities around our use of the `(gogoproto.nullable)` option, calling `.String()` on a `raftpb.Entry` panics. However, upstream at our currently vendored commit 578ffd578362a8962194c4c0c68c1a9824cf9ea3, ``` func TestFoo(t *testing.T) { ent := pb.Entry{Term: 1, Index: 2} _ = ent.String() } ``` does not panic. This commit adds a test documenting the undesired behavior and working around it in `./cockroach debug raft-log` (and the other debug commands that stringify log entries). Found while investigating cockroachdb#61990. Release note: None
62728: roachtest: don't call t.Fatal in FailOnReplicaDivergence r=erikgrinaker a=tbg In #61990 we had this method catch a stats divergence on teardown in an otherwise successful test. The call to `t.Fatal` in that method unfortunately prevented the logs from being collected, which is not helpful. Release note: None 63073: roachtest: retire `bank` tests r=nvanbenschoten,andreimatei a=tbg The `bank` roachtests are a basic form of Jepsen test. We were hoping that they could provide additional benefit due to being less complex than a full Jepsen test. In my opinion, this calculus has not worked out. We've spent many hours staring at this test deadlocked when it wasn't CockroachDB's fault at all. Besides, it's not adding anything meaningful on top of our Jepsen tests plus KVNemesis plus the TPCC invariant checks, so we are removing the `bank` family of tests here. Should we want to reintroduce these tests, we should do it at the level of TestCluster, or at least augment the `bank` workload to incorporate these invariant checks instead. Closes #62754. Closes #62749 Closes #53871. Release note: None 63214: kvserver: reduce ReplicaGCQueueInactivityThreshold to 12 hours r=ajwerner a=erikgrinaker `ReplicaGCQueueInactivityThreshold` specifies the interval at which the GC queue checks whether a replica has been removed from the canonical range descriptor, and was set to 10 days. This is a fallback for when we fail to detect the removal and GC the replica immediately. However, this could occasionally cause stale replicas to linger for 10 days, which surprises users (e.g. by causing alerts if the stale replica thinks the range has become underreplicated). This patch reduces the threshold to 12 hours, which is a more reasonable timeframe for users to expect things to "sort themselves out". The operation to read the range descriptor is fairly cheap, so this is not likely to cause any problems, and the interval is therefore not jittered either. Resolves #63212, touches #60259. Release note (ops change): Replica garbage collection now checks replicas against the range descriptor every 12 hours (down from 10 days) to see if they should be removed. Replicas that fail to notice they have been removed from a range will therefore linger for at most 12 hours rather than 10 days. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.
This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None
In cockroachdb#61990 we had this method catch a stats divergence on teardown in an otherwise successful test. The call to `t.Fatal` in that method unfortunately prevented the logs from being collected, which is not helpful. Release note: None
This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None
63302: raftentry: fix accidental no-op in truncateFrom r=nvanbenschoten,erikgrinaker a=tbg When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to #36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in #36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes #61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.
This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None
When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.
This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None
When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.
This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None
In cockroachdb#61990 we had this method catch a stats divergence on teardown in an otherwise successful test. The call to `t.Fatal` in that method unfortunately prevented the logs from being collected, which is not helpful. Release note: None
(roachtest).hotspotsplits/nodes=4 failed on master@e09b93fe62541c3a94f32a723778660b528a0792:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: