roachtest: hotspotsplits/nodes=4 failed [inconsistency] #61990

cockroach-teamcity · 2021-03-15T06:15:21Z

(roachtest).hotspotsplits/nodes=4 failed on master@e09b93fe62541c3a94f32a723778660b528a0792:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	cluster.go:1754,test_runner.go:862: r38 (/Table/53/1/-182447479829764077) is inconsistent: RANGE_INCONSISTENT 
		(1) attached stack trace
		  -- stack trace:
		  | main.(*cluster).CheckReplicaDivergenceOnDB
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1705
		  | main.(*cluster).FailOnReplicaDivergence.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1751
		  | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:140
		  | main.(*cluster).FailOnReplicaDivergence
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1748
		  | main.(*testRunner).runTest
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:862
		  | main.(*testRunner).runWorker
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:472
		  | main.(*testRunner).Run.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:275
		  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) r38 (/Table/53/1/-182447479829764077) is inconsistent: RANGE_INCONSISTENT 
		  | (n3,s3):3: checksum 099280635e0ad488ea26cb5f253acebde95ceaa12eb3a8088e41ce92866507e5a664046b208ad370517d6af5c839c135dd1c27e612bbb0d59f75357a06084b61 [minority]
		  | - stats: contains_estimates:0 last_update_nanos:1615788916721166939 intent_age:0 gc_bytes_age:0 live_bytes:301241874 live_count:1149 key_bytes:28077 key_count:1149 val_bytes:301213797 val_count:1149 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:702 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1615788916721166939 live_bytes:301241874 live_count:1149 key_bytes:28077 key_count:1149 val_bytes:301213797 val_count:1149 sys_bytes:702 sys_count:6 
		  | (n2,s2):2: checksum ff3cbc427622cc2ebf8ae40b42f37870f5662226a33bd636662a3423b268d19c15ae26c4f10181e9a80e531d00bb5127cbbcd44c21088fe8e5745d10653687fa
		  | - stats: contains_estimates:0 last_update_nanos:1615788916721166939 intent_age:0 gc_bytes_age:0 live_bytes:300979696 live_count:1148 key_bytes:28052 key_count:1148 val_bytes:300951644 val_count:1148 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:702 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1615788916721166939 live_bytes:300979696 live_count:1148 key_bytes:28052 key_count:1148 val_bytes:300951644 val_count:1148 sys_bytes:702 sys_count:6 
		  | (n1,s1):1: checksum ff3cbc427622cc2ebf8ae40b42f37870f5662226a33bd636662a3423b268d19c15ae26c4f10181e9a80e531d00bb5127cbbcd44c21088fe8e5745d10653687fa
		  | - stats: contains_estimates:0 last_update_nanos:1615788916721166939 intent_age:0 gc_bytes_age:0 live_bytes:300979696 live_count:1148 key_bytes:28052 key_count:1148 val_bytes:300951644 val_count:1148 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:702 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1615788916721166939 live_bytes:300979696 live_count:1148 key_bytes:28052 key_count:1148 val_bytes:300951644 val_count:1148 sys_bytes:702 sys_count:6
		Error types: (1) *withstack.withStack (2) *errutil.leafError

More

Artifacts: /hotspotsplits/nodes=4

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

knz · 2021-03-16T08:18:44Z

cc @aayushshah15 @nvanbenschoten for triage

cockroach-teamcity · 2021-03-26T06:18:56Z

(roachtest).hotspotsplits/nodes=4 failed on master@8b137b4f068a0d590a3e86ae91fd60eb84f2750a:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	cluster.go:1754,test_runner.go:863: r40 (/Table/53/1/4396575136294584631) is inconsistent: RANGE_INCONSISTENT 
		(1) attached stack trace
		  -- stack trace:
		  | main.(*cluster).CheckReplicaDivergenceOnDB
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1705
		  | main.(*cluster).FailOnReplicaDivergence.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1751
		  | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:140
		  | main.(*cluster).FailOnReplicaDivergence
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1748
		  | main.(*testRunner).runTest
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:863
		  | main.(*testRunner).runWorker
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:473
		  | main.(*testRunner).Run.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:276
		  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) r40 (/Table/53/1/4396575136294584631) is inconsistent: RANGE_INCONSISTENT 
		  | (n3,s3):2: checksum f03984771a11e0ff99ba698c20d3e2606680201bad62ed4d08da7c9ae4ec774b3f8323e7b15c5f8ee443abde68a2ab57904f017ff835e8edafe4ce1114b6840f
		  | - stats: contains_estimates:0 last_update_nanos:1616739521894074624 intent_age:0 gc_bytes_age:0 live_bytes:315924490 live_count:1205 key_bytes:30125 key_count:1205 val_bytes:315894365 val_count:1205 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:629 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1616739521894074624 live_bytes:315924490 live_count:1205 key_bytes:30125 key_count:1205 val_bytes:315894365 val_count:1205 sys_bytes:629 sys_count:6 
		  | (n2,s2):3: checksum f03984771a11e0ff99ba698c20d3e2606680201bad62ed4d08da7c9ae4ec774b3f8323e7b15c5f8ee443abde68a2ab57904f017ff835e8edafe4ce1114b6840f
		  | - stats: contains_estimates:0 last_update_nanos:1616739521894074624 intent_age:0 gc_bytes_age:0 live_bytes:315924490 live_count:1205 key_bytes:30125 key_count:1205 val_bytes:315894365 val_count:1205 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:629 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1616739521894074624 live_bytes:315924490 live_count:1205 key_bytes:30125 key_count:1205 val_bytes:315894365 val_count:1205 sys_bytes:629 sys_count:6 
		  | (n1,s1):1: checksum 7ad7a7eb605a7f4a0c52e76e4ffb7f075324a842e8f385c10a8e59d3a5da223d17fc992dad8c5838d578206d0e93663d3154e40d9392e17a1610678dd3d4daf9 [minority]
		  | - stats: contains_estimates:0 last_update_nanos:1616739521894074624 intent_age:0 gc_bytes_age:0 live_bytes:315400134 live_count:1203 key_bytes:30075 key_count:1203 val_bytes:315370059 val_count:1203 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:629 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1616739521894074624 live_bytes:315400134 live_count:1203 key_bytes:30075 key_count:1203 val_bytes:315370059 val_count:1203 sys_bytes:629 sys_count:6
		Error types: (1) *withstack.withStack (2) *errutil.leafError

More

Artifacts: /hotspotsplits/nodes=4

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

In #61990 we had this method catch a stats divergence on teardown in an otherwise successful test. The call to `t.Fatal` in that method unfortunately prevented the logs from being collected, which is not helpful. Release note: None

tbg · 2021-03-29T14:53:01Z

Kicking off a repro stress attempt with --debug on top of #62728

tbg · 2021-03-29T15:21:26Z

20 runs passed (in <900s each)

tbg · 2021-03-29T20:15:09Z

out of 200 an additional runs, I got one reproduction (i.e. 199 PASS). That's a 0.45% repro rate. Ouch!

       	Wraps: (2) r59 (/Table/53/1/5664974682252175621) is inconsistent: RANGE_INCONSISTENT 
		 | (n2,s2):2: checksum 5907ea1a6182683e1601abded94351e69219280a4ab1d6d271679e82262e4c49349c683c16b07d99763d8eeb3de5e123158b6584366cdae4b2f847dbb4fba601
		 | - stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:368 sys_count:4 abort_span_bytes:0 
		 | - stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 sys_bytes:368 sys_count:4 
		 | (n3,s3):3: checksum 5907ea1a6182683e1601abded94351e69219280a4ab1d6d271679e82262e4c49349c683c16b07d99763d8eeb3de5e123158b6584366cdae4b2f847dbb4fba601
		 | - stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:368 sys_count:4 abort_span_bytes:0 
		 | - stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 sys_bytes:368 sys_count:4 
		 | (n1,s1):1: checksum 9246dccddf4316dea2eebaf4c89a6b3a87826fa8670b00c9c96266359bbf203845c86b348914dc749d5e8b6f0a746dda8f851e3bfcad7374968fdcbfb499cf50 [minority]
		 | - stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:368 sys_count:4 abort_span_bytes:0 
		 | - stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 sys_bytes:368 sys_count:4
		Error types: (1) *withstack.withStack (2) *errutil.leafError

irfansharif · 2021-03-29T20:17:07Z

Do you have the logs/debug zips collected somewhere? I was just going to say that I'd be really happy if we found a range tombstone bug. Out of obvious ideas for #62267.

tbg · 2021-03-29T20:24:01Z

I have something even better, namely the cluster! Two of the nodes failed on the real consistency checker and got checkpointed and killed. Go check them out (I'm heading to bed, haven't looked yet):

$ roachprod extend tobias-1617031837-01-n4cpu4 --lifetime 100h
tobias-1617031837-01-n4cpu4: [gce] 122h37m11s remaining
  tobias-1617031837-01-n4cpu4-0001	tobias-1617031837-01-n4cpu4-0001.us-east1-b.cockroach-ephemeral	10.142.0.56	35.243.139.251
  tobias-1617031837-01-n4cpu4-0002	tobias-1617031837-01-n4cpu4-0002.us-east1-b.cockroach-ephemeral	10.142.0.101	35.185.42.134
  tobias-1617031837-01-n4cpu4-0003	tobias-1617031837-01-n4cpu4-0003.us-east1-b.cockroach-ephemeral	10.142.0.41	34.74.192.79
  tobias-1617031837-01-n4cpu4-0004	tobias-1617031837-01-n4cpu4-0004.us-east1-b.cockroach-ephemeral	10.142.0.52	35.185.111.113

Unfortunately the checkpoints are ~20GiB each.

tbg · 2021-03-29T20:28:09Z

Kicking off 800 runs overnight on release-21.1 to see if I can get a repro there.

irfansharif · 2021-03-29T21:29:51Z

@sumeerbhola, did we break cockroach debug keys <store>?

$ roachprod ssh tobias-1617031837-01-n4cpu4:2
$ ~/cockroach debug keys /mnt/data1/cockroach/auxiliary/checkpoints/r59_at_2866/
panic: iterator with constraint=2 is being used with key /Min that has constraint=1 [recovered]
        panic: iterator with constraint=2 is being used with key /Min that has constraint=1 [recovered]
        panic: iterator with constraint=2 is being used with key /Min that has constraint=1

irfansharif · 2021-03-29T21:32:04Z

Ditto for range-keys, but that doesn't seem pebble related. #51151, works on a 20.1 binary (broken in 20.2 as well).

$ ~/cockroach debug range-data /mnt/data1/cockroach/auxiliary/checkpoints/r59_at_2866 59
panic: field raftpb.HardState.term has invalid type: got uint64, want pointer [recovered]
        panic: field raftpb.HardState.term has invalid type: got uint64, want pointer [recovered]
        panic: field raftpb.HardState.term has invalid type: got uint64, want pointer

sumeerbhola · 2021-03-29T22:25:10Z

panic: iterator with constraint=2 is being used with key /Min that has constraint=1 [recovered]
panic: iterator with constraint=2 is being used with key /Min that has constraint=1 [recovered]
panic: iterator with constraint=2 is being used with key /Min that has constraint=1

This must have been overlooked in all the changes to users of MVCCIterator such that they don't span from local to global keys. Where is the code that is constructing this MVCCIterator? The fix will typically be one of (a) set more accurate bounds if the iterator is only really trying to read one of the local or global keys, (b) use 2 iterators if both need to be read, (c) if one is trying to also read non-MVCC keys like lock table keys, use EngineIterator instead of MVCCIterator.

irfansharif · 2021-03-29T22:29:25Z

The local range ID keys look fine, except the node in the majority (n2) is ahead by a commit. Is that fine?

3c3
< 0,0 /Local/RangeID/59/r/RangeAppliedState (0x0169c3727261736b00): raft_applied_index:2866 lease_applied_index:2704 range_stats:<last_update_nanos:1617046662819189590 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 sys_bytes:436 sys_count:5 >
---
> 0,0 /Local/RangeID/59/r/RangeAppliedState (0x0169c3727261736b00): raft_applied_index:2865 lease_applied_index:2703 range_stats:<last_update_nanos:1617046662819189590 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 sys_bytes:436 sys_count:5 >
6c6
< 0,0 /Local/RangeID/59/u/RaftHardState (0x0169c3757266746800): term:28 vote:2 commit:2866
---
> 0,0 /Local/RangeID/59/u/RaftHardState (0x0169c3757266746800): term:28 vote:2 commit:2865

@sumeerbhola, I think here:

cockroach/pkg/cli/debug.go

Lines 223 to 236 in d29a7be

    
           return db.MVCCIterate(debugCtx.startKey.Key, debugCtx.endKey.Key, storage.MVCCKeyAndIntentsIterKind, func(kv storage.MVCCKeyValue) error { 
        
           	done, err := printer(kv) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	if done { 
        
           		return iterutil.StopIteration() 
        
           	} 
        
           	results++ 
        
           	if results == debugCtx.maxResults { 
        
           		return iterutil.StopIteration() 
        
           	} 
        
           	return nil 
        
           })

. This is readily reproducible, start a single node (cockroach start-single-node --insecure --logtostderr=INFO) and use cockroach debug keys in against the store directory.

irfansharif · 2021-03-29T23:23:46Z

# leaseholder
$ ~/cockroach20.1 debug keys --from='/Table/53/1' --from='/Table/54'  . | wc -l
79150

# minority errant node
$ ~/cockroach20.1 debug keys --from='/Table/53/1' --from='/Table/54'  . | wc -l
79146

irfansharif · 2021-03-30T02:53:53Z

cockroach.tobias-1617031837-01-n4cpu4-0002.ubuntu.2021-03-29T19_27_11Z.019618.log

(n2,s2):2: checksum 335d894333f35cb2cbcf2844defb44a77b8f42b6f286dff07970e79b43e7e2bef3c46de8b4ec1fb06d28629d87e63a92bfb0578e84be5ef2ba9441a0a04fcc9f
  stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
  stats.Sub(recomputation): last_update_nanos:1617046662819189590 
(n3,s3):3: checksum 335d894333f35cb2cbcf2844defb44a77b8f42b6f286dff07970e79b43e7e2bef3c46de8b4ec1fb06d28629d87e63a92bfb0578e84be5ef2ba9441a0a04fcc9f
  stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
  stats.Sub(recomputation): last_update_nanos:1617046662819189590 
(n1,s1):1: checksum 4174178f69296d98d5d821d3998f0e58ba23e4665bda7e8464a6ef20098623918f3eb27f2293812128fbf4a96cea060fd3c9c8fbecdfec001ee79b9a91d8f9e1 [minority]
  stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
  stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:-524356 live_count:-2 key_bytes:-50 key_count:-2 val_bytes:-524306 val_count:-2 
 consistency check failed; fetching details and shutting down minority (n1,s1):1
 
(n2,s2):2: checksum 2a0745549ed2f4a9b4f7d26f857f85447b3701975a51e74b15a51c1c3f5cb9dc6d8394fc7023ee9a4544e42d92136de092469488dd04c9a883649ef1336fbcc3
  stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
  stats.Sub(recomputation): last_update_nanos:1617046662819189590 
(n3,s3):3: checksum 2a0745549ed2f4a9b4f7d26f857f85447b3701975a51e74b15a51c1c3f5cb9dc6d8394fc7023ee9a4544e42d92136de092469488dd04c9a883649ef1336fbcc3
  stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
  stats.Sub(recomputation): last_update_nanos:1617046662819189590 
+====== diff(2a0745549ed2f4a9b4f7d26f857f85447b3701975a51e74b15a51c1c3f5cb9dc6d8394fc7023ee9a4544e42d92136de092469488dd04c9a883649ef1336fbcc3, [minority]) ======
--- leaseholder
+++ follower
-0,0 â�¹/Local/RangeID/59/r/RangeAppliedStateâ�º
-    ts:1970-01-01 00:00:00 +0000 UTC
-    value:â�¹raft_applied_index:2866 lease_applied_index:2704 range_stats:<last_update_nanos:1617046662819189590 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 sys_bytes:436 sys_count:5 > raft_closed_timestamp:<wall_time:1617047719942013857 >â�º
-    raw mvcc_key/value: â�¹0169c3727261736b00â�º â�¹1204080010001800200028003240785941710308b2161090151a2709569391525fe770162094aca28d0128ea0830dadc0138ea0840bacfa08d0148ea0860b4036805220a08a187ff9fd78abab816â�º
+0,0 â�¹/Local/RangeID/59/r/RangeAppliedStateâ�º
+    ts:1970-01-01 00:00:00 +0000 UTC
+    value:â�¹raft_applied_index:2866 lease_applied_index:2704 range_stats:<last_update_nanos:1617046662819189590 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 sys_bytes:436 sys_count:5 > raft_closed_timestamp:<wall_time:1617047719942013857 >â�º
+    raw mvcc_key/value: â�¹0169c3727261736b00â�º â�¹1204080010001800200028003240ea302a2f0308b2161090151a2709569391525fe77016208cabe28c0128e60830f6db0138e6084096cfe08c0148e60860b4036805220a08a187ff9fd78abab816â�º
-1617046279.883407392,0 â�¹/Table/53/1/5898728633597431726/0â�º
-    ts:2021-03-29 19:31:19.883407392 +0000 UTC
-    value:â�¹"\x01\xaa\x98\x1f\n&\x90\x80\x00\u007fw\xf24\xdc\xd0m\x97\xd5~X\xe5\x98\x1f!\x9f\xf6\xb0\x89\xd3\r\xe5.\x03\x8f\xa2\x1d8q\x1ch=\x9e\xcb\x04\xc6\xe34ce\x85cV\x83\xfa\xf4\x12\xbb\xb4\x95\x1e\x97k}\x04\xe0\x9a\b\xb0\rO\xb3\xf3\xec+\xdf\n|gr-\xe2b\xa8\x1a\u007f#mx\xbf\xb5\xc7m\xf8K8\xcd\xf7\x11\xdb&\xa2ww\x98\xc7K\xa5O\x17\x86$s\x0e\xd4\xdf9\x17\x87I\x9c\xb3:\xad~\xec\xddg\x8cF\x85-\x83\xe7\xb6\xc1\x85u\xde\x1d3.f Ò®\xf7\xa24\x12\xf38\xce\x1d\xb5\xe4$\x9f\xf8\v\x141\x94\xb0\x88\x1b\u007fg\xde\xe7Ê�\x00\xb2\xfek\xc1\xfb\xb0\x83\x97\x95\x19\x88\xcf@HlO\x1b\xd2\xd6\xe1\xf0B\x87w\xc0\xe7w`V\x98\xda0&\xa9\xf6\xf6\xc0-\xff\xf2\xd9*\x9f\xcd\ty\x9e\xc9\x06l\xb7{\t\x19J\x85I\x85R\xb5,2\xa7\x93\xe7\x9eT\x8b\xf4r\xd3m\xa6s\xbf\x1a\x1aU\xbe\xcd\xe0)ã�¦mJ[[\x90\x88\xff<\xbf\x0fj\x14\x93\x89\xb1-\x92\xe6jb\xe3\xf2\"\xa0\xb5\xf7\x15\xdf\xf4>\x15t.\x18\xa9s\x90\x8d?\x8f\xc6\xect\x8bß�\x9c\xeecN}\x9e\xb2Éº\x8f\x8e\xed\xf5\xbfZ\b]\x18d\x04`^\xe4\x1dZ7Ü�@\xd6o\r4\x8c\"\xab\xac\xa3É�\xd3V\xcc$\x97\xa1\x8bwS\xf4\xc9\xec
-    raw mvcc_key/value: â�¹bd89fd51dc7e7080000fae88001670e70629d1402009â�º â�¹01aa981f0a269080007f77f234dcd06d97d57e58e5981f219ff6b089d30de52e038fa21d38711c683d9ecb04c6e334636585635683faf412bbb4951e976b7d04e09a08b00d4fb3f3ec2bdf0a7c67722de262a81a7f236d78bfb5c76df84b38cdf711db26a2777798c74ba54f178624730ed4df391787499cb33aad7eecdd678c46852d83e7b6c18575de1d332e6620d2aef7a23412f338ce1db5e4249ff80b143194b0881b7f67dee7ca8300b2fe6bc1fbb08397951988cf40486c4f1bd2d6e1f0428777c0e777605698da3026a9f6f6c02dfff2d92a9fcd09799ec9066cb77b09194a85498552b52c32a793e79e548bf472d36da673bf1a1a55becde029e394a66d4a5b5b9088ff3cbf0f6a149389b12d92e66a62e3f222a0b5f715dff43e15742e18a973908d3f8fc6ec748bdf929cee634e7d9eb2c9ba8f8eedf5bf5a085d186404605ee41d5a37dc9c40d66f0d348c22abaca3c985d356cc2497a18b7753f4c9ec2b4404bf05c3882ce0a4046d456061160e679eccf09ae40b00d1fd316eff9c537c05a88c637f4f7331658429146b7b5fe22e339f626a203b2bac99fe60b62d4297f6369e4e9838cdd2de45afcbf3ab6431cec7bc70987108897a22487153c9f4a44db16cca7031e17a58daa099c8f83cdf358de6ba1
-1617046280.292398950,0 â�¹/Table/53/1/5938528238689125289/0â�º
-    ts:2021-03-29 19:31:20.29239895 +0000 UTC
-    value:â�¹"\xa0\xa3A1\n&\x90\x80\x00\x9c\x10\xc4\x04\x801\xb8&\x8f\xcaf\"\xe0.\u008b<\"q~]\xcc\xe3(K\x19\xf9\x81OC5\xfa\x89%,\xac\x10\x99\\b\xf0Ý�\xd9\xf9Ô¾\x83é�¾\xbfw\x1a\x9ez\xae5\n]m\u007f\xf9{\xe2\x05\x03\x1c\xb68\x0e\x91\v\xfev\xb9~\xa5\xa8\xc4\x1b\n\xe3\xc7m\xd78\xd9x_\xe6]\x9aCqsj\xc7Q\xa4Õ¨\xc1\x80\xf5&H\xb6l\xc2G\xc0\xb1\xcf\x1eB\v\x8d\xc0Ty\xaaL\x11\x90rL\xbe\x19\xfa\x17\xd7b\x8c]E\xaa\x04\xd5\x02\xdf#$\x9b<\x1d\xda]\xfcNB\xb0O\xbfï§�\x86\xb1UkWo^\xe5\xf9%\xeaGAÇ³\xad\xd1\xc5\x10\x93\xaeE\xfb\xab\xac\xc1\x13\xff\xba\xb7\xe4_\xbd\xfc4FF\xe9Hd\xb6\xcaÚ�\xec]p\xe1Hl\xad5\xc7\u007f\xf1\x8d\x1a\xbe\xe7z\rkJ\x00\x94-=p@\xd8UecXDl\xa5c\xac\xb0\xa7\xb8IÄº\xa9\xa8\u007f,\xceh\xbf\vlÏ�N \xf5z!\x1f\xd9#C\xa9\xf0!o\x12Þ�SVLi\x8em\xa5\xab-Y\xccT@#\xc0\x8e\xf8a\x0e&q\x8d\x99\x94Z$,(\xc1*\x84m\xeaN\x90>\x94s9\x1b\xec\x14\xde\xca{\xf6\x94kH\xa4\x1d\x19T\x0f\xfa\x9c}\x16[\x03\x9e@\xdaM=P\xd9;:\xdb_*E0L\xa3\a]\x92\x9f\xb6\x8e}Ó¯P\fp<VT\xb3\\\xf9W\x80j\xd9{\xb2Ù�:Í¬\x03\x13\xebM\xf7\xea\xba,\x8d\xa9\x16\xdb\xfd!\
-    raw mvcc_key/value: â�¹bd89fd5269e3f82bb103a988001670e7064231f76609â�º â�¹a0a341310a269080009c10c4048031b8268fca6622e02ec28b3c22717e5dcce3284b19f9814f4335fa89252cac10995c62f0dd9bd9f9d4be83e98ebebf771a9e7aae350a5d6d7ff97be205031cb6380e910bfe76b97ea5a8c41b0ae3c76dd738d9785fe65d9a4371736ac751a4d5a8c180f52648b66cc247c0b1cf1e420b8dc05479aa4c1190724cbe19fa17d7628c5d45aa04d502df23249b3c1dda5dfc4e42b04fbfefa79086b1556b576f5ee5f925ea4741c7b3add1c51093ae45fbabacc113ffbab7e45fbdfc344646e94864b6cada9fec5d70e1486cad35c77ff18d1abee77a0d6b4a00942d3d7040d855656358446ca563acb0a7b849c4baa9a87f2cce68bf0b6ccf844e20f57a211fd92343a9f0216f12de8b53564c698e6da5ab2d59cc544023c08ef8610e26718d99945a242c28c12a846dea4e903e9473391bec14deca7bf6946b48a41d19540ffa9c7d165b039e40da4d3d50d93b3adb5f2a45304ca3075d929fb68e7dd3af500c703c5654b35cf957806ad97bb2d9883acdac0313eb4df7eaba2c8da916dbfd211b3c436e3de3f1ea5b3f88950fd9e0ff394538dc16ff9d247dbe34136111f23161b40152a776d1efa4418782d9a2916b0d4021565e146a856f5712fbd7cef4a74d8990a7c961ecea6b5acf5
- stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
- stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:-524356 live_count:-2 key_bytes:-50 key_count:-2 val_bytes:-524306 val_count:-2

Looks like we've dropped two keys.

tbg · 2021-03-30T10:06:11Z

on n1 (minority) and n2 (majority):

$ ./cockroach20.1 debug range-data --replicated /mnt/data1/cockroach/auxiliary/checkpoints/r59_at_2866/ 59 | grep -Eo '^[0-9.,]+ +[^ ]+' > r59.txt

$ roachprod get tobias-1617031837-01-n4cpu4:1-2 r59.txt
$ diff 1.r59.txt 2.r59.txt 
966a967
> 1617046279.883407392,0 /Table/53/1/5898728633597431726/0
1129a1131
> 1617046280.292398950,0 /Table/53/1/5938528238689125289/0

These two keys are not adjacent. The first key is on line 967 of 2.r59.txt and the second one is line 1131 (and is one of the last keys in the range). Also, their timestamps are ~500ms apart, so they're most likely just unrelated and we possibly just happened to lose a key, twice. The repro one higher up also loses two keys, but the first one loses only one. Interestingly, it's always just on one range, which is interesting. I'm also assuming here that the stats reflect the truth (as we've confirmed they are doing in this hands-on repro).

The workload here is kv with --read-percent=0 and a large block size. But in particular, the batch size is one. Really, this is close to the most boring workload you can conceive of. Each insert should be just to a single kv pair. And yet, we haven't seen this error on any of our nightly benchmarks (which may just be luck), so possibly it has to do with the large block size (~256kb), or with the splits, though these inserts never straddle the splits (since they're point inserts, in fact should take the 1PC optimization most of the time though I haven't checked). Additionally, it's on a three-node cluster, so there won't be any rebalancing.

At first glance a plausible explanation for the bug could be that we're simply dropping a whole pebble batch on the floor, but that isn't it. We are definitely losing the kv write in tandem with the stats write, but we seem to be updating the raft applied index just fine, or we would see fatal errors - as we do assert that it goes up in increments of exactly one. A more likely explanation is that we are erroneously applying a command with a forced error when we shouldn't (only on the minority replica), but I don't see any particular reason why this would happen in this test and not others (and a cursory review of the code shows no source of nondeterminism).

The timestamp of the first lost key translates to Monday, March 29, 2021 7:31:19 PM. Looking through the combined logs at that time via

cockroach debug merge-logs logs/.unredacted/cockroach.tobias.log | grep -Ev 'handle raft ready|runtime stats' | grep r59 | xsel -b

there are a few things going on. r59 is created via a split ~2 minutes before the lost writes (so that split is likely not to blame). Around 70s before the problematic write there's a slow heartbeat. 60s before the write the raft log queue truncates (due to size), cutting off n3 from the raft log. A snapshot is sent and applied by n3 at 19:31:10, i.e. ~8s before the write. 2s after the write, the raft log queue truncates again, causing another snapshot (but none is sent for a while, it only gets applied at 19:34:11, minutes after the write. We're also seeing a split in the meantime, at 19:32:10, ~50s after the write. Finally, starting at 19:34:27 (minutes after the write) we see a flurry of failed reproposals due to closed timestamps on n2, and also a bunch of "have been waiting" messages from DistSender trying to talk to 459. That's pretty much it - at 19:37:49 the inconsistency is detected.

Nothing really stands out here. Since we have a repro cycle - as bad as it may be - I am going to instrument the binary to print at each raft log position the command and whether we applied it or not. That should provide valuable insight.

tbg · 2021-03-30T10:51:49Z

#62791 now running with up to 800 iterations. 💸 💸 💸

It's also looking increasingly likely that release-21.1 isn't going to reproduce - I'm ~700 out of 800 runs in and the only failure that has cropped up was a clock offset fatal.

tbg · 2021-03-30T11:26:14Z

Doing some light math here on the repro changes... I got 1 in 200 on master, and the failures are Binomially distributed, so let's say the p is really 0.45 as I said above:

Then with 0/800 basically rules out that it'll ever show up (i.e. under the same distribution, I would've been ~guaranteed to see a failure)

That tool also has facilities for giving you 95% confidence interval estimates for p (given the observation n=100, x=1). The most optimistic one gives 0.3%, the pessimistic one a meager 0.14%. Let's assume the worst (0.14%) - a >= 1/200 observation would be expected in ~25% of cases, and a >= 1/800 observation in 75% of cases. If I give 21.1 another 800 runs overnight, that will go up to 93%, at which point I'll feel pretty good about saying that the problem does not exist on release-21.

tbg · 2021-04-07T10:37:46Z

run 114, tobias-1617787468-46-n4cpu4:2

panic: TBG CRASH DETECTED GOID=158: r40: truncated cache to [143, 207] but got an entry for 208..263
add=false bytesGuessed=0bytesAdded=0 entriesAdded=0 bytesRemoved=0 entriesRemoved=0

tbg · 2021-04-07T10:39:32Z

run 126, tobias-1617787468-24-n4cpu4:2

panic: TBG CRASH DETECTED GOID=159: TBG term regression: idx 312 at term=8 appended with logterm=10

I'll leave this running but you get the idea. We catch the raft entry cache assertion quite a bit, and the term regression sometimes.

tbg · 2021-04-07T10:59:30Z

These repros all have add=false, which means that (*Cache).Add becomes really trivial - it basically just does

	if truncate {
		truncIdx := ents[0].Index
		if add { // not taken
			// Some entries were already overwritten.
			truncIdx = ents[len(ents)-1].Index + 1
		}
		bytesRemoved, entriesRemoved = p.truncateFrom(truncIdx)
	}

I have the Cache here - should be doable to figure out why this didn't do what was advertised.

One last failure that occurred before I aborted the repro attempts:

run 144, tobias-1617787468-18-n4cpu4:1

panic: TBG CRASH DETECTED GOID=221: r41: truncated cache to [2429, 2551] but got an entry for 2552..2555
add=false bytesGuessed=0bytesAdded=0 entriesAdded=0 bytesRemoved=0 entriesRemoved=0

tbg · 2021-04-07T11:17:12Z

I think I have something. The problem is indeed in the code from the last comment. When !add (as it is in every repro I've looked at), we are starting to truncate at the first index we thought we would add (but did not in fact add). Now, when that index is actually lower than the first index we have cached, I think truncateFrom turns into a noop - but it should have removed every entry!

For example, in the dump I'm looking at right now (run 71), we were inserting entries 1154...1253 when the assertion hit. The ringbuf (post assertion, but I think it wasn't mutated at all) has entries 1213..1268. Note 1154 < 1213.

Ok, time to fix that and to see if that indeed does explain everything... but I feel it must. Will have some hopefully not preliminary victory lunch before.

tbg · 2021-04-07T11:51:37Z

Spinning up the repro machine again with provisional fix 97bf81b

Likely due to golang/gogo proto incompatibilities around our use of the `(gogoproto.nullable)` option, calling `.String()` on a `raftpb.Entry` panics. However, upstream at our currently vendored commit 578ffd578362a8962194c4c0c68c1a9824cf9ea3, ``` func TestFoo(t *testing.T) { ent := pb.Entry{Term: 1, Index: 2} _ = ent.String() } ``` does not panic. This commit adds a test documenting the undesired behavior and working around it in `./cockroach debug raft-log` (and the other debug commands that stringify log entries). Found while investigating cockroachdb#61990. Release note: None

62728: roachtest: don't call t.Fatal in FailOnReplicaDivergence r=erikgrinaker a=tbg In #61990 we had this method catch a stats divergence on teardown in an otherwise successful test. The call to `t.Fatal` in that method unfortunately prevented the logs from being collected, which is not helpful. Release note: None 63073: roachtest: retire `bank` tests r=nvanbenschoten,andreimatei a=tbg The `bank` roachtests are a basic form of Jepsen test. We were hoping that they could provide additional benefit due to being less complex than a full Jepsen test. In my opinion, this calculus has not worked out. We've spent many hours staring at this test deadlocked when it wasn't CockroachDB's fault at all. Besides, it's not adding anything meaningful on top of our Jepsen tests plus KVNemesis plus the TPCC invariant checks, so we are removing the `bank` family of tests here. Should we want to reintroduce these tests, we should do it at the level of TestCluster, or at least augment the `bank` workload to incorporate these invariant checks instead. Closes #62754. Closes #62749 Closes #53871. Release note: None 63214: kvserver: reduce ReplicaGCQueueInactivityThreshold to 12 hours r=ajwerner a=erikgrinaker `ReplicaGCQueueInactivityThreshold` specifies the interval at which the GC queue checks whether a replica has been removed from the canonical range descriptor, and was set to 10 days. This is a fallback for when we fail to detect the removal and GC the replica immediately. However, this could occasionally cause stale replicas to linger for 10 days, which surprises users (e.g. by causing alerts if the stale replica thinks the range has become underreplicated). This patch reduces the threshold to 12 hours, which is a more reasonable timeframe for users to expect things to "sort themselves out". The operation to read the range descriptor is fairly cheap, so this is not likely to cause any problems, and the interval is therefore not jittered either. Resolves #63212, touches #60259. Release note (ops change): Replica garbage collection now checks replicas against the range descriptor every 12 hours (down from 10 days) to see if they should be removed. Replicas that fail to notice they have been removed from a range will therefore linger for at most 12 hours rather than 10 days. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.

This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None

In cockroachdb#61990 we had this method catch a stats divergence on teardown in an otherwise successful test. The call to `t.Fatal` in that method unfortunately prevented the logs from being collected, which is not helpful. Release note: None

This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None

63302: raftentry: fix accidental no-op in truncateFrom r=nvanbenschoten,erikgrinaker a=tbg When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to #36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in #36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes #61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.

This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None

When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.

This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None

When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.

This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None

In cockroachdb#61990 we had this method catch a stats divergence on teardown in an otherwise successful test. The call to `t.Fatal` in that method unfortunately prevented the logs from being collected, which is not helpful. Release note: None

nvanbenschoten added the GA-blocker label Mar 26, 2021

tbg changed the title ~~roachtest: hotspotsplits/nodes=4 failed~~ roachtest: hotspotsplits/nodes=4 failed [inconsistency] Mar 29, 2021

tbg mentioned this issue Mar 29, 2021

roachtest: don't call t.Fatal in FailOnReplicaDivergence #62728

Merged

tbg self-assigned this Mar 29, 2021

tbg mentioned this issue Mar 30, 2021

kvserver: repro branch for hotspotsplits issue #62791

Closed

tbg mentioned this issue Apr 8, 2021

cli: don't call (raftpb.Entry).String #63298

Closed

tbg mentioned this issue Apr 8, 2021

raftentry: fix accidental no-op in truncateFrom #63302

Merged

craig bot closed this as completed in 54c27b3 Apr 12, 2021

tbg mentioned this issue Apr 12, 2021

release-21.1: raftentry: fix accidental no-op in truncateFrom #63473

Merged

tbg mentioned this issue Apr 12, 2021

release-20.2: raftentry: fix accidental no-op in truncateFrom #63474

Merged

tbg mentioned this issue Apr 12, 2021

release-20.1: raftentry: fix accidental no-op in truncateFrom #63475

Merged

tbg mentioned this issue Aug 31, 2021

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [replica inconsistency] #69414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: hotspotsplits/nodes=4 failed [inconsistency] #61990

roachtest: hotspotsplits/nodes=4 failed [inconsistency] #61990

cockroach-teamcity commented Mar 15, 2021

knz commented Mar 16, 2021

cockroach-teamcity commented Mar 26, 2021

tbg commented Mar 29, 2021

tbg commented Mar 29, 2021

tbg commented Mar 29, 2021

irfansharif commented Mar 29, 2021 •

edited

Loading

tbg commented Mar 29, 2021 •

edited

Loading

tbg commented Mar 29, 2021

irfansharif commented Mar 29, 2021 •

edited

Loading

irfansharif commented Mar 29, 2021 •

edited

Loading

sumeerbhola commented Mar 29, 2021

irfansharif commented Mar 29, 2021 •

edited

Loading

irfansharif commented Mar 29, 2021

irfansharif commented Mar 30, 2021 •

edited

Loading

tbg commented Mar 30, 2021

tbg commented Mar 30, 2021

tbg commented Mar 30, 2021 •

edited

Loading

tbg commented Apr 7, 2021

tbg commented Apr 7, 2021

tbg commented Apr 7, 2021 •

edited

Loading

tbg commented Apr 7, 2021

tbg commented Apr 7, 2021

roachtest: hotspotsplits/nodes=4 failed [inconsistency] #61990

roachtest: hotspotsplits/nodes=4 failed [inconsistency] #61990

Comments

cockroach-teamcity commented Mar 15, 2021

knz commented Mar 16, 2021

cockroach-teamcity commented Mar 26, 2021

tbg commented Mar 29, 2021

tbg commented Mar 29, 2021

tbg commented Mar 29, 2021

irfansharif commented Mar 29, 2021 • edited Loading

tbg commented Mar 29, 2021 • edited Loading

tbg commented Mar 29, 2021

irfansharif commented Mar 29, 2021 • edited Loading

irfansharif commented Mar 29, 2021 • edited Loading

sumeerbhola commented Mar 29, 2021

irfansharif commented Mar 29, 2021 • edited Loading

irfansharif commented Mar 29, 2021

irfansharif commented Mar 30, 2021 • edited Loading

tbg commented Mar 30, 2021

tbg commented Mar 30, 2021

tbg commented Mar 30, 2021 • edited Loading

tbg commented Apr 7, 2021

tbg commented Apr 7, 2021

tbg commented Apr 7, 2021 • edited Loading

tbg commented Apr 7, 2021

tbg commented Apr 7, 2021

irfansharif commented Mar 29, 2021 •

edited

Loading

tbg commented Mar 29, 2021 •

edited

Loading

irfansharif commented Mar 29, 2021 •

edited

Loading

irfansharif commented Mar 29, 2021 •

edited

Loading

irfansharif commented Mar 29, 2021 •

edited

Loading

irfansharif commented Mar 30, 2021 •

edited

Loading

tbg commented Mar 30, 2021 •

edited

Loading

tbg commented Apr 7, 2021 •

edited

Loading