Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [replica inconsistency] #69414

Closed
cockroach-teamcity opened this issue Aug 26, 2021 · 22 comments · Fixed by #69923
Closed
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@cockroach-teamcity
Copy link
Member

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on master @ ab1fc343c9a1140191f96353995258e609a84d02:

		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I210826 12:54:50.179613 1 (gostd) cluster_synced.go:1677  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *cluster.WithCommandDetails (2) *exec.ExitError

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3359107-1629959017-35-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 1: 1287
		3: 1312
		2: 1418
		4: skipped
		7: dead (exit status 1)
		6: 1334
		5: 1311
		10: 1470
		11: 1440
		8: skipped
		9: 1435
		12: skipped
		Error: UNCLASSIFIED_PROBLEM: 7: dead (exit status 1)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 7: dead (exit status 1)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 26, 2021
@tbg
Copy link
Member

tbg commented Aug 26, 2021

on n7 (not n5 as indicated by the message - it's getting confused because of the gaps in the node list)

* ERROR: ERROR: startup forbidden by prior critical alert
* DETAIL: From /mnt/data1/cockroach/auxiliary/_CRITICAL_ALERT.txt:
*
ERROR: startup forbidden by prior critical alert
DETAIL: From /mnt/data1/cockroach/auxiliary/_CRITICAL_ALERT.txt:
Failed running "start"
cockroach exited with code 1: Thu Aug 26 12:54:31 UTC 2021

We've seen this before, in #67471.

@tbg tbg changed the title roachtest: tpccbench/nodes=9/cpu=4/multi-region failed roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [empty prevented startup file] Aug 26, 2021
@tbg
Copy link
Member

tbg commented Aug 26, 2021

Uh-oh:

(n6,s6):3: checksum 7c32651636a3930317d12e4d006802fefd1cfac60fe9cfe32dac898b5ac75d76c899fb4b908b13395f92cb2f1b2897a7dbbdec995f455c2c8a32d1102b99bf1e [minority]
- stats: contains_estimates:0 last_update_nanos:1629982307391576058 intent_age:0 gc_bytes_age:69662266620 live_bytes:133029913 live_count:396189 key_bytes:9276126 key_count:396189 val_bytes:135241850 val_count:431486 intent_bytes:4923 intent_count:15 separated_intent_count:15 sys_bytes:13323 sys_count:138 abort_span_bytes:9701 
- stats.Sub(recomputation): last_update_nanos:1629982307391576058 intent_age:-2994 live_bytes:-84 val_bytes:-84 intent_bytes:-318 intent_count:-1 separated_intent_count:-1 
(n4,s4):8: checksum f44a7c890d7c01c844b4eae75bf9690d1c6072f2b6f9f3ad09a450b738315239afe21df3bb29f170778cf236d6db4b8d544dfdfb7835940869be19af034deb87
- stats: contains_estimates:0 last_update_nanos:1629982307391576058 intent_age:0 gc_bytes_age:69662266620 live_bytes:133029913 live_count:396189 key_bytes:9276126 key_count:396189 val_bytes:135241850 val_count:431486 intent_bytes:4923 intent_count:15 separated_intent_count:15 sys_bytes:13323 sys_count:138 abort_span_bytes:9701 
- stats.Sub(recomputation): last_update_nanos:1629982307391576058 
(n5,s5):7: checksum f44a7c890d7c01c844b4eae75bf9690d1c6072f2b6f9f3ad09a450b738315239afe21df3bb29f170778cf236d6db4b8d544dfdfb7835940869be19af034deb87
- stats: contains_estimates:0 last_update_nanos:1629982307391576058 intent_age:0 gc_bytes_age:69662266620 live_bytes:133029913 live_count:396189 key_bytes:9276126 key_count:396189 val_bytes:135241850 val_count:431486 intent_bytes:4923 intent_count:15 separated_intent_count:15 sys_bytes:13323 sys_count:138 abort_span_bytes:9701 
- stats.Sub(recomputation): last_update_nanos:1629982307391576058 
consistency check failed; fetching details and shutting down minority (n6,s6):3
t/data1/cockroach/auxiliary/checkpoints/r734_at_14189›

(n6,s6):3: checksum ad260ec0efec305ff8a3a2601ba19ea7276820ef270e602151cf9dcbb22cfe38d1e0407759c29938475d157b3e27776d7d0d0d3a7c7504da112fa886b9fe3497 [minority]
- stats: contains_estimates:0 last_update_nanos:1629982316710611497 intent_age:0 gc_bytes_age:69765807974 live_bytes:133028679 live_count:396189 key_bytes:9277338 key_count:396189 val_bytes:135272256 val_count:431587 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:13323 sys_count:138 abort_span_bytes:9701 
- stats.Sub(recomputation): last_update_nanos:1629982316710611497 intent_age:-3003 live_bytes:-84 val_bytes:-84 intent_bytes:-318 intent_count:-1 separated_intent_count:-1 
(n4,s4):8: checksum f8449b7052dcb7045aefe6309381393c6715d3408fb571e9dfd30d88ec7a7f07819b997ada537108433b50daaec786f8fb7dc22109e1ffc1267b189faeb85c54
- stats: contains_estimates:0 last_update_nanos:1629982316710611497 intent_age:0 gc_bytes_age:69765807974 live_bytes:133028679 live_count:396189 key_bytes:9277338 key_count:396189 val_bytes:135272256 val_count:431587 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:13323 sys_count:138 abort_span_bytes:9701 
- stats.Sub(recomputation): last_update_nanos:1629982316710611497 
(n5,s5):7: checksum f8449b7052dcb7045aefe6309381393c6715d3408fb571e9dfd30d88ec7a7f07819b997ada537108433b50daaec786f8fb7dc22109e1ffc1267b189faeb85c54
- stats: contains_estimates:0 last_update_nanos:1629982316710611497 intent_age:0 gc_bytes_age:69765807974 live_bytes:133028679 live_count:396189 key_bytes:9277338 key_count:396189 val_bytes:135272256 val_count:431587 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:13323 sys_count:138 abort_span_bytes:9701 
- stats.Sub(recomputation): last_update_nanos:1629982316710611497 
====== diff(f8449b7052dcb7045aefe6309381393c6715d3408fb571e9dfd30d88ec7a7f07819b997ada537108433b50daaec786f8fb7dc22109e1ffc1267b189faeb85c54, [minority]) ======
--- leaseholder
+++ follower
+0,0 ‹/Table/61/1/1686/4881/0›
+    ts:1970-01-01 00:00:00 +0000 UTC
+    value:‹1629979313.521491293,2 {Txn:id=a181b3d4 key=/Table/55/1/1686/2/0 pri=0.02746475 epo=4 ts=1629979313.521491293,2 min=1629979155.112560857,0 seq=5 Timestamp:1629979313.521491293,2 Deleted:false KeyBytes:12 ValBytes:306 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> TxnDidNotUpdateMeta:<nil>}›
+    raw mvcc_key/value: ‹c589f70696f713118800› ‹0a3d0a10a181b3d4ff05487885ec309606895b5e1a07bf89f706968a8820042a0c08ddfabbadcbb1b6cf16100230e9ff2338054a0a08d9919a9efdacb6cf16120c08ddfabbadcbb1b6cf1610021800200c28b202›
consistency check failed

I'm uploading the full artifacts to https://drive.google.com/drive/folders/1z_gpHX39QwKMC4x7aYWMriRxV6mV5cTC?usp=sharing (CRL only).

@tbg tbg changed the title roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [empty prevented startup file] roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [replica inconsistency] Aug 26, 2021
@tbg tbg added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 26, 2021
@tbg
Copy link
Member

tbg commented Aug 26, 2021

$ cockroach debug merge-logs --format crdb-v1 --program-filter '^cockroach$' --filter 'r734[^0-9]' logs/ | gh gist create

https://gist.github.com/4d458af340e47fa9c1f549297eb747f9

@tbg
Copy link
Member

tbg commented Aug 26, 2021

The timestamp of the separated intent missing on the leaseholder n6 is

Thursday, August 26, 2021 12:01:53 PM

The log timestamps at the time of the consistency check are close to an hour later, around 12:52. So whatever went wrong likely went wrong around 12:0[01] on r723.

Unfortunately, there's radio silence around this time:

teamcity-3359107-1629959017-35-n12cpu4-geo-0007> I210826 09:42:11.211904 170 kv/kvserver/store_remove_replica.go:133 ⋮ [n6,s6,r734/3:‹/Table/61/1/168{4/7433-5/634…}›] 117349  removing replica r732/3
teamcity-3359107-1629959017-35-n12cpu4-geo-0006> W210826 11:53:31.483979 263 kv/kvserver/store_raft.go:525 ⋮ [n5,s5,r734/7:‹/Table/61/1/168{4/7433-8/3622}›] 2129  handle raft ready: 0.5s [applied=0, batches=0, state_assertions=0]
teamcity-3359107-1629959017-35-n12cpu4-geo-0006> W210826 11:54:53.437996 264 kv/kvserver/store_raft.go:525 ⋮ [n5,s5,r734/7:‹/Table/61/1/168{4/7433-8/3622}›] 4151  handle raft ready: 1.7s [applied=1, batches=1, state_assertions=0]
teamcity-3359107-1629959017-35-n12cpu4-geo-0006> W210826 11:55:56.079618 1368314 kv/kvserver/spanlatch/manager.go:526 ⋮ [n5,s5,r734/7:‹/Table/61/1/168{4/7433-8/3622}›] 5926  have been waiting 15s to acquire ‹read› latch ‹/Table/61/1/1686/15985/0@1629978933.679786651,0›, held by ‹write› la
tch ‹/Table/61/1/1686/15985/0@1629978810.646836154,0›
teamcity-3359107-1629959017-35-n12cpu4-geo-0006> W210826 12:03:30.615608 254 kv/kvserver/store_raft.go:525 ⋮ [n5,s5,r734/7:‹/Table/61/1/168{4/7433-8/3622}›] 32283  handle raft ready: 0.7s [applied=0, batches=0, state_assertions=0]
teamcity-3359107-1629959017-35-n12cpu4-geo-0006> W210826 12:04:09.733580 2575922 kv/kvserver/spanlatch/manager.go:526 ⋮ [n5,s5,r734/7:‹/Table/61/1/168{4/7433-8/3622}›] 33541  have been waiting 15s to acquire ‹write› latch ‹/Table/61/1/1685/25393/0@1629979312.161687797,0›, held by ‹read› latch ‹/Table/61/1/1685/25393/0@1629979379.140708674,0›
teamcity-3359107-1629959017-35-n12cpu4-geo-0005> W210826 12:04:26.445723 1749274 kv/kvclient/kvcoord/dist_sender.go:1537 ⋮ [n4] 6706  slow range RPC: have been waiting 87.18s (1 attempts) for RPC Get

I'm also not seeing anything obviously wrong around the 12:00 mark:

cockroach debug merge-logs --format crdb-v1 --program-filter '^cockroach$' --from '210826 11:58:00' --to '210826 12:03:00' logs/ > 1200.log
grep -vE 'have been waiting|finished waiting|handle raft|slow RPC|runtime_stats|health alerts|gossip.go' 1200.log  | less -S
[...]

@tbg
Copy link
Member

tbg commented Aug 26, 2021

The txnid also makes no appearance in the logs. Neither does the full pretty-printed key.

@sumeerbhola
Copy link
Collaborator

If this is reproducible, then one thing to try would be to change the code to disable the SINGLEDEL optimization.

btw, there is a case where SINGLEDEL can have 2 SETs under it.

  • range r1 and node n1 sees a write with separated intent, and does SET for a lock table key
  • range r1 is removed from node n1, and a RANGEDEL is placed.
  • range r1 is added back, and there is another SET for the same key (since the intent is not yet resolved).
  • intent is resolved and since the txn only wrote once we write SINGLEDEL.
    This is harmless since the RANGEDEL will take care of the original SET and there should never be a compaction involving the original SET and the SINGLEDEL that does not also contain the RANGEDEL and later SET.

@tbg
Copy link
Member

tbg commented Aug 30, 2021

@sumeerbhola how ironclad is the testing of single deletions on the pebble side? I expect this to be very very hard to reproduce, so if we can do any work on the pebble side to make sure that single deletes are very well tested in the presence of concurrent compactions, ingestions, etc that would be helpful.

@tbg
Copy link
Member

tbg commented Aug 30, 2021

@AlexTalks will start trying to reproduce this tomorrow, with help from yours truly. I'll need to take another look at the test failure to get an idea of whether we can skip parts of the test or specialize it somehow, but we probably want to run with something like fd9aa5a and also more stringent consistency checks, perhaps even specialized checks on txn records we know have been single-del'ed.

@sumeerbhola
Copy link
Collaborator

@sumeerbhola how ironclad is the testing of single deletions on the pebble side? I expect this to be very very hard to reproduce, so if we can do any work on the pebble side to make sure that single deletes are very well tested in the presence of concurrent compactions, ingestions, etc that would be helpful.

It is quite thoroughly tested using the Pebble metamorphic test which generates single delete just like other operations to be used in batches and file ingestion.
It does not test the particular case that I mentioned earlier, for which I was claiming correctness based on fundamental Pebble invariants. I'll look into adding testing for it.

@sumeerbhola
Copy link
Collaborator

It does not test the particular case that I mentioned earlier, for which I was claiming correctness based on fundamental Pebble invariants.

@nicktrav will start working on enhancing the Pebble metamorphic test for this case + make it behave randomly in terms of sometimes using Del instead of SingleDel for keys that are eligible for SingleDel.

@tbg
Copy link
Member

tbg commented Aug 31, 2021

The timeline in the test is:

- 8:13am test starts
- 8:37 starting to wait for rebalancing
- 9:42: here we see the last range event for r734. It does not split, merge, or really log anything interesting past this point
- 10:03: line search at 2000 warehouses...
- 10:24: pass,  trying 2015 warehouses
- 10:44: pass, trying 2045
- 11:04: pass, trying 2105
- 11:25: pass, trying 2225
- 11:45: pass, trying 2465
- 11:53: we see log activity on r734, it's not terribly interesting (slow raft ready etc), so this is just an artifact of the CRDB cluster being overloaded.
- 12:01: around this time, the intent should be written
- 12:26: failed due to efficiency, trying 2285
- 12:46 failed due to efficiency, trying 2255
- 12:52: failed due to n7 dead (this is going to be the consistency check failure, since the exit code is 7=FatalError)

Given all of this, I think we can make some assumptions:

  • Ranges and merges etc don't have much to do with the problem at hand.
  • High load does. The intent write that later caused the inconsistency was laid down under overload regime.

In the past, such inconsistencies would usually come down to a problem at the storage layer (i.e. the "singledelete didn't do its job" theory the storage team is looking into by adding more testing) and the kv replication layer (for example, we once had a bug where cached raft entries at the leader were not reliably evicted when an uncommitted part of the raft log was discarded, so we'd sometimes send the wrong entry to a follower and it would thus diverge. #61990 (comment) is the most recent example of this).

The import phase of the test takes around 30 minutes only, and we're interested in overload. I would thus say we try a repro cycle where we stay multi-region (in case that exacerbated anything), and we run the import but skip the rebalancing phase. When we then hit the cluster with 2000 warehouse TPCC, it will most likely overload the cluster (if that proves to be false, we start at a higher warehouse count like 2200). We run with an aggressive consistency check interval, too, as it is very valuable to stop the test close to where the issue happens.
One interesting thing to note is about the line search and how it reacts to dead nodes: it doesn't fail the test! This is because by design this test verges into overload territory and nodes may crash (at least today). The reason the above test failed is because when an inconsistency is detected, we error out during the next attempt at starting the node, and that will terminate the line search. So the line search behavior is already exactly what we want for the repro cycle we're after; we will get failures where the node refused to start (likely due to the inconsistency death rattle), but we don't have to worry about nodes ooming and such, as this will be suppressed by the line search.

Since I expect that it will take a while to get a repro, we'll also put in the pebble ArchiveCleaner commit referenced above. We need to check that we're not running out of disk space; if that becomes an issue maybe we can hack around it by provisioning larger disks, or by catching disk-full conditions and ignoring the outcome.

@tbg
Copy link
Member

tbg commented Aug 31, 2021

Going to see how many of the clusters for this test we can put into andrei-jepsen

for i in $(seq 1 10); do GCE_PROJECT=andrei-jepsen roachprod create $USER-geo$i -n 12 --clouds=gce --local-ssd=true --gce-machine-type=n1-standard-4 --gce-zones=us-east1-b,us-west1-b,europe-west2-b --geo --lifetime=12h0m0s --os-volume-size=32 --local-ssd-no-ext4-barrier; done

@tbg
Copy link
Member

tbg commented Aug 31, 2021

^-- 10 clusters worked fine, so going to try making a few more. 💸

nicktrav added a commit to nicktrav/pebble that referenced this issue Aug 31, 2021
Currently, the metamorphic tests randomly generate a set of operations
to perform on the Pebble instance. Support for single deletes exists,
though the keys that have been deleted are not considered for reuse.

Track keys that have been singly deleted, and randomly reuse these keys
when generating subsequent operations to perform.

Prior to this patch, the generation operation log would resemble the
following:

```
db.Set("foo", "bar")
...
batch54.SingleDelete("foo")
...
// No more operations on key "foo".
```

With this patch, the following seqeunce of operations is permissible:

```
db.Set("foo", "bar")
...
db.SingleDelete("foo")
...
db.Set("foo", "baz")
...
db.Merge("foo", "bam")
...
db.Set("foo", "boom")
...
// Subsequent operations on key "foo" are permissible.
```

Related to cockroachdb/cockroach#69414.
nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 1, 2021
Generated SINGLEDEL operations are eligible for transformation into less
restrictive DELETE operations.

Transform a fixed percentage of SINGLEDEL operations at generation time
into DELETEs to further exercise the delete execution paths.

Related to cockroachdb/cockroach#69414.
nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 7, 2021
Currently, sequences of operations for metamorphic tests are generated
for certain specific cases (e.g. a SINGLEDEL follows a key that has been
SET once). Additional test sequences require maintaining the "state" of
the sequence in the `generator`, which clutters the struct with various
fields required for the state management.

Add a `sequenceGenerator` struct, which uses a "transition map" (a
mapping from a current state to next state, along with a corresponding
output for the transition) to model a state machine. The state machine
can be used to generate random sequences of operations that adhere to
certain rules (e.g. output a GET following a SET for a given key).

A `generator` can be constructed containing one or more
`sequenceGenerator`s that, when selected by the random number generator
(i.e. the "deck"), generate the next operation in the sequence governed
by the state machine and place the operation into the operation log.

Related to cockroachdb/cockroach#69414.
nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 7, 2021
Add a `sequenceGenerator` instance with a transition map that generates
the following sequence of operations, similar to the problematic
sequence generated for cockroachdb/cockroach#69414:

```
((SET -> GET)+ -> DELETE -> GET)+ -> SINGLEDEL -> (GET)+
```

See also cockroachdb/cockroach#69891.
@tbg tbg added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. and removed GA-blocker labels Sep 8, 2021
@tbg
Copy link
Member

tbg commented Sep 8, 2021

Updating this to release-blocker, as discussed on the release triage meeting yesterday. This will mean no beta will be released until this issue is addressed.

tbg added a commit to tbg/cockroach that referenced this issue Sep 8, 2021
I haven't looked at this code in a while, but while my assumption was
that if a ResolveIntent comes in at a newer epoch, we would keep the
existing intent and rewrite it to that epoch, but apparently we "just"
remove it.
This behavior might make sense in practice (now that our concurrency
control is cooperative), but it was not what I remembered, and plays
a prominent role in the correctness bug cockroachdb#69414.

Still need to figure out what to do with this. Comments definitely
refer to the behavior I remember. It's possible that we changed this
by accident. Will investigate more.

Release justification: testing improvement related to release blocker cockroachdb#69414.
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Sep 8, 2021
Fixes cockroachdb#69414.

Only use SingleDel if

- the meta tracking allows it (i.e. the sole previous condition), and
- epoch zero, and
- no ignored txn seqno ranges.

These three together imply that the engine history of the metadata key
is a single `Set`, so we can safely use `SingleDel`.

This is a below-Raft change, but no mixed-version problems are expected.
This is because using a Del vs using a SingleDel (if it does not cause
an anomaly) is not visible at the replica state level. In particular,
a deleted key looks the same to the consistency checker as a
single-deleted key.

Release justification: fix for a release blocker
Release note: None
tbg added a commit to AlexTalks/cockroach that referenced this issue Sep 8, 2021
Fixes cockroachdb#69414.

Only use SingleDel if

- the meta tracking allows it (i.e. the sole previous condition), and
- epoch zero, and
- no ignored txn seqno ranges.

These three together imply that the engine history of the metadata key
is a single `Set`, so we can safely use `SingleDel`.

This is a below-Raft change, but no mixed-version problems are expected.
This is because using a Del vs using a SingleDel (if it does not cause
an anomaly) is not visible at the replica state level. In particular,
a deleted key looks the same to the consistency checker as a
single-deleted key.

Release justification: fix for a release blocker
Release note: None
nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 8, 2021
Generated SINGLEDEL operations are eligible for transformation into less
restrictive DELETE operations.

Transform a fixed percentage of SINGLEDEL operations at generation time
into DELETEs to further exercise the delete execution paths.

Related to cockroachdb/cockroach#69414.
nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 8, 2021
Generated SINGLEDEL operations are eligible for transformation into less
restrictive DELETE operations.

Transform a fixed percentage of SINGLEDEL operations at generation time
into DELETEs to further exercise the delete execution paths.

Related to cockroachdb/cockroach#69414.
nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 8, 2021
Currently, by default, the metamorphic tests use a MemFS-backed DB by
default. The `-disk` flag can be used to override all tests to use a
disk-backed filesystem. The behavior is "all or nothing".

To better simulate use of Pebble with a real, disk-backed filesystem,
randomly generate test configurations that use disk.

Tweak the flags to allow specifying an override filesystem type. In the
case that "mem" or "disk" are specified, all metamorphic tests will use
that FS type, ignoring the randomly generated FS type. The default is to
use the randomly generated FS types. Disk-backed filesystems are used
10% of the time for the randomly generated configurations.

Related to cockroachdb/cockroach#69414.
tbg added a commit to tbg/cockroach that referenced this issue Sep 9, 2021
Fixes cockroachdb#69414.

Only use SingleDel if

- the meta tracking allows it (i.e. the sole previous condition), and
- epoch zero, and
- no ignored txn seqno ranges.

These three together imply that the engine history of the metadata key
is a single `Set`, so we can safely use `SingleDel`.

Release justification: fix for a release blocker
Release note: None
craig bot pushed a commit that referenced this issue Sep 10, 2021
69806: kv/kvserver: use generalized engine keys in debug printing r=AlexTalks a=AlexTalks

Previously the CLI debug command only printed `MVCCKey`s, resulting in
the debug printer to error on key decoding whenever a `LockTableKey` was
encountered.  By switching to the more generalized `EngineKey` (and
utilizing the existing MVCC key formatting whenever the key has an MVCC
version), we can increase visibility into our debug logs while investigating
issues.

Related to #69414

Release justification: Non-production bug fix
Release note: None

69944: stats: add a histogram version number r=rytaft a=rytaft

This commit adds a histogram version number to the `HistogramData`
proto. This will allow us to identify what logic was used to construct
a particular histogram and possibly debug future issues.

Release note: None

Release justification: Low risk, high benefit change to existing
functionality.

69963: sql: skip reset sql stats in TestRandomSyntaxFunctions r=maryliag,rafiss a=Azhng

Previously, crdb_internal.reset_sql_stats() causes timeout
in TestRandomSyntaxFunctions. This is very unlikely due to
implementation of the function, and it is likely caused
by contentions.
This commit skip the tests for crdb_internal.reset_sql_stats()
to prevent nightly failures.

Related #69731

Release justification: Non-production code changes

Release note: None

69967: vendor: bump Pebble to 6c12d67b83e6 r=jbowens a=jbowens

```
6c12d67 internal/metamorphic: randomize FormatMajorVersion
e82fb10 db: randomize format major version in unit tests
535b8d6 db: add FormatUpgrade event to EventListener
53dda0f db: introduce format major version
8ec1a49 vfs/atomicfs: add ReadMarker
daf93f0 sstable: Free cache value when checksum type is corrupt
d89613d metamorphic: randomly use disk for tests
e3b6bec metamorphic: transform percentage of SINGLEDEL ops to DELETE ops
41239f8 db: add test demonstrating current SINGLEDEL behavior
```

Release note: none

Release justification: non-production code changes, and bug fix
necessary for a release blocker.

69974: backupccl: set sqlstats testing knobs for scheduled job test r=maryliag,miretskiy a=Azhng

Previsouly, backupccl tests did not set sql stats AOST testing knob
to override the AOST behavior. This causes sql stats error
stack trace to show up in backupccl tests.
This commit added sql stats testing knobs for backupccl test
helpers to mitigate this.

Release justification: Non-production code changes

Release note: None

Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
Co-authored-by: Azhng <archer.xn@gmail.com>
Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Sep 10, 2021
69923: storage: narrow down use of SingleDel to avoid anomalies r=sumeerbhola a=tbg

Fixes #69414.

Only use SingleDel if

- the meta tracking allows it (i.e. the sole previous condition), and
- epoch zero, and
- no ignored txn seqno ranges.

These three together imply that the engine history of the metadata key
is a single `Set`, so we can safely use `SingleDel`.

Release justification: fix for a release blocker
Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@craig craig bot closed this as completed in 7971115 Sep 10, 2021
blathers-crl bot pushed a commit that referenced this issue Sep 13, 2021
Fixes #69414.

Only use SingleDel if

- the meta tracking allows it (i.e. the sole previous condition), and
- epoch zero, and
- no ignored txn seqno ranges.

These three together imply that the engine history of the metadata key
is a single `Set`, so we can safely use `SingleDel`.

Release justification: fix for a release blocker
Release note: None
craig bot pushed a commit that referenced this issue Sep 14, 2021
69658: spanconfig: disable infrastructure unless envvar is set r=irfansharif a=irfansharif

Cluster settings are too easy a switch to reach for to enable the new
span configs machinery. Let's gate it behind a necessary envvar as
well and use cluster settings to selectively toggle individual
components.

This commit also fixes a mighty silly bug introduced in #69047; for the
two methods we intended to gate use
`spanconfig.experimental_kvaccessor.enabled`, we were checking the
opposite condition or not checking it at all. Oops.

Release note: None
Release justification: non-production code changes

69809: kv/kvserver: use proper formatting when debug printing intents r=AlexTalks a=AlexTalks

This commit changes the formatting used when printing intents via the
CLI debug command from the default generated Protobuf formatter to our
custom `MVCCMetadata` formatter implementation.  Additionally, the
`MergeTimestamp` and `TxnDidNotUpdateMetadata` fields have been added to
the output.  This changes the debug formatting from the former
example:
```
0,0 /Local/RangeID/203/r/RangePriorReadSummary (0x0169f6cb727270727300): {Txn:<nil> Timestamp:0,0 Deleted:false KeyBytes:0 ValBytes:0 RawBytes:[230 123 85 161 3 10 12 10 10 8 146 229 195 204 139 135 186 208 22 18 12 10 10 8 146 229 195 204 139 135 186 208 22] IntentHistory:[] Me
rgeTimestamp:<nil> TxnDidNotUpdateMeta:<nil>}
/Local/Lock/Intent/Table/56/1/1319/6/3055/0 0361fea07d3f0d40ba8f44dc4ee46cbdc2 (0x017a6b12c089f705278ef70bef880001000361fea07d3f0d40ba8f44dc4ee46cbdc212): 1630559399.176502568,0 {Txn:id=61fea07d key=/Table/57/1/1319/6/0 pri=0.01718258 epo=0 ts=1630559399.176502568,0 min=1630559399.176502568,0 seq=4 Timestamp:1630559399.176502568,0 Deleted:false KeyBytes:12 ValBytes:5 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> TxnDidNotUpdateMeta:0xc0016059b0}
```
to the following example:
```
0,0 /Local/RangeID/203/r/RangePriorReadSummary (0x0169f6cb727270727300): txn={<nil>} ts=0,0 del=false klen=0 vlen=0 raw=/BYTES/0x0a0c0a0a0892e5c3cc8b87bad016120c0a0a0892e5c3cc8b87bad016 mergeTs=<nil> txnDidNotUpdateMeta=false
/Local/Lock/Intent/Table/56/1/1319/6/3055/0 0361fea07d3f0d40ba8f44dc4ee46cbdc2 (0x017a6b12c089f705278ef70bef880001000361fea07d3f0d40ba8f44dc4ee46cbdc212): 1630559399.176502568,0 txn={id=61fea07d key=/Table/57/1/1319/6/0 pri=0.01718258 epo=0 ts=1630559399.176502568,0 min=1630559399.176502568,0 seq=4} ts=1630559399.176502568,0 del=false klen=12 vlen=5 mergeTs=<nil> txnDidNotUpdateMeta=true
```

Related to #69414

Release justification: Bug fix
Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
aliher1911 pushed a commit that referenced this issue Sep 16, 2021
Fixes #69414.

Only use SingleDel if

- the meta tracking allows it (i.e. the sole previous condition), and
- epoch zero, and
- no ignored txn seqno ranges.

These three together imply that the engine history of the metadata key
is a single `Set`, so we can safely use `SingleDel`.

Release justification: fix for a release blocker
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants