storage: add experimental MVCC range key primitives #77417

erikgrinaker · 2022-03-06T21:38:42Z

This patch adds initial experimental primitives for MVCC range keys,
which will be the foundation for MVCC range tombstones. They are based
on experimental Pebble range keys.

Data structures:
- MVCCRangeKey
- MVCCRangeKeyValue
Engine methods for mutating range keys:
- ExperimentalClearMVCCRangeKey()
- ExperimentalClearAllMVCCRangeKeys()
- ExperimentalPutMVCCRangeKey()
- SupportsRangeKeys()
SimpleMVCCIterator methods for accessing range keys:
- HasPointAndRange()
- RangeBounds()
- RangeKeys()

Range keys do not have a distinct identity, and should instead be
considered a key continuum: they will merge with abutting keys of the
same value, can be partially cleared, can split or merge along with
ranges, and so on. Bounded scans will truncate them to the scan bounds.

Only MVCC range tombstones are currently supported, with an empty value.
Attempts to write a non-tombstone value will error, since these would
need additional logic throughout the MVCC APIs, and their semantics are
still unclear.

Range key support is implemented in pebbleIterator and
intentInterleavingIter, but not in the rest of the MVCC or KV APIs.
They are not persisted to disk either. Subsequent pull requests will
extend their functionality and integrate them with other components.

Touches #70412.

Release note: None

cockroach-teamcity · 2022-03-06T21:38:50Z

This change is

erikgrinaker · 2022-03-07T13:05:38Z

@jbowens I'm getting an odd test failure here. When running TestEngineRangeKeysUnsupported after TestEngineMVCCRangeKeyMutations, I get a range keys unsupported panic when I'm trying to write a point key. This does not happen when I run TestEngineRangeKeysUnsupported by itself. Seems like the batch state leaks between tests or something, even though I'm setting up a new engine? TestEngineRangeKeysUnsupported uses format version 5, while TestEngineMVCCRangeKeyMutations uses format version 6.

Fails here:

cockroach/pkg/storage/engine_test.go

Line 1806 in fe16c4f

require.NoError(t, eng.PutMVCC(pointKey("a", 1), []byte("a1")))

$ dev test --stress -v pkg/storage -f RangeKey
[...]
=== RUN   TestEngineMVCCRangeKeyMutations/batch=false
=== RUN   TestEngineMVCCRangeKeyMutations/batch=true
=== CONT  TestEngineMVCCRangeKeyMutations
    engine_test.go:1791: -- test log scope end --
--- PASS: TestEngineMVCCRangeKeyMutations (0.01s)
    --- PASS: TestEngineMVCCRangeKeyMutations/batch=false (0.01s)
    --- PASS: TestEngineMVCCRangeKeyMutations/batch=true (0.00s)
=== RUN   TestEngineRangeKeysUnsupported
    test_log_scope.go:79: test logs captured to: /home/erik/.cache/bazel/_bazel_erik/58b8816bb117734f0a0eaaef7fcef95b/sandbox/linux-sandbox/17/execroot/cockroach/_tmp/c5eb8fc8b8e683c19f3c3e4238f64094/logTestEngineRangeKeysUnsupported2226316390
    test_log_scope.go:80: use -show-logs to present logs inline
    panic.go:1038: -- test log scope end --

ERROR: a panic has occurred!
Details cannot be printed yet because we are still unwinding.
Hopefully the test harness prints the panic below, otherwise check the test logs.

test logs left over in: /home/erik/.cache/bazel/_bazel_erik/58b8816bb117734f0a0eaaef7fcef95b/sandbox/linux-sandbox/17/execroot/cockroach/_tmp/c5eb8fc8b8e683c19f3c3e4238f64094/logTestEngineRangeKeysUnsupported2226316390
    panic.go:1038: pebble: range keys require at least format major version 6 (current: 5)
        goroutine 130 [running]:
        runtime/debug.Stack()
        	GOROOT/src/runtime/debug/stack.go:24 +0x65
        github.com/cockroachdb/cockroach/pkg/util/leaktest.AfterTest.func1()
        	github.com/cockroachdb/cockroach/pkg/util/leaktest/leaktest.go:110 +0x11f
        panic({0x17bf8e0, 0xc000111db0})
        	GOROOT/src/runtime/panic.go:1038 +0x215
        github.com/cockroachdb/pebble.(*DB).Apply(0xc000162000, 0xc0006b2c60, 0xb)
        	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:726 +0x33a
        github.com/cockroachdb/pebble.(*DB).Set(0xc000162000, {0xc0004a0f70, 0xb, 0xb}, {0xc0004a0f68, 0x2, 0x2}, 0xc0004a0f60)
        	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:559 +0xda
        github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).put(0xc000164360, {{0xc0004a0f60, 0x1, 0x8}, {0x1, 0x0, 0x0}}, {0xc0004a0f68, 0x2, 0x2})
        	github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1239 +0xd4
        github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).PutMVCC(0x29, {{0xc0004a0f60, 0x1, 0x8}, {0x1, 0x0, 0x0}}, {0xc0004a0f68, 0x2, 0x2})
        	github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1211 +0xac
        github.com/cockroachdb/cockroach/pkg/storage.TestEngineRangeKeysUnsupported(0xc0005901a0)
        	github.com/cockroachdb/cockroach/pkg/storage/engine_test.go:1806 +0x2cd
        testing.tRunner(0xc0005901a0, 0x1b50720)
        	GOROOT/src/testing/testing.go:1259 +0x102
        created by testing.(*T).Run
        	GOROOT/src/testing/testing.go:1306 +0x35a
--- FAIL: TestEngineRangeKeysUnsupported (0.01s)

jbowens · 2022-03-07T15:54:34Z

@erikgrinaker Thanks for catching that! cockroachdb/pebble#1563

erikgrinaker · 2022-03-11T15:49:21Z

@jbowens Would like to get your thoughts on a few IterKeyTypePointsAndRanges behaviors. These aren't necessarily bugs, but they don't seem entirely consistent either.

# Write [a-c)@5, b@7, and b@3
run ok
put_rangekey k=a end=c ts=5 v=
put k=b ts=3 v=b3
put k=b ts=7 v=b7
----
>> at end:
rangekey: {a-c}/5.000000000,0 -> /BYTES/
data: "b"/7.000000000,0 -> /BYTES/b7
data: "b"/3.000000000,0 -> /BYTES/b3

# Seek GE to b@8 yields [a-c)@5, next b@7.
# Seek GE to b@7 yields b@7.
# Should seek to b@7 also yield range key before point key?
run ok
iter_new types=pointsAndRanges
iter_seek_ge k=b ts=8
iter_next
iter_seek_ge k=b ts=7
----
iter_seek_ge: {a-c}=[5.000000000,0=/BYTES/]
iter_next: "b"/7.000000000,0=/BYTES/b7 {a-c}=[5.000000000,0=/BYTES/]
iter_seek_ge: "b"/7.000000000,0=/BYTES/b7 {a-c}=[5.000000000,0=/BYTES/]

# Seek GE to b@6 yields [a-c)@5. Prev twice yields b@7 then [a-c)@5 again.
# Should [a-c)@5 be emitted twice?
run ok
iter_new types=pointsAndRanges
iter_seek_ge k=b ts=6
iter_prev
iter_prev
----
iter_seek_ge: {a-c}=[5.000000000,0=/BYTES/]
iter_prev: "b"/7.000000000,0=/BYTES/b7 {a-c}=[5.000000000,0=/BYTES/]
iter_prev: {a-c}=[5.000000000,0=/BYTES/]

# Seek LT to b@5 yields b@7, prev [a-c)@5.
# Why doesn't this emit [a-c)@5 before b@7, when seek GE does that?
run ok
iter_new types=pointsAndRanges
iter_seek_lt k=b ts=5
iter_prev
iter_prev
----
iter_seek_lt: "b"/7.000000000,0=/BYTES/b7 {a-c}=[5.000000000,0=/BYTES/]
iter_prev: {a-c}=[5.000000000,0=/BYTES/]
iter_prev: .

aliher1911

Some questions about keys while I'm reading more involved bits.

pkg/storage/mvcc_key.go

jbowens · 2022-03-11T20:20:31Z

@erikgrinaker Yeah, I see how some of these mechanics can be confusing and unexpected. Let me try my best to explain the current behavior first, and then we can circle back to see whether it would be beneficial to change them.

We have a continuous infinite keyspace. The classic Pebble iterator moves along this 1-dimensional keyspace and pauses at point keys only. Point keys are by definition inclusive, and the Iterator's position within the keyspace is conveyed through Iterator.Key(). In the example, an iterator pauses at b@7 and b@3 since the both have point keys defined.

                             . b@7     . b@3
<____________________________________________________>
    a                    b                    c

With the introduction of range keys, we need to shoehorn continuous spans into our iteration over discrete points. To do that, we say that the Iterator pauses at point keys and at the beginning of new range keys. The beginning of range keys is inclusive, so it ensures that the state that we surface on the Iterator always applies to Iterator.Key(). In the below diagram * denotes where within the keyspace the Iterator pauses, only ever visiting a discrete location once.

    *                        *         *
    |-----------------------------------------) [a,c)@5
                             . b@7     . b@3
<____________________________________________________>
    a                    b                    c

If the start of a range key coincides with a point key, the Iterator still only visits the Iterator.Key() once.

The above scheme works for scanning across the entire keyspace, but what about seeks? If you SeekGE(b@8) where should you end up? The Pebble iterator will pause at the first location with a key in the searched keyspace [b@8,∞). In this case, the range key overlaps the search key b@8 so the first key defined over the keyspace is the range key. There's a problem though, because the range key's start is in the past. It would be bizarre to surface the range key at Key() == a since a < b@8. There might be additional point keys in the interval [a, b@8) which we don't care about and don't want to surface. So the iterator pauses at b@8, the first discrete key within the search keyspace that contains a key. At this location within the keyspace, there only exists a range key (whose bounds and values are surfaced through RangeBounds() and RangeKeys()).

                         b@8
    *                      * *         *
    |----------------------|------------------) [a,c)@5
                           | . b@7     . b@3
<__________________________|_________________________>
    a                    b                    c

If instead, we SeekGE(b@7), the iterator again pauses at the seek key. However this time, there's actually two keys defined at the seek key. There's the overlapping range key and the point key:

                            b@7
    *                        *         *
    |------------------------|----------------) [a,c)@5
                             . b@7     . b@3
<__________________________ _|_______________________>
    a                    b                    c

This SeekGE behavior can result in stopping at range-key only locations twice, like you observed in your second example. A SeekGE(b@6) pauses at b@6, where the first key exists in the keyspace [b@6,∞). But if we turn around and start going in the reverse direction, we'll eventually pause at the true start of the range key a.

In your last example, you're observing that the Iterator only pauses at range key starts, never end keys. This is because end keys are exclusive, and the state surfaced on an Iterator is interpreted to always apply to the Iterator.Key().

Some of these examples might make more sense with the Iterator.Key() in the test output:

batch
range-key-set a c @5 foo
set b@7 b@7
set b@3 b@3
----
wrote 3 keys

combined-iter
seek-ge b@8
next
seek-ge b@7
----
b@8: (., [a-c) @5=foo)
b@7: (b@7, [a-c) @5=foo)
b@7: (b@7, [a-c) @5=foo)

combined-iter
seek-ge b@6
prev
prev
----
b@6: (., [a-c) @5=foo)
b@7: (b@7, [a-c) @5=foo)
a: (., [a-c) @5=foo)

combined-iter
seek-lt b@5
prev
prev
----
b@7: (b@7, [a-c) @5=foo)
a: (., [a-c) @5=foo)
.

sumeerbhola

Reviewed 5 of 23 files at r4, 2 of 4 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)

pkg/storage/engine.go, line 84 at r5 (raw file):

	// key and/or a range key. If Valid() returns true, one of these will be true.
	// Range keys are only emitted when requested via IterOptions.KeyTypes.
	HasPointAndRange() (bool, bool)

I know I had said we should move towards the EngineIterator interface-style of SeekEngineKeyGE(key EngineKey) (valid bool, err error). I realize the refactor can be messy and one would rather do it later. However, now we have to do a sequence like

iter.Next()
valid, err := iter.Valid()
if !valid || err != nil {
   ...
}
hasP, hasR := iter.HasPointAndRange()
// Do the real work
...

Which looks annoying.
I would prefer one of two things:

HasPointAndRange() (bool, bool, error), so the new code does not need to call Valid.
Keeping Valid, for now, for the old code. And changing all the step and seek methods to return (bool, bool, error) (which the old code will ignore and the new code will use).

pkg/storage/engine.go, line 86 at r5 (raw file):

	HasPointAndRange() (bool, bool)
	// RangeBounds returns the range bounds for the current range key fragment, if
	// any. See RangeKeys() for more info on range key fragments.

if any?
So is it legal to call this method if HasPointAndRange said there was no range?

pkg/storage/engine.go, line 110 at r5 (raw file):

	// partially removed by GC, and may be truncated by iterator bounds.
	//
	// TODO(erikgrinaker): Write a tech note on range keys and link it here.

why do we need to defragment in CockroachDB, now that we have defragmentation built into Pebble?

pkg/storage/engine.go, line 358 at r5 (raw file):

	//
	// NB: range keys are only supported for use with MVCCIterators, but it is
	// legal to enable them for EngineIterators in order to derive cloned

where are we cloning engine iterators to use as the mvcc iterator inside intentInterleavingIter?

This comment does raise an issue: currently the pebble.Iterators that we reuse in pebbleReadOnly only change their bounds. It seems now that they could also switch in their IterKeyType. But the pebble.IterKeyType is currently fixed when creating the pebble.Iterator. We probably need to change that.

pkg/storage/engine.go, line 376 at r5 (raw file):

	// point keys are not surfaced.
	//
	// TODO(erikgrinaker): Consider moving this down into Pebble.

can this TODO be done now, since we have pebble.IterKeyType?

pkg/storage/engine.go, line 673 at r5 (raw file):

	// have severe limitations including being ignored by all KV and MVCC APIs and
	// only being stored in memory.
	ExperimentalClearMVCCRangeKeys(start, end roachpb.Key) error

These two method names differ in "s". Can we use something more forceful like DestroyMVCCRangeKeys, or ClearRangeKeys (since this isn't an MVCC compliant operation)?

pkg/storage/engine.go, line 677 at r5 (raw file):

	// ExperimentalPutMVCCRangeKey writes a value to an MVCC range key. It will
	// replace any existing keys, or any segments that it overlaps. This is
	// currently only used for range tombstones, which have a value of nil.

If this value is nil, I would prefer not passing the parameter at all.
Between future-proofing now and a goland-aided manual refactoring later, I think the latter is better since it avoids adding unnecessary invariant checks now.

pkg/storage/intent_interleaving_iter.go, line 365 at r5 (raw file):

	}
	// If we find an intent, but we land on a range key covering it, skip to
	// the point key.

This needs some justification, grounded in what semantics the user is desiring (ideally documented in engine.go).
And I don't understand why we are doing this for SeekIntentGE and not for SeekGE -- the former is just an optimization.

pkg/storage/mvcc.go, line 105 at r5 (raw file):

type MVCCRangeKeyValue struct {
	Key   MVCCRangeKey
	Value []byte

Is this trying to future proof the API, and Value will always be nil for now?
If yes, please add a comment stating that current invariant. Or better still, let's keep MVCCRangeKeyValue so we have a struct to add to, and get rid of the Value field and add a code comment on how this will change in the future.

pkg/storage/mvcc_key.go, line 349 at r5 (raw file):

// equal, or 1 if this is greater. Comparison is by start,timestamp,end, where
// larger timestamps sort before smaller ones except empty ones which sort first
// (like elsewhere in MVCC).

why do we need this method?

pkg/storage/mvcc_key.go, line 390 at r5 (raw file):

	// We allow equal start and end key, since we allow empty spans in many MVCC
	// APIs (e.g. scans).
	if k.StartKey.Compare(k.EndKey) > 0 {

This validation function is used on the write path. We should not allow writing of empty spans.

pkg/storage/mvcc_range_key_iterator.go, line 42 at r5 (raw file):

// SimpleMVCCIterator, range keys are defragmented into contiguous deterministic
// range keys. It does not support seeking or backtracking, see
// MVCCRangeKeyIterOptions for lower/upper bounds and other options.

why do we need to defragment in CockroachDB now that we have completely deterministic iteration semantics in Pebble?

pkg/storage/pebble.go, line 1873 at r5 (raw file):

	}

	reusable := opts.MinTimestampHint.IsEmpty() && opts.KeyTypes == IterKeyTypePointsOnly

I think we need general reuse. I am fine with a TODO for future PRs.

erikgrinaker · 2022-03-12T13:17:08Z

the Iterator pauses at point keys and at the beginning of new range keys.

Yeah, I think this is at the root of it. I understand why it's done this way, and it isn't necessarily a problem (I can't think of anything that it'd prevent us from doing).

Another possible model to consider would be to surface range keys when they're first encountered. This essentially implies that range keys exist over a range, rather than anchoring it at the start key. The current seekGE semantics already imply this, so there is already an inconsistency in saying that they exist at the start key. Let's say we iterate over this set of keys:

[b-f)@3, b@2, d@2, f@2

The current reverse iteration order is symmetrical, which is sort of nice in itself. However, we could also also imagine this reverse iteration order:

f@2, [b-f)@3, d@2, b@2

It follows that a seekLT to e would land at [b-f)@3 first, just like seekGE would. This would also eliminate the duplicate emission of the lone range key when changing directions after a seek. However, as you point out, it presents a problem wrt. Iterator.Key(), since it's unclear what it should return when positioned at [b-f)@3 during reverse iteration. It's not clear to me that it necessarily makes sense for a range key to have a point position though, but I understand that this is easier to retrofit onto the current Pebble iterators.

Just a thought. I don't think we really gain anything from this other than it being slightly more intuitive (at least to me), so it might be a lot of work for little benefit. But I'd be curious to hear what you think.

erikgrinaker

Dismissed @aliher1911 from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @1, @2, @aliher1911, @jbowens, and @sumeerbhola)

pkg/storage/engine.go, line 84 at r5 (raw file):

Previously, sumeerbhola wrote…

I know I had said we should move towards the EngineIterator interface-style of SeekEngineKeyGE(key EngineKey) (valid bool, err error). I realize the refactor can be messy and one would rather do it later. However, now we have to do a sequence like
iter.Next()
valid, err := iter.Valid()
if !valid || err != nil {
   ...
}
hasP, hasR := iter.HasPointAndRange()
// Do the real work
...
Which looks annoying.
I would prefer one of two things:

HasPointAndRange() (bool, bool, error), so the new code does not need to call Valid.

Keeping Valid, for now, for the old code. And changing all the step and seek methods to return (bool, bool, error) (which the old code will ignore and the new code will use).

I think I prefer the latter, because it also allows us to surface decoding errors in a natural way. I am a bit worried that it may cause a performance issue in certain cases due to unnecessary decoding. Let's say we're iterating over point keys and are only interested in range keys at some points -- we would still have to decode all range keys during iteration regardless of whether the caller is interested in them in order to surface decoding errors via Next (we could cache them until the range bounds change, but still). But let's discuss that separately.

I agree that we should do this as part of the range tombstone work. However, we will have to maintain a feature branch for 1-2 months, and a refactoring like this would be really painful to keep rebasing onto master. Also, we don't want to make large changes on master until 22.1 RCs are out, to ease backports. So I think we should do this after the feature branch lands in master.

pkg/storage/engine.go, line 86 at r5 (raw file):

Previously, sumeerbhola wrote…

if any?
So is it legal to call this method if HasPointAndRange said there was no range?

Yes, it returns (nil, nil).

pkg/storage/engine.go, line 110 at r5 (raw file):

Previously, sumeerbhola wrote…

why do we need to defragment in CockroachDB, now that we have defragmentation built into Pebble?

These are different kinds/levels of defragmentation. Pebble defragments across physical SST bounds, to enforce the invariant that range key bounds always fall on bare prefixes. CockroachDB defragmentation is to return logical range keys from fragment stacks.

For example, after writing [a-c)@1 and [b-d)@2, Pebble returns [a-b)@1, [b-c)@2, [b-c)@1, [c-d)@2. The CRDB defragmentation returns [a-c)@1 and [b-d)@2.

This will be used e.g. for MVCC statistics, as the above range keys will count as 2 range keys rather than 4. It is also a much more compact representation, which is useful in APIs or data structures (e.g. consistency checker diffs) and tests (e.g. data-driven MVCC history tests). And it can result in fewer write operations, since e.g. during GC we would make two RangeKeyUnset calls rather than four for the above.

pkg/storage/engine.go, line 358 at r5 (raw file):

where are we cloning engine iterators to use as the mvcc iterator inside intentInterleavingIter?

cockroach/pkg/storage/intent_interleaving_iter.go

Line 236 in a9e5753

iter = newMVCCIteratorByCloningEngineIter(intentIter, opts)

This comment does raise an issue: currently the pebble.Iterators that we reuse in pebbleReadOnly only change their bounds. It seems now that they could also switch in their IterKeyType. But the pebble.IterKeyType is currently fixed when creating the pebble.Iterator. We probably need to change that.

Yeah, so currently we avoid that by only reusing these iterators for IterKeyTypePointsOnly. New iterators will be created on each NewMVCCIterator() call with other key types. Once we port pebbleMVCCScanner to respect range tombstones we would likely change this to reuse iterators with IterKeyTypePointsAndRanges instead, since that will be the most common iterator key type.

It would be better if we could allow changing the key type on the fly. I don't know what that would entail in terms of labour and the performance penalty of setting up/tearing down the range key processing. Wdyt @jbowens?

pkg/storage/engine.go, line 376 at r5 (raw file):

Previously, sumeerbhola wrote…

can this TODO be done now, since we have pebble.IterKeyType?

No, Pebble does not offer IterKeyTypePointsWithRanges, only the other options. When we discussed this previously you had a preference for implementing this in the MVCC layer rather than in Pebble, but I'd obviously prefer to push it down to Pebble and avoid the complexity here. No practical difference though, either works for me.

pkg/storage/engine.go, line 673 at r5 (raw file):

Previously, sumeerbhola wrote…

These two method names differ in "s". Can we use something more forceful like DestroyMVCCRangeKeys, or ClearRangeKeys (since this isn't an MVCC compliant operation)?

Sure. I'd prefer ClearRangeKeys, since this operation isn't really more destructive than the other one (they both clear range keys, it's just a matter of which ones).

pkg/storage/engine.go, line 677 at r5 (raw file):

Previously, sumeerbhola wrote…

If this value is nil, I would prefer not passing the parameter at all.
Between future-proofing now and a goland-aided manual refactoring later, I think the latter is better since it avoids adding unnecessary invariant checks now.

Yeah, I've been a bit uncertain about this. I was planning to fully implement support for non-nil values in MVCC, e.g. stats, GC, consistency checks, etc (where non-nil is considered live data), since I think the marginal cost of doing it now is significantly lower than if we have to page all this stuff back in later. But there's definitely a risk that it'll be wasted work if we end up either not using it or having to change the semantics later.

I could go either way here. I'm partial to doing it just because I want to "finish the job properly", but it has no bearing on MVCC range tombstones themselves. If people feel that it's a waste of time then I'm fine with removing the value handling for now.

pkg/storage/intent_interleaving_iter.go, line 365 at r5 (raw file):

Previously, sumeerbhola wrote…

This needs some justification, grounded in what semantics the user is desiring (ideally documented in engine.go).
And I don't understand why we are doing this for SeekIntentGE and not for SeekGE -- the former is just an optimization.

SeekGE takes a versioned MVCCKey, but SeekIntentGE takes an unversioned roachpb.Key. If we don't do this, then SeekIntentGE would never be able to seek to an intent with an overlapping range key, it would always land on the range key first. I thought the desired semantics here were to land on an intent that we expect to exist?

pkg/storage/mvcc.go, line 105 at r5 (raw file):

Previously, sumeerbhola wrote…

Is this trying to future proof the API, and Value will always be nil for now?
If yes, please add a comment stating that current invariant. Or better still, let's keep MVCCRangeKeyValue so we have a struct to add to, and get rid of the Value field and add a code comment on how this will change in the future.

See separate discussion above. If we're dropping valued range keys then we should drop MVCCRangeKeyValue too for now and just return MVCCRangeKey instead.

pkg/storage/mvcc_key.go, line 349 at r5 (raw file):

Previously, sumeerbhola wrote…

why do we need this method?

This was pulled in from #76880, where it's needed to generate diffs during replica consistency checks. And I also feel like the ordering semantics of range keys need to be well-defined.

pkg/storage/mvcc_key.go, line 390 at r5 (raw file):

Previously, sumeerbhola wrote…

This validation function is used on the write path. We should not allow writing of empty spans.

Yeah, I think we'll need to differentiate read/write validation -- there are other conditions that come into play too.

pkg/storage/mvcc_range_key_iterator.go, line 42 at r5 (raw file):

Previously, sumeerbhola wrote…

why do we need to defragment in CockroachDB now that we have completely deterministic iteration semantics in Pebble?

This was answered in a different thread.

pkg/storage/pebble.go, line 1873 at r5 (raw file):

Previously, sumeerbhola wrote…

I think we need general reuse. I am fine with a TODO for future PRs.

I agree, we should consolidate this somehow.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @jbowens, and @sumeerbhola)

pkg/storage/engine.go line 338 at r17 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Good call, done.

I realized that we no longer need this now that we have SetOptions(): we can always create the intent iterator with IterKeyTypePointsOnly, and reconfigure the cloned MVCC iterator to enable range keys. This also avoids any associated range key overhead for the intent iterator.

Updated the intentInterleavingIter with this, and also added checks in NewEngineIterator(). We could consider separating the options structs for EngineIterator and MVCCIterator to make this clearer.

sumeerbhola

Reviewed 1 of 7 files at r21.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)

pkg/storage/engine.go line 75 at r17 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I moved all of this to a top-level comment, but kept a note on NextKey() because there is an additional subtlety here that isn't obvious and doesn't follow from the usual iteration semantics.

Consider a@4 and [a-c)@3. NextKey() will first land on [a-c)@3 then a@4. It isn't given that this is the correct or desired behavior. We could instead skip over a@4 (because a was emitted by [a-c)@3), or we could step onto a@4 and emit them both together.

I feel like the latter option is more intuitive, and probably more useful. However, it would require an additional peek ahead to look for a point key at the start key, and a step back if we didn't find one. So I'm not sure if it's worth the cost. Wdyt?

I think NextKey() means the user wants the next roachpb.Key, so skipping over a@4 sounds right. Additional peeking seems too complicated, for what was designed mainly as an optimization. If that first key was only a range key, then the caller that wants to see a point can do Next() instead of NextKey(). Given the smallish set of NextKey() callers, I think we should be able to change their code to follow this semantics.

pkg/storage/engine.go line 46 at r21 (raw file):

// noted. SimpleMVCCIterator is a subset of the functionality offered by
// MVCCIterator.
//

nice comment

pkg/storage/intent_interleaving_iter.go line 946 at r17 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Done.

Still there

pkg/storage/intent_interleaving_iter.go line 1005 at r17 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

With the "new" semantics of intentCmp (relative position of iterators wrt. provisional value), this logic is correct. We need to call this before stepping off the intent, and intentCmp > 0 implies that we're currently on the intent. But see comment above, happy to explore alternatives.

By tracking i.hasPoint and i.hasRange, we won't need to speculate and call maybeSkipIntentRangeKey (which incurs a key comparison). We will know that we need to do i.Prev if i.intentCmp==0 and we are positioned at the intent.

pkg/storage/intent_interleaving_iter.go line 115 at r21 (raw file):

	//   is at an intent, and iter is at the corresponding provisional value, cmp
	//   will be 0. If iter is on a bare range key with the same key position
	//   (start key or SeekGE key) as intentIter then intentCmp is 1 if iterKey has

nit: s/1/+1/

pkg/storage/intent_interleaving_iter.go line 117 at r21 (raw file):

	//   (start key or SeekGE key) as intentIter then intentCmp is 1 if iterKey has
	//   no timestamp and -1 otherwise, to correctly position it for reverse
	//   iteration. See the longer struct-level comment for more on the relative

I worry about cheating here by claiming intentCmp=+1 when it is actually 0.
I think it is better to track hasPoint, hasRange as member variables and consider the two different cases corresponding to intentCmp==0 and dir=-1

hasPoint: positioned at a versioned value and Prev will step iter.
!hasPoint: hasRange must be true. Positioned at the intent. Prev will step both iter and intentIter.

I've added some other comments that follow this same logic.

However, this doesn't actually work because we can have !hasPoint && hasRange for synthetic keys that have a timestamp, due to SeekGE.
But we can still make this work:
iterAtNonTimestampedRange = iterKey.Timestamp.IsEmpty() && hasRange && !hasPoint
The previous logic for intentCmp==0 and dir=-1 is

!iterAtNonTimestampedRange: positioned at a versioned value (may be a range or point) and Prev will step iter.
iterAtNonTimestampedRange: Positioned at the intent. Prev will step both iter and intentIter.

I don't think we need any other special casing for the case where we have encountered a timestamped range key. SeekGE must have positioned the intentIter at a later intent. The only thing that seems to require care is if someone calls Prev immediately after SeekGE. I tried to construct a couple of examples using SeekGE(a@10), where the range key also starts at a, and I think they work fine.
Example1:
Intents at a, b, Versions at a@12, a@9.
After the Seek, the intentIter is at b and iter is at a@10. Since the interleaving iter is not positioned at the intent, the call to Prev will first step the intentIter back to a. And then proceed as if we were already doing backward iteration. Since !iterAtNonTimestampedRange, iter will be stepped back to a@12 and become visible to the user. If the user does Prev again, iter will land on a, at which point iterAtNonTimestampedRange becomes true and the intent will become visible.

Exampe 2:
Intents at a, b. Version at a@9.
After the Seek, the intentIter is at b and iter is at a@10. Since the interleaving iter is not positioned at the intent, a call to Prev will first step the intentIter back to a. And then proceed as if we were already doing backward iteration. Since !iterAtNonTimestampedRange, iter will be stepped back to a. So iterAtNonTimestampedRange becomes true and the intent will become visible to the user.

pkg/storage/intent_interleaving_iter.go line 792 at r21 (raw file):

	//   - intentCmp <= 0. Returns false.
	//   - intentCmp > 0. Returns true.
	return (i.dir > 0 && i.intentCmp <= 0) || (i.dir < 0 && i.intentCmp > 0)

this would become

return (i.dir > 0 && i.intentCmp <= 0) || (i.dir < 0 && (i.intentCmp > 0 || (i.intentCmp == 0 && !i.hasPoint))

sumeerbhola

Reviewed 1 of 21 files at r17, 2 of 7 files at r21.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)

pkg/storage/pebble_iterator.go line 396 at r21 (raw file):

		return
	}
	hasPoint, _ := p.HasPointAndRange()

if we keep the semantics to be that the user wants to step to the next roachpb.Key (its their problem to figure out when to use Next and when to use NextKey), then I think this method doesn't need any change.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @jbowens, and @sumeerbhola)

pkg/storage/engine.go line 75 at r17 (raw file):

Previously, sumeerbhola wrote…

I think NextKey() means the user wants the next roachpb.Key, so skipping over a@4 sounds right. Additional peeking seems too complicated, for what was designed mainly as an optimization. If that first key was only a range key, then the caller that wants to see a point can do Next() instead of NextKey(). Given the smallish set of NextKey() callers, I think we should be able to change their code to follow this semantics.

Sure. -- that's logically consistent at least, and I think the callers can manage. Done.

pkg/storage/intent_interleaving_iter.go line 946 at r17 (raw file):

Previously, sumeerbhola wrote…

Still there

Ah, yeah, I think this was because it triggered the assertion below it -- if we're switching directions after a SeekGE call lands inside a bare range key, then the Prev() call here will land on the start key, which can have intentCmp == 0 if intentIter is colocated with it. We can drop it by calling computePos() instead of setting intentCmp = +1, so I've done that. The rest should come out correct.

pkg/storage/intent_interleaving_iter.go line 1005 at r17 (raw file):

Previously, sumeerbhola wrote…

By tracking i.hasPoint and i.hasRange, we won't need to speculate and call maybeSkipIntentRangeKey (which incurs a key comparison). We will know that we need to do i.Prev if i.intentCmp==0 and we are positioned at the intent.

Yeah, we now only need to do this at this single site in the reverse direction, with computePos() already called, so I've made maybeSkipIntentRangeKey() only work in the forward direction and inlined the Prev() check.

pkg/storage/intent_interleaving_iter.go line 115 at r21 (raw file):

Previously, sumeerbhola wrote…

nit: s/1/+1/

Reverted the changes to this comment.

pkg/storage/intent_interleaving_iter.go line 117 at r21 (raw file):

Previously, sumeerbhola wrote…

I worry about cheating here by claiming intentCmp=+1 when it is actually 0.
I think it is better to track hasPoint, hasRange as member variables and consider the two different cases corresponding to intentCmp==0 and dir=-1

hasPoint: positioned at a versioned value and Prev will step iter.

!hasPoint: hasRange must be true. Positioned at the intent. Prev will step both iter and intentIter.

I've added some other comments that follow this same logic.

However, this doesn't actually work because we can have !hasPoint && hasRange for synthetic keys that have a timestamp, due to SeekGE.
But we can still make this work:
iterAtNonTimestampedRange = iterKey.Timestamp.IsEmpty() && hasRange && !hasPoint
The previous logic for intentCmp==0 and dir=-1 is

!iterAtNonTimestampedRange: positioned at a versioned value (may be a range or point) and Prev will step iter.

iterAtNonTimestampedRange: Positioned at the intent. Prev will step both iter and intentIter.

I don't think we need any other special casing for the case where we have encountered a timestamped range key. SeekGE must have positioned the intentIter at a later intent. The only thing that seems to require care is if someone calls Prev immediately after SeekGE. I tried to construct a couple of examples using SeekGE(a@10), where the range key also starts at a, and I think they work fine.
Example1:
Intents at a, b, Versions at a@12, a@9.
After the Seek, the intentIter is at b and iter is at a@10. Since the interleaving iter is not positioned at the intent, the call to Prev will first step the intentIter back to a. And then proceed as if we were already doing backward iteration. Since !iterAtNonTimestampedRange, iter will be stepped back to a@12 and become visible to the user. If the user does Prev again, iter will land on a, at which point iterAtNonTimestampedRange becomes true and the intent will become visible.

Exampe 2:
Intents at a, b. Version at a@9.
After the Seek, the intentIter is at b and iter is at a@10. Since the interleaving iter is not positioned at the intent, a call to Prev will first step the intentIter back to a. And then proceed as if we were already doing backward iteration. Since !iterAtNonTimestampedRange, iter will be stepped back to a. So iterAtNonTimestampedRange becomes true and the intent will become visible to the user.

Yep, that'll do. Updated, but bikeshedded it as iterBareRangeAtIntent which assumes intentCmp == 0.

pkg/storage/intent_interleaving_iter.go line 792 at r21 (raw file):

Previously, sumeerbhola wrote…

this would become

return (i.dir > 0 && i.intentCmp <= 0) || (i.dir < 0 && (i.intentCmp > 0 || (i.intentCmp == 0 && !i.hasPoint))

Done.

pkg/storage/pebble_iterator.go line 396 at r21 (raw file):

Previously, sumeerbhola wrote…

if we keep the semantics to be that the user wants to step to the next roachpb.Key (its their problem to figure out when to use Next and when to use NextKey), then I think this method doesn't need any change.

It still needs to step off the bare range key after the seek below. Otherwise, with [a-f)@3 and b@2,b@1,c@2 then calling NextKey() at b@2 would land on [a-f)@3 above c@2 due to the seek, even though that key was already emitted at a. But we don't need the condition you're pointing to here.

sumeerbhola

Reviewed 2 of 4 files at r22.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)

pkg/storage/intent_interleaving_iter.go line 334 at r22 (raw file):

//
// In the reverse direction, this is only needed once during Prev(), and the
// caller can check i.iterBareRangeAtIntent instead.

This "reverse direction" comment is confusing since this method is not called for reverse iteration.
I would suggest moving it to a more appropriate location.
Also, can you add
// REQUIRES: i.dir > 0

pkg/storage/intent_interleaving_iter.go line 612 at r22 (raw file):

		if isCurAtIntent {
			// iter precedes the intentIter, so must be at the lowest version of the
			// preceding key or exhausted. So step it forward. It will now point to

... or the bare range at intent or exhausted.

pkg/storage/intent_interleaving_iter.go line 631 at r22 (raw file):

			}
			if util.RaceEnabled {
				cmp := i.intentKey.Compare(i.iterKey.Key)

can you also add hasRange && !hasPoint as an error condition.

pkg/storage/intent_interleaving_iter.go line 874 at r22 (raw file):

		//
		//   hasRange → i.iterValid
		//   i.isCurAtIntentIter() && i.dir < 0 → i.intentCmp > 0

=> i.intentCmp > 0 || (i.intentCmp==0 && i.iterBareRangeAtIntent)

pkg/storage/intent_interleaving_iter.go line 878 at r22 (raw file):

		// TODO(erikgrinaker): consider optimizing this comparison.
		if hasRange && i.dir < 0 {
			hasRange = i.iter.RangeBounds().EndKey.Compare(i.intentKey) > 0

hasRange = i.intentCmp == 0 || i.iter.RangeBounds().EndKey.Compare(i.intentKey) > 0
can occasionally avoid a key comparison.

pkg/storage/intent_interleaving_iter.go line 1035 at r22 (raw file):

		// The iterator is positioned at an intent in intentIter, and iter is
		// exhausted, positioned at a versioned value of a preceding key, or
		// positioned on an intent colocated with the start of a range key.

should this be saying the following since this part of the sentence is about iter:
...or positioned on the start of a range key colocated with the intent.

pkg/storage/intent_interleaving_iter.go line 1047 at r22 (raw file):

			return
		}
		if i.iterBareRangeAtIntent {

I think we need to hoist this if i.iterBareRangeAtIntent block before the preceding code since we need i.iter positioned properly before calling i.makeLowerLimitKey. The *WithLimit code is a bit subtle in terms of sometimes ignoring the limit, which may be why this didn't fail a test. Try something like the following in a debugger:
a@10, b@10, [b, c)@5 and intents at a, b, i.e.., and see if calling intentInterleavingIter.Prev when the iterator is positioned at the intent b, causes the current code to position iter at a@10 and the intentIter to be at IterAtLimit (instead of correctly being at a), which would make intentCmp = -1. Then the next call to intentInterleavingIter.Prev will exhaust iter so the if i.intentIterState == pebble.IterAtLimit && i.iterValid will be false and we will never expose the intent.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)

pkg/storage/pebble_iterator.go line 406 at r22 (raw file):

		// either a point key or range key starting past the seek key.
		if hasPoint, hasRange := p.HasPointAndRange(); !hasPoint && hasRange {
			if p.RangeBounds().Key.Compare(p.keyBuf) <= 0 {

What if we had point keys a@10, a@5, b@10 and range keys [a,a\0)@5, [a\0,b)@8. And we seeked while being positioned at a, so SeekGE(a\0). That would land at [a\0,b)@8, which does not extend into a, so has not been emitted. And then by doing Next we will end up at b@10, so never show the range key.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @5, @8, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)

pkg/storage/intent_interleaving_iter.go line 334 at r22 (raw file):

Previously, sumeerbhola wrote…

This "reverse direction" comment is confusing since this method is not called for reverse iteration.
I would suggest moving it to a more appropriate location.
Also, can you add
// REQUIRES: i.dir > 0

Ah, forgot to remove that paragraph. Done.

pkg/storage/intent_interleaving_iter.go line 612 at r22 (raw file):

Previously, sumeerbhola wrote…

... or the bare range at intent or exhausted.

Done.

pkg/storage/intent_interleaving_iter.go line 631 at r22 (raw file):

Previously, sumeerbhola wrote…

can you also add hasRange && !hasPoint as an error condition.

Done.

pkg/storage/intent_interleaving_iter.go line 874 at r22 (raw file):

Previously, sumeerbhola wrote…

=> i.intentCmp > 0 || (i.intentCmp==0 && i.iterBareRangeAtIntent)

Done.

pkg/storage/intent_interleaving_iter.go line 878 at r22 (raw file):

Previously, sumeerbhola wrote…

hasRange = i.intentCmp == 0 || i.iter.RangeBounds().EndKey.Compare(i.intentKey) > 0
can occasionally avoid a key comparison.

Good catch, thanks. Done.

pkg/storage/intent_interleaving_iter.go line 1035 at r22 (raw file):

Previously, sumeerbhola wrote…

should this be saying the following since this part of the sentence is about iter:
...or positioned on the start of a range key colocated with the intent.

Yes, it should, thanks.

pkg/storage/intent_interleaving_iter.go line 1047 at r22 (raw file):

Previously, sumeerbhola wrote…

I think we need to hoist this if i.iterBareRangeAtIntent block before the preceding code since we need i.iter positioned properly before calling i.makeLowerLimitKey. The *WithLimit code is a bit subtle in terms of sometimes ignoring the limit, which may be why this didn't fail a test. Try something like the following in a debugger:
a@10, b@10, [b, c)@5 and intents at a, b, i.e.., and see if calling intentInterleavingIter.Prev when the iterator is positioned at the intent b, causes the current code to position iter at a@10 and the intentIter to be at IterAtLimit (instead of correctly being at a), which would make intentCmp = -1. Then the next call to intentInterleavingIter.Prev will exhaust iter so the if i.intentIterState == pebble.IterAtLimit && i.iterValid will be false and we will never expose the intent.

Gah, didn't pay enough attention to the limits, thanks. Your example is almost there, but the a@10 key saves us since we reposition intentIter when we land on it. Intents on a,b are sufficient by themselves.

The existing tests didn't catch this because there needs to be a bare intent next to an iterator bound. Added a couple of regression tests for this.

pkg/storage/pebble_iterator.go line 406 at r22 (raw file):

Previously, sumeerbhola wrote…

What if we had point keys a@10, a@5, b@10 and range keys [a,a\0)@5, [a\0,b)@8. And we seeked while being positioned at a, so SeekGE(a\0). That would land at [a\0,b)@8, which does not extend into a, so has not been emitted. And then by doing Next we will end up at b@10, so never show the range key.

Great catch! Added a fix and regression test. I'm going to do another pass over everything with fresh eyes tomorrow, to check for any further stuff I might have missed.

erikgrinaker · 2022-06-03T09:42:48Z

Ran some more benchmarks, which are in line with the previous ones. We see a slight regression, but it isn't dramatic, and good enough that we can merge this to master to unblock other work and revisit performance optimizations later.

name                                                           old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24                4.78µs ± 1%    4.84µs ± 0%  +1.36%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=1/versions=10/valueSize=64-24               6.50µs ± 1%    6.59µs ± 1%  +1.28%  (p=0.002 n=10+10)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24              35.7µs ± 1%    35.7µs ± 1%    ~     (p=0.847 n=10+8)
MVCCScan_Pebble/rows=100/versions=10/valueSize=64-24              113µs ± 1%     112µs ± 1%  -1.05%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24            2.56ms ± 1%    2.58ms ± 1%    ~     (p=0.089 n=10+10)
MVCCScan_Pebble/rows=10000/versions=10/valueSize=64-24           9.65ms ± 2%    9.62ms ± 2%    ~     (p=0.579 n=10+10)
MVCCReverseScan_Pebble/rows=1/versions=1/valueSize=64-24         5.14µs ± 1%    5.25µs ± 1%  +2.09%  (p=0.000 n=10+10)
MVCCReverseScan_Pebble/rows=1/versions=10/valueSize=64-24        8.90µs ± 1%    9.09µs ± 1%  +2.20%  (p=0.000 n=10+10)
MVCCReverseScan_Pebble/rows=100/versions=1/valueSize=64-24       47.1µs ± 1%    48.4µs ± 1%  +2.71%  (p=0.000 n=9+10)
MVCCReverseScan_Pebble/rows=100/versions=10/valueSize=64-24       317µs ± 1%     325µs ± 1%  +2.34%  (p=0.000 n=10+9)
MVCCReverseScan_Pebble/rows=10000/versions=1/valueSize=64-24     3.75ms ± 1%    3.78ms ± 1%  +0.74%  (p=0.023 n=10+10)
MVCCReverseScan_Pebble/rows=10000/versions=10/valueSize=64-24    30.1ms ± 2%    30.9ms ± 2%  +2.70%  (p=0.001 n=10+9)
MVCCScanTransactionalData_Pebble-24                              3.24ms ± 1%    3.24ms ± 0%    ~     (p=1.000 n=10+9)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24             4.57µs ± 0%    4.57µs ± 2%    ~     (p=0.211 n=10+8)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24            5.53µs ± 1%    5.55µs ± 1%    ~     (p=0.183 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24           13.5µs ± 3%    13.8µs ± 1%  +1.77%  (p=0.019 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24              2.80µs ± 0%    2.87µs ± 1%  +2.50%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24             4.04µs ± 0%    4.10µs ± 1%  +1.48%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24            11.1µs ± 3%    11.1µs ± 5%    ~     (p=0.968 n=9+10)

erikgrinaker · 2022-06-03T10:37:24Z

Ran some more benchmarks, which are in line with the previous ones. We see a slight regression, but it isn't dramatic, and good enough that we can merge this to master to unblock other work and revisit performance optimizations later.

This seems to be caused by additional isCurAtIntentIter() checks that are now more expensive, especially in the reverse direction. I ran a benchmark where I made it always return false (since there are no intents in the dataset) and that closed the gap:

MVCCReverseScan_Pebble/rows=1/versions=1/valueSize=64-24         5.14µs ± 1%    5.14µs ± 0%    ~     (p=0.828 n=10+8)
MVCCReverseScan_Pebble/rows=1/versions=10/valueSize=64-24        8.90µs ± 1%    8.93µs ± 1%    ~     (p=0.089 n=10+10)
MVCCReverseScan_Pebble/rows=100/versions=1/valueSize=64-24       47.1µs ± 1%    47.4µs ± 1%    ~     (p=0.133 n=9+10)
MVCCReverseScan_Pebble/rows=100/versions=10/valueSize=64-24       317µs ± 1%     317µs ± 1%    ~     (p=0.971 n=10+10)
MVCCReverseScan_Pebble/rows=10000/versions=1/valueSize=64-24     3.75ms ± 1%    3.72ms ± 1%  -0.84%  (p=0.002 n=10+10)
MVCCReverseScan_Pebble/rows=10000/versions=10/valueSize=64-24    30.1ms ± 2%    30.2ms ± 4%    ~     (p=0.739 n=10+10)

For now, I've added specialized forward/reverse methods that are called in the Next/Prev hot paths. This improves things somewhat compared to the previous results:

name                                                           old time/op    new time/op    delta
MVCCReverseScan_Pebble/rows=1/versions=1/valueSize=64-24         5.14µs ± 1%    5.24µs ± 1%  +1.93%  (p=0.000 n=10+10)
MVCCReverseScan_Pebble/rows=1/versions=10/valueSize=64-24        8.90µs ± 1%    9.05µs ± 0%  +1.65%  (p=0.000 n=10+10)
MVCCReverseScan_Pebble/rows=100/versions=1/valueSize=64-24       47.1µs ± 1%    47.7µs ± 1%  +1.13%  (p=0.000 n=9+10)
MVCCReverseScan_Pebble/rows=100/versions=10/valueSize=64-24       317µs ± 1%     321µs ± 1%  +1.28%  (p=0.000 n=10+10)
MVCCReverseScan_Pebble/rows=10000/versions=1/valueSize=64-24     3.75ms ± 1%    3.74ms ± 1%    ~     (p=0.053 n=10+9)
MVCCReverseScan_Pebble/rows=10000/versions=10/valueSize=64-24    30.1ms ± 2%    30.8ms ± 3%  +2.27%  (p=0.005 n=10+10)

I also added a TODO to compute isCurAtIntentIter once every time the iterator is repositioned, which should address most of the regression here.

sumeerbhola

I don't see any changes in the testdata/intent_interleaving_iter directory. Can you add some test cases there, since we could benefit from focused testing of the intentInterleavingIter changes? Don't worry about the existing tests that also have interleaved intents -- its on my TODO list to cleanup.

Reviewed 1 of 4 files at r23, 3 of 11 files at r24.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @5, @8, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)

pkg/storage/intent_interleaving_iter.go line 584 at r24 (raw file):

	if i.dir < 0 {
		// Switching from reverse to forward iteration.
		isCurAtIntent := i.isCurAtIntentIter()

nit: isCurAtIntentIterReverse()
I think its worth narrowing down the calls to isCurAtIntentIter() only to shared codepaths of forward and reverse iteration.

pkg/storage/intent_interleaving_iter.go line 1019 at r24 (raw file):

			// Step it backward. It will now point to a key that is before the
			// intent key, or a range key whose start key is colocated with the
			// intent, or be exhausted.

how about
... or be exhausted (the exhausted case is an error case).

erikgrinaker

I don't see any changes in the testdata/intent_interleaving_iter directory. Can you add some test cases there, since we could benefit from focused testing of the intentInterleavingIter changes?

I had a look at these, and it isn't clear to me what benefit we get from these compared to the ones in range_key_iter which use the common TestMVCCHistories infrastructure. The only substantial difference I can see is the stats command, but it's easy enough to automatically dump iterator stats after each test that uses an iterator (or after each command), and that seems useful in any case so I'll add it in. Am I missing anything else?

We already have reasonable (but not exhaustive) test coverage of intents in range_key_iter, which also exercises other iterators, and I was thinking we'd cover the rest with metamorphic MVCC testing later since it's hard to manually construct an exhaustive set of test cases. I've run these under coverage analysis to make sure we're at least hitting all non-error code paths. I'm happy to add in more tests if there are particular areas you feel is lacking though. For example, I see several tests around traversal into the lock table, and around specific error conditions.

Reviewable status: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @5, @8, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)

pkg/storage/intent_interleaving_iter.go line 584 at r24 (raw file):

Previously, sumeerbhola wrote…

nit: isCurAtIntentIterReverse()
I think its worth narrowing down the calls to isCurAtIntentIter() only to shared codepaths of forward and reverse iteration.

Sure. There were only a couple, when switching directions, so not particularly hot -- I figured we could leave them until we optimized away isCurAtIntentIter(), but might as swap them I suppose.

pkg/storage/intent_interleaving_iter.go line 1019 at r24 (raw file):

Previously, sumeerbhola wrote…

how about
... or be exhausted (the exhausted case is an error case).

Is it an error case? It just results in an exhausted iterator, not an error, right? This would happen when calling Prev at an intent adjacent to the lower bound.

erikgrinaker

I added mvcc_histories/range_key_iter_intent_without_provisional_norace with tests for intents around range keys that lack provisional values, since error conditions won't be covered by future metamorphic tests. These revealed that we could, in some cases, emit the wrong range key for an intent without a provisional value. I added some cheap checks to prevent this, and more expensive checks under race.

I think that's sufficient for now. Lock table tests seem well-covered by the existing tests, I don't think range keys should change anything there. Planning to add more thorough testing once we've unblocked higher-level teams. Let me know if you feel like we need anything else.

The only substantial difference I can see is the stats command, but it's easy enough to automatically dump iterator stats after each test that uses an iterator (or after each command), and that seems useful in any case so I'll add it in.

Couldn't do this because TestMVCCHistory is metamorphic over all reader types, and the iterator stats will differ by reader type (possibly due to caching and such). I think this metamorphism is more important than the stats, so dropped it for now.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @1, @2, @5, @8, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)

This patch adds initial experimental primitives for MVCC range keys, which will be the foundation for MVCC range tombstones. They are based on experimental Pebble range keys. * Data structures: * `MVCCRangeKey` * `MVCCRangeKeyValue` * `Engine` methods for mutating range keys: * `ExperimentalClearMVCCRangeKey()` * `ExperimentalClearAllMVCCRangeKeys()` * `ExperimentalPutMVCCRangeKey()` * `SupportsRangeKeys()` * `SimpleMVCCIterator` methods for accessing range keys: * `HasPointAndRange()` * `RangeBounds()` * `RangeKeys()` Range keys do not have a distinct identity, and should instead be considered a key continuum: they will merge with abutting keys of the same value, can be partially cleared, can split or merge along with ranges, and so on. Bounded scans will truncate them to the scan bounds. Only MVCC range tombstones are currently supported, with an empty value. Attempts to write a non-tombstone value will error, since these would need additional logic throughout the MVCC APIs, and their semantics are still unclear. Range key support is implemented in `pebbleIterator` and `intentInterleavingIter`, but not in the rest of the MVCC or KV APIs. They are not persisted to disk either. Subsequent pull requests will extend their functionality and integrate them with other components. Release note: None

erikgrinaker · 2022-06-04T08:20:42Z

Let's do this. Thanks for the great reviews!

bors r=sumeerbhola,jbowens,nicktrav

craig · 2022-06-04T09:17:51Z

Build succeeded:

GitHub CI (Cockroach)

sumeerbhola

Reviewed 1 of 11 files at r24, 1 of 3 files at r25.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale)

pkg/storage/intent_interleaving_iter.go line 1019 at r24 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Is it an error case? It just results in an exhausted iterator, not an error, right? This would happen when calling Prev at an intent adjacent to the lower bound.

you are right.

erikgrinaker requested review from jbowens, aliher1911 and sumeerbhola March 6, 2022 21:38

erikgrinaker requested review from a team as code owners March 6, 2022 21:38

erikgrinaker self-assigned this Mar 6, 2022

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch 2 times, most recently from cf20b98 to fe16c4f Compare March 7, 2022 13:05

erikgrinaker force-pushed the mvcc-range-tombstones branch 5 times, most recently from 822ec79 to 18f9103 Compare March 8, 2022 19:10

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch from fe16c4f to bd45018 Compare March 8, 2022 19:48

erikgrinaker force-pushed the mvcc-range-tombstones branch 2 times, most recently from b1b32ee to a5b506b Compare March 11, 2022 15:27

erikgrinaker force-pushed the mvcc-range-tombstones branch from a5b506b to 1075682 Compare March 11, 2022 16:07

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch from bd45018 to a9e5753 Compare March 11, 2022 16:10

aliher1911 reviewed Mar 11, 2022

View reviewed changes

pkg/storage/mvcc_key.go Show resolved Hide resolved

pkg/storage/mvcc_key.go Outdated Show resolved Hide resolved

sumeerbhola requested changes Mar 11, 2022

View reviewed changes

erikgrinaker commented Mar 12, 2022

View reviewed changes

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch from a9e5753 to c381952 Compare March 12, 2022 17:00

erikgrinaker force-pushed the mvcc-range-tombstones branch 2 times, most recently from 3360f58 to 19a02e0 Compare March 14, 2022 10:29

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch 4 times, most recently from 575c741 to 61265c0 Compare May 30, 2022 15:35

erikgrinaker commented May 30, 2022

View reviewed changes

sumeerbhola requested changes Jun 1, 2022

View reviewed changes

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch from 61265c0 to 73e6083 Compare June 2, 2022 12:54

erikgrinaker commented Jun 2, 2022

View reviewed changes

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch from 73e6083 to b5b1939 Compare June 2, 2022 12:57

sumeerbhola requested changes Jun 2, 2022

View reviewed changes

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch from b5b1939 to 37e6663 Compare June 2, 2022 16:59

erikgrinaker commented Jun 2, 2022

View reviewed changes

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch from 37e6663 to fb576f2 Compare June 2, 2022 17:43

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch from fb576f2 to a6c8a5d Compare June 3, 2022 10:38

sumeerbhola approved these changes Jun 3, 2022

View reviewed changes

erikgrinaker commented Jun 3, 2022

View reviewed changes

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch from a6c8a5d to f32e850 Compare June 3, 2022 14:08

erikgrinaker commented Jun 3, 2022

View reviewed changes

erikgrinaker force-pushed the mvcc-range-tombstones-primitives branch from f32e850 to 44f1ca7 Compare June 3, 2022 18:30

craig bot merged commit 0fdac09 into cockroachdb:master Jun 4, 2022

erikgrinaker deleted the mvcc-range-tombstones-primitives branch June 4, 2022 16:04

sumeerbhola reviewed Jun 6, 2022

View reviewed changes

erikgrinaker mentioned this pull request Jun 8, 2022

storage: optimize intent checks in intentInterleavingIter #82588

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: add experimental MVCC range key primitives #77417

storage: add experimental MVCC range key primitives #77417

erikgrinaker commented Mar 6, 2022 •

edited

Loading

cockroach-teamcity commented Mar 6, 2022

erikgrinaker commented Mar 7, 2022 •

edited

Loading

jbowens commented Mar 7, 2022

erikgrinaker commented Mar 11, 2022 •

edited

Loading

aliher1911 left a comment

jbowens commented Mar 11, 2022 •

edited

Loading

sumeerbhola left a comment

erikgrinaker commented Mar 12, 2022

erikgrinaker left a comment

erikgrinaker left a comment

sumeerbhola left a comment

sumeerbhola left a comment

erikgrinaker left a comment

sumeerbhola left a comment

sumeerbhola left a comment

erikgrinaker left a comment

erikgrinaker commented Jun 3, 2022 •

edited

Loading

erikgrinaker commented Jun 3, 2022 •

edited

Loading

sumeerbhola left a comment

erikgrinaker left a comment

erikgrinaker left a comment

erikgrinaker commented Jun 4, 2022

craig bot commented Jun 4, 2022

sumeerbhola left a comment

storage: add experimental MVCC range key primitives #77417

storage: add experimental MVCC range key primitives #77417

Conversation

erikgrinaker commented Mar 6, 2022 • edited Loading

cockroach-teamcity commented Mar 6, 2022

erikgrinaker commented Mar 7, 2022 • edited Loading

jbowens commented Mar 7, 2022

erikgrinaker commented Mar 11, 2022 • edited Loading

aliher1911 left a comment

Choose a reason for hiding this comment

jbowens commented Mar 11, 2022 • edited Loading

sumeerbhola left a comment

Choose a reason for hiding this comment

erikgrinaker commented Mar 12, 2022

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jun 3, 2022 • edited Loading

erikgrinaker commented Jun 3, 2022 • edited Loading

sumeerbhola left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jun 4, 2022

craig bot commented Jun 4, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

erikgrinaker commented Mar 6, 2022 •

edited

Loading

erikgrinaker commented Mar 7, 2022 •

edited

Loading

erikgrinaker commented Mar 11, 2022 •

edited

Loading

jbowens commented Mar 11, 2022 •

edited

Loading

erikgrinaker commented Jun 3, 2022 •

edited

Loading

erikgrinaker commented Jun 3, 2022 •

edited

Loading