-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: add experimental MVCC range key primitives #77417
storage: add experimental MVCC range key primitives #77417
Conversation
cf20b98
to
fe16c4f
Compare
@jbowens I'm getting an odd test failure here. When running Fails here: cockroach/pkg/storage/engine_test.go Line 1806 in fe16c4f
|
@erikgrinaker Thanks for catching that! cockroachdb/pebble#1563 |
822ec79
to
18f9103
Compare
fe16c4f
to
bd45018
Compare
b1b32ee
to
a5b506b
Compare
@jbowens Would like to get your thoughts on a few
|
a5b506b
to
1075682
Compare
bd45018
to
a9e5753
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions about keys while I'm reading more involved bits.
@erikgrinaker Yeah, I see how some of these mechanics can be confusing and unexpected. Let me try my best to explain the current behavior first, and then we can circle back to see whether it would be beneficial to change them. We have a continuous infinite keyspace. The classic Pebble iterator moves along this 1-dimensional keyspace and pauses at point keys only. Point keys are by definition inclusive, and the Iterator's position within the keyspace is conveyed through
With the introduction of range keys, we need to shoehorn continuous spans into our iteration over discrete points. To do that, we say that the Iterator pauses at point keys and at the beginning of new range keys. The beginning of range keys is inclusive, so it ensures that the state that we surface on the
If the start of a range key coincides with a point key, the Iterator still only visits the The above scheme works for scanning across the entire keyspace, but what about seeks? If you
If instead, we
This In your last example, you're observing that the Iterator only pauses at range key starts, never end keys. This is because end keys are exclusive, and the state surfaced on an Some of these examples might make more sense with the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 23 files at r4, 2 of 4 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)
pkg/storage/engine.go, line 84 at r5 (raw file):
// key and/or a range key. If Valid() returns true, one of these will be true. // Range keys are only emitted when requested via IterOptions.KeyTypes. HasPointAndRange() (bool, bool)
I know I had said we should move towards the EngineIterator
interface-style of SeekEngineKeyGE(key EngineKey) (valid bool, err error)
. I realize the refactor can be messy and one would rather do it later. However, now we have to do a sequence like
iter.Next()
valid, err := iter.Valid()
if !valid || err != nil {
...
}
hasP, hasR := iter.HasPointAndRange()
// Do the real work
...
Which looks annoying.
I would prefer one of two things:
- HasPointAndRange() (bool, bool, error), so the new code does not need to call Valid.
- Keeping Valid, for now, for the old code. And changing all the step and seek methods to return (bool, bool, error) (which the old code will ignore and the new code will use).
pkg/storage/engine.go, line 86 at r5 (raw file):
HasPointAndRange() (bool, bool) // RangeBounds returns the range bounds for the current range key fragment, if // any. See RangeKeys() for more info on range key fragments.
if any?
So is it legal to call this method if HasPointAndRange said there was no range?
pkg/storage/engine.go, line 110 at r5 (raw file):
// partially removed by GC, and may be truncated by iterator bounds. // // TODO(erikgrinaker): Write a tech note on range keys and link it here.
why do we need to defragment in CockroachDB, now that we have defragmentation built into Pebble?
pkg/storage/engine.go, line 358 at r5 (raw file):
// // NB: range keys are only supported for use with MVCCIterators, but it is // legal to enable them for EngineIterators in order to derive cloned
where are we cloning engine iterators to use as the mvcc iterator inside intentInterleavingIter?
This comment does raise an issue: currently the pebble.Iterators that we reuse in pebbleReadOnly only change their bounds. It seems now that they could also switch in their IterKeyType. But the pebble.IterKeyType is currently fixed when creating the pebble.Iterator. We probably need to change that.
pkg/storage/engine.go, line 376 at r5 (raw file):
// point keys are not surfaced. // // TODO(erikgrinaker): Consider moving this down into Pebble.
can this TODO be done now, since we have pebble.IterKeyType?
pkg/storage/engine.go, line 673 at r5 (raw file):
// have severe limitations including being ignored by all KV and MVCC APIs and // only being stored in memory. ExperimentalClearMVCCRangeKeys(start, end roachpb.Key) error
These two method names differ in "s". Can we use something more forceful like DestroyMVCCRangeKeys, or ClearRangeKeys (since this isn't an MVCC compliant operation)?
pkg/storage/engine.go, line 677 at r5 (raw file):
// ExperimentalPutMVCCRangeKey writes a value to an MVCC range key. It will // replace any existing keys, or any segments that it overlaps. This is // currently only used for range tombstones, which have a value of nil.
If this value is nil, I would prefer not passing the parameter at all.
Between future-proofing now and a goland-aided manual refactoring later, I think the latter is better since it avoids adding unnecessary invariant checks now.
pkg/storage/intent_interleaving_iter.go, line 365 at r5 (raw file):
} // If we find an intent, but we land on a range key covering it, skip to // the point key.
This needs some justification, grounded in what semantics the user is desiring (ideally documented in engine.go).
And I don't understand why we are doing this for SeekIntentGE and not for SeekGE -- the former is just an optimization.
pkg/storage/mvcc.go, line 105 at r5 (raw file):
type MVCCRangeKeyValue struct { Key MVCCRangeKey Value []byte
Is this trying to future proof the API, and Value will always be nil for now?
If yes, please add a comment stating that current invariant. Or better still, let's keep MVCCRangeKeyValue
so we have a struct to add to, and get rid of the Value
field and add a code comment on how this will change in the future.
pkg/storage/mvcc_key.go, line 349 at r5 (raw file):
// equal, or 1 if this is greater. Comparison is by start,timestamp,end, where // larger timestamps sort before smaller ones except empty ones which sort first // (like elsewhere in MVCC).
why do we need this method?
pkg/storage/mvcc_key.go, line 390 at r5 (raw file):
// We allow equal start and end key, since we allow empty spans in many MVCC // APIs (e.g. scans). if k.StartKey.Compare(k.EndKey) > 0 {
This validation function is used on the write path. We should not allow writing of empty spans.
pkg/storage/mvcc_range_key_iterator.go, line 42 at r5 (raw file):
// SimpleMVCCIterator, range keys are defragmented into contiguous deterministic // range keys. It does not support seeking or backtracking, see // MVCCRangeKeyIterOptions for lower/upper bounds and other options.
why do we need to defragment in CockroachDB now that we have completely deterministic iteration semantics in Pebble?
pkg/storage/pebble.go, line 1873 at r5 (raw file):
} reusable := opts.MinTimestampHint.IsEmpty() && opts.KeyTypes == IterKeyTypePointsOnly
I think we need general reuse. I am fine with a TODO for future PRs.
Yeah, I think this is at the root of it. I understand why it's done this way, and it isn't necessarily a problem (I can't think of anything that it'd prevent us from doing). Another possible model to consider would be to surface range keys when they're first encountered. This essentially implies that range keys exist over a range, rather than anchoring it at the start key. The current seekGE semantics already imply this, so there is already an inconsistency in saying that they exist at the start key. Let's say we iterate over this set of keys: [b-f)@3, b@2, d@2, f@2 The current reverse iteration order is symmetrical, which is sort of nice in itself. However, we could also also imagine this reverse iteration order: f@2, [b-f)@3, d@2, b@2 It follows that a seekLT to e would land at [b-f)@3 first, just like seekGE would. This would also eliminate the duplicate emission of the lone range key when changing directions after a seek. However, as you point out, it presents a problem wrt. Just a thought. I don't think we really gain anything from this other than it being slightly more intuitive (at least to me), so it might be a lot of work for little benefit. But I'd be curious to hear what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @aliher1911 from a discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @1, @2, @aliher1911, @jbowens, and @sumeerbhola)
pkg/storage/engine.go, line 84 at r5 (raw file):
Previously, sumeerbhola wrote…
I know I had said we should move towards the
EngineIterator
interface-style ofSeekEngineKeyGE(key EngineKey) (valid bool, err error)
. I realize the refactor can be messy and one would rather do it later. However, now we have to do a sequence likeiter.Next() valid, err := iter.Valid() if !valid || err != nil { ... } hasP, hasR := iter.HasPointAndRange() // Do the real work ...
Which looks annoying.
I would prefer one of two things:
- HasPointAndRange() (bool, bool, error), so the new code does not need to call Valid.
- Keeping Valid, for now, for the old code. And changing all the step and seek methods to return (bool, bool, error) (which the old code will ignore and the new code will use).
I think I prefer the latter, because it also allows us to surface decoding errors in a natural way. I am a bit worried that it may cause a performance issue in certain cases due to unnecessary decoding. Let's say we're iterating over point keys and are only interested in range keys at some points -- we would still have to decode all range keys during iteration regardless of whether the caller is interested in them in order to surface decoding errors via Next
(we could cache them until the range bounds change, but still). But let's discuss that separately.
I agree that we should do this as part of the range tombstone work. However, we will have to maintain a feature branch for 1-2 months, and a refactoring like this would be really painful to keep rebasing onto master
. Also, we don't want to make large changes on master
until 22.1 RCs are out, to ease backports. So I think we should do this after the feature branch lands in master
.
pkg/storage/engine.go, line 86 at r5 (raw file):
Previously, sumeerbhola wrote…
if any?
So is it legal to call this method if HasPointAndRange said there was no range?
Yes, it returns (nil, nil)
.
pkg/storage/engine.go, line 110 at r5 (raw file):
Previously, sumeerbhola wrote…
why do we need to defragment in CockroachDB, now that we have defragmentation built into Pebble?
These are different kinds/levels of defragmentation. Pebble defragments across physical SST bounds, to enforce the invariant that range key bounds always fall on bare prefixes. CockroachDB defragmentation is to return logical range keys from fragment stacks.
For example, after writing [a-c)@1 and [b-d)@2, Pebble returns [a-b)@1, [b-c)@2, [b-c)@1, [c-d)@2. The CRDB defragmentation returns [a-c)@1 and [b-d)@2.
This will be used e.g. for MVCC statistics, as the above range keys will count as 2 range keys rather than 4. It is also a much more compact representation, which is useful in APIs or data structures (e.g. consistency checker diffs) and tests (e.g. data-driven MVCC history tests). And it can result in fewer write operations, since e.g. during GC we would make two RangeKeyUnset calls rather than four for the above.
pkg/storage/engine.go, line 358 at r5 (raw file):
where are we cloning engine iterators to use as the mvcc iterator inside intentInterleavingIter?
iter = newMVCCIteratorByCloningEngineIter(intentIter, opts) |
This comment does raise an issue: currently the pebble.Iterators that we reuse in pebbleReadOnly only change their bounds. It seems now that they could also switch in their IterKeyType. But the pebble.IterKeyType is currently fixed when creating the pebble.Iterator. We probably need to change that.
Yeah, so currently we avoid that by only reusing these iterators for IterKeyTypePointsOnly
. New iterators will be created on each NewMVCCIterator()
call with other key types. Once we port pebbleMVCCScanner
to respect range tombstones we would likely change this to reuse iterators with IterKeyTypePointsAndRanges
instead, since that will be the most common iterator key type.
It would be better if we could allow changing the key type on the fly. I don't know what that would entail in terms of labour and the performance penalty of setting up/tearing down the range key processing. Wdyt @jbowens?
pkg/storage/engine.go, line 376 at r5 (raw file):
Previously, sumeerbhola wrote…
can this TODO be done now, since we have pebble.IterKeyType?
No, Pebble does not offer IterKeyTypePointsWithRanges
, only the other options. When we discussed this previously you had a preference for implementing this in the MVCC layer rather than in Pebble, but I'd obviously prefer to push it down to Pebble and avoid the complexity here. No practical difference though, either works for me.
pkg/storage/engine.go, line 673 at r5 (raw file):
Previously, sumeerbhola wrote…
These two method names differ in "s". Can we use something more forceful like DestroyMVCCRangeKeys, or ClearRangeKeys (since this isn't an MVCC compliant operation)?
Sure. I'd prefer ClearRangeKeys
, since this operation isn't really more destructive than the other one (they both clear range keys, it's just a matter of which ones).
pkg/storage/engine.go, line 677 at r5 (raw file):
Previously, sumeerbhola wrote…
If this value is nil, I would prefer not passing the parameter at all.
Between future-proofing now and a goland-aided manual refactoring later, I think the latter is better since it avoids adding unnecessary invariant checks now.
Yeah, I've been a bit uncertain about this. I was planning to fully implement support for non-nil values in MVCC, e.g. stats, GC, consistency checks, etc (where non-nil is considered live data), since I think the marginal cost of doing it now is significantly lower than if we have to page all this stuff back in later. But there's definitely a risk that it'll be wasted work if we end up either not using it or having to change the semantics later.
I could go either way here. I'm partial to doing it just because I want to "finish the job properly", but it has no bearing on MVCC range tombstones themselves. If people feel that it's a waste of time then I'm fine with removing the value handling for now.
pkg/storage/intent_interleaving_iter.go, line 365 at r5 (raw file):
Previously, sumeerbhola wrote…
This needs some justification, grounded in what semantics the user is desiring (ideally documented in engine.go).
And I don't understand why we are doing this for SeekIntentGE and not for SeekGE -- the former is just an optimization.
SeekGE
takes a versioned MVCCKey
, but SeekIntentGE
takes an unversioned roachpb.Key
. If we don't do this, then SeekIntentGE
would never be able to seek to an intent with an overlapping range key, it would always land on the range key first. I thought the desired semantics here were to land on an intent that we expect to exist?
pkg/storage/mvcc.go, line 105 at r5 (raw file):
Previously, sumeerbhola wrote…
Is this trying to future proof the API, and Value will always be nil for now?
If yes, please add a comment stating that current invariant. Or better still, let's keepMVCCRangeKeyValue
so we have a struct to add to, and get rid of theValue
field and add a code comment on how this will change in the future.
See separate discussion above. If we're dropping valued range keys then we should drop MVCCRangeKeyValue
too for now and just return MVCCRangeKey
instead.
pkg/storage/mvcc_key.go, line 349 at r5 (raw file):
Previously, sumeerbhola wrote…
why do we need this method?
This was pulled in from #76880, where it's needed to generate diffs during replica consistency checks. And I also feel like the ordering semantics of range keys need to be well-defined.
pkg/storage/mvcc_key.go, line 390 at r5 (raw file):
Previously, sumeerbhola wrote…
This validation function is used on the write path. We should not allow writing of empty spans.
Yeah, I think we'll need to differentiate read/write validation -- there are other conditions that come into play too.
pkg/storage/mvcc_range_key_iterator.go, line 42 at r5 (raw file):
Previously, sumeerbhola wrote…
why do we need to defragment in CockroachDB now that we have completely deterministic iteration semantics in Pebble?
This was answered in a different thread.
pkg/storage/pebble.go, line 1873 at r5 (raw file):
Previously, sumeerbhola wrote…
I think we need general reuse. I am fine with a TODO for future PRs.
I agree, we should consolidate this somehow.
a9e5753
to
c381952
Compare
3360f58
to
19a02e0
Compare
575c741
to
61265c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @jbowens, and @sumeerbhola)
pkg/storage/engine.go
line 338 at r17 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Good call, done.
I realized that we no longer need this now that we have SetOptions()
: we can always create the intent iterator with IterKeyTypePointsOnly
, and reconfigure the cloned MVCC iterator to enable range keys. This also avoids any associated range key overhead for the intent iterator.
Updated the intentInterleavingIter
with this, and also added checks in NewEngineIterator()
. We could consider separating the options structs for EngineIterator and MVCCIterator to make this clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 7 files at r21.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)
pkg/storage/engine.go
line 75 at r17 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I moved all of this to a top-level comment, but kept a note on
NextKey()
because there is an additional subtlety here that isn't obvious and doesn't follow from the usual iteration semantics.Consider
a@4
and[a-c)@3
.NextKey()
will first land on[a-c)@3
thena@4
. It isn't given that this is the correct or desired behavior. We could instead skip overa@4
(becausea
was emitted by[a-c)@3
), or we could step ontoa@4
and emit them both together.I feel like the latter option is more intuitive, and probably more useful. However, it would require an additional peek ahead to look for a point key at the start key, and a step back if we didn't find one. So I'm not sure if it's worth the cost. Wdyt?
I think NextKey()
means the user wants the next roachpb.Key, so skipping over a@4
sounds right. Additional peeking seems too complicated, for what was designed mainly as an optimization. If that first key was only a range key, then the caller that wants to see a point can do Next()
instead of NextKey()
. Given the smallish set of NextKey()
callers, I think we should be able to change their code to follow this semantics.
pkg/storage/engine.go
line 46 at r21 (raw file):
// noted. SimpleMVCCIterator is a subset of the functionality offered by // MVCCIterator. //
nice comment
pkg/storage/intent_interleaving_iter.go
line 946 at r17 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Done.
Still there
pkg/storage/intent_interleaving_iter.go
line 1005 at r17 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
With the "new" semantics of intentCmp (relative position of iterators wrt. provisional value), this logic is correct. We need to call this before stepping off the intent, and intentCmp > 0 implies that we're currently on the intent. But see comment above, happy to explore alternatives.
By tracking i.hasPoint and i.hasRange, we won't need to speculate and call maybeSkipIntentRangeKey
(which incurs a key comparison). We will know that we need to do i.Prev
if i.intentCmp==0
and we are positioned at the intent.
pkg/storage/intent_interleaving_iter.go
line 115 at r21 (raw file):
// is at an intent, and iter is at the corresponding provisional value, cmp // will be 0. If iter is on a bare range key with the same key position // (start key or SeekGE key) as intentIter then intentCmp is 1 if iterKey has
nit: s/1/+1/
pkg/storage/intent_interleaving_iter.go
line 117 at r21 (raw file):
// (start key or SeekGE key) as intentIter then intentCmp is 1 if iterKey has // no timestamp and -1 otherwise, to correctly position it for reverse // iteration. See the longer struct-level comment for more on the relative
I worry about cheating here by claiming intentCmp=+1
when it is actually 0.
I think it is better to track hasPoint
, hasRange
as member variables and consider the two different cases corresponding to intentCmp==0
and dir=-1
hasPoint
: positioned at a versioned value and Prev will step iter.!hasPoint
: hasRange must be true. Positioned at the intent. Prev will step both iter and intentIter.
I've added some other comments that follow this same logic.
However, this doesn't actually work because we can have !hasPoint && hasRange for synthetic keys that have a timestamp, due to SeekGE.
But we can still make this work:
iterAtNonTimestampedRange = iterKey.Timestamp.IsEmpty() && hasRange && !hasPoint
The previous logic for intentCmp==0
and dir=-1
is
!iterAtNonTimestampedRange
: positioned at a versioned value (may be a range or point) and Prev will step iter.iterAtNonTimestampedRange
: Positioned at the intent. Prev will step both iter and intentIter.
I don't think we need any other special casing for the case where we have encountered a timestamped range key. SeekGE must have positioned the intentIter at a later intent. The only thing that seems to require care is if someone calls Prev immediately after SeekGE. I tried to construct a couple of examples using SeekGE(a@10), where the range key also starts at a, and I think they work fine.
Example1:
Intents at a, b, Versions at a@12, a@9.
After the Seek, the intentIter is at b and iter is at a@10. Since the interleaving iter is not positioned at the intent, the call to Prev will first step the intentIter back to a. And then proceed as if we were already doing backward iteration. Since !iterAtNonTimestampedRange, iter will be stepped back to a@12 and become visible to the user. If the user does Prev again, iter will land on a, at which point iterAtNonTimestampedRange becomes true and the intent will become visible.
Exampe 2:
Intents at a, b. Version at a@9.
After the Seek, the intentIter is at b and iter is at a@10. Since the interleaving iter is not positioned at the intent, a call to Prev will first step the intentIter back to a. And then proceed as if we were already doing backward iteration. Since !iterAtNonTimestampedRange, iter will be stepped back to a. So iterAtNonTimestampedRange becomes true and the intent will become visible to the user.
pkg/storage/intent_interleaving_iter.go
line 792 at r21 (raw file):
// - intentCmp <= 0. Returns false. // - intentCmp > 0. Returns true. return (i.dir > 0 && i.intentCmp <= 0) || (i.dir < 0 && i.intentCmp > 0)
this would become
return (i.dir > 0 && i.intentCmp <= 0) || (i.dir < 0 && (i.intentCmp > 0 || (i.intentCmp == 0 && !i.hasPoint))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 21 files at r17, 2 of 7 files at r21.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)
pkg/storage/pebble_iterator.go
line 396 at r21 (raw file):
return } hasPoint, _ := p.HasPointAndRange()
if we keep the semantics to be that the user wants to step to the next roachpb.Key (its their problem to figure out when to use Next and when to use NextKey), then I think this method doesn't need any change.
61265c0
to
73e6083
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @jbowens, and @sumeerbhola)
pkg/storage/engine.go
line 75 at r17 (raw file):
Previously, sumeerbhola wrote…
I think
NextKey()
means the user wants the next roachpb.Key, so skipping overa@4
sounds right. Additional peeking seems too complicated, for what was designed mainly as an optimization. If that first key was only a range key, then the caller that wants to see a point can doNext()
instead ofNextKey()
. Given the smallish set ofNextKey()
callers, I think we should be able to change their code to follow this semantics.
Sure. -- that's logically consistent at least, and I think the callers can manage. Done.
pkg/storage/intent_interleaving_iter.go
line 946 at r17 (raw file):
Previously, sumeerbhola wrote…
Still there
Ah, yeah, I think this was because it triggered the assertion below it -- if we're switching directions after a SeekGE call lands inside a bare range key, then the Prev() call here will land on the start key, which can have intentCmp == 0 if intentIter is colocated with it. We can drop it by calling computePos()
instead of setting intentCmp = +1
, so I've done that. The rest should come out correct.
pkg/storage/intent_interleaving_iter.go
line 1005 at r17 (raw file):
Previously, sumeerbhola wrote…
By tracking i.hasPoint and i.hasRange, we won't need to speculate and call
maybeSkipIntentRangeKey
(which incurs a key comparison). We will know that we need to doi.Prev
ifi.intentCmp==0
and we are positioned at the intent.
Yeah, we now only need to do this at this single site in the reverse direction, with computePos() already called, so I've made maybeSkipIntentRangeKey() only work in the forward direction and inlined the Prev() check.
pkg/storage/intent_interleaving_iter.go
line 115 at r21 (raw file):
Previously, sumeerbhola wrote…
nit: s/1/+1/
Reverted the changes to this comment.
pkg/storage/intent_interleaving_iter.go
line 117 at r21 (raw file):
Previously, sumeerbhola wrote…
I worry about cheating here by claiming
intentCmp=+1
when it is actually 0.
I think it is better to trackhasPoint
,hasRange
as member variables and consider the two different cases corresponding tointentCmp==0
anddir=-1
hasPoint
: positioned at a versioned value and Prev will step iter.!hasPoint
: hasRange must be true. Positioned at the intent. Prev will step both iter and intentIter.I've added some other comments that follow this same logic.
However, this doesn't actually work because we can have !hasPoint && hasRange for synthetic keys that have a timestamp, due to SeekGE.
But we can still make this work:
iterAtNonTimestampedRange = iterKey.Timestamp.IsEmpty() && hasRange && !hasPoint
The previous logic forintentCmp==0
anddir=-1
is
!iterAtNonTimestampedRange
: positioned at a versioned value (may be a range or point) and Prev will step iter.iterAtNonTimestampedRange
: Positioned at the intent. Prev will step both iter and intentIter.I don't think we need any other special casing for the case where we have encountered a timestamped range key. SeekGE must have positioned the intentIter at a later intent. The only thing that seems to require care is if someone calls Prev immediately after SeekGE. I tried to construct a couple of examples using SeekGE(a@10), where the range key also starts at a, and I think they work fine.
Example1:
Intents at a, b, Versions at a@12, a@9.
After the Seek, the intentIter is at b and iter is at a@10. Since the interleaving iter is not positioned at the intent, the call to Prev will first step the intentIter back to a. And then proceed as if we were already doing backward iteration. Since !iterAtNonTimestampedRange, iter will be stepped back to a@12 and become visible to the user. If the user does Prev again, iter will land on a, at which point iterAtNonTimestampedRange becomes true and the intent will become visible.Exampe 2:
Intents at a, b. Version at a@9.
After the Seek, the intentIter is at b and iter is at a@10. Since the interleaving iter is not positioned at the intent, a call to Prev will first step the intentIter back to a. And then proceed as if we were already doing backward iteration. Since !iterAtNonTimestampedRange, iter will be stepped back to a. So iterAtNonTimestampedRange becomes true and the intent will become visible to the user.
Yep, that'll do. Updated, but bikeshedded it as iterBareRangeAtIntent
which assumes intentCmp == 0
.
pkg/storage/intent_interleaving_iter.go
line 792 at r21 (raw file):
Previously, sumeerbhola wrote…
this would become
return (i.dir > 0 && i.intentCmp <= 0) || (i.dir < 0 && (i.intentCmp > 0 || (i.intentCmp == 0 && !i.hasPoint))
Done.
pkg/storage/pebble_iterator.go
line 396 at r21 (raw file):
Previously, sumeerbhola wrote…
if we keep the semantics to be that the user wants to step to the next roachpb.Key (its their problem to figure out when to use Next and when to use NextKey), then I think this method doesn't need any change.
It still needs to step off the bare range key after the seek below. Otherwise, with [a-f)@3
and b@2
,b@1
,c@2
then calling NextKey()
at b@2
would land on [a-f)@3
above c@2
due to the seek, even though that key was already emitted at a
. But we don't need the condition you're pointing to here.
73e6083
to
b5b1939
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r22.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)
pkg/storage/intent_interleaving_iter.go
line 334 at r22 (raw file):
// // In the reverse direction, this is only needed once during Prev(), and the // caller can check i.iterBareRangeAtIntent instead.
This "reverse direction" comment is confusing since this method is not called for reverse iteration.
I would suggest moving it to a more appropriate location.
Also, can you add
// REQUIRES: i.dir > 0
pkg/storage/intent_interleaving_iter.go
line 612 at r22 (raw file):
if isCurAtIntent { // iter precedes the intentIter, so must be at the lowest version of the // preceding key or exhausted. So step it forward. It will now point to
... or the bare range at intent or exhausted.
pkg/storage/intent_interleaving_iter.go
line 631 at r22 (raw file):
} if util.RaceEnabled { cmp := i.intentKey.Compare(i.iterKey.Key)
can you also add hasRange && !hasPoint
as an error condition.
pkg/storage/intent_interleaving_iter.go
line 874 at r22 (raw file):
// // hasRange → i.iterValid // i.isCurAtIntentIter() && i.dir < 0 → i.intentCmp > 0
=> i.intentCmp > 0 || (i.intentCmp==0 && i.iterBareRangeAtIntent)
pkg/storage/intent_interleaving_iter.go
line 878 at r22 (raw file):
// TODO(erikgrinaker): consider optimizing this comparison. if hasRange && i.dir < 0 { hasRange = i.iter.RangeBounds().EndKey.Compare(i.intentKey) > 0
hasRange = i.intentCmp == 0 || i.iter.RangeBounds().EndKey.Compare(i.intentKey) > 0
can occasionally avoid a key comparison.
pkg/storage/intent_interleaving_iter.go
line 1035 at r22 (raw file):
// The iterator is positioned at an intent in intentIter, and iter is // exhausted, positioned at a versioned value of a preceding key, or // positioned on an intent colocated with the start of a range key.
should this be saying the following since this part of the sentence is about iter:
...or positioned on the start of a range key colocated with the intent.
pkg/storage/intent_interleaving_iter.go
line 1047 at r22 (raw file):
return } if i.iterBareRangeAtIntent {
I think we need to hoist this if i.iterBareRangeAtIntent
block before the preceding code since we need i.iter positioned properly before calling i.makeLowerLimitKey
. The *WithLimit
code is a bit subtle in terms of sometimes ignoring the limit, which may be why this didn't fail a test. Try something like the following in a debugger:
a@10, b@10, [b, c)@5 and intents at a, b, i.e.., and see if calling intentInterleavingIter.Prev
when the iterator is positioned at the intent b, causes the current code to position iter at a@10 and the intentIter to be at IterAtLimit (instead of correctly being at a), which would make intentCmp = -1. Then the next call to intentInterleavingIter.Prev
will exhaust iter so the if i.intentIterState == pebble.IterAtLimit && i.iterValid
will be false and we will never expose the intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)
pkg/storage/pebble_iterator.go
line 406 at r22 (raw file):
// either a point key or range key starting past the seek key. if hasPoint, hasRange := p.HasPointAndRange(); !hasPoint && hasRange { if p.RangeBounds().Key.Compare(p.keyBuf) <= 0 {
What if we had point keys a@10, a@5, b@10 and range keys [a,a\0)@5, [a\0,b)@8. And we seeked while being positioned at a, so SeekGE(a\0). That would land at [a\0,b)@8, which does not extend into a, so has not been emitted. And then by doing Next we will end up at b@10, so never show the range key.
b5b1939
to
37e6663
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @5, @8, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)
pkg/storage/intent_interleaving_iter.go
line 334 at r22 (raw file):
Previously, sumeerbhola wrote…
This "reverse direction" comment is confusing since this method is not called for reverse iteration.
I would suggest moving it to a more appropriate location.
Also, can you add
// REQUIRES: i.dir > 0
Ah, forgot to remove that paragraph. Done.
pkg/storage/intent_interleaving_iter.go
line 612 at r22 (raw file):
Previously, sumeerbhola wrote…
... or the bare range at intent or exhausted.
Done.
pkg/storage/intent_interleaving_iter.go
line 631 at r22 (raw file):
Previously, sumeerbhola wrote…
can you also add
hasRange && !hasPoint
as an error condition.
Done.
pkg/storage/intent_interleaving_iter.go
line 874 at r22 (raw file):
Previously, sumeerbhola wrote…
=> i.intentCmp > 0 || (i.intentCmp==0 && i.iterBareRangeAtIntent)
Done.
pkg/storage/intent_interleaving_iter.go
line 878 at r22 (raw file):
Previously, sumeerbhola wrote…
hasRange = i.intentCmp == 0 || i.iter.RangeBounds().EndKey.Compare(i.intentKey) > 0
can occasionally avoid a key comparison.
Good catch, thanks. Done.
pkg/storage/intent_interleaving_iter.go
line 1035 at r22 (raw file):
Previously, sumeerbhola wrote…
should this be saying the following since this part of the sentence is about iter:
...or positioned on the start of a range key colocated with the intent.
Yes, it should, thanks.
pkg/storage/intent_interleaving_iter.go
line 1047 at r22 (raw file):
Previously, sumeerbhola wrote…
I think we need to hoist this
if i.iterBareRangeAtIntent
block before the preceding code since we need i.iter positioned properly before callingi.makeLowerLimitKey
. The*WithLimit
code is a bit subtle in terms of sometimes ignoring the limit, which may be why this didn't fail a test. Try something like the following in a debugger:
a@10, b@10, [b, c)@5 and intents at a, b, i.e.., and see if callingintentInterleavingIter.Prev
when the iterator is positioned at the intent b, causes the current code to position iter at a@10 and the intentIter to be at IterAtLimit (instead of correctly being at a), which would make intentCmp = -1. Then the next call tointentInterleavingIter.Prev
will exhaust iter so theif i.intentIterState == pebble.IterAtLimit && i.iterValid
will be false and we will never expose the intent.
Gah, didn't pay enough attention to the limits, thanks. Your example is almost there, but the a@10
key saves us since we reposition intentIter when we land on it. Intents on a,b are sufficient by themselves.
The existing tests didn't catch this because there needs to be a bare intent next to an iterator bound. Added a couple of regression tests for this.
pkg/storage/pebble_iterator.go
line 406 at r22 (raw file):
Previously, sumeerbhola wrote…
What if we had point keys a@10, a@5, b@10 and range keys [a,a\0)@5, [a\0,b)@8. And we seeked while being positioned at a, so SeekGE(a\0). That would land at [a\0,b)@8, which does not extend into a, so has not been emitted. And then by doing Next we will end up at b@10, so never show the range key.
Great catch! Added a fix and regression test. I'm going to do another pass over everything with fresh eyes tomorrow, to check for any further stuff I might have missed.
37e6663
to
fb576f2
Compare
Ran some more benchmarks, which are in line with the previous ones. We see a slight regression, but it isn't dramatic, and good enough that we can merge this to
|
This seems to be caused by additional
For now, I've added specialized forward/reverse methods that are called in the Next/Prev hot paths. This improves things somewhat compared to the previous results:
I also added a |
fb576f2
to
a6c8a5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any changes in the testdata/intent_interleaving_iter directory. Can you add some test cases there, since we could benefit from focused testing of the intentInterleavingIter changes? Don't worry about the existing tests that also have interleaved intents -- its on my TODO list to cleanup.
Reviewed 1 of 4 files at r23, 3 of 11 files at r24.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @5, @8, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)
pkg/storage/intent_interleaving_iter.go
line 584 at r24 (raw file):
if i.dir < 0 { // Switching from reverse to forward iteration. isCurAtIntent := i.isCurAtIntentIter()
nit: isCurAtIntentIterReverse()
I think its worth narrowing down the calls to isCurAtIntentIter() only to shared codepaths of forward and reverse iteration.
pkg/storage/intent_interleaving_iter.go
line 1019 at r24 (raw file):
// Step it backward. It will now point to a key that is before the // intent key, or a range key whose start key is colocated with the // intent, or be exhausted.
how about
... or be exhausted (the exhausted case is an error case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any changes in the testdata/intent_interleaving_iter directory. Can you add some test cases there, since we could benefit from focused testing of the intentInterleavingIter changes?
I had a look at these, and it isn't clear to me what benefit we get from these compared to the ones in range_key_iter
which use the common TestMVCCHistories
infrastructure. The only substantial difference I can see is the stats
command, but it's easy enough to automatically dump iterator stats after each test that uses an iterator (or after each command), and that seems useful in any case so I'll add it in. Am I missing anything else?
We already have reasonable (but not exhaustive) test coverage of intents in range_key_iter
, which also exercises other iterators, and I was thinking we'd cover the rest with metamorphic MVCC testing later since it's hard to manually construct an exhaustive set of test cases. I've run these under coverage analysis to make sure we're at least hitting all non-error code paths. I'm happy to add in more tests if there are particular areas you feel is lacking though. For example, I see several tests around traversal into the lock table, and around specific error conditions.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @1, @2, @5, @8, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)
pkg/storage/intent_interleaving_iter.go
line 584 at r24 (raw file):
Previously, sumeerbhola wrote…
nit: isCurAtIntentIterReverse()
I think its worth narrowing down the calls to isCurAtIntentIter() only to shared codepaths of forward and reverse iteration.
Sure. There were only a couple, when switching directions, so not particularly hot -- I figured we could leave them until we optimized away isCurAtIntentIter()
, but might as swap them I suppose.
pkg/storage/intent_interleaving_iter.go
line 1019 at r24 (raw file):
Previously, sumeerbhola wrote…
how about
... or be exhausted (the exhausted case is an error case).
Is it an error case? It just results in an exhausted iterator, not an error, right? This would happen when calling Prev at an intent adjacent to the lower bound.
a6c8a5d
to
f32e850
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added mvcc_histories/range_key_iter_intent_without_provisional_norace
with tests for intents around range keys that lack provisional values, since error conditions won't be covered by future metamorphic tests. These revealed that we could, in some cases, emit the wrong range key for an intent without a provisional value. I added some cheap checks to prevent this, and more expensive checks under race.
I think that's sufficient for now. Lock table tests seem well-covered by the existing tests, I don't think range keys should change anything there. Planning to add more thorough testing once we've unblocked higher-level teams. Let me know if you feel like we need anything else.
The only substantial difference I can see is the stats command, but it's easy enough to automatically dump iterator stats after each test that uses an iterator (or after each command), and that seems useful in any case so I'll add it in.
Couldn't do this because TestMVCCHistory
is metamorphic over all reader types, and the iterator stats will differ by reader type (possibly due to caching and such). I think this metamorphism is more important than the stats, so dropped it for now.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @1, @2, @5, @8, @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)
This patch adds initial experimental primitives for MVCC range keys, which will be the foundation for MVCC range tombstones. They are based on experimental Pebble range keys. * Data structures: * `MVCCRangeKey` * `MVCCRangeKeyValue` * `Engine` methods for mutating range keys: * `ExperimentalClearMVCCRangeKey()` * `ExperimentalClearAllMVCCRangeKeys()` * `ExperimentalPutMVCCRangeKey()` * `SupportsRangeKeys()` * `SimpleMVCCIterator` methods for accessing range keys: * `HasPointAndRange()` * `RangeBounds()` * `RangeKeys()` Range keys do not have a distinct identity, and should instead be considered a key continuum: they will merge with abutting keys of the same value, can be partially cleared, can split or merge along with ranges, and so on. Bounded scans will truncate them to the scan bounds. Only MVCC range tombstones are currently supported, with an empty value. Attempts to write a non-tombstone value will error, since these would need additional logic throughout the MVCC APIs, and their semantics are still unclear. Range key support is implemented in `pebbleIterator` and `intentInterleavingIter`, but not in the rest of the MVCC or KV APIs. They are not persisted to disk either. Subsequent pull requests will extend their functionality and integrate them with other components. Release note: None
f32e850
to
44f1ca7
Compare
Let's do this. Thanks for the great reviews! bors r=sumeerbhola,jbowens,nicktrav |
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 11 files at r24, 1 of 3 files at r25.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 3 stale)
pkg/storage/intent_interleaving_iter.go
line 1019 at r24 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Is it an error case? It just results in an exhausted iterator, not an error, right? This would happen when calling Prev at an intent adjacent to the lower bound.
you are right.
This patch adds initial experimental primitives for MVCC range keys,
which will be the foundation for MVCC range tombstones. They are based
on experimental Pebble range keys.
Data structures:
MVCCRangeKey
MVCCRangeKeyValue
Engine
methods for mutating range keys:ExperimentalClearMVCCRangeKey()
ExperimentalClearAllMVCCRangeKeys()
ExperimentalPutMVCCRangeKey()
SupportsRangeKeys()
SimpleMVCCIterator
methods for accessing range keys:HasPointAndRange()
RangeBounds()
RangeKeys()
Range keys do not have a distinct identity, and should instead be
considered a key continuum: they will merge with abutting keys of the
same value, can be partially cleared, can split or merge along with
ranges, and so on. Bounded scans will truncate them to the scan bounds.
Only MVCC range tombstones are currently supported, with an empty value.
Attempts to write a non-tombstone value will error, since these would
need additional logic throughout the MVCC APIs, and their semantics are
still unclear.
Range key support is implemented in
pebbleIterator
andintentInterleavingIter
, but not in the rest of the MVCC or KV APIs.They are not persisted to disk either. Subsequent pull requests will
extend their functionality and integrate them with other components.
Touches #70412.
Release note: None