perf: separate latest and old MVCC versions for faster reads #1170

sumeerbhola · 2021-06-23T02:48:48Z

This issue is different from #847 since this one is specifically about how Pebble is used in CockroachDB to store MVCC data. When there are many MVCC versions for a key, reads slow down because (a) the block cache is less effective since most of each block is data that is not needed, (b) when doing scans one needs to seek from one versioned key to another to skip iterating over all the versions.

A prerequisite to doing something here is Pebble gaining some understanding of MVCC timestamps. That prerequisite is also the case for compaction-time GC cockroachdb/cockroach#57260. One can also argue that compaction-time GC may make separating the latest and old MVCC versions less important.

If we ignore provisional values, a rough approach would be to put “live” and “dead” versions of data into separate files. This should not be hard when the multiple versions are in the same sstable -- when generating that sstable instead of writing one sstable we would write a pair, the “live” and “dead” sstable. These will have overlapping keys and will participate in compactions as a pair. A read at the “current time” can ignore the "dead" sstable. But what is “current time” given that we don’t have TrueTime like spanner? One possibility is to track the newest timestamp in the “live” sstable -- if the txn timestamp is newer than this newest timestamp there is no possibility that it will need anything from the “dead” sstable. This should work for sstables in lower levels which have most of the data.
The above approach will need adjustment to handle provisional values and intents.

sumeerbhola · 2021-06-23T19:00:55Z

There is some historical discussion in cockroachdb/cockroach#17229 (comment) which explicitly changes the key for older versions. That approach has a more explicit performance tradeoff since one needs to rewrite the existing key-value pair, which makes it less tenable.

jbowens · 2021-10-29T16:07:26Z

@sumeerbhola — I had a shower thought regarding differentiating between provisional and finalized keys. It's complicated and hopefully could be simplified. Food for thought:

At the beginning of a compaction, Pebble needs to decide which keys are final and which might be provisional. It takes the smallest-largest key range of the compaction and queries a CockroachDB-provided callback ClosedTimestamp([start, end]) for the current closed timestamp ct for that key range. This value ct guarantees that every key within the range [start, end] with timestamp <ct, is either committed or shadowed by a tombstone somewhere in the LSM.

But those tombstones may be somewhere above the compaction, meaning the compaction may still contain provisional keys at timestamps <ct. To differentiate those keys, we introduce coordination from all batches that remove or revise provisional values.

history-rewrite ranges

When a batch is going to alter history by deleting or modifying provisional values, the caller flips a bit on the batch indicating so. The min and max MVCC timestamps written by the batch form the batch's 'history-rewrite range'. Every sstable and memtable has a 'history-rewrite range' that encodes a start and end MVCC timestamp. When the batch commits, its 'history-rewrite range' is merged with the memtable's, widening the range to encompass the existing range and the new batch's range.

When a memtable is flushed, the outputted sstables inherit the memtable's history-rewrite range. Some of the sstables may not contain any of the history-rewriting keys, but they still inherit the whole range. [Maybe this could be relaxed to be the intersection of the sstable's MVCC key range and the memtable's history-rewrite range?]

When sstables are merged in a compaction, the compaction decides an output's history-rewrite range by merging the history-rewrite ranges of the compaction inputs' that overlap with it. This isn't necessarily all the inputs to the compaction, just the inputs that might have contributed keys to the individual output file. The ranges are merged by widening the range to encompass all the inputted history-rewrite ranges. When a compaction outputs a table to L6, it elides the history-rewrite range property.

compaction closed ranges

Now, when a compaction starts it needs to decide which keys are final and which are provisional. It takes the input sstable boundaries of all its inputs and fragments them. For each fragment's bounds, it calls ClosedTimestamp([start, end]). It associates with the fragment the range [0,ClosedTimestamp], representing the assumption that all its input keys that fall within the fragment and have timestamps in the range [0,@ClosedTimestamp] are final:

    [   input sstable k   [a-c]     ]
               [              input sstable l   [b-d]              ]

        i0              i1                          i2
    [          |                     |                             ]
    |  [0,t6]  |      [0,t9]         |          [0,t9]             |
    a          b                     c                             d

Now, it ascends the LSM level-by-level looking at files and memtables that overlap the compaction's keyspace. It uses these overlapping files' ranges to reduce which timestamp ranges are considered closed. For every overlapping higher-levelled file or memtable, it subtracts out the file or memtable's history-range for overlapping input ranges. This yields a range, or sequence of ranges, in which keys are known to be finalized within each of these intervals of the compaction.

L0 table o [                                            ] (rewrite [t12,t10])
L1 table n [                ]                             (rewrite [t8,t10])
L2 table m                                  [ ]           (rewrite [t5,t5])

L3 compaction:
    [   input sstable k   [a-c]     ]
               [              input sstable l   [b-d]              ]

        i0              i1                          i2
    [          |                     |                             ]
    |  [0,t6]  |      [0,t7]         |        [0,t4],[t6,t9]       |
    a          b                     c                             d

The compaction can consider final any keys that fall within interval i0 and have timestamps within [0, t₆], and so on.

minimum closed timestamps

A long-running transaction in one part of the keyspace doesn't prevent keys from being considered closed in another part... until the long-running transaction gets aborted. When the history-rewriting batch is written to the memtable with the range [t_{long ago}, t_recent], it prevents any keys overlapping the memtable and within that wide time range from being considered closed, even if they were considered closed in earlier compactions.

In the above example, a history-rewriting batch in a disjoint part of the keyspace could commit a rewrite batch with a rewrite range of [t₁,t₉]. If i₀, i₁, or i₂ overlap the memtable, they're forced to only consider keys within [0,t₁) to be closed.

To avoid this, every sstable also has a 'minimum closed timestamp' property. This represents a floor on the file's closed timestamp. It guarantees any of the file's keys with timestamps less than the minimum closed timestamp are closed.

The value is derived from the compaction input intervals. When an sstable is outputted, its key range overlaps some set of the fragmented compaction input intervals. The minimum closed timestamp holds the minimum upper bound t such that [0, t] is contained within all the overlapping input intervals' closed timestamp ranges. In the above example, an output that spans i₀ and i₁ would have a minimum closed timestamp of min(t₆,t₈) = t₆.

When computing closed ranges for a compaction, each compaction input interval is assigned a minimum closed timestamp equal to the minimum of all the overlapping input sstables' minimum closed timestamps.

    [   input sstable k   [a-c]     ]                                mct t3
               [              input sstable l   [b-d]              ] mct t7

        i0              i1                          i2
    [          |                     |                             ]
    | mct = t3 |     mct = t3        |             mct = t7        |
    a          b                     c                             d

When traversing the LSM and subtracting out overlapping files' history-rewrite ranges, the interval's minimum closed timestamp prevents the removal of the range below the minimum closed timestamp.:

L0 table o [                                            ] (rewrite [t12,t10])
L1 table n [                ]                             (rewrite [t8,t10])
L2 table m                                  [ ]           (rewrite [t5,t5])

L3 compaction:
    [   input sstable k   [a-c]     ]                                mct t3
               [              input sstable l   [b-d]              ] mct t7

        i0              i1                          i2
    [          |                     |                             ]
    | mct = t3 |     mct = t3        |             mct = t7        |
    |  [0,t6]  |      [0,t7]         |              [0,t9]         |
    a          b                     c                             d

In the above example, interval i₂'s minimum closed timestamp of t₇ prevents the L2 table m's rewrite range of [t₅,t₅] from reducing the size of its closed timestamp range.

The minimum closed timestamp serves as a compact way of carrying forward knowledge of keys' finality from compaction to compaction.

sumeerbhola · 2021-11-01T19:21:48Z

Thanks for the writeup @jbowens

Another idea, that came up in the discussion in cockroachdb/cockroach#72121 (comment)

Consides sstables with separate sections for keys and values. The simplest way to do this is to write

'0' + k0 => 0
'0' + k1 => 1
...
'1' + 0 => v0
'1' + 1 => v1
...

which ensures that the normal sstable sorting takes care of the ordering. It does require buffering the values in-memory (as compressed ssblocks) since they can't be written out immediately. Alternatively, one could reverse the order and put values before keys, in case the buffering required for keys is less.

The downside of this scheme is that reading a particular ki=>vi requires loading 2 ssblocks. So instead of doing this for all keys, say we did this for only older versions of keys in the sstable. Something like:

'0' + k0@10 => v0
'0' + k0@9 => index 0
'0' + k0@8 => index 1
'0' + k1@50 => v1
'0' + k1@40 => index 2
...
'1' + 0 => <value>
'1' + 1 => <value>
'1' + 2 => <value>

All the keys are visible as usual and the values can be lazily loaded if there is indirection. If we get unlucky and k0@10 has been deleted at a higher level, say because the transaction was aborted, there is no correctness issue since we can load the value for k0@9. Assuming such aborts (or pushes causing rewrites) are rare, we will mostly not need to do this. Scans will typically want to step from k0@10 to k1@50, and won't need to pay the expense of reading the older values of k0. And even if the key lengths are not significantly smaller than value lengths, we will get prefix compression on the different versions of the same key.

jbowens · 2021-11-02T02:13:27Z

@sumeerbhola Nice, that's cool.

I'd briefly entertained storing values out of band when thinking about range keys because of the duplication incurred from fragmentation. I didn't entertain it for long since range keys are rare. It makes a lot of sense in this case.

And even if the key lengths are not significantly smaller than value lengths, we will get prefix compression on the different versions of the same key.

Now I'm wondering whether you could make use of the prefix compression to avoid the extra key comparisons too. Something like the caller calls NextPrefix(prefixLength int) passing len(mvccKey.Key) if they read a non-provisional key at Iterator.Key(), and the iterator at the top of the heap can skip over any keys read for which shared > prefixLength.

sumeerbhola · 2021-11-02T13:25:00Z

@jbowens nice idea for avoiding the key comparison -- it seems very doable.

Adding an offline comment by @nvanbenschoten on how queue-like workloads could benefit from this:

It’s interesting to consider what a unoptimized queue-like workload would look like in this scheme. Because the latest version for most keys would be a deletion tombstone, the “key” section would consist almost exclusively of keys -> tombstone and shadowed keys -> indirect_value_ptr. So we would never need to load the values of the deleted queue entries and the cost of a scan would be independent of the value size of the deleted queue entries. I don’t have a mental model for the cost of a scan of tombstones (one that collects no live versions), but I imagine that this would provide a sizable speedup, especially for a workload that generates large queue entries.

sumeerbhola · 2021-11-02T21:49:47Z

@jbowens pointed out that this organization may also help during GC of older versions.

We do two scans for GC

Using gcIterator where we are looking at both the keys and values, in order to identify what can be GC'd. Reading the values from the value section will be slower in this case. However, it seems we don't typically need the value -- the only place we actually use the value is if it is an intent. We should be able to change the code to not load MVCC values.
Performing the GC: MVCCGarbageCollect only needs the key and the size of the value (for updating MVCCStats). So the key section should store the value size when there is an indirection, so that we can avoid loading the value.

A different issue: not all keys with a non-empty suffix are true CockroachDB MVCC keys. Lock table keys are the exception, where the ordering on the suffix is random (contains the txn UUID). One easy way to accommodate this would be for the Pebble DB Options to allow specification of key span exclusion(s) for where this separation of versions is not applicable -- we would use it to exclude [LockTableSingleKeyStart, LockTableSingleKeyEnd).

jbowens · 2021-11-03T14:54:03Z

For the case where there's a lot of garbage like the naive queue implementation, could we use a block property collector, my favorite new Pebble feature?

We could write a block property collector that records the prefix cardinality of a block/index-block/sstable. If a block/index-block/sstable has cardinality 1, a NextPrefix call could skip the rest of the block/index-block/sstable at the top of the mergingIter heap.

jbowens · 2021-11-19T22:38:44Z

If we did compaction-time GC, does the encoding of a key still need to be stable once it's committed? I'm imagining modifying committed MVCC keys to set a bit indicating that they are committed.

This could be done through a new special range key that associates a key span (always of whole prefixes) with both a closed timestamp and a GC threshold timestamp. These range keys descend the LSM and respect the LSM invariant with respect to keys older than their closed timestamp. In a compaction with one of these range keys, a user-defined merger is invoked for every point key within the range key's span.

type Options struct {
    // ...
    RangeKeyMerger func() RangeKeyMerger
 }

type RangeKeyMerger interface {
    InitRangeKey(rangeValue) error
    MergePointKey(key, value []byte) (newKey, newValue []byte, drop bool)
}

If the point key's timestamp is greater than the range key's closed timestamp, the point key is output unmodified.
If the point key's timestamp is less than the range key's GC threshold timestamp and the merger has already seen another key with the same prefix, it returns (nil, nil, true /* drop */) to drop the value.
If the point key's timestamp is less than the range key's closed timestamp, it returns the key with the closed bit (or unchanged if it's already set).

MVCCStats would be updated when these range keys are written. The range keys would be elided out of L6.

With this scheme:

A 'mvcc garbage' block property collector can examine keys and determine when a block is all committed garbage. It can set the block's annotation to the highest timestamp of any of the keys. A read at a timestamp t can use a block property filterer that skips over any blocks with garbage timestamps < t.
Garbage compaction happens at compaction-time, through these range keys.

One obstacle that immediately comes to mind is that the above scheme changes a key's encoding once it's merged with one of these range keys. That would be a problem with today's GC scheme, but performing compaction-time GC sidesteps that. I'm probably unaware of other problems with unstable mvcc keys.

When WriterOptions.ValueBlocksAreEnabled is set to true, older versions of a key are written to a sequence of value blocks, and the key contains a valueHandle which is a tuple (valueLen, blockNum, offsetInBlock). The assumption here is that most reads only need the value of the latest version, and many reads that care about an older version only need the value length. Value blocks are a simple sequence of (varint encoded length, value bytes) tuples such that given the uncompressed value block, the valueHandle can cheaply read the value. The value blocks index connects the blockNum in the valueHandle to the location of the value block. It uses fixed width encoding to avoid the expense of a general purpose key-value block. See the comment at the top of value_block.go for details. The following are preliminary results from a read benchmark, after some performance tuning. The old numbers are master. The needValue=false cases are the ones where value blocks are expected to help. - The versions=1 have no values in value blocks, and the slowdown is the extra call to valueBlockReader that needs to subslice to remove the single byte prefix. - The hasCache=false case correspond to a cold cache, where there will be additional wasted decompression of values that we don't need (when needValue=false). As expected, when there is an improvement, it is larger with hasCache=false. For example the -97.83% below (almost 50x faster) compared with -79.89%. - The needValue=true is where the code can be slower up to 2x. The higher slowdowns occur when the value size is smaller. In such cases more inline values can be packed into an ssblock and the code overhead of decoding the valueHandle, and the value length in the value block (all of these are varints) becomes a significant component. This is a prototype in that there are no changes to the InternalIterator interface, and the read path only works for singleLevelIterator. name old time/op new time/op delta ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16 25.5ns ± 3% 25.9ns ± 2% +1.50% (p=0.028 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16 15.6ns ± 1% 15.5ns ± 2% ~ (p=0.268 n=9+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16 27.3ns ± 3% 29.5ns ± 3% +8.11% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16 17.1ns ± 2% 19.2ns ± 2% +12.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16 26.7ns ± 2% 29.4ns ± 2% +10.46% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16 15.9ns ± 2% 15.2ns ± 3% -4.63% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16 26.7ns ± 2% 53.0ns ± 4% +98.79% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16 16.6ns ± 1% 26.7ns ± 2% +61.05% (p=0.000 n=9+9) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16 28.3ns ± 4% 25.3ns ± 5% -10.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16 15.8ns ± 2% 14.9ns ± 1% -5.66% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16 29.4ns ± 4% 47.8ns ± 3% +62.46% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16 16.7ns ± 4% 26.1ns ± 3% +56.04% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16 123ns ± 4% 125ns ± 7% ~ (p=0.735 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16 23.0ns ± 5% 22.9ns ± 5% ~ (p=0.684 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16 124ns ± 6% 131ns ± 7% +5.76% (p=0.008 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16 24.3ns ± 4% 26.4ns ± 3% +8.26% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16 130ns ± 8% 27ns ± 4% -79.10% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16 23.8ns ± 4% 16.6ns ± 2% -30.00% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16 128ns ± 9% 164ns ±12% +27.94% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16 25.0ns ± 4% 33.0ns ± 2% +32.22% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16 123ns ± 9% 28ns ± 3% -76.89% (p=0.000 n=9+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16 23.0ns ± 2% 15.3ns ± 5% -33.36% (p=0.000 n=10+9) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16 132ns ± 2% 171ns ± 5% +29.24% (p=0.000 n=8+10) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16 24.3ns ± 3% 32.6ns ± 3% +33.98% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16 1.45µs ± 8% 1.35µs ±10% -6.41% (p=0.015 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16 75.5ns ± 2% 76.7ns ± 5% ~ (p=0.218 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16 1.34µs ± 3% 1.46µs ±16% +9.03% (p=0.022 n=9+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16 77.0ns ± 3% 79.9ns ± 3% +3.80% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16 1.46µs ± 6% 0.13µs ± 3% -91.15% (p=0.000 n=9+9) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16 76.4ns ± 3% 21.4ns ± 2% -72.06% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16 1.47µs ± 8% 1.56µs ± 7% +5.72% (p=0.013 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16 78.1ns ± 4% 76.1ns ± 2% -2.52% (p=0.009 n=10+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16 1.34µs ± 5% 0.03µs ± 2% -97.83% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16 77.0ns ± 2% 15.5ns ± 2% -79.89% (p=0.000 n=8+10) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16 1.42µs ± 9% 1.49µs ± 2% +5.28% (p=0.007 n=10+9) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16 78.5ns ± 4% 73.0ns ± 4% -7.01% (p=0.000 n=10+9) Informs cockroachdb#1170

When WriterOptions.ValueBlocksAreEnabled is set to true, older versions of a key are written to a sequence of value blocks, and the key contains a valueHandle which is a tuple (valueLen, blockNum, offsetInBlock). The assumption here is that most reads only need the value of the latest version, and many reads that care about an older version only need the value length. Value blocks are a simple sequence of (varint encoded length, value bytes) tuples such that given the uncompressed value block, the valueHandle can cheaply read the value. The value blocks index connects the blockNum in the valueHandle to the location of the value block. It uses fixed width encoding to avoid the expense of a general purpose key-value block. See the comment at the top of value_block.go for details. The following are preliminary results from a read benchmark, after some performance tuning. The old numbers are master. The needValue=false cases are the ones where value blocks are expected to help. - The versions=1 have no values in value blocks, and the slowdown is the extra call to valueBlockReader that needs to subslice to remove the single byte prefix. - The hasCache=false case correspond to a cold cache, where there will be additional wasted decompression of values that we don't need (when needValue=false). As expected, when there is an improvement, it is larger with hasCache=false. For example the -97.83% below (almost 50x faster) compared with -79.89%. - The needValue=true is where the code can be slower up to 2x. The higher slowdowns occur when the value size is smaller. In such cases more inline values can be packed into an ssblock and the code overhead of decoding the valueHandle, and the value length in the value block (all of these are varints) becomes a significant component. This is a prototype in that there are no changes to the InternalIterator interface, and the read path only works for singleLevelIterator. name old time/op new time/op delta ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16 25.5ns ± 3% 25.9ns ± 2% +1.50% (p=0.028 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16 15.6ns ± 1% 15.5ns ± 2% ~ (p=0.268 n=9+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16 27.3ns ± 3% 29.5ns ± 3% +8.11% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16 17.1ns ± 2% 19.2ns ± 2% +12.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16 26.7ns ± 2% 29.4ns ± 2% +10.46% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16 15.9ns ± 2% 15.2ns ± 3% -4.63% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16 26.7ns ± 2% 53.0ns ± 4% +98.79% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16 16.6ns ± 1% 26.7ns ± 2% +61.05% (p=0.000 n=9+9) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16 28.3ns ± 4% 25.3ns ± 5% -10.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16 15.8ns ± 2% 14.9ns ± 1% -5.66% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16 29.4ns ± 4% 47.8ns ± 3% +62.46% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16 16.7ns ± 4% 26.1ns ± 3% +56.04% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16 123ns ± 4% 125ns ± 7% ~ (p=0.735 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16 23.0ns ± 5% 22.9ns ± 5% ~ (p=0.684 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16 124ns ± 6% 131ns ± 7% +5.76% (p=0.008 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16 24.3ns ± 4% 26.4ns ± 3% +8.26% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16 130ns ± 8% 27ns ± 4% -79.10% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16 23.8ns ± 4% 16.6ns ± 2% -30.00% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16 128ns ± 9% 164ns ±12% +27.94% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16 25.0ns ± 4% 33.0ns ± 2% +32.22% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16 123ns ± 9% 28ns ± 3% -76.89% (p=0.000 n=9+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16 23.0ns ± 2% 15.3ns ± 5% -33.36% (p=0.000 n=10+9) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16 132ns ± 2% 171ns ± 5% +29.24% (p=0.000 n=8+10) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16 24.3ns ± 3% 32.6ns ± 3% +33.98% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16 1.45µs ± 8% 1.35µs ±10% -6.41% (p=0.015 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16 75.5ns ± 2% 76.7ns ± 5% ~ (p=0.218 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16 1.34µs ± 3% 1.46µs ±16% +9.03% (p=0.022 n=9+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16 77.0ns ± 3% 79.9ns ± 3% +3.80% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16 1.46µs ± 6% 0.13µs ± 3% -91.15% (p=0.000 n=9+9) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16 76.4ns ± 3% 21.4ns ± 2% -72.06% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16 1.47µs ± 8% 1.56µs ± 7% +5.72% (p=0.013 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16 78.1ns ± 4% 76.1ns ± 2% -2.52% (p=0.009 n=10+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16 1.34µs ± 5% 0.03µs ± 2% -97.83% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16 77.0ns ± 2% 15.5ns ± 2% -79.89% (p=0.000 n=8+10) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16 1.42µs ± 9% 1.49µs ± 2% +5.28% (p=0.007 n=10+9) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16 78.5ns ± 4% 73.0ns ± 4% -7.01% (p=0.000 n=10+9) To put these numbers in context of what fraction of bytes in an sstable are in regular data blocks (the rest are in value blocks): value-size: 100,versions: 1,data-block-bytes: 1185521,total-bytes: 1186984,fraction-in-data-blocks:1.00 value-size: 100,versions: 10,data-block-bytes: 174822,total-bytes: 1085218,fraction-in-data-blocks:0.16 value-size: 100,versions:100,data-block-bytes: 82118,total-bytes: 1083393,fraction-in-data-blocks:0.08 value-size: 1000,versions: 1,data-block-bytes: 10197788,total-bytes: 10203513,fraction-in-data-blocks:1.00 value-size: 1000,versions: 10,data-block-bytes: 1199435,total-bytes: 10222244,fraction-in-data-blocks:0.12 value-size: 1000,versions:100,data-block-bytes: 169931,total-bytes: 10094442,fraction-in-data-blocks:0.02 value-size:10000,versions: 1,data-block-bytes:100246534,total-bytes:100292713,fraction-in-data-blocks:1.00 value-size:10000,versions: 10,data-block-bytes: 10204053,total-bytes:100256921,fraction-in-data-blocks:0.10 value-size:10000,versions:100,data-block-bytes: 1198846,total-bytes:100252245,fraction-in-data-blocks:0.01 Informs cockroachdb#1170

jbowens · 2022-06-29T14:10:50Z

I've been thinking about this issue in the context of MVCCDeleteRange. A span that includes a MVCCDeleteRange will suffer a performance cliff, because an iterator seeking into the deleted span needs to construct a range-key iterator stack and perform extra key comparisons during interleaving.

If a MVCCDeleteRange falls into L6, we could output it and the point keys it covers into a garbage-only L7. Iterators reading at recent timestamps could skip the entire file. This particular case is easier than the general garbage case, since there are no provisional MVCCDeleteRanges.

erikgrinaker · 2022-06-30T09:49:08Z

there are no provisional MVCCDeleteRanges

It's possible that these will exist in the future (along with ranged intents), if SQL wants to start making use of them. But there are no short-term plans for this.

Also, Pebble can't know whether a range key is an MVCCDeleteRange or some other ranged key type, so CRDB will have to inform Pebble about these semantics. Although I suppose any kind of MVCC range key would imply that points below it are garbage.

We discussed something related over in cockroachdb/cockroach#57260 (comment), where CRDB can lay down a new kind of range key during GC that it can use in compaction filters to do MVCC GC during Pebble compactions.

jbowens · 2022-06-30T14:27:14Z

Also, Pebble can't know whether a range key is an MVCCDeleteRange or some other ranged key type, so CRDB will have to inform Pebble about these semantics.

Yeah, I think we should do this regardless when we introduce any other ranged key type, so that we can explicitly avoid applying range-key masking using non-MVCCDeleteRange range keys.

The GC scan to identify garbage does not need the value of MVCC data, and only needs the length and whether the value is a tombstone. This change switches it to use ReplicaMVCCDataIterator.MVCCValueLenAndIsTombstone which passes through to the underlying MVCCIterator. Informs cockroachdb/pebble#1170 Release note: None

89522: gc: switch gc to use MVCCValueLenAndIsTombstone r=nvanbenschoten,erikgrinaker a=sumeerbhola The GC scan to identify garbage does not need the value of MVCC data, and only needs the length and whether the value is a tombstone. This change switches it to use ReplicaMVCCDataIterator.MVCCValueLenAndIsTombstone which passes through to the underlying MVCCIterator. Informs cockroachdb/pebble#1170 Release note: None 89587: sqlproxyccl: validate tenant certs common name and org r=JeffSwenson a=JeffSwenson Currently the sqlproxy only validates the tenant cert's DNS name. The tenant cert also contains a specific common name and organization. Validating the common name and organization is useful, because the dns name for k8s pod is the pod's IP address and IP addresses are reused by the cluster. Fixes: [CC-7532](https://cockroachlabs.atlassian.net/browse/CC-7532) Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Jeff <swenson@cockroachlabs.com>

jbowens · 2023-01-18T18:33:47Z

@sumeerbhola — are you treating this issue as the primary issue tracking value blocks? I put this under in progress and assigned you, but feel free to unassign and move it if not

BenchmarkIteratorScanManyVersions, which iterates over all the versions, now also runs with readValue=true. Similarly, for BenchmarkIteratorScanNextPrefix, which iterates over the most recent versions. Informs cockroachdb#1170

BenchmarkIteratorScanNextPrefix now exercises value blocks. Also, the benchmark now does an iterator step per benchmark iteration instead of a full scan -- this makes the benchmark faster and allows us to compare performance metrics across benchmark parameters (since the metrics do not vary based on the total number of different key prefixes in the LSM). There is a tiny tweak to the field ordering in LazyFetcher and additional code comments about performance and future optimization. Informs cockroachdb#1170

BenchmarkIteratorScanManyVersions, which iterates over all the versions, now also runs with readValue=true. Similarly, for BenchmarkIteratorScanNextPrefix, which iterates over the most recent versions. Informs cockroachdb#1170

BenchmarkIteratorScanManyVersions, which iterates over all the versions, now also runs with readValue=true. Similarly, for BenchmarkIteratorScanNextPrefix, which iterates over the most recent versions. Informs #1170

BenchmarkIteratorScanNextPrefix now exercises value blocks. Also, the benchmark now does an iterator step per benchmark iteration instead of a full scan -- this makes the benchmark faster and allows us to compare performance metrics across benchmark parameters (since the metrics do not vary based on the total number of different key prefixes in the LSM). There is a tiny tweak to the field ordering in LazyFetcher and additional code comments about performance and future optimization. Informs #1170

Informs cockroachdb#1170

Informs #1170

execinfrapb.ScanStats and execstats.ScanStats have these new fields. The intention here is for traces to show these stats and not to complete the plumbing to expose these in the fingerprint stats table. Informs cockroachdb/pebble#1170 Epic: CRDB-20378 Release note: None

96825: pkg/cloud: Add implicit authentication to Azure Storage & KMS r=benbardin a=benbardin This enables users to authenticate to Azure Storage and KeyVault with the Azure Default Credential, described here: https://learn.microsoft.com/en-us/azure/developer/go/azure-sdk-authentication. This supports environmental variable authentication, and also authentication via managed identity if CRDB is running on an Azure platform. The Azure documentation describes which environment variables to set (Tenant ID, Client ID, Client Secret) for RBAC. Once selected, appropriate permissions must still be granted to the authenticating Client to use requested Azure resources. These permissions are described in #96459. Release note (enterprise change): Add support for implicit authentication to Azure Storage and KMS using Azure RBAC. Informs: #96972 97182: storage,sql: add separated value iteration stats to ScanStats r=ericharmeling a=sumeerbhola execinfrapb.ScanStats and execstats.ScanStats have these new fields. The intention here is for traces to show these stats and not to complete the plumbing to expose these in the fingerprint stats table. Informs cockroachdb/pebble#1170 Epic: CRDB-20378 Release note: None 97186: server: reduce logs from pgwire cancel r=erikgrinaker a=rafiss fixes #91386 Now we avoid logging a full stack trace, and also only log if the rate limit was exceeded. This is an indication that someone may be maliciously spamming the query cancel protocol. Release note: None 97191: sqlsmith: skip crdb_internal.fingerprint r=mgartner a=mgartner `crdb_internal.fingerprint` is a recently added builtin function that produces internal errors for some valid inputs. This commit adds it to the sqlsmith skip list until it is fixed. Informs #97097 Epic: None Release note: None 97202: roachtest: add flaky test to activerecord ignore list r=rafiss a=andyyang890 Fixes #97163 Release note: None Co-authored-by: Ben Bardin <bardin@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Andy Yang <yang@cockroachlabs.com>

Informs cockroachdb#1170

Informs #1170

Informs cockroachdb#1170

Informs #1170

sumeerbhola · 2023-02-28T00:17:02Z

The remaining work is to produce a tech note describing the design.

jbowens · 2023-05-16T17:44:30Z

Something that didn't click for me until I was talking with @itsbilal just now: The setHasSameKeyPrefix can also be used in the progressive restore case when restoring from backup sstables with revision history. During those restores, we need to ignore all the keys in the backup sstable that are not the first key with the prefix. When transforming backup sstables' blocks for sstables created with TableFormatv3, we can just skip any keys with the bit set.

sumeerbhola · 2023-05-16T21:36:26Z

During those restores, we need to ignore all the keys in the backup sstable that are not the first key with the prefix.

Are we restoring to a point in time, so it could be an arbitrary version, and not necessarily the most recent one? The optimization does seem worthwhile.

jbowens · 2023-05-17T16:37:35Z

Yeah, I think it's possible to restore to a particular point-in-time that might be behind the most recent version of a key. I suppose at transform time, we could use the block properties (do we write these to backups today?) to determine whether the restore timestamp is higher than all the keys within the block.

I think it'll be relatively uncommon to need to restore to a non-most-recent version, because only the full backup will be restored "lazily" and require transformation.

sumeerbhola mentioned this issue Oct 25, 2021

docs: draft design sketch for ranged keys #1341

Merged

nvanbenschoten mentioned this issue Nov 1, 2021

kv/storage: disable observed timestamps for async resolved intents that are pushed cockroachdb/cockroach#72121

Closed

sumeerbhola mentioned this issue Jan 6, 2022

[DNM] sstable: add value blocks for storing older versions of a key #1443

Open

jbowens mentioned this issue Jun 29, 2022

storage: optimize range key iteration cockroachdb/cockroach#83049

Closed

irfansharif mentioned this issue Oct 3, 2022

*: lower default GC TTL cockroachdb/cockroach#89233

Closed

sumeerbhola mentioned this issue Oct 6, 2022

gc: switch gc to use MVCCValueLenAndIsTombstone cockroachdb/cockroach#89522

Merged

jbowens mentioned this issue Oct 22, 2022

roachtest: import-cancellation failed cockroachdb/cockroach#90510

Closed

This was referenced Oct 25, 2022

storage,kvserver: support lazy value fetching in SimpleMVCCIterator cockroachdb/cockroach#89119

Merged

sstable: adjust sizeEstimate logic for forthcoming value blocks #2016

Merged

sumeerbhola mentioned this issue Nov 2, 2022

db: add LazyValue for a value that may not be stored in-place with th… #2012

Merged

sumeerbhola mentioned this issue Jan 24, 2023

sstable: update two sstable benchmarks to also read the value #2268

Merged

sumeerbhola mentioned this issue Jan 24, 2023

db: tweak to BenchmarkIteratorScanNextPrefix and performance commentary #2269

Merged

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 29, 2023

sstable: add v3 to BenchmarkWriters

695d91d

Informs cockroachdb#1170

sumeerbhola mentioned this issue Jan 29, 2023

sstable: add v3 to BenchmarkWriters #2287

Merged

sumeerbhola added a commit that referenced this issue Jan 31, 2023

sstable: add v3 to BenchmarkWriters

a5c9053

Informs #1170

ericharmeling mentioned this issue Feb 15, 2023

sql: add separated value iteration stats to fingerprint stats tables cockroachdb/cockroach#97184

Open

sumeerbhola mentioned this issue Feb 15, 2023

storage,sql: add separated value iteration stats to ScanStats cockroachdb/cockroach#97182

Merged

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Feb 15, 2023

sstable: add readahead for value blocks

748468b

Informs cockroachdb#1170

sumeerbhola mentioned this issue Feb 15, 2023

sstable: add readahead for value blocks #2332

Merged

sumeerbhola added a commit that referenced this issue Feb 15, 2023

sstable: add readahead for value blocks

fbbc39d

Informs #1170

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Feb 20, 2023

db: add per-level ValueBlocksSize gauge to Metrics

0ee5356

Informs cockroachdb#1170

sumeerbhola mentioned this issue Feb 20, 2023

db: add per-level ValueBlocksSize gauge to Metrics #2350

Merged

sumeerbhola added a commit that referenced this issue Feb 22, 2023

db: add per-level ValueBlocksSize gauge to Metrics

ab1a49c

Informs #1170

a-robinson mentioned this issue Mar 8, 2023

storage: Performance degradation caused by kv tombstones cockroachdb/cockroach#17229

Open

jbowens mentioned this issue Mar 9, 2023

storage: set default of storage.value_blocks.enabled to true cockroachdb/cockroach#96669

Merged

sumeerbhola closed this as completed May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: separate latest and old MVCC versions for faster reads #1170

perf: separate latest and old MVCC versions for faster reads #1170

sumeerbhola commented Jun 23, 2021 •

edited

Loading

sumeerbhola commented Jun 23, 2021

jbowens commented Oct 29, 2021

sumeerbhola commented Nov 1, 2021

jbowens commented Nov 2, 2021

sumeerbhola commented Nov 2, 2021

sumeerbhola commented Nov 2, 2021

jbowens commented Nov 3, 2021

jbowens commented Nov 19, 2021

jbowens commented Jun 29, 2022

erikgrinaker commented Jun 30, 2022 •

edited

Loading

jbowens commented Jun 30, 2022

jbowens commented Jan 18, 2023

sumeerbhola commented Feb 28, 2023

jbowens commented May 16, 2023

sumeerbhola commented May 16, 2023 •

edited

Loading

jbowens commented May 17, 2023

perf: separate latest and old MVCC versions for faster reads #1170

perf: separate latest and old MVCC versions for faster reads #1170

Comments

sumeerbhola commented Jun 23, 2021 • edited Loading

sumeerbhola commented Jun 23, 2021

jbowens commented Oct 29, 2021

history-rewrite ranges

compaction closed ranges

minimum closed timestamps

sumeerbhola commented Nov 1, 2021

jbowens commented Nov 2, 2021

sumeerbhola commented Nov 2, 2021

sumeerbhola commented Nov 2, 2021

jbowens commented Nov 3, 2021

jbowens commented Nov 19, 2021

jbowens commented Jun 29, 2022

erikgrinaker commented Jun 30, 2022 • edited Loading

jbowens commented Jun 30, 2022

jbowens commented Jan 18, 2023

sumeerbhola commented Feb 28, 2023

jbowens commented May 16, 2023

sumeerbhola commented May 16, 2023 • edited Loading

jbowens commented May 17, 2023

sumeerbhola commented Jun 23, 2021 •

edited

Loading

erikgrinaker commented Jun 30, 2022 •

edited

Loading

sumeerbhola commented May 16, 2023 •

edited

Loading