Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: separate latest and old MVCC versions for faster reads #1170

Closed
sumeerbhola opened this issue Jun 23, 2021 · 16 comments
Closed

perf: separate latest and old MVCC versions for faster reads #1170

sumeerbhola opened this issue Jun 23, 2021 · 16 comments
Assignees

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Jun 23, 2021

This issue is different from #847 since this one is specifically about how Pebble is used in CockroachDB to store MVCC data. When there are many MVCC versions for a key, reads slow down because (a) the block cache is less effective since most of each block is data that is not needed, (b) when doing scans one needs to seek from one versioned key to another to skip iterating over all the versions.

A prerequisite to doing something here is Pebble gaining some understanding of MVCC timestamps. That prerequisite is also the case for compaction-time GC cockroachdb/cockroach#57260. One can also argue that compaction-time GC may make separating the latest and old MVCC versions less important.

If we ignore provisional values, a rough approach would be to put “live” and “dead” versions of data into separate files. This should not be hard when the multiple versions are in the same sstable -- when generating that sstable instead of writing one sstable we would write a pair, the “live” and “dead” sstable. These will have overlapping keys and will participate in compactions as a pair. A read at the “current time” can ignore the "dead" sstable. But what is “current time” given that we don’t have TrueTime like spanner? One possibility is to track the newest timestamp in the “live” sstable -- if the txn timestamp is newer than this newest timestamp there is no possibility that it will need anything from the “dead” sstable. This should work for sstables in lower levels which have most of the data.
The above approach will need adjustment to handle provisional values and intents.

@sumeerbhola
Copy link
Collaborator Author

There is some historical discussion in cockroachdb/cockroach#17229 (comment) which explicitly changes the key for older versions. That approach has a more explicit performance tradeoff since one needs to rewrite the existing key-value pair, which makes it less tenable.

@jbowens
Copy link
Collaborator

jbowens commented Oct 29, 2021

@sumeerbhola — I had a shower thought regarding differentiating between provisional and finalized keys. It's complicated and hopefully could be simplified. Food for thought:

At the beginning of a compaction, Pebble needs to decide which keys are final and which might be provisional. It takes the smallest-largest key range of the compaction and queries a CockroachDB-provided callback ClosedTimestamp([start, end]) for the current closed timestamp ct for that key range. This value ct guarantees that every key within the range [start, end] with timestamp <ct, is either committed or shadowed by a tombstone somewhere in the LSM.

But those tombstones may be somewhere above the compaction, meaning the compaction may still contain provisional keys at timestamps <ct. To differentiate those keys, we introduce coordination from all batches that remove or revise provisional values.

history-rewrite ranges

When a batch is going to alter history by deleting or modifying provisional values, the caller flips a bit on the batch indicating so. The min and max MVCC timestamps written by the batch form the batch's 'history-rewrite range'. Every sstable and memtable has a 'history-rewrite range' that encodes a start and end MVCC timestamp. When the batch commits, its 'history-rewrite range' is merged with the memtable's, widening the range to encompass the existing range and the new batch's range.

When a memtable is flushed, the outputted sstables inherit the memtable's history-rewrite range. Some of the sstables may not contain any of the history-rewriting keys, but they still inherit the whole range. [Maybe this could be relaxed to be the intersection of the sstable's MVCC key range and the memtable's history-rewrite range?]

When sstables are merged in a compaction, the compaction decides an output's history-rewrite range by merging the history-rewrite ranges of the compaction inputs' that overlap with it. This isn't necessarily all the inputs to the compaction, just the inputs that might have contributed keys to the individual output file. The ranges are merged by widening the range to encompass all the inputted history-rewrite ranges. When a compaction outputs a table to L6, it elides the history-rewrite range property.

compaction closed ranges

Now, when a compaction starts it needs to decide which keys are final and which are provisional. It takes the input sstable boundaries of all its inputs and fragments them. For each fragment's bounds, it calls ClosedTimestamp([start, end]). It associates with the fragment the range [0,ClosedTimestamp], representing the assumption that all its input keys that fall within the fragment and have timestamps in the range [0,@ClosedTimestamp] are final:

    [   input sstable k   [a-c]     ]
               [              input sstable l   [b-d]              ]

        i0              i1                          i2
    [          |                     |                             ]
    |  [0,t6]  |      [0,t9]         |          [0,t9]             |
    a          b                     c                             d

Now, it ascends the LSM level-by-level looking at files and memtables that overlap the compaction's keyspace. It uses these overlapping files' ranges to reduce which timestamp ranges are considered closed. For every overlapping higher-levelled file or memtable, it subtracts out the file or memtable's history-range for overlapping input ranges. This yields a range, or sequence of ranges, in which keys are known to be finalized within each of these intervals of the compaction.

L0 table o [                                            ] (rewrite [t12,t10])
L1 table n [                ]                             (rewrite [t8,t10])
L2 table m                                  [ ]           (rewrite [t5,t5])

L3 compaction:
    [   input sstable k   [a-c]     ]
               [              input sstable l   [b-d]              ]

        i0              i1                          i2
    [          |                     |                             ]
    |  [0,t6]  |      [0,t7]         |        [0,t4],[t6,t9]       |
    a          b                     c                             d

The compaction can consider final any keys that fall within interval i0 and have timestamps within [0, t6], and so on.

minimum closed timestamps

A long-running transaction in one part of the keyspace doesn't prevent keys from being considered closed in another part... until the long-running transaction gets aborted. When the history-rewriting batch is written to the memtable with the range [tlong ago, trecent], it prevents any keys overlapping the memtable and within that wide time range from being considered closed, even if they were considered closed in earlier compactions.

In the above example, a history-rewriting batch in a disjoint part of the keyspace could commit a rewrite batch with a rewrite range of [t1,t9]. If i0, i1, or i2 overlap the memtable, they're forced to only consider keys within [0,t1) to be closed.

To avoid this, every sstable also has a 'minimum closed timestamp' property. This represents a floor on the file's closed timestamp. It guarantees any of the file's keys with timestamps less than the minimum closed timestamp are closed.

The value is derived from the compaction input intervals. When an sstable is outputted, its key range overlaps some set of the fragmented compaction input intervals. The minimum closed timestamp holds the minimum upper bound t such that [0, t] is contained within all the overlapping input intervals' closed timestamp ranges. In the above example, an output that spans i0 and i1 would have a minimum closed timestamp of min(t6,t8) = t6.

When computing closed ranges for a compaction, each compaction input interval is assigned a minimum closed timestamp equal to the minimum of all the overlapping input sstables' minimum closed timestamps.

    [   input sstable k   [a-c]     ]                                mct t3
               [              input sstable l   [b-d]              ] mct t7

        i0              i1                          i2
    [          |                     |                             ]
    | mct = t3 |     mct = t3        |             mct = t7        |
    a          b                     c                             d

When traversing the LSM and subtracting out overlapping files' history-rewrite ranges, the interval's minimum closed timestamp prevents the removal of the range below the minimum closed timestamp.:

L0 table o [                                            ] (rewrite [t12,t10])
L1 table n [                ]                             (rewrite [t8,t10])
L2 table m                                  [ ]           (rewrite [t5,t5])

L3 compaction:
    [   input sstable k   [a-c]     ]                                mct t3
               [              input sstable l   [b-d]              ] mct t7

        i0              i1                          i2
    [          |                     |                             ]
    | mct = t3 |     mct = t3        |             mct = t7        |
    |  [0,t6]  |      [0,t7]         |              [0,t9]         |
    a          b                     c                             d

In the above example, interval i2's minimum closed timestamp of t7 prevents the L2 table m's rewrite range of [t5,t5] from reducing the size of its closed timestamp range.

The minimum closed timestamp serves as a compact way of carrying forward knowledge of keys' finality from compaction to compaction.

@sumeerbhola
Copy link
Collaborator Author

Thanks for the writeup @jbowens

Another idea, that came up in the discussion in cockroachdb/cockroach#72121 (comment)

Consides sstables with separate sections for keys and values. The simplest way to do this is to write

'0' + k0 => 0
'0' + k1 => 1
...
'1' + 0 => v0
'1' + 1 => v1
...

which ensures that the normal sstable sorting takes care of the ordering. It does require buffering the values in-memory (as compressed ssblocks) since they can't be written out immediately. Alternatively, one could reverse the order and put values before keys, in case the buffering required for keys is less.

The downside of this scheme is that reading a particular ki=>vi requires loading 2 ssblocks. So instead of doing this for all keys, say we did this for only older versions of keys in the sstable. Something like:

'0' + k0@10 => v0
'0' + k0@9 => index 0
'0' + k0@8 => index 1
'0' + k1@50 => v1
'0' + k1@40 => index 2
...
'1' + 0 => <value>
'1' + 1 => <value>
'1' + 2 => <value>

All the keys are visible as usual and the values can be lazily loaded if there is indirection. If we get unlucky and k0@10 has been deleted at a higher level, say because the transaction was aborted, there is no correctness issue since we can load the value for k0@9. Assuming such aborts (or pushes causing rewrites) are rare, we will mostly not need to do this. Scans will typically want to step from k0@10 to k1@50, and won't need to pay the expense of reading the older values of k0. And even if the key lengths are not significantly smaller than value lengths, we will get prefix compression on the different versions of the same key.

@jbowens
Copy link
Collaborator

jbowens commented Nov 2, 2021

@sumeerbhola Nice, that's cool.

I'd briefly entertained storing values out of band when thinking about range keys because of the duplication incurred from fragmentation. I didn't entertain it for long since range keys are rare. It makes a lot of sense in this case.

And even if the key lengths are not significantly smaller than value lengths, we will get prefix compression on the different versions of the same key.

Now I'm wondering whether you could make use of the prefix compression to avoid the extra key comparisons too. Something like the caller calls NextPrefix(prefixLength int) passing len(mvccKey.Key) if they read a non-provisional key at Iterator.Key(), and the iterator at the top of the heap can skip over any keys read for which shared > prefixLength.

@sumeerbhola
Copy link
Collaborator Author

@jbowens nice idea for avoiding the key comparison -- it seems very doable.

Adding an offline comment by @nvanbenschoten on how queue-like workloads could benefit from this:

It’s interesting to consider what a unoptimized queue-like workload would look like in this scheme. Because the latest version for most keys would be a deletion tombstone, the “key” section would consist almost exclusively of keys -> tombstone and shadowed keys -> indirect_value_ptr. So we would never need to load the values of the deleted queue entries and the cost of a scan would be independent of the value size of the deleted queue entries. I don’t have a mental model for the cost of a scan of tombstones (one that collects no live versions), but I imagine that this would provide a sizable speedup, especially for a workload that generates large queue entries.

@sumeerbhola
Copy link
Collaborator Author

@jbowens pointed out that this organization may also help during GC of older versions.

We do two scans for GC

  • Using gcIterator where we are looking at both the keys and values, in order to identify what can be GC'd. Reading the values from the value section will be slower in this case. However, it seems we don't typically need the value -- the only place we actually use the value is if it is an intent. We should be able to change the code to not load MVCC values.
  • Performing the GC: MVCCGarbageCollect only needs the key and the size of the value (for updating MVCCStats). So the key section should store the value size when there is an indirection, so that we can avoid loading the value.

A different issue: not all keys with a non-empty suffix are true CockroachDB MVCC keys. Lock table keys are the exception, where the ordering on the suffix is random (contains the txn UUID). One easy way to accommodate this would be for the Pebble DB Options to allow specification of key span exclusion(s) for where this separation of versions is not applicable -- we would use it to exclude [LockTableSingleKeyStart, LockTableSingleKeyEnd).

@jbowens
Copy link
Collaborator

jbowens commented Nov 3, 2021

For the case where there's a lot of garbage like the naive queue implementation, could we use a block property collector, my favorite new Pebble feature?

We could write a block property collector that records the prefix cardinality of a block/index-block/sstable. If a block/index-block/sstable has cardinality 1, a NextPrefix call could skip the rest of the block/index-block/sstable at the top of the mergingIter heap.

@jbowens
Copy link
Collaborator

jbowens commented Nov 19, 2021

If we did compaction-time GC, does the encoding of a key still need to be stable once it's committed? I'm imagining modifying committed MVCC keys to set a bit indicating that they are committed.

This could be done through a new special range key that associates a key span (always of whole prefixes) with both a closed timestamp and a GC threshold timestamp. These range keys descend the LSM and respect the LSM invariant with respect to keys older than their closed timestamp. In a compaction with one of these range keys, a user-defined merger is invoked for every point key within the range key's span.

type Options struct {
    // ...
    RangeKeyMerger func() RangeKeyMerger
 }

type RangeKeyMerger interface {
    InitRangeKey(rangeValue) error
    MergePointKey(key, value []byte) (newKey, newValue []byte, drop bool)
}
  • If the point key's timestamp is greater than the range key's closed timestamp, the point key is output unmodified.
  • If the point key's timestamp is less than the range key's GC threshold timestamp and the merger has already seen another key with the same prefix, it returns (nil, nil, true /* drop */) to drop the value.
  • If the point key's timestamp is less than the range key's closed timestamp, it returns the key with the closed bit (or unchanged if it's already set).

MVCCStats would be updated when these range keys are written. The range keys would be elided out of L6.

With this scheme:

  • A 'mvcc garbage' block property collector can examine keys and determine when a block is all committed garbage. It can set the block's annotation to the highest timestamp of any of the keys. A read at a timestamp t can use a block property filterer that skips over any blocks with garbage timestamps < t.
  • Garbage compaction happens at compaction-time, through these range keys.

One obstacle that immediately comes to mind is that the above scheme changes a key's encoding once it's merged with one of these range keys. That would be a problem with today's GC scheme, but performing compaction-time GC sidesteps that. I'm probably unaware of other problems with unstable mvcc keys.

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 6, 2022
When WriterOptions.ValueBlocksAreEnabled is set to true, older
versions of a key are written to a sequence of value blocks, and
the key contains a valueHandle which is a tuple
(valueLen, blockNum, offsetInBlock). The assumption here is
that most reads only need the value of the latest version, and
many reads that care about an older version only need the value
length.

Value blocks are a simple sequence of (varint encoded length,
value bytes) tuples such that given the uncompressed value
block, the valueHandle can cheaply read the value. The value
blocks index connects the blockNum in the valueHandle to the
location of the value block. It uses fixed width encoding to
avoid the expense of a general purpose key-value block.

See the comment at the top of value_block.go for details.

The following are preliminary results from a read benchmark,
after some performance tuning. The old numbers are master.
The needValue=false cases are the ones where value blocks are
expected to help.
- The versions=1 have no values in value blocks, and the slowdown
  is the extra call to valueBlockReader that needs to subslice
  to remove the single byte prefix.
- The hasCache=false case correspond to a cold cache, where there
  will be additional wasted decompression of values that we don't
  need (when needValue=false). As expected, when there is an
  improvement, it is larger with hasCache=false. For example the
  -97.83% below (almost 50x faster) compared with -79.89%.
- The needValue=true is where the code can be slower up to 2x.
  The higher slowdowns occur when the value size is smaller. In
  such cases more inline values can be packed into an ssblock and
  the code overhead of decoding the valueHandle, and the value
  length in the value block (all of these are varints) becomes
  a significant component.

This is a prototype in that there are no changes to the
InternalIterator interface, and the read path only works for
singleLevelIterator.

name                                                                        old time/op    new time/op    delta
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16        25.5ns ± 3%    25.9ns ± 2%   +1.50%  (p=0.028 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16         15.6ns ± 1%    15.5ns ± 2%     ~     (p=0.268 n=9+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16         27.3ns ± 3%    29.5ns ± 3%   +8.11%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16          17.1ns ± 2%    19.2ns ± 2%  +12.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16       26.7ns ± 2%    29.4ns ± 2%  +10.46%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16        15.9ns ± 2%    15.2ns ± 3%   -4.63%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16        26.7ns ± 2%    53.0ns ± 4%  +98.79%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16         16.6ns ± 1%    26.7ns ± 2%  +61.05%  (p=0.000 n=9+9)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16      28.3ns ± 4%    25.3ns ± 5%  -10.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16       15.8ns ± 2%    14.9ns ± 1%   -5.66%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16       29.4ns ± 4%    47.8ns ± 3%  +62.46%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16        16.7ns ± 4%    26.1ns ± 3%  +56.04%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16        123ns ± 4%     125ns ± 7%     ~     (p=0.735 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16        23.0ns ± 5%    22.9ns ± 5%     ~     (p=0.684 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16         124ns ± 6%     131ns ± 7%   +5.76%  (p=0.008 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16         24.3ns ± 4%    26.4ns ± 3%   +8.26%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16       130ns ± 8%      27ns ± 4%  -79.10%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16       23.8ns ± 4%    16.6ns ± 2%  -30.00%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16        128ns ± 9%     164ns ±12%  +27.94%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16        25.0ns ± 4%    33.0ns ± 2%  +32.22%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16      123ns ± 9%      28ns ± 3%  -76.89%  (p=0.000 n=9+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16      23.0ns ± 2%    15.3ns ± 5%  -33.36%  (p=0.000 n=10+9)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16       132ns ± 2%     171ns ± 5%  +29.24%  (p=0.000 n=8+10)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16       24.3ns ± 3%    32.6ns ± 3%  +33.98%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16      1.45µs ± 8%    1.35µs ±10%   -6.41%  (p=0.015 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16       75.5ns ± 2%    76.7ns ± 5%     ~     (p=0.218 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16       1.34µs ± 3%    1.46µs ±16%   +9.03%  (p=0.022 n=9+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16        77.0ns ± 3%    79.9ns ± 3%   +3.80%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16     1.46µs ± 6%    0.13µs ± 3%  -91.15%  (p=0.000 n=9+9)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16      76.4ns ± 3%    21.4ns ± 2%  -72.06%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16      1.47µs ± 8%    1.56µs ± 7%   +5.72%  (p=0.013 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16       78.1ns ± 4%    76.1ns ± 2%   -2.52%  (p=0.009 n=10+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16    1.34µs ± 5%    0.03µs ± 2%  -97.83%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16     77.0ns ± 2%    15.5ns ± 2%  -79.89%  (p=0.000 n=8+10)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16     1.42µs ± 9%    1.49µs ± 2%   +5.28%  (p=0.007 n=10+9)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16      78.5ns ± 4%    73.0ns ± 4%   -7.01%  (p=0.000 n=10+9)

Informs cockroachdb#1170
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 7, 2022
When WriterOptions.ValueBlocksAreEnabled is set to true, older
versions of a key are written to a sequence of value blocks, and
the key contains a valueHandle which is a tuple
(valueLen, blockNum, offsetInBlock). The assumption here is
that most reads only need the value of the latest version, and
many reads that care about an older version only need the value
length.

Value blocks are a simple sequence of (varint encoded length,
value bytes) tuples such that given the uncompressed value
block, the valueHandle can cheaply read the value. The value
blocks index connects the blockNum in the valueHandle to the
location of the value block. It uses fixed width encoding to
avoid the expense of a general purpose key-value block.

See the comment at the top of value_block.go for details.

The following are preliminary results from a read benchmark,
after some performance tuning. The old numbers are master.
The needValue=false cases are the ones where value blocks are
expected to help.
- The versions=1 have no values in value blocks, and the slowdown
  is the extra call to valueBlockReader that needs to subslice
  to remove the single byte prefix.
- The hasCache=false case correspond to a cold cache, where there
  will be additional wasted decompression of values that we don't
  need (when needValue=false). As expected, when there is an
  improvement, it is larger with hasCache=false. For example the
  -97.83% below (almost 50x faster) compared with -79.89%.
- The needValue=true is where the code can be slower up to 2x.
  The higher slowdowns occur when the value size is smaller. In
  such cases more inline values can be packed into an ssblock and
  the code overhead of decoding the valueHandle, and the value
  length in the value block (all of these are varints) becomes
  a significant component.

This is a prototype in that there are no changes to the
InternalIterator interface, and the read path only works for
singleLevelIterator.

name                                                                        old time/op    new time/op    delta
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16        25.5ns ± 3%    25.9ns ± 2%   +1.50%  (p=0.028 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16         15.6ns ± 1%    15.5ns ± 2%     ~     (p=0.268 n=9+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16         27.3ns ± 3%    29.5ns ± 3%   +8.11%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16          17.1ns ± 2%    19.2ns ± 2%  +12.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16       26.7ns ± 2%    29.4ns ± 2%  +10.46%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16        15.9ns ± 2%    15.2ns ± 3%   -4.63%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16        26.7ns ± 2%    53.0ns ± 4%  +98.79%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16         16.6ns ± 1%    26.7ns ± 2%  +61.05%  (p=0.000 n=9+9)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16      28.3ns ± 4%    25.3ns ± 5%  -10.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16       15.8ns ± 2%    14.9ns ± 1%   -5.66%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16       29.4ns ± 4%    47.8ns ± 3%  +62.46%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16        16.7ns ± 4%    26.1ns ± 3%  +56.04%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16        123ns ± 4%     125ns ± 7%     ~     (p=0.735 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16        23.0ns ± 5%    22.9ns ± 5%     ~     (p=0.684 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16         124ns ± 6%     131ns ± 7%   +5.76%  (p=0.008 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16         24.3ns ± 4%    26.4ns ± 3%   +8.26%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16       130ns ± 8%      27ns ± 4%  -79.10%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16       23.8ns ± 4%    16.6ns ± 2%  -30.00%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16        128ns ± 9%     164ns ±12%  +27.94%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16        25.0ns ± 4%    33.0ns ± 2%  +32.22%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16      123ns ± 9%      28ns ± 3%  -76.89%  (p=0.000 n=9+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16      23.0ns ± 2%    15.3ns ± 5%  -33.36%  (p=0.000 n=10+9)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16       132ns ± 2%     171ns ± 5%  +29.24%  (p=0.000 n=8+10)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16       24.3ns ± 3%    32.6ns ± 3%  +33.98%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16      1.45µs ± 8%    1.35µs ±10%   -6.41%  (p=0.015 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16       75.5ns ± 2%    76.7ns ± 5%     ~     (p=0.218 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16       1.34µs ± 3%    1.46µs ±16%   +9.03%  (p=0.022 n=9+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16        77.0ns ± 3%    79.9ns ± 3%   +3.80%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16     1.46µs ± 6%    0.13µs ± 3%  -91.15%  (p=0.000 n=9+9)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16      76.4ns ± 3%    21.4ns ± 2%  -72.06%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16      1.47µs ± 8%    1.56µs ± 7%   +5.72%  (p=0.013 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16       78.1ns ± 4%    76.1ns ± 2%   -2.52%  (p=0.009 n=10+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16    1.34µs ± 5%    0.03µs ± 2%  -97.83%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16     77.0ns ± 2%    15.5ns ± 2%  -79.89%  (p=0.000 n=8+10)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16     1.42µs ± 9%    1.49µs ± 2%   +5.28%  (p=0.007 n=10+9)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16      78.5ns ± 4%    73.0ns ± 4%   -7.01%  (p=0.000 n=10+9)

To put these numbers in context of what fraction of bytes in an sstable
are in regular data blocks (the rest are in value blocks):

value-size:  100,versions:  1,data-block-bytes:  1185521,total-bytes:  1186984,fraction-in-data-blocks:1.00
value-size:  100,versions: 10,data-block-bytes:   174822,total-bytes:  1085218,fraction-in-data-blocks:0.16
value-size:  100,versions:100,data-block-bytes:    82118,total-bytes:  1083393,fraction-in-data-blocks:0.08
value-size: 1000,versions:  1,data-block-bytes: 10197788,total-bytes: 10203513,fraction-in-data-blocks:1.00
value-size: 1000,versions: 10,data-block-bytes:  1199435,total-bytes: 10222244,fraction-in-data-blocks:0.12
value-size: 1000,versions:100,data-block-bytes:   169931,total-bytes: 10094442,fraction-in-data-blocks:0.02
value-size:10000,versions:  1,data-block-bytes:100246534,total-bytes:100292713,fraction-in-data-blocks:1.00
value-size:10000,versions: 10,data-block-bytes: 10204053,total-bytes:100256921,fraction-in-data-blocks:0.10
value-size:10000,versions:100,data-block-bytes:  1198846,total-bytes:100252245,fraction-in-data-blocks:0.01

Informs cockroachdb#1170
@jbowens
Copy link
Collaborator

jbowens commented Jun 29, 2022

I've been thinking about this issue in the context of MVCCDeleteRange. A span that includes a MVCCDeleteRange will suffer a performance cliff, because an iterator seeking into the deleted span needs to construct a range-key iterator stack and perform extra key comparisons during interleaving.

If a MVCCDeleteRange falls into L6, we could output it and the point keys it covers into a garbage-only L7. Iterators reading at recent timestamps could skip the entire file. This particular case is easier than the general garbage case, since there are no provisional MVCCDeleteRanges.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 30, 2022

there are no provisional MVCCDeleteRanges

It's possible that these will exist in the future (along with ranged intents), if SQL wants to start making use of them. But there are no short-term plans for this.

Also, Pebble can't know whether a range key is an MVCCDeleteRange or some other ranged key type, so CRDB will have to inform Pebble about these semantics. Although I suppose any kind of MVCC range key would imply that points below it are garbage.

We discussed something related over in cockroachdb/cockroach#57260 (comment), where CRDB can lay down a new kind of range key during GC that it can use in compaction filters to do MVCC GC during Pebble compactions.

@jbowens
Copy link
Collaborator

jbowens commented Jun 30, 2022

Also, Pebble can't know whether a range key is an MVCCDeleteRange or some other ranged key type, so CRDB will have to inform Pebble about these semantics.

Yeah, I think we should do this regardless when we introduce any other ranged key type, so that we can explicitly avoid applying range-key masking using non-MVCCDeleteRange range keys.

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Oct 6, 2022
The GC scan to identify garbage does not need the value
of MVCC data, and only needs the length and whether the
value is a tombstone. This change switches it to use
ReplicaMVCCDataIterator.MVCCValueLenAndIsTombstone which
passes through to the underlying MVCCIterator.

Informs cockroachdb/pebble#1170

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Oct 10, 2022
89522: gc: switch gc to use MVCCValueLenAndIsTombstone r=nvanbenschoten,erikgrinaker a=sumeerbhola

The GC scan to identify garbage does not need the value of MVCC data, and only needs the length and whether the value is a tombstone. This change switches it to use ReplicaMVCCDataIterator.MVCCValueLenAndIsTombstone which passes through to the underlying MVCCIterator.

Informs cockroachdb/pebble#1170

Release note: None

89587: sqlproxyccl: validate tenant certs common name and org r=JeffSwenson a=JeffSwenson

Currently the sqlproxy only validates the tenant cert's DNS name. The
tenant cert also contains a specific common name and organization.
Validating the common name and organization is useful, because the dns
name for k8s pod is the pod's IP address and IP addresses are reused by
the cluster.

Fixes: [CC-7532](https://cockroachlabs.atlassian.net/browse/CC-7532)

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: Jeff <swenson@cockroachlabs.com>
@jbowens
Copy link
Collaborator

jbowens commented Jan 18, 2023

@sumeerbhola — are you treating this issue as the primary issue tracking value blocks? I put this under in progress and assigned you, but feel free to unassign and move it if not

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 24, 2023
BenchmarkIteratorScanManyVersions, which iterates over all the versions,
now also runs with readValue=true. Similarly, for
BenchmarkIteratorScanNextPrefix, which iterates over the most recent
versions.

Informs cockroachdb#1170
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 24, 2023
BenchmarkIteratorScanNextPrefix now exercises value blocks. Also, the
benchmark now does an iterator step per benchmark iteration instead of a
full scan -- this makes the benchmark faster and allows us to compare
performance metrics across benchmark parameters (since the metrics do not
vary based on the total number of different key prefixes in the LSM).

There is a tiny tweak to the field ordering in LazyFetcher and additional
code comments about performance and future optimization.

Informs cockroachdb#1170
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 24, 2023
BenchmarkIteratorScanManyVersions, which iterates over all the versions,
now also runs with readValue=true. Similarly, for
BenchmarkIteratorScanNextPrefix, which iterates over the most recent
versions.

Informs cockroachdb#1170
sumeerbhola added a commit that referenced this issue Jan 24, 2023
BenchmarkIteratorScanManyVersions, which iterates over all the versions,
now also runs with readValue=true. Similarly, for
BenchmarkIteratorScanNextPrefix, which iterates over the most recent
versions.

Informs #1170
sumeerbhola added a commit that referenced this issue Jan 24, 2023
BenchmarkIteratorScanNextPrefix now exercises value blocks. Also, the
benchmark now does an iterator step per benchmark iteration instead of a
full scan -- this makes the benchmark faster and allows us to compare
performance metrics across benchmark parameters (since the metrics do not
vary based on the total number of different key prefixes in the LSM).

There is a tiny tweak to the field ordering in LazyFetcher and additional
code comments about performance and future optimization.

Informs #1170
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 29, 2023
sumeerbhola added a commit that referenced this issue Jan 31, 2023
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 15, 2023
execinfrapb.ScanStats and execstats.ScanStats have these new fields.
The intention here is for traces to show these stats and not to
complete the plumbing to expose these in the fingerprint stats table.

Informs cockroachdb/pebble#1170

Epic: CRDB-20378

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Feb 15, 2023
96825: pkg/cloud: Add implicit authentication to Azure Storage & KMS r=benbardin a=benbardin

This enables users to authenticate to Azure Storage and KeyVault with the Azure Default Credential, described here: https://learn.microsoft.com/en-us/azure/developer/go/azure-sdk-authentication. This supports environmental variable authentication, and also authentication via managed identity if CRDB is running on an Azure platform.

The Azure documentation describes which environment variables to set (Tenant ID, Client ID, Client Secret) for RBAC. Once selected, appropriate permissions must still be granted to the authenticating Client to use requested Azure resources. These permissions are described in #96459.

Release note (enterprise change): Add support for implicit authentication to Azure Storage and KMS using Azure RBAC.

Informs: #96972

97182: storage,sql: add separated value iteration stats to ScanStats r=ericharmeling a=sumeerbhola

execinfrapb.ScanStats and execstats.ScanStats have these new fields. The intention here is for traces to show these stats and not to complete the plumbing to expose these in the fingerprint stats table.

Informs cockroachdb/pebble#1170

Epic: CRDB-20378

Release note: None

97186: server: reduce logs from pgwire cancel r=erikgrinaker a=rafiss

fixes #91386

Now we avoid logging a full stack trace, and also only log if the rate limit was exceeded. This is an indication that someone may be maliciously spamming the query cancel protocol.

Release note: None

97191: sqlsmith: skip crdb_internal.fingerprint r=mgartner a=mgartner

`crdb_internal.fingerprint` is a recently added builtin function that
produces internal errors for some valid inputs. This commit adds it to
the sqlsmith skip list until it is fixed.

Informs #97097

Epic: None

Release note: None


97202: roachtest: add flaky test to activerecord ignore list r=rafiss a=andyyang890

Fixes #97163

Release note: None

Co-authored-by: Ben Bardin <bardin@cockroachlabs.com>
Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Andy Yang <yang@cockroachlabs.com>
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Feb 15, 2023
sumeerbhola added a commit that referenced this issue Feb 15, 2023
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Feb 20, 2023
@sumeerbhola
Copy link
Collaborator Author

The remaining work is to produce a tech note describing the design.

@jbowens
Copy link
Collaborator

jbowens commented May 16, 2023

Something that didn't click for me until I was talking with @itsbilal just now: The setHasSameKeyPrefix can also be used in the progressive restore case when restoring from backup sstables with revision history. During those restores, we need to ignore all the keys in the backup sstable that are not the first key with the prefix. When transforming backup sstables' blocks for sstables created with TableFormatv3, we can just skip any keys with the bit set.

@sumeerbhola
Copy link
Collaborator Author

sumeerbhola commented May 16, 2023

During those restores, we need to ignore all the keys in the backup sstable that are not the first key with the prefix.

Are we restoring to a point in time, so it could be an arbitrary version, and not necessarily the most recent one? The optimization does seem worthwhile.

@jbowens
Copy link
Collaborator

jbowens commented May 17, 2023

Yeah, I think it's possible to restore to a particular point-in-time that might be behind the most recent version of a key. I suppose at transform time, we could use the block properties (do we write these to backups today?) to determine whether the restore timestamp is higher than all the keys within the block.

I think it'll be relatively uncommon to need to restore to a non-most-recent version, because only the full backup will be restored "lazily" and require transformation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants