-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sstable: reduce block cache fragmentation #3508
sstable: reduce block cache fragmentation #3508
Conversation
7beb636
to
70be36b
Compare
The internal fragmentation reduction is mostly observable for small key value pairs where data is packed efficiently near the 32KiB threshold. In these cases the 32B of metadata for a block cache entry can cause jemalloc to allocate a 40KiB block instead of a 32KiB block. We expect that these changes will not reduce fragmentation for large key value pairs, however, it should not introduce any memory regressions. The following tests involved running kv-50 workloads on a single cockroach node with 4vCPUs and 16GB of RAM with varying block sizes. To illustrate the fragmentation improvements and shorten the test duration the block cache size was reduced to 400MB. TLDR
Note Fragmentation was computed by computing the difference between the size of Test 1 (KV Block Size: 1-16B)Previous SystemAllocations: 14393295
Note For small key value pairs a large number of allocations fall into the 40KiB size class and we are wasting ~8MiB of space on these allocations.
New HeuristicsAllocations: 30157396
Test 2 (KV Block Size: 1-1024B)Previous SystemAllocations: 20823143
New HeuristicsAllocations: 20823143
Test 3 (KV Block Size: 1KiB-16KiB)Previous SystemAllocations: 5883656
New HeuristicsAllocations: 7904419
Note Even with the new heuristics many allocations will fall into the 40KiB size class due to a minimum size threshold (29.4KiB) required to flush an sstable block to disk. However, the size of the KV blocks also helps reduce the internal fragmentation in larger size classes.
|
afe5e62
to
c72080f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recollection is that the block flushing heuristics allow the target block size to be exceeded if the block isn't too full. Specifically, the sizeThreshold
for a block is computed as 99% of the target block size. If the block is less than that size then a new key value is always allowed to be added. 99% of 32768 is 32440. So if a new key value pair is larger than 328 bytes we'll almost always get bumped past the target block size. This is independent of the value metadata size. If we know the target block size and the jemalloc size classes, I think we could enhance shouldFlush
to make a more informed decision that optimizes for reducing internal fragmentation (i.e. make the choice of whether to flush or not which minimizes internal fragmentation).
Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @CheranMahalingam)
internal/cache/value_invariants.go
line 20 at r1 (raw file):
// NewValueMetadataSize returns the number of bytes of metadata allocated for // a cache entry. func NewValueMetadataSize() int {
Nit: I think this can be a const ValueMetadataSize = 0
with a little effort of sprinkling the other definitions in files that have appropriate build constraints.
c176db1
to
825ca11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During my experiments I noticed that at runtime the sizeThreshold
is now 90% of the target block size (29492B) and we seem to be using the default threshold (https://github.com/cockroachdb/pebble/blob/master/internal/base/options.go#L11). In this case there is about 3KiB of leeway, however, in cases where the key value pair is larger such as 4KiB, this threshold gets in the way and contributes to internal fragmentation. For key value pairs of 10-11KiB this isn't problematic since they will go over the 32KiB target but land just under the 40KiB size class.
The simplest solution I can think of when the block size is below the threshold is a check to see whether adding the next key value pair would decrease fragmentation (i.e. 32768 - blockSize > 40960 - futureBlockSize
). However, I am not sure what the behaviour should be for large key value pairs that set the block size beyond 40KiB and there is an argument that blocks above 32KiB could have less internal fragmentation if we waited until the size was closer to 40KiB. I am wondering whether we still want to pursue a size-class aware heuristic.
Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @CheranMahalingam)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're correct that the default target block size threshold is 90%. I looked at the code really quickly and misread it. I also misread your testing. I thought "KV Block" referred to the target block size (which was confusing given the numbers you provided), but it really refers to the size of a "key+value".
A size-class aware heuristic shouldn't be difficult. Do you think it wouldn't be worthwhile? The jemalloc size classes can either be hardcoded, or injected via an Option (the Option approach would probably be better). When deciding whether to flush we see if the new KV would cause us to bump up to the next size class from the one the block currently fits in. If it is then we look to see which is smaller: nextSizeClass(currentBlockSize) - currentBlockSize
or nextSizeClass(futureBlockSize) - futureBlockSize
.
When the target block size is exceeded we can also keep adding KVs until the next size class would be exceeded (we don't have to flush the block immediately once it exceeds the target).
PS The new internal/cache code . Did you mean to revert the changes to options.go
?
Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @CheranMahalingam)
825ca11
to
458a389
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try a size-class aware heuristic and see how it performs with key value pair sizes that introduce a lot of fragmentation. Just for clarification should the shouldFlush
logic look something like this, because I am not sure how we want to prioritize reducing internal fragmentation vs. keeping the sstable block size close to the 32KiB target and how performant this code path has to be?
if (size-class options are set) {
if (blockSize > targetBlockSize) {
nextSizeClass = sizeClassFunc(blockSize)
if (nextSizeClass - blockSize > nextSizeClass - futureBlockSize
&& futureBlockSize <= nextSizeClass
// bound future block size to 40KiB to prevent rare scenario where blockSize continually increases without flushing because fragmentation decreases
&& nextSizeClass == sizeClassFunc(targetBlockSize + 1)) {
return false // wait because we can definitely reduce fragmentation in the future
} else {
return true
}
} else { // remove sizeThreshold heuristic
if (futureBlockSize > targetBlockSize) {
if (sizeClassFunc(futureBlockSize) - futureBlockSize < sizeClassFunc(blockSize) - blockSize) {
return false // option that reduces fragmentation with a small risk of having a very large sstable block being flushed
} else {
return true
}
} else {
return false
}
}
} else {
existing logic…
}
Sorry, I reverted my changes while experimenting with changes that would have the Cockroach layer adjust the block size due to a significant number of test failures this change causes in Pebble, however, I found some workarounds in the latest revision.
Reviewable status: 0 of 10 files reviewed, all discussions resolved (waiting on @CheranMahalingam)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something like those heuristics is what I'm imagining. We do call shouldFlush
on every KV pair, but I think you can have fast-path checks to make this very fast.
Reviewable status: 0 of 10 files reviewed, all discussions resolved (waiting on @CheranMahalingam)
458a389
to
1b1901e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size-class aware heuristics show a significant improvement in fragmentation for general workloads where the key-value pair size has a large range. However, these improvements do not directly correspond to a reduction in memory usage. The tests were run on a machine with 16vCPUs and 64GB of RAM where the block cache size was set to 16GB (also ran the following on the previous setup with similar results). We ran a kv-50 workload with key-value pair sizes ranging from 1B to 16KiB and enough concurrency/batching to saturate the cache.
TLDR
Version | Fragmentation (B/Alloc) | Space Wasted Per Allocation (%) | Jemalloc Allocated Bytes (GiB) | Jemalloc Total Bytes (GiB) |
---|---|---|---|---|
Current | 3755 | 11.26 | 16.621 | 19.096 |
New Heuristics | 1730 | 6.21 | 15.785 | 18.726 |
Improvement | +5.05% | +5.03% | +1.94% |
Results
Current
Allocations: 16187686
Average Allocation Size (B): 33358
Jemalloc Stats
Average Allocated Bytes (GiB): 16.621
Average Resident Bytes (GiB): 19.096
Bin Size (KiB) | Count | Fragmentation (B/Alloc) |
---|---|---|
64 | 695 | 5024 |
56 | 4181 | 4816 |
48 | 1590425 | 6409 |
40 | 8343194 | 4658 |
32 | 5284819 | 1554 |
28 | 18039 | 2153 |
24 | 22626 | 2257 |
20 | 17984 | 2031 |
16 | 11573 | 920 |
New Heuristics
Allocations: 16207665
Average Allocation Size (B): 27867
Jemalloc Stats
Average Allocated Bytes (GiB): 15.785
Average Resident Bytes (GiB): 18.726
Bin Size (KiB) | Count | Fragmentation (B/Alloc) |
---|---|---|
64 | 986 | 5173 |
56 | 6166 | 4151 |
48 | 101524 | 5263 |
40 | 2285861 | 1492 |
32 | 5855568 | 1818 |
28 | 3766744 | 1689 |
24 | 2492485 | 1776 |
20 | 740086 | 1346 |
16 | 13756 | 953 |
Reviewable status: 0 of 13 files reviewed, all discussions resolved (waiting on @CheranMahalingam)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't scrutinize the implementation, but the numbers here look compelling. I'm going to leave it to the Storage folks to do a detailed code review.
I'm surprised that Jemalloc Total Bytes didn't fall by the same size as Jemalloc Allocated Bytes.
Reviewable status: 0 of 13 files reviewed, all discussions resolved (waiting on @CheranMahalingam)
1b1901e
to
f874ba5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r2, 1 of 8 files at r3, 2 of 5 files at r4, all commit messages.
Reviewable status: 5 of 13 files reviewed, 1 unresolved discussion (waiting on @CheranMahalingam and @jbowens)
sstable/writer.go
line 1656 at r5 (raw file):
newSize += 4 // varint for shared prefix length newSize += uvarintLen(uint32(keyLen)) // varint for unshared key bytes newSize += uvarintLen(uint32(valueLen)) // varint for value size
Worth adding a code comment that newSize
is also an upper-bound, even though it is tighter than newEstimatedSize
since it does not account for key prefix compression.
For very long keys with a long shared prefix, and tiny values, such as secondary indexes, this over-estimation could be a problem in that it could result in tiny blocks. I am not sure what to do about this given we don't want to incur the expense of computing the shared prefix (that we do in blockWriter.storeWithOptionalValuePrefix
). We may want to use the sizeThreshold
after all, and reduce it to something smaller, like 60% (and return false if estimatedBlockSize <= sizeThreshold
).
I was expecting to see cache.ValueMetadataSize
accounted for here in adjusting estimatedBlockSize
, instead of in options. It will naturally bump up the effective BlockSize
if someone sets a very tiny value, since we will add at least one entry. And after that entry is added the block will be bigger than BlockSize
and all subsequent attempts to add will fall through to the size class based comparisons. Was there a reason to do it in options.go? The logic there seems not quite right in that if someone sets a block size of 2 bytes, they should be getting a tiny block, and not a block size of DefaultBlockSize
.
f874ba5
to
852a5db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 14 files reviewed, 2 unresolved discussions (waiting on @jbowens and @sumeerbhola)
sstable/writer.go
line 1656 at r5 (raw file):
Previously, sumeerbhola wrote…
Worth adding a code comment that
newSize
is also an upper-bound, even though it is tighter thannewEstimatedSize
since it does not account for key prefix compression.
For very long keys with a long shared prefix, and tiny values, such as secondary indexes, this over-estimation could be a problem in that it could result in tiny blocks. I am not sure what to do about this given we don't want to incur the expense of computing the shared prefix (that we do inblockWriter.storeWithOptionalValuePrefix
). We may want to use thesizeThreshold
after all, and reduce it to something smaller, like 60% (and return false ifestimatedBlockSize <= sizeThreshold
).I was expecting to see
cache.ValueMetadataSize
accounted for here in adjustingestimatedBlockSize
, instead of in options. It will naturally bump up the effectiveBlockSize
if someone sets a very tiny value, since we will add at least one entry. And after that entry is added the block will be bigger thanBlockSize
and all subsequent attempts to add will fall through to the size class based comparisons. Was there a reason to do it in options.go? The logic there seems not quite right in that if someone sets a block size of 2 bytes, they should be getting a tiny block, and not a block size ofDefaultBlockSize
.
Nice, I overlooked how the shared prefix is computed later. Do you have a preference for what that size threshold should be? I have left it as 60% (~19KiB) but we can increase it if we want to prioritize having block sizes closer to 32KiB or if we want to shift the bin size distribution up (i.e. larger blocks -> fewer allocations in saturated cache -> less overhead from allocator metadata).
You're right, pushing the metadata accounting down to the flush heuristics also avoids breaking existing tests.
sstable/writer.go
line 1649 at r6 (raw file):
// us at risk of overestimating the size and flushing small blocks. We // mitigate this by imposing a minimum size restriction. sizeClassAwareThreshold := (targetBlockSize*base.SizeClassAwareBlockSizeThreshold + 99) / 100
This change is awkward because we already store size thresholds in the Writer
and pass down this value to shouldFlushWithHints
. We could make those thresholds dynamic by checking whether size class hints are empty, however, then if we are forced to fall back to the old heuristics the original 90% threshold is lost. Wondering whether there is a cleaner way to express this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rerun the last experiment with these changes to validate that the results are still good.
Reviewed 2 of 10 files at r6.
Reviewable status: 6 of 14 files reviewed, 4 unresolved discussions (waiting on @CheranMahalingam and @jbowens)
sstable/writer.go
line 1638 at r6 (raw file):
// loaded into the block cache we also allocate space for the cache entry // metadata. The new allocation of size ~1052B may now only fit within a // 2048B class, which increases fragmentation.
nit: increases internal fragmentation.
sstable/writer.go
line 1649 at r6 (raw file):
Previously, CheranMahalingam wrote…
This change is awkward because we already store size thresholds in the
Writer
and pass down this value toshouldFlushWithHints
. We could make those thresholds dynamic by checking whether size class hints are empty, however, then if we are forced to fall back to the old heuristics the original 90% threshold is lost. Wondering whether there is a cleaner way to express this.
Let's just compute this once in Writer
and wrap them in flushDecisionOptions
struct with 3 fields blockSize
, blockSizeThreshold
, sizeClassAwareThreshold
, and pass that options down.
sstable/writer.go
line 1674 at r6 (raw file):
if blockSizeWithMetadata < targetBlockSize { newSizeClass, ok := blockSizeClass(newSize, sizeClassHints) if !ok || newSizeClass-newSize >= sizeClass-blockSizeWithMetadata {
nit: shouldn't !ok
result in returning false? Since then we should continue adding to this block and will eventually fall into one of the earlier cases that returns true. I realize !ok
won't happen in a properly configured system.
d30fb06
to
6511784
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that there are no significant regressions following these changes. There is a slight bump in fragmentation for the 40KiB size class which can be attributed to the 60% size threshold. There is also a slight increase in bytes allocated for the new heuristics and in total bytes for the old heuristics as reported by jemalloc. However, note that I ran this experiment almost 2x longer than the previous experiment to make sure that the computed averages were stable.
Version | Fragmentation (B/Alloc) | Space Wasted Per Allocation (%) | Jemalloc Allocated Bytes (GiB) | Jemalloc Total Bytes (GiB) |
---|---|---|---|---|
Current | 3760 | 11.26 | 16.66 | 19.43 |
New Heuristics | 1778 | 6.22 | 15.97 | 18.80 |
Improvement | +5.04% | +4.16% | +3.22% |
Raw Results
Current | Heuristics v3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Reviewable status: 3 of 14 files reviewed, 1 unresolved discussion (waiting on @jbowens and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guessing all the numbers above are based on an instrumented version of the commit in this PR. Can you drop a link to that commit in a PR comment, and what workload you ran, so that we can easily reproduce these if we need to in the future.
Reviewed 1 of 10 files at r6, 3 of 4 files at r7, 1 of 1 files at r9, all commit messages.
Reviewable status: 8 of 14 files reviewed, 3 unresolved discussions (waiting on @CheranMahalingam and @jbowens)
sstable/writer_test.go
line 675 at r9 (raw file):
td.ScanArgs(t, "key-size", &keySize) td.ScanArgs(t, "val-size", &valSize) td.ScanArgs(t, "block-size-with-metadata", &blockSize)
nit: this is not the block size with metadata, yes?
sstable/writer_test.go
line 703 at r9 (raw file):
blockSize: targetSize, blockSizeThreshold: sizeThreshold, sizeClassAwareThreshold: sizeThreshold,
can you add some test cases where this sizeClassAwareThreshold is different from the blockSizeThreshold.
6c053d7
to
aa02ea5
Compare
Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: cockroachdb#999.
aa02ea5
to
96d6f61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pebble commit: CheranMahalingam@43a74ee
Cockroach commit: CheranMahalingam/cockroach@7ce83ae
Reproduction Steps
// Spin up single node cluster with sufficient memory
$ roachprod create --gce-machine-type=n2-standard-16 $CLUSTER -n 1
$ roachprod put $CLUSTER stats.sh
$ roachprod start $CLUSTER
// Run kv-50 workload
$ roachprod run $CLUSTER:1 -- './cockroach workload init kv {pgurl:1}'
$ roachprod run $CLUSTER:1 -- './cockroach workload run kv --read-percent 50 --tolerate-errors --max-block-bytes 16384 --min-block-bytes 1 --concurrency 256 --batch 10 --insert-count 100000 {pgurl:1}'
// SSH into the node to check progress
$ roachprod ssh $CLUSTER:1
// Wait until block cache has saturated in dbconsole and for sufficient time to pass, I used 30000000 allocations as a stopping point
$ cat logs/cockroach.stdout.log | grep "ALLOC" | wc -l
$ cat logs/cockroach.stdout.log > flush
// Note that there are some assumptions made in this script to get an accurate average for jemalloc stats which are documented inside the script
$ ./stats.sh
Reviewable status: 3 of 15 files reviewed, 3 unresolved discussions (waiting on @jbowens and @sumeerbhola)
sstable/writer_test.go
line 675 at r9 (raw file):
Previously, sumeerbhola wrote…
nit: this is not the block size with metadata, yes?
This is a hacky way of dealing with the fact that with certain builds metadata size is 0 (cgo disabled or a set of flags defined in value_cgo.go
) and we want blockSizeWithMetadata
to be consistent. Then, I decrement by cache.ValueMetadataSize
(0 or 32) to compute the block size for the flush call.
I am realizing that this has been a huge pain for testing and to avoid future friction I have set cache.ValueMetadataSize
to 32 for all builds. If cgo were ever disabled in production our heuristics would break anyways.
Note that there is still a large point of friction preventing extending the tests because the value of cache.ValueMetadataSize
is reduced for 32-bit builds.
sstable/writer_test.go
line 703 at r9 (raw file):
Previously, sumeerbhola wrote…
can you add some test cases where this sizeClassAwareThreshold is different from the blockSizeThreshold.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r7, 2 of 6 files at r10, 1 of 2 files at r11.
Reviewable status: 7 of 15 files reviewed, 3 unresolved discussions (waiting on @CheranMahalingam and @jbowens)
sstable/writer_test.go
line 675 at r9 (raw file):
I am realizing that this has been a huge pain for testing and to avoid future friction I have set
cache.ValueMetadataSize
to 32 for all builds. If cgo were ever disabled in production our heuristics would break anyways.
This is fine.
This patch makes use of the pebble size class aware flush heuristics introduced in cockroachdb/pebble#3508. The experimentation there showed improvements in internal fragmentation. Fixes cockroachdb#123090. Release note: None
This patch makes use of the pebble size class aware flush heuristics introduced in cockroachdb/pebble#3508. The experimentation there showed improvements in internal fragmentation. Fixes cockroachdb#123090. Release note: None
125793: storage: use jemalloc size classes when intializing pebble r=RaduBerinde a=aadityasondhi This patch makes use of the pebble size class aware flush heuristics introduced in cockroachdb/pebble#3508. The experimentation there showed improvements in internal fragmentation. Fixes #123090. Release note: None Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com>
Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs.
Fixes: #999.