Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sstable: reduce block cache fragmentation #3508

Merged

Conversation

CheranMahalingam
Copy link
Contributor

@CheranMahalingam CheranMahalingam commented Apr 12, 2024

Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs.

Fixes: #999.

@CheranMahalingam CheranMahalingam self-assigned this Apr 12, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@CheranMahalingam
Copy link
Contributor Author

CheranMahalingam commented Apr 12, 2024

The internal fragmentation reduction is mostly observable for small key value pairs where data is packed efficiently near the 32KiB threshold. In these cases the 32B of metadata for a block cache entry can cause jemalloc to allocate a 40KiB block instead of a 32KiB block. We expect that these changes will not reduce fragmentation for large key value pairs, however, it should not introduce any memory regressions.

The following tests involved running kv-50 workloads on a single cockroach node with 4vCPUs and 16GB of RAM with varying block sizes. To illustrate the fragmentation improvements and shorten the test duration the block cache size was reduced to 400MB.

TLDR

KV Block Size (B) Old Average Fragmentation (B/Alloc) New Average Fragmentation (B/Alloc) Fragmentation Reduction (%) Old Jemalloc Total Bytes (MiB) New Jemalloc Total Bytes (MiB) Jemalloc Memory Usage Reduction (%)
1-16 3630 1221 +63.4 650 510 +21.5
1-1024 778 361 +53.6 570 556 +2.5
1024-16384 3793 3765 +0.7 524 532 -1.5

Note

Fragmentation was computed by computing the difference between the size of mallocs in Pebble and the appropriate jemalloc size class (documented size classes).

Test 1 (KV Block Size: 1-16B)

Previous System

Allocations: 14393295
Average Allocation Size (B): 29612

Bin Size (B) Count Fragmentation (B/Alloc)
48 535 4320
40 4239577 8178
32 3433477 15
28 6640636 2603
24 12007 1672
20 10329 2033

Note

For small key value pairs a large number of allocations fall into the 40KiB size class and we are wasting ~8MiB of space on these allocations.

Screenshot 2024-04-12 at 3 29 18 PM
Block Cache Usage: ~252MiB
Jemalloc Allocated Bytes: ~420MiB
Jemalloc Resident Bytes: ~650MiB

New Heuristics

Allocations: 30157396
Average Allocation Size (B): 29561

Bin Size (B) Count Fragmentation (B/Alloc)
48 928 4432
40 3844 6006
32 15779481 30
28 14238856 2539
24 19798 1536
20 16900 1960
Screenshot 2024-04-12 at 2 43 54 PM
Block Cache Usage: ~252MiB
Jemalloc Allocated Bytes: ~386MiB
Jemalloc Resident Bytes: ~510MiB

Test 2 (KV Block Size: 1-1024B)

Previous System

Allocations: 20823143
Average Allocation Size (B): 32217

Bin Size (B) Count Fragmentation (B/Alloc)
48 1738 3698
40 1136895 8169
32 19454511 343
28 16336 2414
24 21417 1645
20 16546 2065
Screenshot 2024-04-12 at 4 25 22 PM
Block Cache Usage: ~252MiB
Jemalloc Allocated Bytes: ~393MiB
Jemalloc Resident Bytes: ~570MiB

New Heuristics

Allocations: 20823143
Average Allocation Size (B): 32217

Bin Size (B) Count Fragmentation (B/Alloc)
48 1496 3832
40 3095 6685
32 14105612 355
28 9812 2575
24 14556 1682
20 11255 2179
Screenshot 2024-04-12 at 4 55 54 PM
Block Cache Usage: ~252MiB
Jemalloc Allocated Bytes: ~389MiB
Jemalloc Resident Bytes: ~556MiB

Test 3 (KV Block Size: 1KiB-16KiB)

Previous System

Allocations: 5883656
Average Allocation Size (B): 33775

Bin Size (B) Count Fragmentation (B/Alloc)
48 621285 6390
40 3085923 4635
32 1868052 1559
28 6660 2049
24 10379 2182
20 5098 2182
Screenshot 2024-04-12 at 6 09 38 PM
Block Cache Usage: ~252MiB
Jemalloc Allocated Bytes: ~413MiB
Jemalloc Resident Bytes: ~524MiB

New Heuristics

Allocations: 7904419
Average Allocation Size (B): 33871

Bin Size (B) Count Fragmentation (B/Alloc)
48 816899 6407
40 4132956 4626
32 2578651 1566
28 7995 2034
24 12272 2146
20 6689 2178

Note

Even with the new heuristics many allocations will fall into the 40KiB size class due to a minimum size threshold (29.4KiB) required to flush an sstable block to disk. However, the size of the KV blocks also helps reduce the internal fragmentation in larger size classes.

Screenshot 2024-04-12 at 5 42 59 PM
Block Cache Usage: ~252MiB
Jemalloc Allocated Bytes: ~414MiB
Jemalloc Resident Bytes: ~532MiB

@CheranMahalingam CheranMahalingam force-pushed the jemalloc-fragmentation branch 2 times, most recently from afe5e62 to c72080f Compare April 15, 2024 14:23
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recollection is that the block flushing heuristics allow the target block size to be exceeded if the block isn't too full. Specifically, the sizeThreshold for a block is computed as 99% of the target block size. If the block is less than that size then a new key value is always allowed to be added. 99% of 32768 is 32440. So if a new key value pair is larger than 328 bytes we'll almost always get bumped past the target block size. This is independent of the value metadata size. If we know the target block size and the jemalloc size classes, I think we could enhance shouldFlush to make a more informed decision that optimizes for reducing internal fragmentation (i.e. make the choice of whether to flush or not which minimizes internal fragmentation).

Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @CheranMahalingam)


internal/cache/value_invariants.go line 20 at r1 (raw file):

// NewValueMetadataSize returns the number of bytes of metadata allocated for
// a cache entry.
func NewValueMetadataSize() int {

Nit: I think this can be a const ValueMetadataSize = 0 with a little effort of sprinkling the other definitions in files that have appropriate build constraints.

@CheranMahalingam CheranMahalingam force-pushed the jemalloc-fragmentation branch 3 times, most recently from c176db1 to 825ca11 Compare April 15, 2024 18:06
Copy link
Contributor Author

@CheranMahalingam CheranMahalingam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During my experiments I noticed that at runtime the sizeThreshold is now 90% of the target block size (29492B) and we seem to be using the default threshold (https://github.com/cockroachdb/pebble/blob/master/internal/base/options.go#L11). In this case there is about 3KiB of leeway, however, in cases where the key value pair is larger such as 4KiB, this threshold gets in the way and contributes to internal fragmentation. For key value pairs of 10-11KiB this isn't problematic since they will go over the 32KiB target but land just under the 40KiB size class.

The simplest solution I can think of when the block size is below the threshold is a check to see whether adding the next key value pair would decrease fragmentation (i.e. 32768 - blockSize > 40960 - futureBlockSize). However, I am not sure what the behaviour should be for large key value pairs that set the block size beyond 40KiB and there is an argument that blocks above 32KiB could have less internal fragmentation if we waited until the size was closer to 40KiB. I am wondering whether we still want to pursue a size-class aware heuristic.

Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @CheranMahalingam)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're correct that the default target block size threshold is 90%. I looked at the code really quickly and misread it. I also misread your testing. I thought "KV Block" referred to the target block size (which was confusing given the numbers you provided), but it really refers to the size of a "key+value".

A size-class aware heuristic shouldn't be difficult. Do you think it wouldn't be worthwhile? The jemalloc size classes can either be hardcoded, or injected via an Option (the Option approach would probably be better). When deciding whether to flush we see if the new KV would cause us to bump up to the next size class from the one the block currently fits in. If it is then we look to see which is smaller: nextSizeClass(currentBlockSize) - currentBlockSize or nextSizeClass(futureBlockSize) - futureBlockSize.

When the target block size is exceeded we can also keep adding KVs until the next size class would be exceeded (we don't have to flush the block immediately once it exceeds the target).

PS The new internal/cache code :lgtm:. Did you mean to revert the changes to options.go?

Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @CheranMahalingam)

Copy link
Contributor Author

@CheranMahalingam CheranMahalingam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try a size-class aware heuristic and see how it performs with key value pair sizes that introduce a lot of fragmentation. Just for clarification should the shouldFlush logic look something like this, because I am not sure how we want to prioritize reducing internal fragmentation vs. keeping the sstable block size close to the 32KiB target and how performant this code path has to be?

if (size-class options are set) {
	if (blockSize > targetBlockSize) {
		nextSizeClass = sizeClassFunc(blockSize)
		if (nextSizeClass - blockSize > nextSizeClass - futureBlockSize 
                    && futureBlockSize <= nextSizeClass
                    // bound future block size to 40KiB to prevent rare scenario where blockSize continually increases without flushing because fragmentation decreases
                    && nextSizeClass == sizeClassFunc(targetBlockSize + 1)) {
			return false // wait because we can definitely reduce fragmentation in the future
		} else {
			return true
		}
	} else { // remove sizeThreshold heuristic
		if (futureBlockSize > targetBlockSize) {
			if (sizeClassFunc(futureBlockSize) - futureBlockSize < sizeClassFunc(blockSize) - blockSize) {
				return false // option that reduces fragmentation with a small risk of having a very large sstable block being flushed
			} else {
				return true
			}
		} else {
			return false
		}
	}
} else {
	existing logic…
}

Sorry, I reverted my changes while experimenting with changes that would have the Cockroach layer adjust the block size due to a significant number of test failures this change causes in Pebble, however, I found some workarounds in the latest revision.

Reviewable status: 0 of 10 files reviewed, all discussions resolved (waiting on @CheranMahalingam)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like those heuristics is what I'm imagining. We do call shouldFlush on every KV pair, but I think you can have fast-path checks to make this very fast.

Reviewable status: 0 of 10 files reviewed, all discussions resolved (waiting on @CheranMahalingam)

Copy link
Contributor Author

@CheranMahalingam CheranMahalingam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size-class aware heuristics show a significant improvement in fragmentation for general workloads where the key-value pair size has a large range. However, these improvements do not directly correspond to a reduction in memory usage. The tests were run on a machine with 16vCPUs and 64GB of RAM where the block cache size was set to 16GB (also ran the following on the previous setup with similar results). We ran a kv-50 workload with key-value pair sizes ranging from 1B to 16KiB and enough concurrency/batching to saturate the cache.

TLDR

Version Fragmentation (B/Alloc) Space Wasted Per Allocation (%) Jemalloc Allocated Bytes (GiB) Jemalloc Total Bytes (GiB)
Current 3755 11.26 16.621 19.096
New Heuristics 1730 6.21 15.785 18.726
Improvement +5.05% +5.03% +1.94%

Results

Current

Allocations: 16187686
Average Allocation Size (B): 33358

Jemalloc Stats
Average Allocated Bytes (GiB): 16.621
Average Resident Bytes (GiB): 19.096
Bin Size (KiB) Count Fragmentation (B/Alloc)
64 695 5024
56 4181 4816
48 1590425 6409
40 8343194 4658
32 5284819 1554
28 18039 2153
24 22626 2257
20 17984 2031
16 11573 920
Screenshot 2024-04-18 at 3 20 43 PM

New Heuristics

Allocations: 16207665
Average Allocation Size (B): 27867

Jemalloc Stats
Average Allocated Bytes (GiB): 15.785
Average Resident Bytes (GiB): 18.726
Bin Size (KiB) Count Fragmentation (B/Alloc)
64 986 5173
56 6166 4151
48 101524 5263
40 2285861 1492
32 5855568 1818
28 3766744 1689
24 2492485 1776
20 740086 1346
16 13756 953
Screenshot 2024-04-18 at 3 51 59 PM

Reviewable status: 0 of 13 files reviewed, all discussions resolved (waiting on @CheranMahalingam)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't scrutinize the implementation, but the numbers here look compelling. I'm going to leave it to the Storage folks to do a detailed code review.

I'm surprised that Jemalloc Total Bytes didn't fall by the same size as Jemalloc Allocated Bytes.

Reviewable status: 0 of 13 files reviewed, all discussions resolved (waiting on @CheranMahalingam)

@CheranMahalingam CheranMahalingam marked this pull request as ready for review April 19, 2024 15:43
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r2, 1 of 8 files at r3, 2 of 5 files at r4, all commit messages.
Reviewable status: 5 of 13 files reviewed, 1 unresolved discussion (waiting on @CheranMahalingam and @jbowens)


sstable/writer.go line 1656 at r5 (raw file):

	newSize += 4                            // varint for shared prefix length
	newSize += uvarintLen(uint32(keyLen))   // varint for unshared key bytes
	newSize += uvarintLen(uint32(valueLen)) // varint for value size

Worth adding a code comment that newSize is also an upper-bound, even though it is tighter than newEstimatedSize since it does not account for key prefix compression.
For very long keys with a long shared prefix, and tiny values, such as secondary indexes, this over-estimation could be a problem in that it could result in tiny blocks. I am not sure what to do about this given we don't want to incur the expense of computing the shared prefix (that we do in blockWriter.storeWithOptionalValuePrefix). We may want to use the sizeThreshold after all, and reduce it to something smaller, like 60% (and return false if estimatedBlockSize <= sizeThreshold).

I was expecting to see cache.ValueMetadataSize accounted for here in adjusting estimatedBlockSize, instead of in options. It will naturally bump up the effective BlockSize if someone sets a very tiny value, since we will add at least one entry. And after that entry is added the block will be bigger than BlockSize and all subsequent attempts to add will fall through to the size class based comparisons. Was there a reason to do it in options.go? The logic there seems not quite right in that if someone sets a block size of 2 bytes, they should be getting a tiny block, and not a block size of DefaultBlockSize.

Copy link
Contributor Author

@CheranMahalingam CheranMahalingam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 14 files reviewed, 2 unresolved discussions (waiting on @jbowens and @sumeerbhola)


sstable/writer.go line 1656 at r5 (raw file):

Previously, sumeerbhola wrote…

Worth adding a code comment that newSize is also an upper-bound, even though it is tighter than newEstimatedSize since it does not account for key prefix compression.
For very long keys with a long shared prefix, and tiny values, such as secondary indexes, this over-estimation could be a problem in that it could result in tiny blocks. I am not sure what to do about this given we don't want to incur the expense of computing the shared prefix (that we do in blockWriter.storeWithOptionalValuePrefix). We may want to use the sizeThreshold after all, and reduce it to something smaller, like 60% (and return false if estimatedBlockSize <= sizeThreshold).

I was expecting to see cache.ValueMetadataSize accounted for here in adjusting estimatedBlockSize, instead of in options. It will naturally bump up the effective BlockSize if someone sets a very tiny value, since we will add at least one entry. And after that entry is added the block will be bigger than BlockSize and all subsequent attempts to add will fall through to the size class based comparisons. Was there a reason to do it in options.go? The logic there seems not quite right in that if someone sets a block size of 2 bytes, they should be getting a tiny block, and not a block size of DefaultBlockSize.

Nice, I overlooked how the shared prefix is computed later. Do you have a preference for what that size threshold should be? I have left it as 60% (~19KiB) but we can increase it if we want to prioritize having block sizes closer to 32KiB or if we want to shift the bin size distribution up (i.e. larger blocks -> fewer allocations in saturated cache -> less overhead from allocator metadata).

You're right, pushing the metadata accounting down to the flush heuristics also avoids breaking existing tests.


sstable/writer.go line 1649 at r6 (raw file):

	// us at risk of overestimating the size and flushing small blocks. We
	// mitigate this by imposing a minimum size restriction.
	sizeClassAwareThreshold := (targetBlockSize*base.SizeClassAwareBlockSizeThreshold + 99) / 100

This change is awkward because we already store size thresholds in the Writer and pass down this value to shouldFlushWithHints. We could make those thresholds dynamic by checking whether size class hints are empty, however, then if we are forced to fall back to the old heuristics the original 90% threshold is lost. Wondering whether there is a cleaner way to express this.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rerun the last experiment with these changes to validate that the results are still good.

Reviewed 2 of 10 files at r6.
Reviewable status: 6 of 14 files reviewed, 4 unresolved discussions (waiting on @CheranMahalingam and @jbowens)


sstable/writer.go line 1638 at r6 (raw file):

	// loaded into the block cache we also allocate space for the cache entry
	// metadata. The new allocation of size ~1052B may now only fit within a
	// 2048B class, which increases fragmentation.

nit: increases internal fragmentation.


sstable/writer.go line 1649 at r6 (raw file):

Previously, CheranMahalingam wrote…

This change is awkward because we already store size thresholds in the Writer and pass down this value to shouldFlushWithHints. We could make those thresholds dynamic by checking whether size class hints are empty, however, then if we are forced to fall back to the old heuristics the original 90% threshold is lost. Wondering whether there is a cleaner way to express this.

Let's just compute this once in Writer and wrap them in flushDecisionOptions struct with 3 fields blockSize, blockSizeThreshold, sizeClassAwareThreshold, and pass that options down.


sstable/writer.go line 1674 at r6 (raw file):

	if blockSizeWithMetadata < targetBlockSize {
		newSizeClass, ok := blockSizeClass(newSize, sizeClassHints)
		if !ok || newSizeClass-newSize >= sizeClass-blockSizeWithMetadata {

nit: shouldn't !ok result in returning false? Since then we should continue adding to this block and will eventually fall into one of the earlier cases that returns true. I realize !ok won't happen in a properly configured system.

@CheranMahalingam CheranMahalingam force-pushed the jemalloc-fragmentation branch 3 times, most recently from d30fb06 to 6511784 Compare April 25, 2024 13:21
Copy link
Contributor Author

@CheranMahalingam CheranMahalingam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that there are no significant regressions following these changes. There is a slight bump in fragmentation for the 40KiB size class which can be attributed to the 60% size threshold. There is also a slight increase in bytes allocated for the new heuristics and in total bytes for the old heuristics as reported by jemalloc. However, note that I ran this experiment almost 2x longer than the previous experiment to make sure that the computed averages were stable.

Version Fragmentation (B/Alloc) Space Wasted Per Allocation (%) Jemalloc Allocated Bytes (GiB) Jemalloc Total Bytes (GiB)
Current 3760 11.26 16.66 19.43
New Heuristics 1778 6.22 15.97 18.80
Improvement +5.04% +4.16% +3.22%

Raw Results

Current Heuristics v3
Allocations: 30252804
Average Allocation Size (B): 33387

Jemalloc Stats
Average Allocated Bytes (GiB): 16.6588
Average Resident Bytes (GiB): 19.4298
Bin Size (KiB) Count Fragmentation (B/Alloc)
64 2136 3848
56 8812 4723
48 3025767 6402
40 15593149 4656
32 9793786 1546
28 44441 2137
24 47351 2206
20 34000 2047
16 21698 920
Allocations: 30246591
Average Allocation Size (B): 28582

Jemalloc Stats
Average Allocated Bytes (GiB): 15.9657
Average Resident Bytes (GiB): 18.8035
Bin Size (KiB) Count Fragmentation (B/Alloc)
64 1411 4191
56 8549 3880
48 101385 1879
40 5315069 2002
32 11033603 1794
28 7004610 1683
24 4602729 1774
20 536362 522
16 21589 986

Reviewable status: 3 of 14 files reviewed, 1 unresolved discussion (waiting on @jbowens and @sumeerbhola)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I am guessing all the numbers above are based on an instrumented version of the commit in this PR. Can you drop a link to that commit in a PR comment, and what workload you ran, so that we can easily reproduce these if we need to in the future.

Reviewed 1 of 10 files at r6, 3 of 4 files at r7, 1 of 1 files at r9, all commit messages.
Reviewable status: 8 of 14 files reviewed, 3 unresolved discussions (waiting on @CheranMahalingam and @jbowens)


sstable/writer_test.go line 675 at r9 (raw file):

			td.ScanArgs(t, "key-size", &keySize)
			td.ScanArgs(t, "val-size", &valSize)
			td.ScanArgs(t, "block-size-with-metadata", &blockSize)

nit: this is not the block size with metadata, yes?


sstable/writer_test.go line 703 at r9 (raw file):

					blockSize:               targetSize,
					blockSizeThreshold:      sizeThreshold,
					sizeClassAwareThreshold: sizeThreshold,

can you add some test cases where this sizeClassAwareThreshold is different from the blockSizeThreshold.

@CheranMahalingam CheranMahalingam force-pushed the jemalloc-fragmentation branch 2 times, most recently from 6c053d7 to aa02ea5 Compare April 25, 2024 17:44
Currently, the sstable writer contains heuristics to flush sstable
blocks once the size reaches a specified threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated sometimes exceeding the 32KiB threshold.
Since CRDB uses jemalloc, these allocations use a 40KiB size class which
leads to significant internal fragmentation. In addition, since the
system is unaware of these size classes we cannot design heuristics that
prioritize reducing memory fragmentation. Reducing internal
fragmentation can help reduce CRDB's memory footprint. This commit
decrements the target block size to prevent internal fragmentation for
small key-value pairs and adds support for optionally specifying size
classes to enable a new set of heuristics that will reduce internal
fragmentation for workloads with larger key-value pairs.

Fixes: cockroachdb#999.
Copy link
Contributor Author

@CheranMahalingam CheranMahalingam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pebble commit: CheranMahalingam@43a74ee
Cockroach commit: CheranMahalingam/cockroach@7ce83ae

Reproduction Steps

// Spin up single node cluster with sufficient memory
$ roachprod create --gce-machine-type=n2-standard-16 $CLUSTER -n 1

$ roachprod put $CLUSTER stats.sh
$ roachprod start $CLUSTER

// Run kv-50 workload
$ roachprod run $CLUSTER:1 -- './cockroach workload init kv {pgurl:1}'
$ roachprod run $CLUSTER:1 -- './cockroach workload run kv --read-percent 50 --tolerate-errors --max-block-bytes 16384 --min-block-bytes 1 --concurrency 256 --batch 10 --insert-count 100000 {pgurl:1}'

// SSH into the node to check progress
$ roachprod ssh $CLUSTER:1
// Wait until block cache has saturated in dbconsole and for sufficient time to pass, I used 30000000 allocations as a stopping point
$ cat logs/cockroach.stdout.log | grep "ALLOC" | wc -l
$ cat logs/cockroach.stdout.log > flush
// Note that there are some assumptions made in this script to get an accurate average for jemalloc stats which are documented inside the script
$ ./stats.sh

Reviewable status: 3 of 15 files reviewed, 3 unresolved discussions (waiting on @jbowens and @sumeerbhola)


sstable/writer_test.go line 675 at r9 (raw file):

Previously, sumeerbhola wrote…

nit: this is not the block size with metadata, yes?

This is a hacky way of dealing with the fact that with certain builds metadata size is 0 (cgo disabled or a set of flags defined in value_cgo.go) and we want blockSizeWithMetadata to be consistent. Then, I decrement by cache.ValueMetadataSize (0 or 32) to compute the block size for the flush call.

I am realizing that this has been a huge pain for testing and to avoid future friction I have set cache.ValueMetadataSize to 32 for all builds. If cgo were ever disabled in production our heuristics would break anyways.

Note that there is still a large point of friction preventing extending the tests because the value of cache.ValueMetadataSize is reduced for 32-bit builds.


sstable/writer_test.go line 703 at r9 (raw file):

Previously, sumeerbhola wrote…

can you add some test cases where this sizeClassAwareThreshold is different from the blockSizeThreshold.

Done.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 4 files at r7, 2 of 6 files at r10, 1 of 2 files at r11.
Reviewable status: 7 of 15 files reviewed, 3 unresolved discussions (waiting on @CheranMahalingam and @jbowens)


sstable/writer_test.go line 675 at r9 (raw file):

I am realizing that this has been a huge pain for testing and to avoid future friction I have set cache.ValueMetadataSize to 32 for all builds. If cgo were ever disabled in production our heuristics would break anyways.

This is fine.

@CheranMahalingam CheranMahalingam merged commit 1b234bd into cockroachdb:master Apr 26, 2024
11 checks passed
@CheranMahalingam CheranMahalingam deleted the jemalloc-fragmentation branch April 26, 2024 13:33
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this pull request Jun 13, 2024
This patch makes use of the pebble size class aware flush heuristics
introduced in cockroachdb/pebble#3508. The
experimentation there showed improvements in internal fragmentation.

Fixes cockroachdb#123090.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this pull request Jun 17, 2024
This patch makes use of the pebble size class aware flush heuristics
introduced in cockroachdb/pebble#3508. The
experimentation there showed improvements in internal fragmentation.

Fixes cockroachdb#123090.

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Jun 17, 2024
125793: storage: use jemalloc size classes when intializing pebble r=RaduBerinde a=aadityasondhi

This patch makes use of the pebble size class aware flush heuristics introduced in cockroachdb/pebble#3508. The experimentation there showed improvements in internal fragmentation.

Fixes #123090.

Release note: None

Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sstable: revisit writer block flushing heuristics
4 participants