sstable: revisit writer block flushing heuristics #999

petermattis · 2020-11-17T13:08:01Z

sstable.shouldFlush contains the heuristics for deciding whether a block should be flushed or not during sstable construction. The intention is to flush a block before it reaches the configured blockSize. The block size controls how large the block will be in memory (on disk the block is compressed and may be significantly smaller). For CRDB, a block size of 32 KiB is used. What is interesting is how this block size interacts with jemalloc (the memory allocator linked into CRDB for C memory allocations). Jemalloc has many size classes and an allocation is always performed in the smallest size class that will hold it. Of interest to the block cache are the size classes 24 KiB, 28 KiB, 32 KiB, 40 KiB. If a block is just a tiny bit larger than 32 KiB it will be allocated from the 40 KiB size class which will waste ~25% of the space. Much better for a block to be just a bit smaller than 32 KiB.

The shouldFlush heuristic attempts to flush a block just before it grows larger than the target block size. But there is a second heuristic that says "don't flush a block if it is smaller than 99% of the target block size". (Note this heuristic was inherited from RocksDB). 99% of 32 KiB is 31.68 KiB, a difference of only 328 bytes. So if we have a key/value pair that is larger than 328 bytes we'll flush the block when it is just a little bit larger than 32 KiB. If we want to minimize internal fragmentation in the block cache, we should instead allow the block to be flushed earlier. If the block was flushed at just over 28 KiB internal fragmentation would be 14%.

Making shouldFlush aware of the jemalloc size classes could allow significantly reduce this memory wastage. Does this matter in practice? Maybe. CRDB is frequently run with multi-gigabyte block cache sizes. The internal fragmentation is not accounted for in the block cache size which makes memory usage higher than expected. Smarter block sizing heuristics could bring a tighter bound on memory usage which we could use to reduce the CRDB memory footprint, or we could increase the block cache size in order to improve read performance. With a multi-gigabyte block cache we're talking about hundreds of megabytes of memory.

We'd want to make the allocator size class knowledge a configurable so that we don't hard code something specific to jemalloc.

The text was updated successfully, but these errors were encountered:

petermattis · 2020-11-25T13:40:40Z

Also related to internal fragmentation is this TODO in internal/cache/value_normal.go:

func newValue(n int) *Value {
	if n == 0 {
		return nil
	}
	// When we're not performing leak detection, the lifetime of the returned
	// Value is exactly the lifetime of the backing buffer and we can manually
	// allocate both.
	//
	// TODO(peter): It may be better to separate the allocation of the value and
	// the buffer in order to reduce internal fragmentation in malloc. If the
	// buffer is right at a power of 2, adding valueSize might push the
	// allocation over into the next larger size.
	b := manual.New(valueSize + n)
	v := (*Value)(unsafe.Pointer(&b[0]))
	v.buf = b[valueSize:]
	v.ref.init(1)
	return v
}

petermattis · 2020-12-11T16:31:37Z

I did a bit of analysis on an imported TPCC-100 dataset. The following table shows the uncompressed data block sizes bucketed by the jemalloc class sizes. count is the count of the number of blocks that fall in that class size. wasted is the average bytes wasted per block due to the actual block size being smaller than the class size. total space = class size * count. wasted space = wasted * count.

class size	count	wasted	total space	wasted space
28 KB	49	2121.0	1.3 MB	0.1 MB
32 KB	294530	145.3	9204.1 MB	40.8 MB
40 KB	0	0	0	0

This looks great. The wasted space is quite small. But if we include the size of the Value struct which is allocated contiguously with the memory for the block, a different picture emerges:

class size	count	wasted	total space	wasted space
28 KB	49	2089.0	1.3 MB	0.1 MB
32 KB	221486	154.5	6921.4 MB	32.6 MB
40 KB	73044	8180.6	2853.3 MB	569.9 MB

Wasted space is >10x higher in this scenario. It looks fairly straightforward to reclaim this wasted space.

github-actions · 2022-06-06T11:01:17Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it
in 10 days to keep the issue queue tidy. Thank you for your
contribution to Pebble!

sumeerbhola · 2023-11-27T21:24:17Z

Internal fragmentation (the topic of this issue) is not tracked in jemalloc stats, but we can see it when varying size of the value in a kv50 workload. There are three runs below, (1) used 4096 byte values that are not compressible, (2) uses 4096 byte values with --target-compression-ratio=3, (3) use 1024 byte values with --target-compression-ratio=3. Note that the rocksdb.block.cache.usage stabilizes to the same in all three runs, but the allocated bytes in run 1 and 2 are much higher. Looking at the detailed jemalloc stats, most of the allocated bytes are in size class 40960 in run 1 and run 2, while in run 3 most of the allocated bytes are in size class 32768. The difference between allocbytes and totalbytes (external fragmentation) is similar in all three runs.

petermattis · 2023-11-28T14:12:54Z

Does the compressibility of the data actually matter here given that the block cache stores uncompressed blocks?

My recollection of the TPC-C analysis I did above was that I dumped out the block sizes for all of the sstables using the pebble sstable tool (possibly with some custom tweaks, I can't recall). Looking at the Pebble shouldFlush code, I have a suspicion that this code may be problematic:

	// The block is currently smaller than the target size.
	if estimatedBlockSize <= sizeThreshold {
		// The block is smaller than the threshold size at which we'll consider
		// flushing it.
		return false
	}

In CRDB, the block size is 32kb and the configured size threshold is left at the default (90%). So we won't consider a block for flushing if its estimated size is smaller than 29492 bytes. With 4096 byte values we're guaranteed to have blocks slightly larger than 32kb. I suspect we can do something better here. If we knew the jemalloc size classes, we could make a better decision of whether it reduces internal fragmentation more to flush the block, or to add another entry and then flush the block. Somewhat awkward to have Pebble know about the jemalloc size classes since using jemalloc isn't required. Perhaps a config option which CRDB can specify.

sumeerbhola · 2023-11-28T17:59:03Z

Does the compressibility of the data actually matter here given that the block cache stores uncompressed blocks?

No, it doesn't -- I was just playing around.

Previously, the sstable writer contained heuristics to flush sstable blocks when the size reached a certain threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated with the block causing the allocation to go beyond this threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to internal fragmentation and higher memory usage. This commit decrements the block size threshold to reduce internal memory fragmentation. Informs: cockroachdb#999.

Previously, the sstable writer contained heuristics to flush sstable blocks when the size reached a certain threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated with the block causing the allocation to go beyond this threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to internal fragmentation and higher memory usage. This commit decrements the block size threshold to reduce internal memory fragmentation. Fixes: cockroachdb#999.

Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: cockroachdb#999.

Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: #999.

github-actions bot added the no-issue-activity label Jun 6, 2022

jbowens removed the no-issue-activity label Jun 6, 2022

jbowens mentioned this issue Jun 15, 2023

deps: pebble crashes when using zstd #1706

Closed

CheranMahalingam mentioned this issue Apr 19, 2024

sstable: reduce block cache fragmentation #3508

Merged

CheranMahalingam closed this as completed in #3508 Apr 26, 2024

jbowens added this to Storage Jun 4, 2024

jbowens moved this to Done in Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sstable: revisit writer block flushing heuristics #999

sstable: revisit writer block flushing heuristics #999

petermattis commented Nov 17, 2020

petermattis commented Nov 25, 2020

petermattis commented Dec 11, 2020

github-actions bot commented Jun 6, 2022

sumeerbhola commented Nov 27, 2023 •

edited

Loading

petermattis commented Nov 28, 2023

sumeerbhola commented Nov 28, 2023

sstable: revisit writer block flushing heuristics #999

sstable: revisit writer block flushing heuristics #999

Comments

petermattis commented Nov 17, 2020

petermattis commented Nov 25, 2020

petermattis commented Dec 11, 2020

github-actions bot commented Jun 6, 2022

sumeerbhola commented Nov 27, 2023 • edited Loading

petermattis commented Nov 28, 2023

sumeerbhola commented Nov 28, 2023

sumeerbhola commented Nov 27, 2023 •

edited

Loading