-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sstable: revisit writer block flushing heuristics #999
Comments
Also related to internal fragmentation is this TODO in func newValue(n int) *Value {
if n == 0 {
return nil
}
// When we're not performing leak detection, the lifetime of the returned
// Value is exactly the lifetime of the backing buffer and we can manually
// allocate both.
//
// TODO(peter): It may be better to separate the allocation of the value and
// the buffer in order to reduce internal fragmentation in malloc. If the
// buffer is right at a power of 2, adding valueSize might push the
// allocation over into the next larger size.
b := manual.New(valueSize + n)
v := (*Value)(unsafe.Pointer(&b[0]))
v.buf = b[valueSize:]
v.ref.init(1)
return v
} |
I did a bit of analysis on an imported TPCC-100 dataset. The following table shows the uncompressed data block sizes bucketed by the jemalloc class sizes.
This looks great. The
Wasted space is >10x higher in this scenario. It looks fairly straightforward to reclaim this wasted space. |
We have marked this issue as stale because it has been inactive for |
Does the compressibility of the data actually matter here given that the block cache stores uncompressed blocks? My recollection of the TPC-C analysis I did above was that I dumped out the block sizes for all of the sstables using the pebble sstable tool (possibly with some custom tweaks, I can't recall). Looking at the Pebble
In CRDB, the block size is 32kb and the configured size threshold is left at the default (90%). So we won't consider a block for flushing if its estimated size is smaller than 29492 bytes. With 4096 byte values we're guaranteed to have blocks slightly larger than 32kb. I suspect we can do something better here. If we knew the jemalloc size classes, we could make a better decision of whether it reduces internal fragmentation more to flush the block, or to add another entry and then flush the block. Somewhat awkward to have Pebble know about the jemalloc size classes since using jemalloc isn't required. Perhaps a config option which CRDB can specify. |
No, it doesn't -- I was just playing around. |
Previously, the sstable writer contained heuristics to flush sstable blocks when the size reached a certain threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated with the block causing the allocation to go beyond this threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to internal fragmentation and higher memory usage. This commit decrements the block size threshold to reduce internal memory fragmentation. Informs: cockroachdb#999.
Previously, the sstable writer contained heuristics to flush sstable blocks when the size reached a certain threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated with the block causing the allocation to go beyond this threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to internal fragmentation and higher memory usage. This commit decrements the block size threshold to reduce internal memory fragmentation. Informs: cockroachdb#999.
Previously, the sstable writer contained heuristics to flush sstable blocks when the size reached a certain threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated with the block causing the allocation to go beyond this threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to internal fragmentation and higher memory usage. This commit decrements the block size threshold to reduce internal memory fragmentation. Informs: cockroachdb#999.
Previously, the sstable writer contained heuristics to flush sstable blocks when the size reached a certain threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated with the block causing the allocation to go beyond this threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to internal fragmentation and higher memory usage. This commit decrements the block size threshold to reduce internal memory fragmentation. Informs: cockroachdb#999.
Previously, the sstable writer contained heuristics to flush sstable blocks when the size reached a certain threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated with the block causing the allocation to go beyond this threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to internal fragmentation and higher memory usage. This commit decrements the block size threshold to reduce internal memory fragmentation. Informs: cockroachdb#999.
Previously, the sstable writer contained heuristics to flush sstable blocks when the size reached a certain threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated with the block causing the allocation to go beyond this threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to internal fragmentation and higher memory usage. This commit decrements the block size threshold to reduce internal memory fragmentation. Informs: cockroachdb#999.
Previously, the sstable writer contained heuristics to flush sstable blocks when the size reached a certain threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated with the block causing the allocation to go beyond this threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to internal fragmentation and higher memory usage. This commit decrements the block size threshold to reduce internal memory fragmentation. Fixes: cockroachdb#999.
Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: cockroachdb#999.
Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: cockroachdb#999.
Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: cockroachdb#999.
Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: cockroachdb#999.
Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: cockroachdb#999.
Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: cockroachdb#999.
Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: cockroachdb#999.
Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: cockroachdb#999.
Currently, the sstable writer contains heuristics to flush sstable blocks once the size reaches a specified threshold. In CRDB this is defined as 32KiB. However, when these blocks are loaded into memory additional metadata is allocated sometimes exceeding the 32KiB threshold. Since CRDB uses jemalloc, these allocations use a 40KiB size class which leads to significant internal fragmentation. In addition, since the system is unaware of these size classes we cannot design heuristics that prioritize reducing memory fragmentation. Reducing internal fragmentation can help reduce CRDB's memory footprint. This commit decrements the target block size to prevent internal fragmentation for small key-value pairs and adds support for optionally specifying size classes to enable a new set of heuristics that will reduce internal fragmentation for workloads with larger key-value pairs. Fixes: #999.
sstable.shouldFlush
contains the heuristics for deciding whether a block should be flushed or not during sstable construction. The intention is to flush a block before it reaches the configuredblockSize
. The block size controls how large the block will be in memory (on disk the block is compressed and may be significantly smaller). For CRDB, a block size of 32 KiB is used. What is interesting is how this block size interacts with jemalloc (the memory allocator linked into CRDB for C memory allocations). Jemalloc has many size classes and an allocation is always performed in the smallest size class that will hold it. Of interest to the block cache are the size classes 24 KiB, 28 KiB, 32 KiB, 40 KiB. If a block is just a tiny bit larger than 32 KiB it will be allocated from the 40 KiB size class which will waste ~25% of the space. Much better for a block to be just a bit smaller than 32 KiB.The
shouldFlush
heuristic attempts to flush a block just before it grows larger than the target block size. But there is a second heuristic that says "don't flush a block if it is smaller than 99% of the target block size". (Note this heuristic was inherited from RocksDB). 99% of 32 KiB is 31.68 KiB, a difference of only 328 bytes. So if we have a key/value pair that is larger than 328 bytes we'll flush the block when it is just a little bit larger than 32 KiB. If we want to minimize internal fragmentation in the block cache, we should instead allow the block to be flushed earlier. If the block was flushed at just over 28 KiB internal fragmentation would be 14%.Making
shouldFlush
aware of the jemalloc size classes could allow significantly reduce this memory wastage. Does this matter in practice? Maybe. CRDB is frequently run with multi-gigabyte block cache sizes. The internal fragmentation is not accounted for in the block cache size which makes memory usage higher than expected. Smarter block sizing heuristics could bring a tighter bound on memory usage which we could use to reduce the CRDB memory footprint, or we could increase the block cache size in order to improve read performance. With a multi-gigabyte block cache we're talking about hundreds of megabytes of memory.We'd want to make the allocator size class knowledge a configurable so that we don't hard code something specific to jemalloc.
The text was updated successfully, but these errors were encountered: