-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: do not cache compaction block reads #2699
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
d41de74
to
9017c6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 15 files at r1, 2 of 4 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @bananabrick and @jbowens)
sstable/buffer_pool.go
line 77 at r3 (raw file):
*p = BufferPool{ cache: cache, pool: pool,
[nit] should we do pool[:0]
here so there's no possibility of misuse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 15 files at r1, 2 of 4 files at r2, 2 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @bananabrick and @jbowens)
compaction.go
line 2756 at r2 (raw file):
// Compactions use a pool of buffers to read blocks, avoiding polluting the // block cache with blocks that will not be read again.
Can you add a comment on why 5 is a good value for initial size.
In the worst-case with 2 level compactions do we have up to 2 data blocks, 2 value blocks, 2 rangedel blocks and 2 rangekey blocks allocated?
sstable/block.go
line 405 at r2 (raw file):
cached []blockEntry cachedBuf []byte cacheHandle cache.Handle
is cacheHandle
unused, since we have the cache.Handle
inside bufferHandle
?
sstable/buffer_pool.go
line 49 at r2 (raw file):
// pool contains all the buffers held by the pool, including buffers that // are in-use. pool []allocedBuffer
IIUC, the invariant is that for every pool[i]
the pool[i].v
value is non-nil. If so, can you add a comment.
sstable/buffer_pool.go
line 66 at r2 (raw file):
func (p *BufferPool) Init(cache *cache.Cache, initialSize int) { *p = BufferPool{ cache: cache,
I don't see why cache.Alloc
and cache.Free
are methods on the Cache
struct. They don't seem to reference anything in the cache. Could they be functions. Then we don't need cache
as a member.
sstable/buffer_pool.go
line 139 at r2 (raw file):
// is no longer in use and a future call to BufferPool.Alloc may reuse this // buffer. b.p.pool[b.i].b = nil
This code is not setting b.p
to nil, which makes me worry that if Release
is accidentally called again it may free a buffer that no longer belongs to Buf
. I suppose we can't do better because Release
is a value receiver. I am guessing making it a pointer receiver is not going to cause Buf
to be heap allocated, in which case we could fix this.
sstable/reader.go
line 3429 at r2 (raw file):
decompressed = cacheValueOrBuf{buf: bufferPool.Alloc(decodedLen)} } else { decompressed = cacheValueOrBuf{
so we need two bufs for each block, which means my previous comment may have undercounted by 2x.
sstable/value_block.go
line 753 at r2 (raw file):
) (bufferHandle, error) { ctx = objiotracing.WithBlockType(ctx, objiotracing.ValueBlock) return bpwc.r.readBlock(ctx, h, nil, nil, stats, nil /* buffer pool */)
this case should be rare enough but if we needed to use the buffer pool we could since the blockProviderWhenClosed
is not allowed to outlive the iterator tree, which means it cannot outlive the buffer pool.
Alloc and Free do not actually use or record any state on the cache.Cache type. This commit lifts them up on to the package level.
During compactions, avoid populating the block cache with input files' blocks. These files will soon be removed from the LSM, making it less likely any iterator will need to read these blocks. While Pebble uses a scan-resistant block cache algorithm (ClockPRO), the act of inserting the blocks into the cache increases contention on the block cache mutexes (cockroachdb#1997). This contention has been observed to significantly contribute to tail latencies, both for reads and for writes during memtable reservation. Additionally, although these blocks may be soon replaced with more useful blocks due to ClockPRO's scan resistance, they may be freed by a different thread inducing excessive TLB shootdowns (cockroachdb#2693). A compaction only requires a relatively small working set of buffers during its scan across input sstables. In this commit, we introduce a per-compaction BufferPool that is used to allocate buffers during cache misses. Buffers are reused throughout the compaction and only freed to the memory allocator when they're too small or the compaction is finished. This reduces pressure on the memory allocator and the block cache.
When opening a sstable for the first time, we read two 'meta' blocks: one describes the layout of sstable, encoding block handles for the index block, properties block, etc. The other contains table properties. These blocks are decoded, and the necessary state is copied onto the heap, and then blocks are released. In typical configurations with sufficiently large table caches, these blocks are never read again or only much later after the table has been evicted from the table cache. This commit uses a BufferPool to hold these temporary blocks, and refrains from adding these blocks to the block cache. This reduces contention on the block cache mutexes (cockroachdb#1997).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @bananabrick and @sumeerbhola)
compaction.go
line 2756 at r2 (raw file):
Previously, sumeerbhola wrote…
Can you add a comment on why 5 is a good value for initial size.
In the worst-case with 2 level compactions do we have up to 2 data blocks, 2 value blocks, 2 rangedel blocks and 2 rangekey blocks allocated?
In the worst case for a two-level compaction, we have up to 2 index blocks, 2 data blocks, 2 value blocks, n rangedel blocks and n rangekey blocks + 1 ephemeral block used to hold the intermediary compressed data block during a block load. I've bumped this up to 12 and written up a justification.
sstable/block.go
line 405 at r2 (raw file):
Previously, sumeerbhola wrote…
is
cacheHandle
unused, since we have thecache.Handle
insidebufferHandle
?
Good catch
sstable/buffer_pool.go
line 49 at r2 (raw file):
Previously, sumeerbhola wrote…
IIUC, the invariant is that for every
pool[i]
thepool[i].v
value is non-nil. If so, can you add a comment.
Done.
sstable/buffer_pool.go
line 66 at r2 (raw file):
Previously, sumeerbhola wrote…
I don't see why
cache.Alloc
andcache.Free
are methods on theCache
struct. They don't seem to reference anything in the cache. Could they be functions. Then we don't needcache
as a member.
Very good point. This also lets us remove cache
as a member from cacheValueOrBuf
sstable/buffer_pool.go
line 139 at r2 (raw file):
Previously, sumeerbhola wrote…
This code is not setting
b.p
to nil, which makes me worry that ifRelease
is accidentally called again it may free a buffer that no longer belongs toBuf
. I suppose we can't do better becauseRelease
is a value receiver. I am guessing making it a pointer receiver is not going to causeBuf
to be heap allocated, in which case we could fix this.
Done.
sstable/buffer_pool.go
line 77 at r3 (raw file):
Previously, RaduBerinde wrote…
[nit] should we do
pool[:0]
here so there's no possibility of misuse?
Done
sstable/reader.go
line 3429 at r2 (raw file):
Previously, sumeerbhola wrote…
so we need two bufs for each block, which means my previous comment may have undercounted by 2x.
It should be a constant +1 buf since only one of the buffers survive to the end
sstable/value_block.go
line 753 at r2 (raw file):
Previously, sumeerbhola wrote…
this case should be rare enough but if we needed to use the buffer pool we could since the
blockProviderWhenClosed
is not allowed to outlive the iterator tree, which means it cannot outlive the buffer pool.
Left a TODO here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they may be freed by a different thread inducing excessive TLB shootdowns
I never thought about this.
Why is it necessary that a free happens on a different thread to observe a TLB shootdown? If more than one thread is working with some memory, and any one of those threads calls free, we'd have the same problem right? Doesn't necessarily have to be a different thread than the one which allocated the memory.
A compaction only requires a relatively small working set of buffers during its scan across input sstables.
Do we really need manual memory allocation for these buffers?
Reviewed 6 of 15 files at r1, 1 of 5 files at r3, 6 of 25 files at r4, 14 of 17 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens and @sumeerbhola)
sstable/buffer_pool.go
line 99 at r6 (raw file):
for i := 0; i < len(p.pool); i++ { if p.pool[i].b == nil { if len(p.pool[i].v.Buf()) >= n {
nit: I think there's an invariant that v.buf
is never resized so cap(v.Buf) == len(v.Buf)
, when a Value
is used by the BufferPool. Could we mention that in the allocatedBuffer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary that a free happens on a different thread to observe a TLB shootdown? If more than one thread is working with some memory, and any one of those threads calls free, we'd have the same problem right? Doesn't necessarily have to be a different thread than the one which allocated the memory.
I believe that's true, but:
- for these blocks read during compactions, the expectation is that only the one thread running the compaction goroutine is reading this memory, so only the free from another thread is causing any TLB shootdown.
- jemalloc has a thread-local 'tcache' of cached objects. Beyond the TLB shootdown, the memory being freed from another thread defeats the utility of the thread-local tcache, making it less likely that subsequent allocations can be served by the tcache.
Do we really need manual memory allocation for these buffers?
Not strictly, but we already need to use a manual accounting of their lifecycles, so keeping them off the Go heap and out of GC's view is just a freebie.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @bananabrick and @sumeerbhola)
sstable/buffer_pool.go
line 99 at r6 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
nit: I think there's an invariant that
v.buf
is never resized socap(v.Buf) == len(v.Buf)
, when aValue
is used by the BufferPool. Could we mention that in theallocatedBuffer
?
while true, it's not an invariant we use in any way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the expectation is that only the one thread running the compaction goroutine is reading this memory, so only the free from another thread is causing any TLB shootdown.
If the compaction thread is still reading the memory, then another thread wouldn't call free on it, because the refcount of the Value isn't 0, right?
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens and @sumeerbhola)
sstable/buffer_pool.go
line 99 at r6 (raw file):
Previously, jbowens (Jackson Owens) wrote…
while true, it's not an invariant we use in any way
I figured we're relying on it here: len(p.pool[i].v.Buf()) >= n
, we use len
instead of cap
, and it's correct to use len
instead of cap
, because len == cap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 25 files at r4, 9 of 17 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the compaction thread is still reading the memory, then another thread wouldn't call free on it, because the refcount of the Value isn't 0, right?
That's right, but even if the compaction thread is done with the block and has moved on to a subsequent block, the block may still be in its TLB. I believe even in this scenario, the processor suffers the cost of a 'TLB shootdown.'
TFTRs!
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @sumeerbhola)
db: do not cache compaction block reads
During compactions, avoid populating the block cache with input files' blocks.
These files will soon be removed from the LSM, making it less likely any
iterator will need to read these blocks. While Pebble uses a scan-resistant
block cache algorithm (ClockPRO), the act of inserting the blocks into the
cache increases contention on the block cache mutexes (#1997). This contention
has been observed to significantly contribute to tail latencies, both for reads
and for writes during memtable reservation. Additionally, although these blocks
may be soon replaced with more useful blocks due to ClockPRO's scan resistance,
they may be freed by a different thread inducing excessive TLB shootdowns
(#2693).
A compaction only requires a relatively small working set of buffers during its
scan across input sstables. In this commit, we introduce a per-compaction
BufferPool that is used to allocate buffers during cache misses. Buffers are
reused throughout the compaction and only freed to the memory allocator when
they're too small or the compaction is finished. This reduces pressure on the
memory allocator and the block cache.
sstable: avoid caching meta blocks
When opening a sstable for the first time, we read two 'meta' blocks: one
describes the layout of sstable, encoding block handles for the index block,
properties block, etc. The other contains table properties. These blocks are
decoded, and the necessary state is copied onto the heap, and then blocks are
released. In typical configurations with sufficiently large table caches, these
blocks are never read again or only much later after the table has been evicted
from the table cache.
This commit uses a BufferPool to hold these temporary blocks, and refrains from
adding these blocks to the block cache. This reduces contention on the block
cache mutexes (#1997).