-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colexec: make external sorter respect memory limit better #60593
Conversation
The microbenchmarks don't show noticeable difference, but in some cases the performance can have a significant hit because we are respecting the memory limits much better (#60248 exacerbates the problem, but I think we'll address it soon). Anyway, in the scenario described in #59851 (comment), I think that we now report the correct memory usage and stay very close to the limit (we report 149MiB of RAM usage - with the current implementation, ideally we would have something like 2 x 64MiB because of #60022) while I don't know of any place missing the memory accounting, so I'm quite happy with the current state. One thing I'm concerned about is that we seem to be under-reporting the disk usage: |
e60aec5
to
4e87096
Compare
would this fix have release notes added? looks like this could result in improved RAS ( reliability availability serviceability) |
56a041f
to
0fcb847
Compare
Previously, we forgot to perform the memory accounting of the batches that are dequeued from the partitions in the external sort (which could be substantial when we're merging multiple partitions at once and the tuples are wide) and in the hash based partitioner. This is now fixed. Additionally, this commit retains references to some internal operators in the external sort in order to reuse the memory under the dequeued batches (this will be beneficial if we perform repeated merging). Also, this commit fixes an issue with repeated re-initializing of the disk-backed operators in the disk spiller if the latter has been reset (the problem would lead to redundant allocations and not reusing of the available memory). Slight complication with accounting was because of the fact that we were using the same allocator for all usages. This would be quite wrong because in the merge phase we have two distinct memory usage with different lifecycles - the memory under the dequeued batches is kept (and reused later) whereas the memory under the output batch of the ordered synchronizer is released. We now correctly handle these lifecycles by using separate allocators. Release note (bug fix): CockroachDB previously didn't account for some RAM used when disk-spilling operations (like sorts and hash joins) were using the temporary storage in the vectorized execution engine. This could result in OOM crashes, especially when the rows are large in size.
@robert-s-lee good point about the release notes, I added one to the first commit which I think we should backport. I'm less certain that we will backport the second commit (the issue that the first commit fixes is that we didn't account for some RAM usage whereas the second is about staying as close to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but tbh I find the details of the second commit a bit hard to follow since it's been a while since I touched this code. Relying on the correctness of current tests, make sure to add anything that should be added to those.
Reviewed 6 of 6 files at r1, 3 of 3 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/colexec/external_sort.go, line 266 at r2 (raw file):
partitionedDiskQueueSemaphore = nil } // We give another half of the available RAM to the merge operation.
nit: make these consts grouped together so a reader can understand the relation between the two easily.
This commit improves in how the external sorter manages its available RAM. There are two different main usages that overlap because we are keeping the references to both at all times: 1. during the spilling/sorting phase, we use a single in-memory sorter 2. during the merging phase, we use the ordered synchronizer that reads one batch from each of the partitions and also allocates an output batch. Previously, we would give the whole memory limit to the in-memory sorter in 1. which resulted in the external sorter using at least 2x of its memory limit. This is now fixed by giving only a half to the in-memory sorter. The handling of 2. was even worse - we didn't have any logic that would limit the number of active partitions based on the memory footprint. If the batches are large (say 1GB in size), during the merge phase we would be using on the order of 16GB of RAM (number 16 would be determined based on the number of file descriptors). Additionally, we would give the whole memory limit to the output batch too. This misbehavior is also now fixed by tracking the maximum size of a single batch in each active partition and computing the actual maximum number of partitions to have using those sizes. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have good coverage in terms of correctness, plus these commits don't change anything fundamentally other than adjusting the limits and reporting the memory usage. The only somewhat fundamental change is that we're now can limit the max number of partitions based on the batch mem sizes which will make us perform repeated merging sooner and more often than previously. Still, I think these changes are safe but could have the performance hit given that we're now using less RAM.
TFTR!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto)
pkg/sql/colexec/external_sort.go, line 266 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
nit: make these consts grouped together so a reader can understand the relation between the two easily.
Done.
Build failed (retrying...): |
Build succeeded: |
colexec: register memory used by dequeued batches from partitioned queue
Previously, we forgot to perform the memory accounting of the batches
that are dequeued from the partitions in the external sort (which could
be substantial when we're merging multiple partitions at once and the
tuples are wide) and in the hash based partitioner. This is now fixed.
Additionally, this commit retains references to some internal operators
in the external sort in order to reuse the memory under the dequeued
batches (this will be beneficial if we perform repeated merging).
Also, this commit fixes an issue with repeated re-initializing of the
disk-backed operators in the disk spiller if the latter has been reset
(the problem would lead to redundant allocations and not reusing of the
available memory).
Slight complication with accounting was because of the fact that we were
using the same allocator for all usages. This would be quite wrong
because in the merge phase we have two distinct memory usage with
different lifecycles - the memory under the dequeued batches is kept
(and reused later) whereas the memory under the output batch of the
ordered synchronizer is released. We now correctly handle these
lifecycles by using separate allocators.
Release note (bug fix): CockroachDB previously didn't account for some
RAM used when disk-spilling operations (like sorts and hash joins) were
using the temporary storage in the vectorized execution engine. This
could result in OOM crashes, especially when the rows are large in size.
colexec: make external sorter respect memory limit better
This commit improves in how the external sorter manages its available
RAM. There are two different main usages that overlap because we are
keeping the references to both at all times:
one batch from each of the partitions and also allocates an output
batch.
Previously, we would give the whole memory limit to the in-memory sorter
in 1. which resulted in the external sorter using at least 2x of its
memory limit. This is now fixed by giving only a half to the in-memory
sorter.
The handling of 2. was even worse - we didn't have any logic that would
limit the number of active partitions based on the memory footprint. If
the batches are large (say 1GB in size), during the merge phase we would
be using on the order of 16GB of RAM (number 16 would be determined
based on the number of file descriptors). Additionally, we would give
the whole memory limit to the output batch too.
This misbehavior is also now fixed by tracking the maximum size of
a single batch in each active partition and computing the actual maximum
number of partitions to have using those sizes.
Fixes: #60017.
Release note: None