Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexec: make external sorter respect memory limit better #60593

Merged
merged 2 commits into from
Feb 17, 2021

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Feb 15, 2021

colexec: register memory used by dequeued batches from partitioned queue

Previously, we forgot to perform the memory accounting of the batches
that are dequeued from the partitions in the external sort (which could
be substantial when we're merging multiple partitions at once and the
tuples are wide) and in the hash based partitioner. This is now fixed.

Additionally, this commit retains references to some internal operators
in the external sort in order to reuse the memory under the dequeued
batches (this will be beneficial if we perform repeated merging).

Also, this commit fixes an issue with repeated re-initializing of the
disk-backed operators in the disk spiller if the latter has been reset
(the problem would lead to redundant allocations and not reusing of the
available memory).

Slight complication with accounting was because of the fact that we were
using the same allocator for all usages. This would be quite wrong
because in the merge phase we have two distinct memory usage with
different lifecycles - the memory under the dequeued batches is kept
(and reused later) whereas the memory under the output batch of the
ordered synchronizer is released. We now correctly handle these
lifecycles by using separate allocators.

Release note (bug fix): CockroachDB previously didn't account for some
RAM used when disk-spilling operations (like sorts and hash joins) were
using the temporary storage in the vectorized execution engine. This
could result in OOM crashes, especially when the rows are large in size.

colexec: make external sorter respect memory limit better

This commit improves in how the external sorter manages its available
RAM. There are two different main usages that overlap because we are
keeping the references to both at all times:

  1. during the spilling/sorting phase, we use a single in-memory sorter
  2. during the merging phase, we use the ordered synchronizer that reads
    one batch from each of the partitions and also allocates an output
    batch.

Previously, we would give the whole memory limit to the in-memory sorter
in 1. which resulted in the external sorter using at least 2x of its
memory limit. This is now fixed by giving only a half to the in-memory
sorter.

The handling of 2. was even worse - we didn't have any logic that would
limit the number of active partitions based on the memory footprint. If
the batches are large (say 1GB in size), during the merge phase we would
be using on the order of 16GB of RAM (number 16 would be determined
based on the number of file descriptors). Additionally, we would give
the whole memory limit to the output batch too.

This misbehavior is also now fixed by tracking the maximum size of
a single batch in each active partition and computing the actual maximum
number of partitions to have using those sizes.

Fixes: #60017.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

yuzefovich commented Feb 15, 2021

The microbenchmarks don't show noticeable difference, but in some cases the performance can have a significant hit because we are respecting the memory limits much better (#60248 exacerbates the problem, but I think we'll address it soon).

Anyway, in the scenario described in #59851 (comment), I think that we now report the correct memory usage and stay very close to the limit (we report 149MiB of RAM usage - with the current implementation, ideally we would have something like 2 x 64MiB because of #60022) while I don't know of any place missing the memory accounting, so I'm quite happy with the current state.

One thing I'm concerned about is that we seem to be under-reporting the disk usage:
Screen Shot 2021-02-15 at 1 45 59 PM
Here, I'd expect that we use about 1GB of disk because we're performing the general external sort. Oh, it's because of the compression - we get like 90% compression, nice!

@yuzefovich yuzefovich requested review from asubiotto and a team February 15, 2021 22:14
@robert-s-lee
Copy link
Contributor

would this fix have release notes added? looks like this could result in improved RAS ( reliability availability serviceability)

Previously, we forgot to perform the memory accounting of the batches
that are dequeued from the partitions in the external sort (which could
be substantial when we're merging multiple partitions at once and the
tuples are wide) and in the hash based partitioner. This is now fixed.

Additionally, this commit retains references to some internal operators
in the external sort in order to reuse the memory under the dequeued
batches (this will be beneficial if we perform repeated merging).

Also, this commit fixes an issue with repeated re-initializing of the
disk-backed operators in the disk spiller if the latter has been reset
(the problem would lead to redundant allocations and not reusing of the
available memory).

Slight complication with accounting was because of the fact that we were
using the same allocator for all usages. This would be quite wrong
because in the merge phase we have two distinct memory usage with
different lifecycles - the memory under the dequeued batches is kept
(and reused later) whereas the memory under the output batch of the
ordered synchronizer is released. We now correctly handle these
lifecycles by using separate allocators.

Release note (bug fix): CockroachDB previously didn't account for some
RAM used when disk-spilling operations (like sorts and hash joins) were
using the temporary storage in the vectorized execution engine. This
could result in OOM crashes, especially when the rows are large in size.
@yuzefovich
Copy link
Member Author

@robert-s-lee good point about the release notes, I added one to the first commit which I think we should backport.

I'm less certain that we will backport the second commit (the issue that the first commit fixes is that we didn't account for some RAM usage whereas the second is about staying as close to workmem limit as possible while still correctly reporting the memory usage).

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but tbh I find the details of the second commit a bit hard to follow since it's been a while since I touched this code. Relying on the correctness of current tests, make sure to add anything that should be added to those.

Reviewed 6 of 6 files at r1, 3 of 3 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/sql/colexec/external_sort.go, line 266 at r2 (raw file):

		partitionedDiskQueueSemaphore = nil
	}
	// We give another half of the available RAM to the merge operation.

nit: make these consts grouped together so a reader can understand the relation between the two easily.

This commit improves in how the external sorter manages its available
RAM. There are two different main usages that overlap because we are
keeping the references to both at all times:
1. during the spilling/sorting phase, we use a single in-memory sorter
2. during the merging phase, we use the ordered synchronizer that reads
one batch from each of the partitions and also allocates an output
batch.

Previously, we would give the whole memory limit to the in-memory sorter
in 1. which resulted in the external sorter using at least 2x of its
memory limit. This is now fixed by giving only a half to the in-memory
sorter.

The handling of 2. was even worse - we didn't have any logic that would
limit the number of active partitions based on the memory footprint. If
the batches are large (say 1GB in size), during the merge phase we would
be using on the order of 16GB of RAM (number 16 would be determined
based on the number of file descriptors). Additionally, we would give
the whole memory limit to the output batch too.

This misbehavior is also now fixed by tracking the maximum size of
a single batch in each active partition and computing the actual maximum
number of partitions to have using those sizes.

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have good coverage in terms of correctness, plus these commits don't change anything fundamentally other than adjusting the limits and reporting the memory usage. The only somewhat fundamental change is that we're now can limit the max number of partitions based on the batch mem sizes which will make us perform repeated merging sooner and more often than previously. Still, I think these changes are safe but could have the performance hit given that we're now using less RAM.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto)


pkg/sql/colexec/external_sort.go, line 266 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

nit: make these consts grouped together so a reader can understand the relation between the two easily.

Done.

@craig
Copy link
Contributor

craig bot commented Feb 17, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 17, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

colexec: external sort doesn't respect the memory limit well
4 participants