colexec: make external sorter respect memory limit better #60593

yuzefovich · 2021-02-15T21:01:23Z

colexec: register memory used by dequeued batches from partitioned queue

Previously, we forgot to perform the memory accounting of the batches
that are dequeued from the partitions in the external sort (which could
be substantial when we're merging multiple partitions at once and the
tuples are wide) and in the hash based partitioner. This is now fixed.

Additionally, this commit retains references to some internal operators
in the external sort in order to reuse the memory under the dequeued
batches (this will be beneficial if we perform repeated merging).

Also, this commit fixes an issue with repeated re-initializing of the
disk-backed operators in the disk spiller if the latter has been reset
(the problem would lead to redundant allocations and not reusing of the
available memory).

Slight complication with accounting was because of the fact that we were
using the same allocator for all usages. This would be quite wrong
because in the merge phase we have two distinct memory usage with
different lifecycles - the memory under the dequeued batches is kept
(and reused later) whereas the memory under the output batch of the
ordered synchronizer is released. We now correctly handle these
lifecycles by using separate allocators.

Release note (bug fix): CockroachDB previously didn't account for some
RAM used when disk-spilling operations (like sorts and hash joins) were
using the temporary storage in the vectorized execution engine. This
could result in OOM crashes, especially when the rows are large in size.

colexec: make external sorter respect memory limit better

This commit improves in how the external sorter manages its available
RAM. There are two different main usages that overlap because we are
keeping the references to both at all times:

during the spilling/sorting phase, we use a single in-memory sorter
during the merging phase, we use the ordered synchronizer that reads
one batch from each of the partitions and also allocates an output
batch.

Previously, we would give the whole memory limit to the in-memory sorter
in 1. which resulted in the external sorter using at least 2x of its
memory limit. This is now fixed by giving only a half to the in-memory
sorter.

The handling of 2. was even worse - we didn't have any logic that would
limit the number of active partitions based on the memory footprint. If
the batches are large (say 1GB in size), during the merge phase we would
be using on the order of 16GB of RAM (number 16 would be determined
based on the number of file descriptors). Additionally, we would give
the whole memory limit to the output batch too.

This misbehavior is also now fixed by tracking the maximum size of
a single batch in each active partition and computing the actual maximum
number of partitions to have using those sizes.

Fixes: #60017.

Release note: None

cockroach-teamcity · 2021-02-15T21:01:30Z

This change is

yuzefovich · 2021-02-15T21:43:58Z

The microbenchmarks don't show noticeable difference, but in some cases the performance can have a significant hit because we are respecting the memory limits much better (#60248 exacerbates the problem, but I think we'll address it soon).

Anyway, in the scenario described in #59851 (comment), I think that we now report the correct memory usage and stay very close to the limit (we report 149MiB of RAM usage - with the current implementation, ideally we would have something like 2 x 64MiB because of #60022) while I don't know of any place missing the memory accounting, so I'm quite happy with the current state.

One thing I'm concerned about is that we seem to be under-reporting the disk usage:

~~Here, I'd expect that we use about 1GB of disk because we're performing the general external sort.~~ Oh, it's because of the compression - we get like 90% compression, nice!

robert-s-lee · 2021-02-16T13:02:52Z

would this fix have release notes added? looks like this could result in improved RAS ( reliability availability serviceability)

Previously, we forgot to perform the memory accounting of the batches that are dequeued from the partitions in the external sort (which could be substantial when we're merging multiple partitions at once and the tuples are wide) and in the hash based partitioner. This is now fixed. Additionally, this commit retains references to some internal operators in the external sort in order to reuse the memory under the dequeued batches (this will be beneficial if we perform repeated merging). Also, this commit fixes an issue with repeated re-initializing of the disk-backed operators in the disk spiller if the latter has been reset (the problem would lead to redundant allocations and not reusing of the available memory). Slight complication with accounting was because of the fact that we were using the same allocator for all usages. This would be quite wrong because in the merge phase we have two distinct memory usage with different lifecycles - the memory under the dequeued batches is kept (and reused later) whereas the memory under the output batch of the ordered synchronizer is released. We now correctly handle these lifecycles by using separate allocators. Release note (bug fix): CockroachDB previously didn't account for some RAM used when disk-spilling operations (like sorts and hash joins) were using the temporary storage in the vectorized execution engine. This could result in OOM crashes, especially when the rows are large in size.

yuzefovich · 2021-02-16T23:40:39Z

@robert-s-lee good point about the release notes, I added one to the first commit which I think we should backport.

I'm less certain that we will backport the second commit (the issue that the first commit fixes is that we didn't account for some RAM usage whereas the second is about staying as close to workmem limit as possible while still correctly reporting the memory usage).

asubiotto

but tbh I find the details of the second commit a bit hard to follow since it's been a while since I touched this code. Relying on the correctness of current tests, make sure to add anything that should be added to those.

Reviewed 6 of 6 files at r1, 3 of 3 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)

pkg/sql/colexec/external_sort.go, line 266 at r2 (raw file):

		partitionedDiskQueueSemaphore = nil
	}
	// We give another half of the available RAM to the merge operation.

nit: make these consts grouped together so a reader can understand the relation between the two easily.

This commit improves in how the external sorter manages its available RAM. There are two different main usages that overlap because we are keeping the references to both at all times: 1. during the spilling/sorting phase, we use a single in-memory sorter 2. during the merging phase, we use the ordered synchronizer that reads one batch from each of the partitions and also allocates an output batch. Previously, we would give the whole memory limit to the in-memory sorter in 1. which resulted in the external sorter using at least 2x of its memory limit. This is now fixed by giving only a half to the in-memory sorter. The handling of 2. was even worse - we didn't have any logic that would limit the number of active partitions based on the memory footprint. If the batches are large (say 1GB in size), during the merge phase we would be using on the order of 16GB of RAM (number 16 would be determined based on the number of file descriptors). Additionally, we would give the whole memory limit to the output batch too. This misbehavior is also now fixed by tracking the maximum size of a single batch in each active partition and computing the actual maximum number of partitions to have using those sizes. Release note: None

yuzefovich

I think we have good coverage in terms of correctness, plus these commits don't change anything fundamentally other than adjusting the limits and reporting the memory usage. The only somewhat fundamental change is that we're now can limit the max number of partitions based on the batch mem sizes which will make us perform repeated merging sooner and more often than previously. Still, I think these changes are safe but could have the performance hit given that we're now using less RAM.

TFTR!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto)

pkg/sql/colexec/external_sort.go, line 266 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

nit: make these consts grouped together so a reader can understand the relation between the two easily.

Done.

craig · 2021-02-17T16:55:10Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2021-02-17T18:06:36Z

Build succeeded:

GitHub CI (Cockroach)

yuzefovich force-pushed the es-more branch from 68bd7d0 to 08cef2c Compare February 15, 2021 21:37

yuzefovich requested review from asubiotto and a team February 15, 2021 22:14

yuzefovich force-pushed the es-more branch 2 times, most recently from e60aec5 to 4e87096 Compare February 15, 2021 23:38

yuzefovich force-pushed the es-more branch 2 times, most recently from 56a041f to 0fcb847 Compare February 16, 2021 23:34

yuzefovich force-pushed the es-more branch from 0fcb847 to a000e83 Compare February 16, 2021 23:35

asubiotto approved these changes Feb 17, 2021

View reviewed changes

yuzefovich force-pushed the es-more branch from a000e83 to 59b9796 Compare February 17, 2021 15:49

yuzefovich force-pushed the es-more branch from 59b9796 to 750ef76 Compare February 17, 2021 15:54

yuzefovich commented Feb 17, 2021

View reviewed changes

craig bot merged commit 889c27a into cockroachdb:master Feb 17, 2021

yuzefovich deleted the es-more branch February 17, 2021 18:46

yuzefovich mentioned this pull request Feb 23, 2021

release-20.2: colexec: partially fix memory accounting in the external sorter #61016

Merged

irfansharif mentioned this pull request Mar 15, 2021

roachtest: tpccbench/nodes=3/cpu=16 failed #61973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colexec: make external sorter respect memory limit better #60593

colexec: make external sorter respect memory limit better #60593

yuzefovich commented Feb 15, 2021 •

edited

Loading

cockroach-teamcity commented Feb 15, 2021

yuzefovich commented Feb 15, 2021 •

edited

Loading

robert-s-lee commented Feb 16, 2021

yuzefovich commented Feb 16, 2021

asubiotto left a comment

yuzefovich left a comment

craig bot commented Feb 17, 2021

craig bot commented Feb 17, 2021

colexec: make external sorter respect memory limit better #60593

colexec: make external sorter respect memory limit better #60593

Conversation

yuzefovich commented Feb 15, 2021 • edited Loading

cockroach-teamcity commented Feb 15, 2021

yuzefovich commented Feb 15, 2021 • edited Loading

robert-s-lee commented Feb 16, 2021

yuzefovich commented Feb 16, 2021

asubiotto left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Feb 17, 2021

craig bot commented Feb 17, 2021

yuzefovich commented Feb 15, 2021 •

edited

Loading

yuzefovich commented Feb 15, 2021 •

edited

Loading