colmem: limit batches of dynamic size by workmem in memory footprint #59851

yuzefovich · 2021-02-05T17:19:10Z

This commit adds a minor improvement to ResetMaybeReallocate method so
that it reuses the old batch (without doubling the capacity and
allocating a new batch) when the old batch has enough capacity and
reached the provided limit in size. This change effectively allows us to
limit all batches flowing through the vectorized engine by workmem
setting in size since the main batch-producers (cfetcher and
columnarizer) use this method.

Addresses: https://github.com/cockroachlabs/support/issues/808.

Release note (sql change): Most batches of data flowing through the
vectorized execution engine will now be limited in size by
sql.distsql.temp_storage.workmem (64MiB by default) which should
improve the stability of CockroachDB clusters.

cockroach-teamcity · 2021-02-05T17:19:18Z

This change is

yuzefovich · 2021-02-05T17:24:37Z

I think this is quite an easy solution to the problem that we could even backport. Interested to hear your thoughts whether a cluster setting is necessary for this memory limit.

I also confirmed in a toy scenario that it helps - I have a table with 100 rows each of which is 10MB in size, the query performs a full table scan followed by a top 10 sort.

Before:

After:

Note that the number of batches emitted by the cfetcher increased from 7 to 15. I think in the "after" case we have the following batch lengths: 1, 2, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 5. When we first reach the length of 8, the batch becomes 80MB in memory size, so we won't allocate any larger batches.

asubiotto

Nice change. I think this is high-impact enough that we should probably backport it, especially given that this would improve the situation for some of our customers. Is there a performance improvement I see as well?

Is there some testing we can do or do existing tests give us enough confidence of the correctness?

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)

pkg/sql/colmem/allocator.go, line 174 at r1 (raw file):

	} else {
		// If old batch is already of the largest capacity, we will reuse it.
		useOldBatch := oldBatch.Capacity() == coldata.BatchSize()

Maybe condense this line and the block below into:

useOldBatch := oldBatch.Capacity() == coldata.BatchSize() || (GetBatchMemSize(oldBatch)  >= maxBatchMemSize && oldBatch.Capacity() >= minCapacity)

yuzefovich

Yeah, I agree that we should backport it. There does appear to be a performance improvement in this diagram that is likely due to the fact that we're reusing the same memory a few times (without limit in place we would be allocating new huge batches for every emitted by the cfetcher batch), but I wouldn't look into this too much since the improvement is likely to become smaller on larger queries. Still, this will be a very good change for the cluster stability (at least from the SQL execution perspective).

I'm very confident in the correctness of this change since we effectively modifying coldata.BatchSize() value based on a memory limit in some cases, and we have been testing extensively with randomizing that value for a long time now.

I did add a simple unit test though.

I also added another commit that partially makes the external sorter respect the memory limit much better. I think we should backport this commit too. I filed #60017 to track the remaining issue that I want to address in a separate PR (which also might be worth backporting depending on the complexity).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)

pkg/sql/colmem/allocator.go, line 174 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Maybe condense this line and the block below into:

useOldBatch := oldBatch.Capacity() == coldata.BatchSize() || (GetBatchMemSize(oldBatch)  >= maxBatchMemSize && oldBatch.Capacity() >= minCapacity)

I don't think that it quite works because we need to calculate oldBatchMemSize, with this proposed change we would have to calculate it twice is useOldBatch becomes false.

yuzefovich · 2021-02-09T07:19:19Z

I've just deleted some of the comments because I confused myself, so feel free to disregard previous messages. I also realized the fix for double-counting is actually very simple, and I couldn't wait for tomorrow to implement it.

Additionally, I found a few other issues with memory accounting, but I'll keep those for a separate PR. Time to go to bed.

The setup I used to debug the external sorter and its memory accounting is the following: a table with 100k rows with each row 10KB in size, the query runs top K sort where K = 10k which ends up creating 100MB output and requires spilling to disk.

The state of this PR:

the first commit is the main change for limiting the batches in size throughout the engine based on workmem setting; the reported memory usage is 254MiB (no change with master since the limiting in question doesn't kick in; this is expected)
the second commit reduces the amount of double-counting; the reported usage is 205MiB
the third commit limits the size of the output batch; the reported usage is the same, but the number of batches increases from 19 to 162 (the former is somewhat unexpected to me at the moment, but the latter is expected)
the fourth commit removes the double-counting and we now report 137MiB usage. This is precisely what I'd expect given the current code:
- we hold a bit more than 64MiB of data in the in-memory sort that spilled (releasing that buffered state is tracked in colexec: clean up buffered state when operators spilled to disk #60022)
- we use a bit more than 64MiB of RAM in the external sorter itself.

asubiotto

This is starting to be a lot of code to backport, so I think we should think critically about doing this. Let's discuss during standup. My opinion is that we should backport the first commit, but all these external sort improvements have enough risk associated with them that I don't think it's a good move to backport them.

Reviewed 37 of 37 files at r2, 4 of 4 files at r3, 4 of 4 files at r4, 2 of 2 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @yuzefovich)

pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):

	case columnarizerBufferingMode:
		c.batch, reallocated = c.allocator.ResetMaybeReallocate(
			c.typs, c.batch, 1 /* minCapacity */, c.maxBatchMemSize,

Is there a better way to do this than passing in the memory limit to ResetMaybeReallocate? It might be nice to keep the memory-related code out of each operator so my suggestion is to let the allocator determine the max mem limit in NewAllocator using a settings object (or pass the memory limit directly). This will be the default for most use cases. In cases where we don't want to limit batches by memory, we could add a method to the allocator OverrideMemLimit that will be called before passing the allocator in to whatever operator needs this to operate.

pkg/sql/colexec/external_sort.go, line 119 at r4 (raw file):

	unlimitedAllocator *colmem.Allocator
	// memoryLimit is the amount of RAM available to the external sorter (the

Can we add a test similar to the spilling queue that asserts the expected memory usage during each stage of a simple and artificial disk spilling example? I think this would be useful for this and the last commit.

yuzefovich

I agree that it's a lot of code changes now, but all commits (except maybe for the third one) are worthy of backporting and have very low risk associated with them.

Anyway, I'll extract all of them in a separate PR so that we can proceed with the main change here (the first and now only commit).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)

pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Is there a better way to do this than passing in the memory limit to ResetMaybeReallocate? It might be nice to keep the memory-related code out of each operator so my suggestion is to let the allocator determine the max mem limit in NewAllocator using a settings object (or pass the memory limit directly). This will be the default for most use cases. In cases where we don't want to limit batches by memory, we could add a method to the allocator OverrideMemLimit that will be called before passing the allocator in to whatever operator needs this to operate.

I don't think that it is a good idea for several reasons:

currently, all logic about the dynamic nature of batches is encapsulated in ResetMaybeReallocate method, and I'd like to keep it this way
if an allocator object derives some memory limit based on a cluster setting in NewAllocator, then the semantics of that memory limit become confusing: what methods respect the limit? is it ok to ignore it in all existing methods except for ResetMaybeReallocate?
(although not very frequently used currently but) the allocator can be shared between multiple components, and what should we do if different components want to use different memory limits? we would have to call OverrideMemLimit every time before ResetMaybeReallocate

pkg/sql/colexec/external_sort.go, line 119 at r4 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Can we add a test similar to the spilling queue that asserts the expected memory usage during each stage of a simple and artificial disk spilling example? I think this would be useful for this and the last commit.

Yeah, I too realized the need for a unit test while debugging (it was a bit annoying to recompile the binary and use EXPLAIN ANALYZE), will add in the separate PR.

asubiotto

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @yuzefovich)

pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I don't think that it is a good idea for several reasons:

currently, all logic about the dynamic nature of batches is encapsulated in ResetMaybeReallocate method, and I'd like to keep it this way

if an allocator object derives some memory limit based on a cluster setting in NewAllocator, then the semantics of that memory limit become confusing: what methods respect the limit? is it ok to ignore it in all existing methods except for ResetMaybeReallocate?

(although not very frequently used currently but) the allocator can be shared between multiple components, and what should we do if different components want to use different memory limits? we would have to call OverrideMemLimit every time before ResetMaybeReallocate

The memory limit is independent of dynamic batches, and I think that the encapsulation of this memory limit in ResetMaybeReallocate comes at the cost of removing the encapsulation of memory limiting in the planning code and having to add fields to a lot of structs, which I'm arguing is not a good tradeoff.
If we take a step back, we should probably respect the memory limit in the other allocator methods (at least I don't see a reason not to), but this is probably not important right now and can be left for later.
Could you share an example of this last point? I agree it's important, but maybe it can be solved by passing a new allocator to the other component (OverrideMemLimit should only be called once on the allocator, this logic could also be just passing math.MaxInt64 to the NewAllocator constructor, which is probably better), since allocators are basically just wrappers over a bound account.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)

pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

The memory limit is independent of dynamic batches, and I think that the encapsulation of this memory limit in ResetMaybeReallocate comes at the cost of removing the encapsulation of memory limiting in the planning code and having to add fields to a lot of structs, which I'm arguing is not a good tradeoff.

If we take a step back, we should probably respect the memory limit in the other allocator methods (at least I don't see a reason not to), but this is probably not important right now and can be left for later.

Could you share an example of this last point? I agree it's important, but maybe it can be solved by passing a new allocator to the other component (OverrideMemLimit should only be called once on the allocator, this logic could also be just passing math.MaxInt64 to the NewAllocator constructor, which is probably better), since allocators are basically just wrappers over a bound account.

I still disagree about the semantics. If we put the memory limit into the Allocator object, what does it mean? Is the allocator only allowed to use that amount of memory total? No, so what is there memory limit then? We can call it something like singleBatchMemoryLimit, and it will be used only by ResetMaybeReallocate.

I think I see where you're coming from: in the future, we would like for the allocator to limit the size of all batches, not only the ones returned by ResetMaybeReallocate, but we're very from that point. At least the way things are implemented now, it is up to each operator (the user of the allocator object) to stop adding tuples into a batch if that batch becomes too big, and I don't think we will change it in 21.1 cycle.

Changing the signature of NewAllocator will be a lot more annoying too (66 callsites vs 15), so the current change will be easier to backport.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)

pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I still disagree about the semantics. If we put the memory limit into the Allocator object, what does it mean? Is the allocator only allowed to use that amount of memory total? No, so what is there memory limit then? We can call it something like singleBatchMemoryLimit, and it will be used only by ResetMaybeReallocate.

I think I see where you're coming from: in the future, we would like for the allocator to limit the size of all batches, not only the ones returned by ResetMaybeReallocate, but we're very from that point. At least the way things are implemented now, it is up to each operator (the user of the allocator object) to stop adding tuples into a batch if that batch becomes too big, and I don't think we will change it in 21.1 cycle.

Changing the signature of NewAllocator will be a lot more annoying too (66 callsites vs 15), so the current change will be easier to backport.

re 2. Currently, the memory limits are not respected by the variable-length data types, and when the allocator creates a new vector or a new batch, it can only use an estimate which is always within limits. I believe it should still be up to the caller to make sure to stay within the limit (possibly we could add another method on the allocator for that). I agree that it is not urgent though.

re 3. I think most external operators use the same unlimited allocator for both the main and the fallback strategy. This example is probably not applicable since each involved operator is tracking the memory usage on its own.

asubiotto

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @yuzefovich)

pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):
If you're talking about the naming semantics, you can call it maxBatchMemSize

I think I see where you're coming from: in the future, we would like for the allocator to limit the size of all batches, not only the ones returned by ResetMaybeReallocate, but we're very from that point. At least the way things are implemented now, it is up to each operator (the user of the allocator object) to stop adding tuples into a batch if that batch becomes too big, and I don't think we will change it in 21.1 cycle.

Why not bring it closer to that point? At some point the allocator should only return fixed size batches from a shared memory pool so we'll have to change operators again to be agnostic to how much memory they want.

Changing the signature of NewAllocator will be a lot more annoying too (66 callsites vs 15), so the current change will be easier to backport.

Fair.

I still don't fully agree but I also don't want to block this PR on this point so go ahead.

This commit adds a minor improvement to `ResetMaybeReallocate` method so that it reuses the old batch (without doubling the capacity and allocating a new batch) when the old batch has enough capacity and reached the provided limit in size. This change effectively allows us to limit all batches flowing through the vectorized engine by `workmem` setting in size since the main batch-producers (cfetcher and columnarizer) use this method. Release note (sql change): Most batches of data flowing through the vectorized execution engine will now be limited in size by `sql.distsql.temp_storage.workmem` (64MiB by default) which should improve the stability of CockroachDB clusters.

yuzefovich

TFTR!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto, @jordanlewis, and @yuzefovich)

craig · 2021-02-14T17:28:49Z

Build succeeded:

GitHub CI (Cockroach)

yuzefovich requested review from jordanlewis and asubiotto February 5, 2021 17:19

asubiotto suggested changes Feb 8, 2021

View reviewed changes

yuzefovich force-pushed the batch-mem-size branch 6 times, most recently from dfd748a to e9b7147 Compare February 9, 2021 00:40

yuzefovich changed the title ~~colmem: limit batches of dynamic size by 64MiB in memory footprint~~ colmem: limit batches of dynamic size by workmem in memory footprint Feb 9, 2021

yuzefovich mentioned this pull request Feb 9, 2021

colexec: external sort doesn't respect the memory limit well #60017

Closed

yuzefovich force-pushed the batch-mem-size branch 3 times, most recently from 6a5b96d to b1933e4 Compare February 9, 2021 01:43

yuzefovich commented Feb 9, 2021

View reviewed changes

yuzefovich force-pushed the batch-mem-size branch 4 times, most recently from 2a0a029 to dbc0912 Compare February 9, 2021 07:07

asubiotto suggested changes Feb 9, 2021

View reviewed changes

yuzefovich force-pushed the batch-mem-size branch from dbc0912 to 7ad72ef Compare February 9, 2021 18:34

yuzefovich commented Feb 9, 2021

View reviewed changes

asubiotto reviewed Feb 10, 2021

View reviewed changes

yuzefovich commented Feb 10, 2021

View reviewed changes

asubiotto approved these changes Feb 11, 2021

View reviewed changes

yuzefovich force-pushed the batch-mem-size branch from 7ad72ef to 9bbbce4 Compare February 14, 2021 15:58

yuzefovich commented Feb 14, 2021

View reviewed changes

craig bot merged commit d6d5fd4 into cockroachdb:master Feb 14, 2021

yuzefovich deleted the batch-mem-size branch February 14, 2021 19:16

This was referenced Feb 14, 2021

release-20.2: colmem: limit batches of dynamic size by workmem in memory footprint #60571

Merged

colexec: partially fix memory accounting in the external sorter #60284

Merged

colexec: make external sorter respect memory limit better #60593

Merged

jseldess mentioned this pull request Mar 23, 2021

colmem: limit batches of dynamic size by workmem in memory footprint cockroachdb/docs#10087

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colmem: limit batches of dynamic size by workmem in memory footprint #59851

colmem: limit batches of dynamic size by workmem in memory footprint #59851

yuzefovich commented Feb 5, 2021 •

edited

Loading

cockroach-teamcity commented Feb 5, 2021

yuzefovich commented Feb 5, 2021

asubiotto left a comment

yuzefovich left a comment

yuzefovich commented Feb 9, 2021

asubiotto left a comment

yuzefovich left a comment

asubiotto left a comment

yuzefovich left a comment

yuzefovich left a comment

asubiotto left a comment

yuzefovich left a comment

craig bot commented Feb 14, 2021

colmem: limit batches of dynamic size by workmem in memory footprint #59851

colmem: limit batches of dynamic size by workmem in memory footprint #59851

Conversation

yuzefovich commented Feb 5, 2021 • edited Loading

cockroach-teamcity commented Feb 5, 2021

yuzefovich commented Feb 5, 2021

asubiotto left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich commented Feb 9, 2021

asubiotto left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

asubiotto left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

asubiotto left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Feb 14, 2021

yuzefovich commented Feb 5, 2021 •

edited

Loading