Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colmem: limit batches of dynamic size by workmem in memory footprint #59851

Merged
merged 1 commit into from
Feb 14, 2021

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Feb 5, 2021

This commit adds a minor improvement to ResetMaybeReallocate method so
that it reuses the old batch (without doubling the capacity and
allocating a new batch) when the old batch has enough capacity and
reached the provided limit in size. This change effectively allows us to
limit all batches flowing through the vectorized engine by workmem
setting in size since the main batch-producers (cfetcher and
columnarizer) use this method.

Addresses: https://github.com/cockroachlabs/support/issues/808.

Release note (sql change): Most batches of data flowing through the
vectorized execution engine will now be limited in size by
sql.distsql.temp_storage.workmem (64MiB by default) which should
improve the stability of CockroachDB clusters.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

I think this is quite an easy solution to the problem that we could even backport. Interested to hear your thoughts whether a cluster setting is necessary for this memory limit.

I also confirmed in a toy scenario that it helps - I have a table with 100 rows each of which is 10MB in size, the query performs a full table scan followed by a top 10 sort.

Before:
Screen Shot 2021-02-05 at 9 19 18 AM

After:
Screen Shot 2021-02-05 at 9 19 28 AM

Note that the number of batches emitted by the cfetcher increased from 7 to 15. I think in the "after" case we have the following batch lengths: 1, 2, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 5. When we first reach the length of 8, the batch becomes 80MB in memory size, so we won't allocate any larger batches.

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change. I think this is high-impact enough that we should probably backport it, especially given that this would improve the situation for some of our customers. Is there a performance improvement I see as well?

Is there some testing we can do or do existing tests give us enough confidence of the correctness?

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)


pkg/sql/colmem/allocator.go, line 174 at r1 (raw file):

	} else {
		// If old batch is already of the largest capacity, we will reuse it.
		useOldBatch := oldBatch.Capacity() == coldata.BatchSize()

Maybe condense this line and the block below into:

useOldBatch := oldBatch.Capacity() == coldata.BatchSize() || (GetBatchMemSize(oldBatch)  >= maxBatchMemSize && oldBatch.Capacity() >= minCapacity)

@yuzefovich yuzefovich force-pushed the batch-mem-size branch 6 times, most recently from dfd748a to e9b7147 Compare February 9, 2021 00:40
@yuzefovich yuzefovich changed the title colmem: limit batches of dynamic size by 64MiB in memory footprint colmem: limit batches of dynamic size by workmem in memory footprint Feb 9, 2021
@yuzefovich yuzefovich force-pushed the batch-mem-size branch 3 times, most recently from 6a5b96d to b1933e4 Compare February 9, 2021 01:43
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that we should backport it. There does appear to be a performance improvement in this diagram that is likely due to the fact that we're reusing the same memory a few times (without limit in place we would be allocating new huge batches for every emitted by the cfetcher batch), but I wouldn't look into this too much since the improvement is likely to become smaller on larger queries. Still, this will be a very good change for the cluster stability (at least from the SQL execution perspective).

I'm very confident in the correctness of this change since we effectively modifying coldata.BatchSize() value based on a memory limit in some cases, and we have been testing extensively with randomizing that value for a long time now.

I did add a simple unit test though.


I also added another commit that partially makes the external sorter respect the memory limit much better. I think we should backport this commit too. I filed #60017 to track the remaining issue that I want to address in a separate PR (which also might be worth backporting depending on the complexity).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)


pkg/sql/colmem/allocator.go, line 174 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Maybe condense this line and the block below into:

useOldBatch := oldBatch.Capacity() == coldata.BatchSize() || (GetBatchMemSize(oldBatch)  >= maxBatchMemSize && oldBatch.Capacity() >= minCapacity)

I don't think that it quite works because we need to calculate oldBatchMemSize, with this proposed change we would have to calculate it twice is useOldBatch becomes false.

@yuzefovich yuzefovich force-pushed the batch-mem-size branch 4 times, most recently from 2a0a029 to dbc0912 Compare February 9, 2021 07:07
@yuzefovich
Copy link
Member Author

I've just deleted some of the comments because I confused myself, so feel free to disregard previous messages. I also realized the fix for double-counting is actually very simple, and I couldn't wait for tomorrow to implement it.

Additionally, I found a few other issues with memory accounting, but I'll keep those for a separate PR. Time to go to bed.

The setup I used to debug the external sorter and its memory accounting is the following: a table with 100k rows with each row 10KB in size, the query runs top K sort where K = 10k which ends up creating 100MB output and requires spilling to disk.

The state of this PR:

  • the first commit is the main change for limiting the batches in size throughout the engine based on workmem setting; the reported memory usage is 254MiB (no change with master since the limiting in question doesn't kick in; this is expected)
  • the second commit reduces the amount of double-counting; the reported usage is 205MiB
  • the third commit limits the size of the output batch; the reported usage is the same, but the number of batches increases from 19 to 162 (the former is somewhat unexpected to me at the moment, but the latter is expected)
  • the fourth commit removes the double-counting and we now report 137MiB usage. This is precisely what I'd expect given the current code:

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to be a lot of code to backport, so I think we should think critically about doing this. Let's discuss during standup. My opinion is that we should backport the first commit, but all these external sort improvements have enough risk associated with them that I don't think it's a good move to backport them.

Reviewed 37 of 37 files at r2, 4 of 4 files at r3, 4 of 4 files at r4, 2 of 2 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @yuzefovich)


pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):

	case columnarizerBufferingMode:
		c.batch, reallocated = c.allocator.ResetMaybeReallocate(
			c.typs, c.batch, 1 /* minCapacity */, c.maxBatchMemSize,

Is there a better way to do this than passing in the memory limit to ResetMaybeReallocate? It might be nice to keep the memory-related code out of each operator so my suggestion is to let the allocator determine the max mem limit in NewAllocator using a settings object (or pass the memory limit directly). This will be the default for most use cases. In cases where we don't want to limit batches by memory, we could add a method to the allocator OverrideMemLimit that will be called before passing the allocator in to whatever operator needs this to operate.


pkg/sql/colexec/external_sort.go, line 119 at r4 (raw file):

	unlimitedAllocator *colmem.Allocator
	// memoryLimit is the amount of RAM available to the external sorter (the

Can we add a test similar to the spilling queue that asserts the expected memory usage during each stage of a simple and artificial disk spilling example? I think this would be useful for this and the last commit.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's a lot of code changes now, but all commits (except maybe for the third one) are worthy of backporting and have very low risk associated with them.

Anyway, I'll extract all of them in a separate PR so that we can proceed with the main change here (the first and now only commit).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)


pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Is there a better way to do this than passing in the memory limit to ResetMaybeReallocate? It might be nice to keep the memory-related code out of each operator so my suggestion is to let the allocator determine the max mem limit in NewAllocator using a settings object (or pass the memory limit directly). This will be the default for most use cases. In cases where we don't want to limit batches by memory, we could add a method to the allocator OverrideMemLimit that will be called before passing the allocator in to whatever operator needs this to operate.

I don't think that it is a good idea for several reasons:

  • currently, all logic about the dynamic nature of batches is encapsulated in ResetMaybeReallocate method, and I'd like to keep it this way
  • if an allocator object derives some memory limit based on a cluster setting in NewAllocator, then the semantics of that memory limit become confusing: what methods respect the limit? is it ok to ignore it in all existing methods except for ResetMaybeReallocate?
  • (although not very frequently used currently but) the allocator can be shared between multiple components, and what should we do if different components want to use different memory limits? we would have to call OverrideMemLimit every time before ResetMaybeReallocate

pkg/sql/colexec/external_sort.go, line 119 at r4 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Can we add a test similar to the spilling queue that asserts the expected memory usage during each stage of a simple and artificial disk spilling example? I think this would be useful for this and the last commit.

Yeah, I too realized the need for a unit test while debugging (it was a bit annoying to recompile the binary and use EXPLAIN ANALYZE), will add in the separate PR.

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @yuzefovich)


pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I don't think that it is a good idea for several reasons:

  • currently, all logic about the dynamic nature of batches is encapsulated in ResetMaybeReallocate method, and I'd like to keep it this way
  • if an allocator object derives some memory limit based on a cluster setting in NewAllocator, then the semantics of that memory limit become confusing: what methods respect the limit? is it ok to ignore it in all existing methods except for ResetMaybeReallocate?
  • (although not very frequently used currently but) the allocator can be shared between multiple components, and what should we do if different components want to use different memory limits? we would have to call OverrideMemLimit every time before ResetMaybeReallocate
  1. The memory limit is independent of dynamic batches, and I think that the encapsulation of this memory limit in ResetMaybeReallocate comes at the cost of removing the encapsulation of memory limiting in the planning code and having to add fields to a lot of structs, which I'm arguing is not a good tradeoff.
  2. If we take a step back, we should probably respect the memory limit in the other allocator methods (at least I don't see a reason not to), but this is probably not important right now and can be left for later.
  3. Could you share an example of this last point? I agree it's important, but maybe it can be solved by passing a new allocator to the other component (OverrideMemLimit should only be called once on the allocator, this logic could also be just passing math.MaxInt64 to the NewAllocator constructor, which is probably better), since allocators are basically just wrappers over a bound account.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)


pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
  1. The memory limit is independent of dynamic batches, and I think that the encapsulation of this memory limit in ResetMaybeReallocate comes at the cost of removing the encapsulation of memory limiting in the planning code and having to add fields to a lot of structs, which I'm arguing is not a good tradeoff.
  2. If we take a step back, we should probably respect the memory limit in the other allocator methods (at least I don't see a reason not to), but this is probably not important right now and can be left for later.
  3. Could you share an example of this last point? I agree it's important, but maybe it can be solved by passing a new allocator to the other component (OverrideMemLimit should only be called once on the allocator, this logic could also be just passing math.MaxInt64 to the NewAllocator constructor, which is probably better), since allocators are basically just wrappers over a bound account.

I still disagree about the semantics. If we put the memory limit into the Allocator object, what does it mean? Is the allocator only allowed to use that amount of memory total? No, so what is there memory limit then? We can call it something like singleBatchMemoryLimit, and it will be used only by ResetMaybeReallocate.

I think I see where you're coming from: in the future, we would like for the allocator to limit the size of all batches, not only the ones returned by ResetMaybeReallocate, but we're very from that point. At least the way things are implemented now, it is up to each operator (the user of the allocator object) to stop adding tuples into a batch if that batch becomes too big, and I don't think we will change it in 21.1 cycle.

Changing the signature of NewAllocator will be a lot more annoying too (66 callsites vs 15), so the current change will be easier to backport.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)


pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I still disagree about the semantics. If we put the memory limit into the Allocator object, what does it mean? Is the allocator only allowed to use that amount of memory total? No, so what is there memory limit then? We can call it something like singleBatchMemoryLimit, and it will be used only by ResetMaybeReallocate.

I think I see where you're coming from: in the future, we would like for the allocator to limit the size of all batches, not only the ones returned by ResetMaybeReallocate, but we're very from that point. At least the way things are implemented now, it is up to each operator (the user of the allocator object) to stop adding tuples into a batch if that batch becomes too big, and I don't think we will change it in 21.1 cycle.

Changing the signature of NewAllocator will be a lot more annoying too (66 callsites vs 15), so the current change will be easier to backport.

re 2. Currently, the memory limits are not respected by the variable-length data types, and when the allocator creates a new vector or a new batch, it can only use an estimate which is always within limits. I believe it should still be up to the caller to make sure to stay within the limit (possibly we could add another method on the allocator for that). I agree that it is not urgent though.

re 3. I think most external operators use the same unlimited allocator for both the main and the fallback strategy. This example is probably not applicable since each involved operator is tracking the memory usage on its own.

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @yuzefovich)


pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):
If you're talking about the naming semantics, you can call it maxBatchMemSize

I think I see where you're coming from: in the future, we would like for the allocator to limit the size of all batches, not only the ones returned by ResetMaybeReallocate, but we're very from that point. At least the way things are implemented now, it is up to each operator (the user of the allocator object) to stop adding tuples into a batch if that batch becomes too big, and I don't think we will change it in 21.1 cycle.

Why not bring it closer to that point? At some point the allocator should only return fixed size batches from a shared memory pool so we'll have to change operators again to be agnostic to how much memory they want.

Changing the signature of NewAllocator will be a lot more annoying too (66 callsites vs 15), so the current change will be easier to backport.

Fair.

I still don't fully agree but I also don't want to block this PR on this point so go ahead.

This commit adds a minor improvement to `ResetMaybeReallocate` method so
that it reuses the old batch (without doubling the capacity and
allocating a new batch) when the old batch has enough capacity and
reached the provided limit in size. This change effectively allows us to
limit all batches flowing through the vectorized engine by `workmem`
setting in size since the main batch-producers (cfetcher and
columnarizer) use this method.

Release note (sql change): Most batches of data flowing through the
vectorized execution engine will now be limited in size by
`sql.distsql.temp_storage.workmem` (64MiB by default) which should
improve the stability of CockroachDB clusters.
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto, @jordanlewis, and @yuzefovich)

@craig
Copy link
Contributor

craig bot commented Feb 14, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants