-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colmem: limit batches of dynamic size by workmem in memory footprint #59851
Conversation
I think this is quite an easy solution to the problem that we could even backport. Interested to hear your thoughts whether a cluster setting is necessary for this memory limit. I also confirmed in a toy scenario that it helps - I have a table with 100 rows each of which is 10MB in size, the query performs a full table scan followed by a top 10 sort. Note that the number of batches emitted by the cfetcher increased from 7 to 15. I think in the "after" case we have the following batch lengths: 1, 2, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 5. When we first reach the length of 8, the batch becomes 80MB in memory size, so we won't allocate any larger batches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change. I think this is high-impact enough that we should probably backport it, especially given that this would improve the situation for some of our customers. Is there a performance improvement I see as well?
Is there some testing we can do or do existing tests give us enough confidence of the correctness?
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/colmem/allocator.go, line 174 at r1 (raw file):
} else { // If old batch is already of the largest capacity, we will reuse it. useOldBatch := oldBatch.Capacity() == coldata.BatchSize()
Maybe condense this line and the block below into:
useOldBatch := oldBatch.Capacity() == coldata.BatchSize() || (GetBatchMemSize(oldBatch) >= maxBatchMemSize && oldBatch.Capacity() >= minCapacity)
dfd748a
to
e9b7147
Compare
6a5b96d
to
b1933e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that we should backport it. There does appear to be a performance improvement in this diagram that is likely due to the fact that we're reusing the same memory a few times (without limit in place we would be allocating new huge batches for every emitted by the cfetcher batch), but I wouldn't look into this too much since the improvement is likely to become smaller on larger queries. Still, this will be a very good change for the cluster stability (at least from the SQL execution perspective).
I'm very confident in the correctness of this change since we effectively modifying coldata.BatchSize()
value based on a memory limit in some cases, and we have been testing extensively with randomizing that value for a long time now.
I did add a simple unit test though.
I also added another commit that partially makes the external sorter respect the memory limit much better. I think we should backport this commit too. I filed #60017 to track the remaining issue that I want to address in a separate PR (which also might be worth backporting depending on the complexity).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)
pkg/sql/colmem/allocator.go, line 174 at r1 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Maybe condense this line and the block below into:
useOldBatch := oldBatch.Capacity() == coldata.BatchSize() || (GetBatchMemSize(oldBatch) >= maxBatchMemSize && oldBatch.Capacity() >= minCapacity)
I don't think that it quite works because we need to calculate oldBatchMemSize
, with this proposed change we would have to calculate it twice is useOldBatch
becomes false
.
2a0a029
to
dbc0912
Compare
I've just deleted some of the comments because I confused myself, so feel free to disregard previous messages. I also realized the fix for double-counting is actually very simple, and I couldn't wait for tomorrow to implement it. Additionally, I found a few other issues with memory accounting, but I'll keep those for a separate PR. Time to go to bed. The setup I used to debug the external sorter and its memory accounting is the following: a table with 100k rows with each row 10KB in size, the query runs top K sort where K = 10k which ends up creating 100MB output and requires spilling to disk. The state of this PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is starting to be a lot of code to backport, so I think we should think critically about doing this. Let's discuss during standup. My opinion is that we should backport the first commit, but all these external sort improvements have enough risk associated with them that I don't think it's a good move to backport them.
Reviewed 37 of 37 files at r2, 4 of 4 files at r3, 4 of 4 files at r4, 2 of 2 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @yuzefovich)
pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):
case columnarizerBufferingMode: c.batch, reallocated = c.allocator.ResetMaybeReallocate( c.typs, c.batch, 1 /* minCapacity */, c.maxBatchMemSize,
Is there a better way to do this than passing in the memory limit to ResetMaybeReallocate
? It might be nice to keep the memory-related code out of each operator so my suggestion is to let the allocator determine the max mem limit in NewAllocator
using a settings object (or pass the memory limit directly). This will be the default for most use cases. In cases where we don't want to limit batches by memory, we could add a method to the allocator OverrideMemLimit
that will be called before passing the allocator in to whatever operator needs this to operate.
pkg/sql/colexec/external_sort.go, line 119 at r4 (raw file):
unlimitedAllocator *colmem.Allocator // memoryLimit is the amount of RAM available to the external sorter (the
Can we add a test similar to the spilling queue that asserts the expected memory usage during each stage of a simple and artificial disk spilling example? I think this would be useful for this and the last commit.
dbc0912
to
7ad72ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's a lot of code changes now, but all commits (except maybe for the third one) are worthy of backporting and have very low risk associated with them.
Anyway, I'll extract all of them in a separate PR so that we can proceed with the main change here (the first and now only commit).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)
pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Is there a better way to do this than passing in the memory limit to
ResetMaybeReallocate
? It might be nice to keep the memory-related code out of each operator so my suggestion is to let the allocator determine the max mem limit inNewAllocator
using a settings object (or pass the memory limit directly). This will be the default for most use cases. In cases where we don't want to limit batches by memory, we could add a method to the allocatorOverrideMemLimit
that will be called before passing the allocator in to whatever operator needs this to operate.
I don't think that it is a good idea for several reasons:
- currently, all logic about the dynamic nature of batches is encapsulated in
ResetMaybeReallocate
method, and I'd like to keep it this way - if an allocator object derives some memory limit based on a cluster setting in
NewAllocator
, then the semantics of that memory limit become confusing: what methods respect the limit? is it ok to ignore it in all existing methods except forResetMaybeReallocate
? - (although not very frequently used currently but) the allocator can be shared between multiple components, and what should we do if different components want to use different memory limits? we would have to call
OverrideMemLimit
every time beforeResetMaybeReallocate
pkg/sql/colexec/external_sort.go, line 119 at r4 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Can we add a test similar to the spilling queue that asserts the expected memory usage during each stage of a simple and artificial disk spilling example? I think this would be useful for this and the last commit.
Yeah, I too realized the need for a unit test while debugging (it was a bit annoying to recompile the binary and use EXPLAIN ANALYZE), will add in the separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @yuzefovich)
pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I don't think that it is a good idea for several reasons:
- currently, all logic about the dynamic nature of batches is encapsulated in
ResetMaybeReallocate
method, and I'd like to keep it this way- if an allocator object derives some memory limit based on a cluster setting in
NewAllocator
, then the semantics of that memory limit become confusing: what methods respect the limit? is it ok to ignore it in all existing methods except forResetMaybeReallocate
?- (although not very frequently used currently but) the allocator can be shared between multiple components, and what should we do if different components want to use different memory limits? we would have to call
OverrideMemLimit
every time beforeResetMaybeReallocate
- The memory limit is independent of dynamic batches, and I think that the encapsulation of this memory limit in
ResetMaybeReallocate
comes at the cost of removing the encapsulation of memory limiting in the planning code and having to add fields to a lot of structs, which I'm arguing is not a good tradeoff. - If we take a step back, we should probably respect the memory limit in the other allocator methods (at least I don't see a reason not to), but this is probably not important right now and can be left for later.
- Could you share an example of this last point? I agree it's important, but maybe it can be solved by passing a new allocator to the other component (
OverrideMemLimit
should only be called once on the allocator, this logic could also be just passingmath.MaxInt64
to theNewAllocator
constructor, which is probably better), since allocators are basically just wrappers over a bound account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)
pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
- The memory limit is independent of dynamic batches, and I think that the encapsulation of this memory limit in
ResetMaybeReallocate
comes at the cost of removing the encapsulation of memory limiting in the planning code and having to add fields to a lot of structs, which I'm arguing is not a good tradeoff.- If we take a step back, we should probably respect the memory limit in the other allocator methods (at least I don't see a reason not to), but this is probably not important right now and can be left for later.
- Could you share an example of this last point? I agree it's important, but maybe it can be solved by passing a new allocator to the other component (
OverrideMemLimit
should only be called once on the allocator, this logic could also be just passingmath.MaxInt64
to theNewAllocator
constructor, which is probably better), since allocators are basically just wrappers over a bound account.
I still disagree about the semantics. If we put the memory limit into the Allocator
object, what does it mean? Is the allocator only allowed to use that amount of memory total? No, so what is there memory limit then? We can call it something like singleBatchMemoryLimit
, and it will be used only by ResetMaybeReallocate
.
I think I see where you're coming from: in the future, we would like for the allocator to limit the size of all batches, not only the ones returned by ResetMaybeReallocate
, but we're very from that point. At least the way things are implemented now, it is up to each operator (the user of the allocator object) to stop adding tuples into a batch if that batch becomes too big, and I don't think we will change it in 21.1 cycle.
Changing the signature of NewAllocator
will be a lot more annoying too (66 callsites vs 15), so the current change will be easier to backport.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)
pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I still disagree about the semantics. If we put the memory limit into the
Allocator
object, what does it mean? Is the allocator only allowed to use that amount of memory total? No, so what is there memory limit then? We can call it something likesingleBatchMemoryLimit
, and it will be used only byResetMaybeReallocate
.I think I see where you're coming from: in the future, we would like for the allocator to limit the size of all batches, not only the ones returned by
ResetMaybeReallocate
, but we're very from that point. At least the way things are implemented now, it is up to each operator (the user of the allocator object) to stop adding tuples into a batch if that batch becomes too big, and I don't think we will change it in 21.1 cycle.Changing the signature of
NewAllocator
will be a lot more annoying too (66 callsites vs 15), so the current change will be easier to backport.
re 2. Currently, the memory limits are not respected by the variable-length data types, and when the allocator creates a new vector or a new batch, it can only use an estimate which is always within limits. I believe it should still be up to the caller to make sure to stay within the limit (possibly we could add another method on the allocator for that). I agree that it is not urgent though.
re 3. I think most external operators use the same unlimited allocator for both the main and the fallback strategy. This example is probably not applicable since each involved operator is tracking the memory usage on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @yuzefovich)
pkg/sql/colexec/columnarizer.go, line 146 at r2 (raw file):
If you're talking about the naming semantics, you can call it maxBatchMemSize
I think I see where you're coming from: in the future, we would like for the allocator to limit the size of all batches, not only the ones returned by ResetMaybeReallocate, but we're very from that point. At least the way things are implemented now, it is up to each operator (the user of the allocator object) to stop adding tuples into a batch if that batch becomes too big, and I don't think we will change it in 21.1 cycle.
Why not bring it closer to that point? At some point the allocator should only return fixed size batches from a shared memory pool so we'll have to change operators again to be agnostic to how much memory they want.
Changing the signature of NewAllocator will be a lot more annoying too (66 callsites vs 15), so the current change will be easier to backport.
Fair.
I still don't fully agree but I also don't want to block this PR on this point so go ahead.
This commit adds a minor improvement to `ResetMaybeReallocate` method so that it reuses the old batch (without doubling the capacity and allocating a new batch) when the old batch has enough capacity and reached the provided limit in size. This change effectively allows us to limit all batches flowing through the vectorized engine by `workmem` setting in size since the main batch-producers (cfetcher and columnarizer) use this method. Release note (sql change): Most batches of data flowing through the vectorized execution engine will now be limited in size by `sql.distsql.temp_storage.workmem` (64MiB by default) which should improve the stability of CockroachDB clusters.
7ad72ef
to
9bbbce4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto, @jordanlewis, and @yuzefovich)
Build succeeded: |
This commit adds a minor improvement to
ResetMaybeReallocate
method sothat it reuses the old batch (without doubling the capacity and
allocating a new batch) when the old batch has enough capacity and
reached the provided limit in size. This change effectively allows us to
limit all batches flowing through the vectorized engine by
workmem
setting in size since the main batch-producers (cfetcher and
columnarizer) use this method.
Addresses: https://github.com/cockroachlabs/support/issues/808.
Release note (sql change): Most batches of data flowing through the
vectorized execution engine will now be limited in size by
sql.distsql.temp_storage.workmem
(64MiB by default) which shouldimprove the stability of CockroachDB clusters.