-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql,storage: add support for COL_BATCH_RESPONSE scan format #94438
Conversation
186d49a
to
a284dbb
Compare
07f8ff0
to
f2fc46d
Compare
d24e48c
to
26940e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @sumeerbhola)
pkg/roachpb/api.proto
line 673 at r7 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
[nit]
rowin
->row in
and same above.
Done.
pkg/sql/colfetcher/cfetcher.go
line 812 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
I think I still want to push on this a bit more, since it would be nice to decouple things. What if we added a method to the
NextKVer
interfaceSetFirstKVOfRow
that would be called by thecFetcher
when it reachesstateDecodeFirstKVOfRow
? I think that should givemvccScanFetchAdapter
all the information it needs to implement my earlier suggestion, right?
Thanks for pushing on this. I implemented your suggestion in first WIP commit, but I didn't like one aspect of it - having to call SetFirstKeyOfRow
on every row seen by the cFetcher
, even when not using the direct columnar scans. I think it could have a noticeable performance impact, so I added another WIP commit which eliminates this function call on every SQL row and still decouples things as you suggested.
Overall, I probably like this change. My only hesitation is that now the trimming logic is split into two places - we rely on synchronizing singleResults
returning the appropriate first key of the trimmed row and the cFetcher
actually removing that row. However, the interfaces do seem cleaner, so I'm curious to hear what everyone thinks.
pkg/storage/col_mvcc.go
line 110 at r7 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
[nit]
stable
now just depends on the implementation and is always true or false, right? Consider either making it a separate method on the interface or making it a field on thecFetcher
itself.
Good point, extracted into another method in NextKVer
interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @sumeerbhola, and @yuzefovich)
pkg/sql/colfetcher/cfetcher.go
line 812 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Thanks for pushing on this. I implemented your suggestion in first WIP commit, but I didn't like one aspect of it - having to call
SetFirstKeyOfRow
on every row seen by thecFetcher
, even when not using the direct columnar scans. I think it could have a noticeable performance impact, so I added another WIP commit which eliminates this function call on every SQL row and still decouples things as you suggested.Overall, I probably like this change. My only hesitation is that now the trimming logic is split into two places - we rely on synchronizing
singleResults
returning the appropriate first key of the trimmed row and thecFetcher
actually removing that row. However, the interfaces do seem cleaner, so I'm curious to hear what everyone thinks.
One last option - we could make the first key of the row an argument to NextKV
.
Anyway, I think we've been pretty thorough at this point. I think I'd be happy with either of the WIPs you added. I'll do a final pass once you've finalized this bit but LGTM for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nits about comments. I am happy with the storage interfaces and code, so feel free to merge when @DrewKimball is satisfied.
Reviewed 1 of 3 files at r10.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @yuzefovich)
pkg/storage/col_mvcc.go
line 246 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
This part was very confusing - although the KV is unstable, in some cases the
cFetcher
doesn't need to hold onto it after the followingNextKV
call, so the copying is not needed (if there is only one column family in the table). The complication was that a copy needed to be performed for a very different reason in the old code path (in therow.KVFetcher
). I refactored the code to clean things up, and now the implementation here should make more sense. In particular, we now just say that the returned KV is unstable, and thecFetcher
itself decides whether to deep copy it or not.
Looks much cleaner now.
pkg/storage/col_mvcc.go
line 124 at r10 (raw file):
// When (ok=false,partialRow=true) is returned, the caller is expected to // discard all KVs that were part of the last SQL row that was incomplete. // The scan will be resumed from the last key provided by SetFirstKeyOfRow
was this supposed to say FirstKeyOfRowGetter
?
pkg/storage/pebble_mvcc_scanner.go
line 76 at r6 (raw file):
To me saying "before calling put() with the same key" includes both of the cases - i.e. the current getOne call hasn't called put with this key, nor the previous getOne calls have - which leads to this correct assumption
getOne
could call put
with key a
and then enquire about key b
, even though it has no intention of call put
with key b
in this call. Seems something that is permitted by this contract.
Even if it was not permitted, anything like this that is not obvious is worth spelling out in a code comment so that future readers can understand this super quickly.
Can you add something here about both the contract and how it implies that nothing is buffered in the singleResults
implementation of results
. I don't mind comments here that refer to any of the implementations since it helps folks understand why these constraints are being defined. Abstraction is good in terms of having clean code but we often want to understand why this abstraction its various invariants.
c6a32ca
to
00362eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)
pkg/sql/vars.go
line 490 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
IIRC
ALTER ROLE ALL
doesn't affectroot
user, and for logic tests I think we do useroot
often. The direct columnar scans need to be disabled for some types of logic tests, and it seems like we do need an explicit cluster setting to handle that. Perhaps we'd just need to do bothALTER ROLE ALL
andALTER ROLE root
. Happy to be taught here since the choice of using the cluster setting can be driven simply by me being used to this way.
My current understanding of the situation is that we no longer allow cluster settings with sql.defaults
prefix and perhaps we stir away engineers from introducing public cluster settings that also have corresponding session variables. In this case, this is a private undocumented setting, and I think it's ok to keep it that way.
That said, I did ask in slack, will see what SQL Session team recommends.
pkg/sql/colfetcher/cfetcher.go
line 812 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
One last option - we could make the first key of the row an argument to
NextKV
.Anyway, I think we've been pretty thorough at this point. I think I'd be happy with either of the WIPs you added. I'll do a final pass once you've finalized this bit but LGTM for now.
I like Getter method approach the most - passing a key when calling NextKV
(which returns a key) seems wrong.
pkg/storage/col_mvcc.go
line 124 at r10 (raw file):
Previously, sumeerbhola wrote…
was this supposed to say
FirstKeyOfRowGetter
?
Yes, this was a leftover, fixed.
pkg/storage/pebble_mvcc_scanner.go
line 76 at r6 (raw file):
Previously, sumeerbhola wrote…
To me saying "before calling put() with the same key" includes both of the cases - i.e. the current getOne call hasn't called put with this key, nor the previous getOne calls have - which leads to this correct assumption
getOne
could callput
with keya
and then enquire about keyb
, even though it has no intention of callput
with keyb
in this call. Seems something that is permitted by this contract.
Even if it was not permitted, anything like this that is not obvious is worth spelling out in a code comment so that future readers can understand this super quickly.
Can you add something here about both the contract and how it implies that nothing is buffered in thesingleResults
implementation ofresults
. I don't mind comments here that refer to any of the implementations since it helps folks understand why these constraints are being defined. Abstraction is good in terms of having clean code but we often want to understand why this abstraction its various invariants.
Thanks, makes sense, added this information into the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 22 files at r7, 10 of 10 files at r11, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2, @sumeerbhola, and @yuzefovich)
I've got some performance numbers to share. The microbenchmarks are as expected:
TPCH numbers though aren't as good:
This is a bit disappointing given the original very good numbers some time ago in the prototype. I did a bisect last week and couldn't reproduce those results. I'll do another bisect tomorrow, but most likely I messed something up when obtaining those numbers (although I do remember collecting them twice). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @yuzefovich)
pkg/storage/pebble_mvcc_scanner.go
line 80 at r11 (raw file):
// returned on the NextKVer.NextKV call). Therefore, the singleResults can // make a determination on its own whether the given key belongs to the // first SQL row.
Now that singleResults
is making a determination on its own, this constraint is not needed, yes?
It copies the first row eagerly in put
, so even if that KV is still buffered, it is ok.
00362eb
to
cbb68d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)
pkg/storage/pebble_mvcc_scanner.go
line 80 at r11 (raw file):
Previously, sumeerbhola wrote…
Now that
singleResults
is making a determination on its own, this constraint is not needed, yes?
It copies the first row eagerly input
, so even if that KV is still buffered, it is ok.
Yes, you're right. I tightened the contract here a bit further since continuesFirstRow
is only called if put
has already been called at least once.
For reference, with the local fastpath for in-memory tenants (WIP in #95033) the benchmarks are here (the comparison is master vs this PR + #95033 with direct columnar scans always enabled). They show that for in-memory tenants (this is how UA will be achieved for self-hosted customers) the KV projection pushdown work already has benefits in some scenarios, and I hope with more targeted optimization work we can improve there noticeably. |
Finally, I was able to reproduce the "great" results I obtained on the prototype two months ago. In short, they were due to a subtle bug in In particular, as can be seen in #95793 (ignoring the last commit) the prototype had the following:
after processing the Scan request via the direct columnar scans. As a result, after evaluating the Scan we didn't update limits of the Once this bug is fixed (i.e. the last commit in #95793 is applied), we get the results that show effectively no difference of the prototype on TPCH queries:
(It's interesting to observe that on the current polished PR there are some noticeable improvements and noticeable regressions, I haven't explored this in-depth though, and will do so in the next milestone.) This begs the question whether we should increase these limits since it has such non-trivial impact on the performance (TBD whether this impact is more pronounced in the multi-tenant setups and whether the usage of the direct columnar scans play a role too). Perhaps, if we set Thus, I think there are no more blockers for this PR to be merged. It shows an expected improvement in microbenchmarks when only a subset of columns is needed in multi-tenant separate-process setups. I'll wait for Michael's approval. |
Maybe this isn't the best place to discuss, but I will anyway :) Some memory usage in queries is hard to throttle up and down, but it seems like batch sizing isn't one of those things - the |
This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans and ReverseScans which results only in needed columns to be returned from the KV server. In other words, this commit introduces the ability to perform the KV projection pushdown. The main idea of this feature is to use the injected decoding logic from SQL in order to process each KV and keep only the needed parts (i.e. necessary SQL columns). Those needed parts are then propagated back to the KV client as coldata.Batch'es (serialized in the Apache Arrow format). Here is the outline of all components involved: ┌────────────────────────────────────────────────┐ │ SQL │ │________________________________________________│ │ colfetcher.ColBatchDirectScan │ │ │ │ │ ▼ │ │ row.txnKVFetcher │ │ (behind the row.KVBatchFetcher interface) │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Client │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Server │ │________________________________________________│ │ colfetcher.cFetcherWrapper │ │ (behind the storage.CFetcherWrapper interface) │ │ │ │ │ ▼ │ │ colfetcher.cFetcher │ │ │ │ │ ▼ │ │ storage.mvccScanFetchAdapter ────────┐│ │ (behind the storage.NextKVer interface) ││ │ │ ││ │ ▼ ││ │ storage.pebbleMVCCScanner ││ │ (which put's KVs into storage.singleResults) <┘│ └────────────────────────────────────────────────┘ On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with the `COL_BATCH_RESPONSE` format and returns the response (which contains the columnar data) to the `colfetcher.ColBatchDirectScan`. On the KV server side, we create a `storage.CFetcherWrapper` that asks the `colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn, fetches the next KV, decodes it, and keeps only values for the needed SQL columns, discarding the rest of the KV. The KV is emitted by the `mvccScanFetchAdapter` which - via the `singleResults` struct - exposes access to the current KV that the `pebbleMVCCScanner` is pointing at. Note that there is an additional "implicit synchronization" between components that is not shown on this diagram. In particular, `storage.singleResults.maybeTrimPartialLastRow` must be in sync with the `colfetcher.cFetcher` which is achieved by - the `cFetcher` exposing access to the first key of the last incomplete SQL row via the `FirstKeyOfRowGetter`, - the `singleResults` using that key as the resume key for the response, - and the `cFetcher` removing that last partial SQL row when `NextKV()` returns `partialRow=true`. This "upstream" link (although breaking the layering a bit) allows us to avoid a performance penalty for handling the case with multiple column families. (This case is handled by the `storage.pebbleResults` via tracking offsets into the `pebbleResults.repr`.) This code structure deserves some elaboration. First, there is a mismatch between the "push" mode in which the `pebbleMVCCScanner` operates and the "pull" mode that the `NextKVer` exposes. The adaption between two different modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control returns to it) the current unstable KV pair from the `singleResults` struct which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into. Second, in order be able to use the unstable KV pair without performing a copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when the next KV pair is needed. There are multiple scenarios which are currently not supported: - SQL cannot issue Get requests (likely will support in 23.1) - `TraceKV` option is not supported (likely will support in 23.1) - user-defined types other than enums are not supported (will _not_ support in 23.1) - non-default key locking strength as well as SKIP LOCKED wait policy are not supported (will _not_ support in 23.1). The usage of this feature is currently disabled by default, but I intend to enable it by default for multi-tenant setups. The rationale is that currently there is a large performance hit when enabling it for single-tenant deployments whereas it offers significant speed up in the multi-tenant world. The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719) the expected improvement in multi-tenant setups when the tenant runs in a separate process whenever we don't need to decode all of the columns from the table. The TPCH numbers, though, don't show the expected speedup: ``` Q1: before: 11.47s after: 8.84s -22.89% Q2: before: 0.41s after: 0.29s -27.71% Q3: before: 7.89s after: 9.68s 22.63% Q4: before: 4.48s after: 4.52s 0.86% Q5: before: 10.39s after: 10.35s -0.29% Q6: before: 33.57s after: 33.41s -0.48% Q7: before: 23.82s after: 23.81s -0.02% Q8: before: 3.78s after: 3.76s -0.68% Q9: before: 28.15s after: 28.03s -0.42% Q10: before: 5.00s after: 4.98s -0.42% Q11: before: 2.44s after: 2.44s 0.22% Q12: before: 34.78s after: 34.65s -0.37% Q13: before: 3.20s after: 2.94s -8.28% Q14: before: 3.13s after: 3.21s 2.43% Q15: before: 16.80s after: 16.73s -0.38% Q16: before: 1.60s after: 1.65s 2.96% Q17: before: 0.85s after: 0.96s 13.04% Q18: before: 16.39s after: 15.47s -5.61% Q19: before: 13.76s after: 13.01s -5.45% Q20: before: 55.33s after: 55.12s -0.38% Q21: before: 24.31s after: 24.31s -0.00% Q22: before: 1.28s after: 1.41s 10.26% ``` Note that much better numbers were obtained on a buggy prototype. Those numbers are invalid, and more details can be found [here](cockroachdb#94438 (comment)). At the moment, `coldata.Batch` that is included into the response is always serialized into the Arrow format, but I intend to introduce the local fastpath to avoid that serialization. That work will be done in a follow-up and should be able to reduce the perf hit for single-tenant deployments. A quick note on the TODOs sprinkled in this commit: - `TODO(yuzefovich)` means that this will be left for 23.2 or later. - `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1. A quick note on testing: this commit randomizes the fact whether the new infrastructure is used in almost all test builds. Introducing some unit testing (say, in `storage` package) seems rather annoying since we must create keys that are valid SQL keys (i.e. have TableID / Index ID prefix) and need to come with the corresponding `fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems ok to me given that the "meat" of the work there is still done by the `pebbleMVCCScanner` which is exercised using the regular Scans. End-to-end testing is well covered by all of our existing tests which now runs randomly. I did run the CI multiple times with the new feature enabled by default with no failure, so I hope that it shouldn't become flaky. Release note: None
cbb68d0
to
2ae0116
Compare
This sounds reasonable. However, setting reasonable value for Also, I just ran some quick benchmarks (using Still, we now know for sure how those "great" numbers from the prototype were obtained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Thanks for waiting for me. Thank you for reworking those layer violations.
Interesting about the NumKeys
/ NumBytes
omission causing the performance difference. Seems really promising, if we can figure out what else they affect besides BatchRequest
limits.
Reviewed 2 of 55 files at r1, 2 of 15 files at r5, 11 of 22 files at r7, 7 of 10 files at r11, 2 of 3 files at r12, 10 of 11 files at r13, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball, @sumeerbhola, and @yuzefovich)
pkg/sql/vars.go
line 490 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
My current understanding of the situation is that we no longer allow cluster settings with
sql.defaults
prefix and perhaps we stir away engineers from introducing public cluster settings that also have corresponding session variables. In this case, this is a private undocumented setting, and I think it's ok to keep it that way.That said, I did ask in slack, will see what SQL Session team recommends.
TIL about ALTER ROLE
not working for root. Ok, this is fine.
pkg/storage/col_mvcc.go
line 229 at r13 (raw file):
// The KV wasn't added for whatever reason (e.g. it could have been // skipped over due to having been deleted), so just move on. return f.NextKV(ctx, mvccDecodingStrategy)
nit: Using recursion here makes me a little nervous. I don't think Go has tail-call optimization, so the stack could grow deep if we have to skip over many KVs. Instead of recursion could we use either goto
or a loop?
This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans and ReverseScans which results only in needed columns to be returned from the KV server. In other words, this commit introduces the ability to perform the KV projection pushdown. The main idea of this feature is to use the injected decoding logic from SQL in order to process each KV and keep only the needed parts (i.e. necessary SQL columns). Those needed parts are then propagated back to the KV client as coldata.Batch'es (serialized in the Apache Arrow format). Here is the outline of all components involved: ┌────────────────────────────────────────────────┐ │ SQL │ │________________________________________________│ │ colfetcher.ColBatchDirectScan │ │ │ │ │ ▼ │ │ row.txnKVFetcher │ │ (behind the row.KVBatchFetcher interface) │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Client │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Server │ │________________________________________________│ │ colfetcher.cFetcherWrapper │ │ (behind the storage.CFetcherWrapper interface) │ │ │ │ │ ▼ │ │ colfetcher.cFetcher │ │ │ │ │ ▼ │ │ storage.mvccScanFetchAdapter ────────┐│ │ (behind the storage.NextKVer interface) ││ │ │ ││ │ ▼ ││ │ storage.pebbleMVCCScanner ││ │ (which put's KVs into storage.singleResults) <┘│ └────────────────────────────────────────────────┘ On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with the `COL_BATCH_RESPONSE` format and returns the response (which contains the columnar data) to the `colfetcher.ColBatchDirectScan`. On the KV server side, we create a `storage.CFetcherWrapper` that asks the `colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn, fetches the next KV, decodes it, and keeps only values for the needed SQL columns, discarding the rest of the KV. The KV is emitted by the `mvccScanFetchAdapter` which - via the `singleResults` struct - exposes access to the current KV that the `pebbleMVCCScanner` is pointing at. Note that there is an additional "implicit synchronization" between components that is not shown on this diagram. In particular, `storage.singleResults.maybeTrimPartialLastRow` must be in sync with the `colfetcher.cFetcher` which is achieved by - the `cFetcher` exposing access to the first key of the last incomplete SQL row via the `FirstKeyOfRowGetter`, - the `singleResults` using that key as the resume key for the response, - and the `cFetcher` removing that last partial SQL row when `NextKV()` returns `partialRow=true`. This "upstream" link (although breaking the layering a bit) allows us to avoid a performance penalty for handling the case with multiple column families. (This case is handled by the `storage.pebbleResults` via tracking offsets into the `pebbleResults.repr`.) This code structure deserves some elaboration. First, there is a mismatch between the "push" mode in which the `pebbleMVCCScanner` operates and the "pull" mode that the `NextKVer` exposes. The adaption between two different modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control returns to it) the current unstable KV pair from the `singleResults` struct which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into. Second, in order be able to use the unstable KV pair without performing a copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when the next KV pair is needed. There are multiple scenarios which are currently not supported: - SQL cannot issue Get requests (likely will support in 23.1) - `TraceKV` option is not supported (likely will support in 23.1) - user-defined types other than enums are not supported (will _not_ support in 23.1) - non-default key locking strength as well as SKIP LOCKED wait policy are not supported (will _not_ support in 23.1). The usage of this feature is currently disabled by default, but I intend to enable it by default for multi-tenant setups. The rationale is that currently there is a large performance hit when enabling it for single-tenant deployments whereas it offers significant speed up in the multi-tenant world. The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719) the expected improvement in multi-tenant setups when the tenant runs in a separate process whenever we don't need to decode all of the columns from the table. The TPCH numbers, though, don't show the expected speedup: ``` Q1: before: 11.47s after: 8.84s -22.89% Q2: before: 0.41s after: 0.29s -27.71% Q3: before: 7.89s after: 9.68s 22.63% Q4: before: 4.48s after: 4.52s 0.86% Q5: before: 10.39s after: 10.35s -0.29% Q6: before: 33.57s after: 33.41s -0.48% Q7: before: 23.82s after: 23.81s -0.02% Q8: before: 3.78s after: 3.76s -0.68% Q9: before: 28.15s after: 28.03s -0.42% Q10: before: 5.00s after: 4.98s -0.42% Q11: before: 2.44s after: 2.44s 0.22% Q12: before: 34.78s after: 34.65s -0.37% Q13: before: 3.20s after: 2.94s -8.28% Q14: before: 3.13s after: 3.21s 2.43% Q15: before: 16.80s after: 16.73s -0.38% Q16: before: 1.60s after: 1.65s 2.96% Q17: before: 0.85s after: 0.96s 13.04% Q18: before: 16.39s after: 15.47s -5.61% Q19: before: 13.76s after: 13.01s -5.45% Q20: before: 55.33s after: 55.12s -0.38% Q21: before: 24.31s after: 24.31s -0.00% Q22: before: 1.28s after: 1.41s 10.26% ``` Note that much better numbers were obtained on a buggy prototype. Those numbers are invalid, and more details can be found [here](cockroachdb#94438 (comment)). At the moment, `coldata.Batch` that is included into the response is always serialized into the Arrow format, but I intend to introduce the local fastpath to avoid that serialization. That work will be done in a follow-up and should be able to reduce the perf hit for single-tenant deployments. A quick note on the TODOs sprinkled in this commit: - `TODO(yuzefovich)` means that this will be left for 23.2 or later. - `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1. A quick note on testing: this commit randomizes the fact whether the new infrastructure is used in almost all test builds. Introducing some unit testing (say, in `storage` package) seems rather annoying since we must create keys that are valid SQL keys (i.e. have TableID / Index ID prefix) and need to come with the corresponding `fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems ok to me given that the "meat" of the work there is still done by the `pebbleMVCCScanner` which is exercised using the regular Scans. End-to-end testing is well covered by all of our existing tests which now runs randomly. I did run the CI multiple times with the new feature enabled by default with no failure, so I hope that it shouldn't become flaky. Release note: None
2ae0116
to
556cb98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to adjust one of the upgrade tests which was incorrectly written.
Thanks for the reviews everyone!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)
pkg/storage/col_mvcc.go
line 229 at r13 (raw file):
Previously, michae2 (Michael Erickson) wrote…
nit: Using recursion here makes me a little nervous. I don't think Go has tail-call optimization, so the stack could grow deep if we have to skip over many KVs. Instead of recursion could we use either
goto
or a loop?
I didn't see any usages of goto
in our codebase, so I opted for the loop. However, I'm a bit concerned that the loop will have noticeable performance impact, so I left a TODO to explore this - if it does have a noticeable impact, then I'd rather keep the recursion since for the recursion to lead to a crash we'd need like 1 billion KVs to be skipped which I'm assuming should not be possible if the compaction are working as intended.
Build succeeded: |
This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans and ReverseScans which results only in needed columns to be returned from the KV server. In other words, this commit introduces the ability to perform the KV projection pushdown. The main idea of this feature is to use the injected decoding logic from SQL in order to process each KV and keep only the needed parts (i.e. necessary SQL columns). Those needed parts are then propagated back to the KV client as coldata.Batch'es (serialized in the Apache Arrow format). Here is the outline of all components involved: ┌────────────────────────────────────────────────┐ │ SQL │ │________________________________________________│ │ colfetcher.ColBatchDirectScan │ │ │ │ │ ▼ │ │ row.txnKVFetcher │ │ (behind the row.KVBatchFetcher interface) │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Client │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Server │ │________________________________________________│ │ colfetcher.cFetcherWrapper │ │ (behind the storage.CFetcherWrapper interface) │ │ │ │ │ ▼ │ │ colfetcher.cFetcher │ │ │ │ │ ▼ │ │ storage.mvccScanFetchAdapter ────────┐│ │ (behind the storage.NextKVer interface) ││ │ │ ││ │ ▼ ││ │ storage.pebbleMVCCScanner ││ │ (which put's KVs into storage.singleResults) <┘│ └────────────────────────────────────────────────┘ On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with the `COL_BATCH_RESPONSE` format and returns the response (which contains the columnar data) to the `colfetcher.ColBatchDirectScan`. On the KV server side, we create a `storage.CFetcherWrapper` that asks the `colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn, fetches the next KV, decodes it, and keeps only values for the needed SQL columns, discarding the rest of the KV. The KV is emitted by the `mvccScanFetchAdapter` which - via the `singleResults` struct - exposes access to the current KV that the `pebbleMVCCScanner` is pointing at. Note that there is an additional "implicit synchronization" between components that is not shown on this diagram. In particular, `storage.singleResults.maybeTrimPartialLastRow` must be in sync with the `colfetcher.cFetcher` which is achieved by - the `cFetcher` exposing access to the first key of the last incomplete SQL row via the `FirstKeyOfRowGetter`, - the `singleResults` using that key as the resume key for the response, - and the `cFetcher` removing that last partial SQL row when `NextKV()` returns `partialRow=true`. This "upstream" link (although breaking the layering a bit) allows us to avoid a performance penalty for handling the case with multiple column families. (This case is handled by the `storage.pebbleResults` via tracking offsets into the `pebbleResults.repr`.) This code structure deserves some elaboration. First, there is a mismatch between the "push" mode in which the `pebbleMVCCScanner` operates and the "pull" mode that the `NextKVer` exposes. The adaption between two different modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control returns to it) the current unstable KV pair from the `singleResults` struct which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into. Second, in order be able to use the unstable KV pair without performing a copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when the next KV pair is needed. There are multiple scenarios which are currently not supported: - SQL cannot issue Get requests (likely will support in 23.1) - `TraceKV` option is not supported (likely will support in 23.1) - user-defined types other than enums are not supported (will _not_ support in 23.1) - non-default key locking strength as well as SKIP LOCKED wait policy are not supported (will _not_ support in 23.1). The usage of this feature is currently disabled by default, but I intend to enable it by default for multi-tenant setups. The rationale is that currently there is a large performance hit when enabling it for single-tenant deployments whereas it offers significant speed up in the multi-tenant world. The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719) the expected improvement in multi-tenant setups when the tenant runs in a separate process whenever we don't need to decode all of the columns from the table. The TPCH numbers, though, don't show the expected speedup: ``` Q1: before: 11.47s after: 8.84s -22.89% Q2: before: 0.41s after: 0.29s -27.71% Q3: before: 7.89s after: 9.68s 22.63% Q4: before: 4.48s after: 4.52s 0.86% Q5: before: 10.39s after: 10.35s -0.29% Q6: before: 33.57s after: 33.41s -0.48% Q7: before: 23.82s after: 23.81s -0.02% Q8: before: 3.78s after: 3.76s -0.68% Q9: before: 28.15s after: 28.03s -0.42% Q10: before: 5.00s after: 4.98s -0.42% Q11: before: 2.44s after: 2.44s 0.22% Q12: before: 34.78s after: 34.65s -0.37% Q13: before: 3.20s after: 2.94s -8.28% Q14: before: 3.13s after: 3.21s 2.43% Q15: before: 16.80s after: 16.73s -0.38% Q16: before: 1.60s after: 1.65s 2.96% Q17: before: 0.85s after: 0.96s 13.04% Q18: before: 16.39s after: 15.47s -5.61% Q19: before: 13.76s after: 13.01s -5.45% Q20: before: 55.33s after: 55.12s -0.38% Q21: before: 24.31s after: 24.31s -0.00% Q22: before: 1.28s after: 1.41s 10.26% ``` Note that much better numbers were obtained on a buggy prototype. Those numbers are invalid, and more details can be found [here](cockroachdb#94438 (comment)). At the moment, `coldata.Batch` that is included into the response is always serialized into the Arrow format, but I intend to introduce the local fastpath to avoid that serialization. That work will be done in a follow-up and should be able to reduce the perf hit for single-tenant deployments. A quick note on the TODOs sprinkled in this commit: - `TODO(yuzefovich)` means that this will be left for 23.2 or later. - `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1. A quick note on testing: this commit randomizes the fact whether the new infrastructure is used in almost all test builds. Introducing some unit testing (say, in `storage` package) seems rather annoying since we must create keys that are valid SQL keys (i.e. have TableID / Index ID prefix) and need to come with the corresponding `fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems ok to me given that the "meat" of the work there is still done by the `pebbleMVCCScanner` which is exercised using the regular Scans. End-to-end testing is well covered by all of our existing tests which now runs randomly. I did run the CI multiple times with the new feature enabled by default with no failure, so I hope that it shouldn't become flaky. Release note: None
This commit introduces a new
COL_BATCH_RESPONSE
scan format for Scansand ReverseScans which results only in needed columns to be returned
from the KV server. In other words, this commit introduces the ability
to perform the KV projection pushdown.
The main idea of this feature is to use the injected decoding logic from
SQL in order to process each KV and keep only the needed parts (i.e.
necessary SQL columns). Those needed parts are then propagated back to
the KV client as coldata.Batch'es (serialized in the Apache Arrow format).
Here is the outline of all components involved:
On the KV client side,
row.txnKVFetcher
issues Scans and ReverseScans withthe
COL_BATCH_RESPONSE
format and returns the response (which contains thecolumnar data) to the
colfetcher.ColBatchDirectScan
.On the KV server side, we create a
storage.CFetcherWrapper
that asks thecolfetcher.cFetcher
for the nextcoldata.Batch
. ThecFetcher
, in turn,fetches the next KV, decodes it, and keeps only values for the needed SQL
columns, discarding the rest of the KV. The KV is emitted by the
mvccScanFetchAdapter
which - via thesingleResults
struct - exposesaccess to the current KV that the
pebbleMVCCScanner
is pointing at.Note that there is an additional "implicit synchronization" between
components that is not shown on this diagram. In particular,
storage.singleResults.maybeTrimPartialLastRow
must be in sync with thecolfetcher.cFetcher
which is achieved bycFetcher
exposing access to the first key of the last incomplete SQLrow via the
FirstKeyOfRowGetter
,singleResults
using that key as the resume key for the response,cFetcher
removing that last partial SQL row whenNextKV()
returns
partialRow=true
.This "upstream" link (although breaking the layering a bit) allows us to
avoid a performance penalty for handling the case with multiple column
families. (This case is handled by the
storage.pebbleResults
via trackingoffsets into the
pebbleResults.repr
.)This code structure deserves some elaboration. First, there is a mismatch
between the "push" mode in which the
pebbleMVCCScanner
operates and the"pull" mode that the
NextKVer
exposes. The adaption between two differentmodes is achieved via the
mvccScanFetcherAdapter
grabbing (when the controlreturns to it) the current unstable KV pair from the
singleResults
structwhich serves as a one KV pair buffer that the
pebbleMVCCScanner
put
s into.Second, in order be able to use the unstable KV pair without performing a
copy, the
pebbleMVCCScanner
stops at the current KV pair and returns thecontrol flow (which is exactly what
pebbleMVCCScanner.getOne
does) back tothe
mvccScanFetcherAdapter
, with the adapter advancing the scanner only whenthe next KV pair is needed.
There are multiple scenarios which are currently not supported:
TraceKV
option is not supported (likely will support in 23.1)support in 23.1)
are not supported (will not support in 23.1).
The usage of this feature is currently disabled by default, but I intend
to enable it by default for multi-tenant setups. The rationale is that
currently there is a large performance hit when enabling it for
single-tenant deployments whereas it offers significant speed up in the
multi-tenant world.
The microbenchmarks show
the expected improvement in multi-tenant setups when the tenant runs in
a separate process whenever we don't need to decode all of the columns
from the table.
The TPCH numbers, though, don't show the expected speedup:
At the moment,
coldata.Batch
that is included into the response isalways serialized into the Arrow format, but I intend to introduce the
local fastpath to avoid that serialization. That work will be done in
a follow-up and should be able to reduce the perf hit for single-tenant
deployments.
A quick note on the TODOs sprinkled in this commit:
TODO(yuzefovich)
means that this will be left for 23.2 or later.TODO(yuzefovich, 23.1)
means that it should be addressed in 23.1.A quick note on testing: this commit randomizes the fact whether the new
infrastructure is used in almost all test builds. Introducing some unit
testing (say, in
storage
package) seems rather annoying since we mustcreate keys that are valid SQL keys (i.e. have TableID / Index ID
prefix) and need to come with the corresponding
fetchpb.IndexFetchSpec
. Not having unit tests in thestorage
seemsok to me given that the "meat" of the work there is still done by the
pebbleMVCCScanner
which is exercised using the regular Scans.End-to-end testing is well covered by all of our existing tests which
now runs randomly. I did run the CI multiple times with the new feature
enabled by default with no failure, so I hope that it shouldn't become
flaky.
Addresses: #82323.
Informs: #87610.
Epic: CRDB-14837
Release note: None