-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv,sql: integrate row-level TTL reads with CPU limiter #108815
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I'll defer to others for approval.
Reviewed 7 of 7 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @sumeerbhola)
pkg/kv/kvserver/kvadmission/kvadmission.go
line 68 at r2 (raw file):
"controls how many CPU tokens are allotted for each ttl read request", 10*time.Millisecond, func(duration time.Duration) error {
nit: consider extracting this validation function as a helper (or as a helper constructor), probably in admission
package, to be used in three places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
Reviewed 7 of 7 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, and @rafiss)
-- commits
line 29 at r2:
fwiw, one could argue that this should be elastic work even if the slot based mechanism was working better, because an improved cpu admission mechanism for regular work will not try to keep scheduling latency as low as we do for elastic work.
pkg/kv/kvserver/kvadmission/kvadmission.go
line 397 at r2 (raw file):
// - We do the same for KV work done on the behalf of row-level TTL // reads. isTTLRead := ba.IsReadOnly() && admissionInfo.Priority == admissionpb.TTLLowPri
I'm planning to fiddle with priorities, based on whether locks are held or not, and maybe more such changes in the future. So using the priority to decide whether something is a TTL scan is a bit brittle. One could argue that anything < UserLowPri is elastic work and ought to go through this path. Thoughts on the latter?
pkg/kv/kvserver/kvadmission/kvadmission.go
line 412 at r2 (raw file):
elasticWorkHandle, err := n.elasticCPUGrantCoordinator.ElasticCPUWorkQueue.Admit( ctx, admitDuration, admissionInfo, )
we're doing the initial admission, but the corresponding request processing code is not doing admission.ElasticCPUWorkHandleFromContext(ctx)
in order to return if it consumes too much time. I suppose it isn't harming us because these requests are short lived, but I think this needs a code comment and a TODO.
pkg/sql/row/kv_batch_fetcher.go
line 376 at r2 (raw file):
CreateTime: txn.AdmissionHeader().CreateTime, }) }
this code is duplicated. Can you pull it into a method, and call it TryInitAdmissionPacing (or something like that). And include a comment that for now we are only doing this for TTL.
pkg/sql/row/kv_batch_fetcher.go
line 621 at r2 (raw file):
// Do admission control after we've accounted for the response bytes. if br != nil {
This admission block is worth pulling out into a method too, and having a comment in that method implementation on why we are doing one or the other.
Also we need a code comment justifying why this pacing is effective for TTL work. Please work in some of the comments from the slack thread and maybe post the profiles onto the github issue so that the callstacks can be referenced here. I am assuming this is working because the bulk of the SQL CPU consumption for TTL work is happening on the goroutine that is driving this fetch.
And how will this be robust to future changes in the SQL code? This problem is not unique to this TTL instrumentation, so hopefully @yuzefovich can comment on how someone (perhaps sql-queries) could add unit tests. Also, I'm not suggesting it happen in this PR, but I worry about brittleness since:
- The admission roachtests typically don't have pass/fail conclusions, so we may not notice if TTL isolation breaks because the instrumentation point is no longer relevant.
- We have similar response path instrumentation in streamer.go. For some reason the TTL code is using
txnKVFetcher
and not that code, but presumably that could change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @sumeerbhola)
pkg/sql/row/kv_batch_fetcher.go
line 621 at r2 (raw file):
I am assuming this is working because the bulk of the SQL CPU consumption for TTL work is happening on the goroutine that is driving this fetch.
Yes, I believe that's the case because the SELECTs done by the TTL job are not distributed at SQL level (since our DistSQL physical planning heuristics deems it not worthy of distribution), and with the local plan we only have a single goroutine (unless maybeParallelizeLocalScans
splits up the single scan into multiple TableReader processors).
And how will this be robust to future changes in the SQL code? This problem is not unique to this TTL instrumentation, so hopefully @yuzefovich can comment on how someone (perhaps sql-queries) could add unit tests.
If you're wondering how we'd verify that TTL code still benefits from this elastic CPU pacing, I'd consider adding tests for the SELECT queries issued by the TTL to ensure that they have local plans with a single TableReader processor in multi-node clusters. The assumption that such plans have a single goroutine is pretty fundamental and won't be changed (until the TableReaders are powered by the streamer #82164).
We have similar response path instrumentation in streamer.go. For some reason the TTL code is using
txnKVFetcher
and not that code, but presumably that could change.
Yes, that's a good point. This will change once #82164 is implemented, but that's not going to happen in 23.2. The streamer does have many goroutines involved, and the CPU intensive SQL work will happen on a different goroutine from the ones that evaluate the BatchRequests, so the integration will be much tricker there. However, we'll always have an option to disable the usage of the streamer in the TTL SELECT queries to preserve this elastic pacing.
96e69b9
to
0cdc0e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @rafiss, @sumeerbhola, and @yuzefovich)
Previously, sumeerbhola wrote…
fwiw, one could argue that this should be elastic work even if the slot based mechanism was working better, because an improved cpu admission mechanism for regular work will not try to keep scheduling latency as low as we do for elastic work.
Edited the sentence earlier to reflect this.
pkg/kv/kvserver/kvadmission/kvadmission.go
line 68 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: consider extracting this validation function as a helper (or as a helper constructor), probably in
admission
package, to be used in three places.
Good idea, done.
pkg/kv/kvserver/kvadmission/kvadmission.go
line 397 at r2 (raw file):
Previously, sumeerbhola wrote…
I'm planning to fiddle with priorities, based on whether locks are held or not, and maybe more such changes in the future. So using the priority to decide whether something is a TTL scan is a bit brittle. One could argue that anything < UserLowPri is elastic work and ought to go through this path. Thoughts on the latter?
Just saw the other PR. I was initially trying to be very targeted here, trying to affect only TTL reads, but applying it to < UserLowPri makes sense. Done.
pkg/kv/kvserver/kvadmission/kvadmission.go
line 412 at r2 (raw file):
Previously, sumeerbhola wrote…
we're doing the initial admission, but the corresponding request processing code is not doing
admission.ElasticCPUWorkHandleFromContext(ctx)
in order to return if it consumes too much time. I suppose it isn't harming us because these requests are short lived, but I think this needs a code comment and a TODO.
We are using admission.ElasticCPUWorkHandleFromContext(ctx)
in order to return unused tokens. See below where we populate ah.elasticCPUWorkHandle
and the calling code in pkg/server/node.go
. The unused CPU nanos gets returned when calling AdmittedKVWorkDone
.
pkg/sql/row/kv_batch_fetcher.go
line 376 at r2 (raw file):
Previously, sumeerbhola wrote…
this code is duplicated. Can you pull it into a method, and call it TryInitAdmissionPacing (or something like that). And include a comment that for now we are only doing this for TTL.
Done.
pkg/sql/row/kv_batch_fetcher.go
line 621 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I am assuming this is working because the bulk of the SQL CPU consumption for TTL work is happening on the goroutine that is driving this fetch.
Yes, I believe that's the case because the SELECTs done by the TTL job are not distributed at SQL level (since our DistSQL physical planning heuristics deems it not worthy of distribution), and with the local plan we only have a single goroutine (unless
maybeParallelizeLocalScans
splits up the single scan into multiple TableReader processors).And how will this be robust to future changes in the SQL code? This problem is not unique to this TTL instrumentation, so hopefully @yuzefovich can comment on how someone (perhaps sql-queries) could add unit tests.
If you're wondering how we'd verify that TTL code still benefits from this elastic CPU pacing, I'd consider adding tests for the SELECT queries issued by the TTL to ensure that they have local plans with a single TableReader processor in multi-node clusters. The assumption that such plans have a single goroutine is pretty fundamental and won't be changed (until the TableReaders are powered by the streamer #82164).
We have similar response path instrumentation in streamer.go. For some reason the TTL code is using
txnKVFetcher
and not that code, but presumably that could change.Yes, that's a good point. This will change once #82164 is implemented, but that's not going to happen in 23.2. The streamer does have many goroutines involved, and the CPU intensive SQL work will happen on a different goroutine from the ones that evaluate the BatchRequests, so the integration will be much tricker there. However, we'll always have an option to disable the usage of the streamer in the TTL SELECT queries to preserve this elastic pacing.
Added a comment, TODOs, and a back reference in #82164 to this thread. Posting some internal experimental results over to #98722 for reference-ability.
745df70
to
b373890
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 10 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @sumeerbhola)
pkg/util/admission/elastic_cpu_work_queue.go
line 34 at r3 (raw file):
// ValidateElasticCPUDuration is a validation function for the {minimum,maximum} // on-CPU time elastic requests can ask for when seeking admission. func ValidateElasticCPUDuration(duration time.Duration) error {
nit: I think this is no longer needed (and unused) given Raphael's changes to settings API.
b373890
to
2c22647
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @rafiss, @sumeerbhola, and @yuzefovich)
pkg/util/admission/elastic_cpu_work_queue.go
line 34 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I think this is no longer needed (and unused) given Raphael's changes to settings API.
Removed.
2c22647
to
3e54e99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 10 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @yuzefovich)
Previously, irfansharif (irfan sharif) wrote…
Edited the sentence earlier to reflect this.
Did you mean to collapse into 1 commit? The second commit description has been lost.
pkg/cmd/roachtest/tests/admission_control_row_level_ttl.go
line 97 at r4 (raw file):
// The TPCC fixtures have dates from 2006 for the ol_delivery_d column. expirationExpr = `'((ol_delivery_d::TIMESTAMP) + INTERVAL ''1000 years'') AT TIME ZONE ''UTC'''` }
Can you add a TODO for what metric(s) we should read and aggregate to check that AC is working as expected.
pkg/kv/kvserver/kvadmission/kvadmission.go
line 412 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
We are using
admission.ElasticCPUWorkHandleFromContext(ctx)
in order to return unused tokens. See below where we populateah.elasticCPUWorkHandle
and the calling code inpkg/server/node.go
. The unused CPU nanos gets returned when callingAdmittedKVWorkDone
.
Yes, I realize that. I was referring to the fact that there is nothing in the processing code (like we have in mvcc.go for ExportRequest), that returns if 10ms has been exceeded. So in some sense the integration is incomplete, even though it is probably harmless.
pkg/kv/kvserver/kvadmission/kvadmission.go
line 365 at r4 (raw file):
// isolation to non-elastic ("latency sensitive") work running on the // same machine. // - We do the same internally submitted low priority reads in general
nit: same for internally
pkg/sql/row/kv_batch_fetcher.go
line 69 at r4 (raw file):
settings.SystemOnly, "sqladmission.elastic_cpu.duration_per_low_pri_read_response", "controls how many CPU tokens are allotted for handling responses for internally submitted low priority reads",
kvadmission.elastic_cpu.duration_per_low_pri_read is 10ms while this is 100ms. Can you add some code commentary on why the difference? Presumably there was some experimental basis for this.
pkg/sql/row/kv_batch_fetcher.go
line 75 at r4 (raw file):
// internalLowPriReadElasticControlEnabled determines whether the sql portion of // internally submitted low-priorority reads (like row-level TTL selects)
nit: priority
pkg/sql/row/kv_batch_fetcher.go
line 350 at r4 (raw file):
) { if sv == nil { return // only nil in tests
and in SQL pods?
pkg/sql/row/kv_batch_fetcher.go
line 650 at r4 (raw file):
if f.admissionPacer != nil { // If admissionPacer is initialized, we're using the elastic CPU control // mechanism (the work is elastic in nature using the slots based
nit: nature and using
Rename + own as an AC integration test, similar to ones we have for backup/changefeeds/etc. We'll integrate row-level TTL reads in the subsequent commit Part of cockroachdb#89208. Release note: None
Part of cockroachdb#98722. We do it at two levels, each appearing prominently in CPU profiles: - Down in KV, where we're handling batch requests issued as part of row-level TTL selects. This is gated by kvadmission.low_pri_read_elastic_control.enabled. - Up in SQL, when handling KV responses to said batch requests. This is gated by sqladmission.low_pri_read_response_elastic_control.enabled. Similar to backups, rangefeed initial scans, and changefeed event processing, we've observed latency impact during CPU-intensive scans issued as part of row-level TTL jobs. We know from before that the existing slots based mechanism for CPU work can result in excessive scheduling latency for high-pri work in the presence of lower-pri work, affecting end-user latencies. This is because the slots mechanisms aims for full utilization of the underlying resource, which is incompatible with low scheduling latencies. This commit then tries to control the total CPU% used by row-level TTL selects through the elastic CPU limiter. For the KV work, this was trivial -- we already have integrations at the batch request level and now we pick out requests with priorities less than admissionpb.UserLowPri, which includes admissionpb.TTLLowPri. For the SQL portion of the work we introduce some minimal plumbing. Where previously we sought admission in the SQL-KV response queues after fetching each batch of KVs from KV as part of our volcano operator iteration, we now incrementally acquire CPU nanos. We do this specifically for row-level TTL work. Experimentally the CPU nanos we acquire here map roughly to the CPU utilization due to SQL work for row-level TTL selects. (Note that we apply the elastic CPU limiter for all reads with priorities less than admissionpb.UserPriLow. This is typically internally submitted reads, and includes row-level TTL selects.) Release note: None
3e54e99
to
4451735
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @rafiss, @sumeerbhola, and @yuzefovich)
Previously, sumeerbhola wrote…
Did you mean to collapse into 1 commit? The second commit description has been lost.
Oops, separated back out.
pkg/cmd/roachtest/tests/admission_control_row_level_ttl.go
line 97 at r4 (raw file):
Previously, sumeerbhola wrote…
Can you add a TODO for what metric(s) we should read and aggregate to check that AC is working as expected.
Done.
pkg/kv/kvserver/kvadmission/kvadmission.go
line 412 at r2 (raw file):
Previously, sumeerbhola wrote…
Yes, I realize that. I was referring to the fact that there is nothing in the processing code (like we have in mvcc.go for ExportRequest), that returns if 10ms has been exceeded. So in some sense the integration is incomplete, even though it is probably harmless.
Ah, understood. Yes it should be harmless, but I added a TODO.
pkg/sql/row/kv_batch_fetcher.go
line 69 at r4 (raw file):
Previously, sumeerbhola wrote…
kvadmission.elastic_cpu.duration_per_low_pri_read is 10ms while this is 100ms. Can you add some code commentary on why the difference? Presumably there was some experimental basis for this.
Done.
pkg/sql/row/kv_batch_fetcher.go
line 350 at r4 (raw file):
Previously, sumeerbhola wrote…
and in SQL pods?
Done.
Build succeeded: |
This was backported to 23.1.9 and 22.2.14. The behavior is off-by-default in those releases. To enable, use the following hidden cluster settings:
In v23.2, this behavior is on-by-default, so those cluster settings are no longer needed. |
Part of #98722.
We do it at two levels, each appearing prominently in CPU profiles:
Similar to backups, rangefeed initial scans, and changefeed event processing, we've observed latency impact during CPU-intensive scans issued as part of row-level TTL jobs. We know from before that the existing slots based mechanism for CPU work can result in excessive scheduling latency for high-pri work in the presence of lower-pri work, affecting end-user latencies. This commit then tries to control the total CPU% used by row-level TTL selects through the elastic CPU limiter. For the KV work, this was trivial -- we already have integrations at the batch request level and now we pick out requests with the admissionpb.TTLLowPri bit set.
For the SQL portion of the work we introduce some minimal plumbing. Where previously we sought admission in the SQL-KV response queues after fetching each batch of KVs from KV as part of our volcano operator iteration, we now incrementally acquire CPU nanos. We do this specifically for row-level TTL work. Experimentally the CPU nanos we acquire here map roughly to the CPU utilization due to SQL work for row-level TTL selects.
Release note: None