Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,sql: integrate row-level TTL reads with CPU limiter #108815

Merged
merged 2 commits into from
Aug 22, 2023

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Aug 15, 2023

Part of #98722.

We do it at two levels, each appearing prominently in CPU profiles:

  • Down in KV, where we're handling batch requests issued as part of row-level TTL selects. This is gated by kvadmission.low_pri_read_elastic_control.enabled.
  • Up in SQL, when handling KV responses to said batch requests. This is gated by sqladmission.low_pri_read_response_elastic_control.enabled.

Similar to backups, rangefeed initial scans, and changefeed event processing, we've observed latency impact during CPU-intensive scans issued as part of row-level TTL jobs. We know from before that the existing slots based mechanism for CPU work can result in excessive scheduling latency for high-pri work in the presence of lower-pri work, affecting end-user latencies. This commit then tries to control the total CPU% used by row-level TTL selects through the elastic CPU limiter. For the KV work, this was trivial -- we already have integrations at the batch request level and now we pick out requests with the admissionpb.TTLLowPri bit set.

For the SQL portion of the work we introduce some minimal plumbing. Where previously we sought admission in the SQL-KV response queues after fetching each batch of KVs from KV as part of our volcano operator iteration, we now incrementally acquire CPU nanos. We do this specifically for row-level TTL work. Experimentally the CPU nanos we acquire here map roughly to the CPU utilization due to SQL work for row-level TTL selects.

Release note: None

@irfansharif irfansharif requested review from a team as code owners August 15, 2023 21:06
@irfansharif irfansharif requested review from rachitgsrivastava and DarrylWong and removed request for a team August 15, 2023 21:06
@irfansharif irfansharif changed the title kv,sql: integrate row-level TTL reads with CPU limiterj kv,sql: integrate row-level TTL reads with CPU limiter Aug 15, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'll defer to others for approval.

Reviewed 7 of 7 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @sumeerbhola)


pkg/kv/kvserver/kvadmission/kvadmission.go line 68 at r2 (raw file):

	"controls how many CPU tokens are allotted for each ttl read request",
	10*time.Millisecond,
	func(duration time.Duration) error {

nit: consider extracting this validation function as a helper (or as a helper constructor), probably in admission package, to be used in three places.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Reviewed 7 of 7 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, and @rafiss)


-- commits line 29 at r2:
fwiw, one could argue that this should be elastic work even if the slot based mechanism was working better, because an improved cpu admission mechanism for regular work will not try to keep scheduling latency as low as we do for elastic work.


pkg/kv/kvserver/kvadmission/kvadmission.go line 397 at r2 (raw file):

		// - We do the same for KV work done on the behalf of row-level TTL
		//   reads.
		isTTLRead := ba.IsReadOnly() && admissionInfo.Priority == admissionpb.TTLLowPri

I'm planning to fiddle with priorities, based on whether locks are held or not, and maybe more such changes in the future. So using the priority to decide whether something is a TTL scan is a bit brittle. One could argue that anything < UserLowPri is elastic work and ought to go through this path. Thoughts on the latter?


pkg/kv/kvserver/kvadmission/kvadmission.go line 412 at r2 (raw file):

			elasticWorkHandle, err := n.elasticCPUGrantCoordinator.ElasticCPUWorkQueue.Admit(
				ctx, admitDuration, admissionInfo,
			)

we're doing the initial admission, but the corresponding request processing code is not doing admission.ElasticCPUWorkHandleFromContext(ctx) in order to return if it consumes too much time. I suppose it isn't harming us because these requests are short lived, but I think this needs a code comment and a TODO.


pkg/sql/row/kv_batch_fetcher.go line 376 at r2 (raw file):

				CreateTime: txn.AdmissionHeader().CreateTime,
			})
	}

this code is duplicated. Can you pull it into a method, and call it TryInitAdmissionPacing (or something like that). And include a comment that for now we are only doing this for TTL.


pkg/sql/row/kv_batch_fetcher.go line 621 at r2 (raw file):

	// Do admission control after we've accounted for the response bytes.

	if br != nil {

This admission block is worth pulling out into a method too, and having a comment in that method implementation on why we are doing one or the other.

Also we need a code comment justifying why this pacing is effective for TTL work. Please work in some of the comments from the slack thread and maybe post the profiles onto the github issue so that the callstacks can be referenced here. I am assuming this is working because the bulk of the SQL CPU consumption for TTL work is happening on the goroutine that is driving this fetch.

And how will this be robust to future changes in the SQL code? This problem is not unique to this TTL instrumentation, so hopefully @yuzefovich can comment on how someone (perhaps sql-queries) could add unit tests. Also, I'm not suggesting it happen in this PR, but I worry about brittleness since:

  • The admission roachtests typically don't have pass/fail conclusions, so we may not notice if TTL isolation breaks because the instrumentation point is no longer relevant.
  • We have similar response path instrumentation in streamer.go. For some reason the TTL code is using txnKVFetcher and not that code, but presumably that could change.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @sumeerbhola)


pkg/sql/row/kv_batch_fetcher.go line 621 at r2 (raw file):

I am assuming this is working because the bulk of the SQL CPU consumption for TTL work is happening on the goroutine that is driving this fetch.

Yes, I believe that's the case because the SELECTs done by the TTL job are not distributed at SQL level (since our DistSQL physical planning heuristics deems it not worthy of distribution), and with the local plan we only have a single goroutine (unless maybeParallelizeLocalScans splits up the single scan into multiple TableReader processors).

And how will this be robust to future changes in the SQL code? This problem is not unique to this TTL instrumentation, so hopefully @yuzefovich can comment on how someone (perhaps sql-queries) could add unit tests.

If you're wondering how we'd verify that TTL code still benefits from this elastic CPU pacing, I'd consider adding tests for the SELECT queries issued by the TTL to ensure that they have local plans with a single TableReader processor in multi-node clusters. The assumption that such plans have a single goroutine is pretty fundamental and won't be changed (until the TableReaders are powered by the streamer #82164).

We have similar response path instrumentation in streamer.go. For some reason the TTL code is using txnKVFetcher and not that code, but presumably that could change.

Yes, that's a good point. This will change once #82164 is implemented, but that's not going to happen in 23.2. The streamer does have many goroutines involved, and the CPU intensive SQL work will happen on a different goroutine from the ones that evaluate the BatchRequests, so the integration will be much tricker there. However, we'll always have an option to disable the usage of the streamer in the TTL SELECT queries to preserve this elastic pacing.

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @rafiss, @sumeerbhola, and @yuzefovich)


-- commits line 29 at r2:

Previously, sumeerbhola wrote…

fwiw, one could argue that this should be elastic work even if the slot based mechanism was working better, because an improved cpu admission mechanism for regular work will not try to keep scheduling latency as low as we do for elastic work.

Edited the sentence earlier to reflect this.


pkg/kv/kvserver/kvadmission/kvadmission.go line 68 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: consider extracting this validation function as a helper (or as a helper constructor), probably in admission package, to be used in three places.

Good idea, done.


pkg/kv/kvserver/kvadmission/kvadmission.go line 397 at r2 (raw file):

Previously, sumeerbhola wrote…

I'm planning to fiddle with priorities, based on whether locks are held or not, and maybe more such changes in the future. So using the priority to decide whether something is a TTL scan is a bit brittle. One could argue that anything < UserLowPri is elastic work and ought to go through this path. Thoughts on the latter?

Just saw the other PR. I was initially trying to be very targeted here, trying to affect only TTL reads, but applying it to < UserLowPri makes sense. Done.


pkg/kv/kvserver/kvadmission/kvadmission.go line 412 at r2 (raw file):

Previously, sumeerbhola wrote…

we're doing the initial admission, but the corresponding request processing code is not doing admission.ElasticCPUWorkHandleFromContext(ctx) in order to return if it consumes too much time. I suppose it isn't harming us because these requests are short lived, but I think this needs a code comment and a TODO.

We are using admission.ElasticCPUWorkHandleFromContext(ctx) in order to return unused tokens. See below where we populate ah.elasticCPUWorkHandle and the calling code in pkg/server/node.go. The unused CPU nanos gets returned when calling AdmittedKVWorkDone.


pkg/sql/row/kv_batch_fetcher.go line 376 at r2 (raw file):

Previously, sumeerbhola wrote…

this code is duplicated. Can you pull it into a method, and call it TryInitAdmissionPacing (or something like that). And include a comment that for now we are only doing this for TTL.

Done.


pkg/sql/row/kv_batch_fetcher.go line 621 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I am assuming this is working because the bulk of the SQL CPU consumption for TTL work is happening on the goroutine that is driving this fetch.

Yes, I believe that's the case because the SELECTs done by the TTL job are not distributed at SQL level (since our DistSQL physical planning heuristics deems it not worthy of distribution), and with the local plan we only have a single goroutine (unless maybeParallelizeLocalScans splits up the single scan into multiple TableReader processors).

And how will this be robust to future changes in the SQL code? This problem is not unique to this TTL instrumentation, so hopefully @yuzefovich can comment on how someone (perhaps sql-queries) could add unit tests.

If you're wondering how we'd verify that TTL code still benefits from this elastic CPU pacing, I'd consider adding tests for the SELECT queries issued by the TTL to ensure that they have local plans with a single TableReader processor in multi-node clusters. The assumption that such plans have a single goroutine is pretty fundamental and won't be changed (until the TableReaders are powered by the streamer #82164).

We have similar response path instrumentation in streamer.go. For some reason the TTL code is using txnKVFetcher and not that code, but presumably that could change.

Yes, that's a good point. This will change once #82164 is implemented, but that's not going to happen in 23.2. The streamer does have many goroutines involved, and the CPU intensive SQL work will happen on a different goroutine from the ones that evaluate the BatchRequests, so the integration will be much tricker there. However, we'll always have an option to disable the usage of the streamer in the TTL SELECT queries to preserve this elastic pacing.

Added a comment, TODOs, and a back reference in #82164 to this thread. Posting some internal experimental results over to #98722 for reference-ability.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 10 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @sumeerbhola)


pkg/util/admission/elastic_cpu_work_queue.go line 34 at r3 (raw file):

// ValidateElasticCPUDuration is a validation function for the {minimum,maximum}
// on-CPU time elastic requests can ask for when seeking admission.
func ValidateElasticCPUDuration(duration time.Duration) error {

nit: I think this is no longer needed (and unused) given Raphael's changes to settings API.

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @rafiss, @sumeerbhola, and @yuzefovich)


pkg/util/admission/elastic_cpu_work_queue.go line 34 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I think this is no longer needed (and unused) given Raphael's changes to settings API.

Removed.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 8 of 10 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @yuzefovich)


-- commits line 29 at r2:

Previously, irfansharif (irfan sharif) wrote…

Edited the sentence earlier to reflect this.

Did you mean to collapse into 1 commit? The second commit description has been lost.


pkg/cmd/roachtest/tests/admission_control_row_level_ttl.go line 97 at r4 (raw file):

						// The TPCC fixtures have dates from 2006 for the ol_delivery_d column.
						expirationExpr = `'((ol_delivery_d::TIMESTAMP) + INTERVAL ''1000 years'') AT TIME ZONE ''UTC'''`
					}

Can you add a TODO for what metric(s) we should read and aggregate to check that AC is working as expected.


pkg/kv/kvserver/kvadmission/kvadmission.go line 412 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We are using admission.ElasticCPUWorkHandleFromContext(ctx) in order to return unused tokens. See below where we populate ah.elasticCPUWorkHandle and the calling code in pkg/server/node.go. The unused CPU nanos gets returned when calling AdmittedKVWorkDone.

Yes, I realize that. I was referring to the fact that there is nothing in the processing code (like we have in mvcc.go for ExportRequest), that returns if 10ms has been exceeded. So in some sense the integration is incomplete, even though it is probably harmless.


pkg/kv/kvserver/kvadmission/kvadmission.go line 365 at r4 (raw file):

		//   isolation to non-elastic ("latency sensitive") work running on the
		//   same machine.
		// - We do the same internally submitted low priority reads in general

nit: same for internally


pkg/sql/row/kv_batch_fetcher.go line 69 at r4 (raw file):

	settings.SystemOnly,
	"sqladmission.elastic_cpu.duration_per_low_pri_read_response",
	"controls how many CPU tokens are allotted for handling responses for internally submitted low priority reads",

kvadmission.elastic_cpu.duration_per_low_pri_read is 10ms while this is 100ms. Can you add some code commentary on why the difference? Presumably there was some experimental basis for this.


pkg/sql/row/kv_batch_fetcher.go line 75 at r4 (raw file):

// internalLowPriReadElasticControlEnabled determines whether the sql portion of
// internally submitted low-priorority reads (like row-level TTL selects)

nit: priority


pkg/sql/row/kv_batch_fetcher.go line 350 at r4 (raw file):

) {
	if sv == nil {
		return // only nil in tests

and in SQL pods?


pkg/sql/row/kv_batch_fetcher.go line 650 at r4 (raw file):

	if f.admissionPacer != nil {
		// If admissionPacer is initialized, we're using the elastic CPU control
		// mechanism (the work is elastic in nature using the slots based

nit: nature and using

Rename + own as an AC integration test, similar to ones we have for
backup/changefeeds/etc. We'll integrate row-level TTL reads in the
subsequent commit

Part of cockroachdb#89208.

Release note: None
Part of cockroachdb#98722.

We do it at two levels, each appearing prominently in CPU profiles:

- Down in KV, where we're handling batch requests issued as part of
  row-level TTL selects. This is gated by
  kvadmission.low_pri_read_elastic_control.enabled.
- Up in SQL, when handling KV responses to said batch requests. This is
  gated by sqladmission.low_pri_read_response_elastic_control.enabled.

Similar to backups, rangefeed initial scans, and changefeed event
processing, we've observed latency impact during CPU-intensive scans
issued as part of row-level TTL jobs. We know from before that the
existing slots based mechanism for CPU work can result in excessive
scheduling latency for high-pri work in the presence of lower-pri
work, affecting end-user latencies. This is because the slots mechanisms
aims for full utilization of the underlying resource, which is
incompatible with low scheduling latencies. This commit then tries to
control the total CPU% used by row-level TTL selects through the elastic
CPU limiter. For the KV work, this was trivial -- we already have
integrations at the batch request level and now we pick out requests
with priorities less than admissionpb.UserLowPri, which includes
admissionpb.TTLLowPri.

For the SQL portion of the work we introduce some minimal plumbing.
Where previously we sought admission in the SQL-KV response queues after
fetching each batch of KVs from KV as part of our volcano operator
iteration, we now incrementally acquire CPU nanos. We do this
specifically for row-level TTL work. Experimentally the CPU nanos we
acquire here map roughly to the CPU utilization due to SQL work for
row-level TTL selects.

(Note that we apply the elastic CPU limiter for all reads with priorities
less than admissionpb.UserPriLow. This is typically internally submitted
reads, and includes row-level TTL selects.)

Release note: None
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @rafiss, @sumeerbhola, and @yuzefovich)


-- commits line 29 at r2:

Previously, sumeerbhola wrote…

Did you mean to collapse into 1 commit? The second commit description has been lost.

Oops, separated back out.


pkg/cmd/roachtest/tests/admission_control_row_level_ttl.go line 97 at r4 (raw file):

Previously, sumeerbhola wrote…

Can you add a TODO for what metric(s) we should read and aggregate to check that AC is working as expected.

Done.


pkg/kv/kvserver/kvadmission/kvadmission.go line 412 at r2 (raw file):

Previously, sumeerbhola wrote…

Yes, I realize that. I was referring to the fact that there is nothing in the processing code (like we have in mvcc.go for ExportRequest), that returns if 10ms has been exceeded. So in some sense the integration is incomplete, even though it is probably harmless.

Ah, understood. Yes it should be harmless, but I added a TODO.


pkg/sql/row/kv_batch_fetcher.go line 69 at r4 (raw file):

Previously, sumeerbhola wrote…

kvadmission.elastic_cpu.duration_per_low_pri_read is 10ms while this is 100ms. Can you add some code commentary on why the difference? Presumably there was some experimental basis for this.

Done.


pkg/sql/row/kv_batch_fetcher.go line 350 at r4 (raw file):

Previously, sumeerbhola wrote…

and in SQL pods?

Done.

@craig
Copy link
Contributor

craig bot commented Aug 22, 2023

Build succeeded:

@craig craig bot merged commit fd3a78b into cockroachdb:master Aug 22, 2023
5 of 7 checks passed
@irfansharif irfansharif deleted the 230810.ttl-ac branch August 22, 2023 15:41
@rafiss
Copy link
Collaborator

rafiss commented Dec 4, 2023

This was backported to 23.1.9 and 22.2.14. The behavior is off-by-default in those releases. To enable, use the following hidden cluster settings:

SET CLUSTER SETTING kvadmission.low_pri_read_elastic_control.enabled = true;
SET CLUSTER SETTING sqladmission.low_pri_read_response_elastic_control.enabled = true;

In v23.2, this behavior is on-by-default, so those cluster settings are no longer needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants