kv,sql: integrate row-level TTL reads with CPU limiter #108815

irfansharif · 2023-08-15T21:06:17Z

Part of #98722.

We do it at two levels, each appearing prominently in CPU profiles:

Down in KV, where we're handling batch requests issued as part of row-level TTL selects. This is gated by kvadmission.low_pri_read_elastic_control.enabled.
Up in SQL, when handling KV responses to said batch requests. This is gated by sqladmission.low_pri_read_response_elastic_control.enabled.

Similar to backups, rangefeed initial scans, and changefeed event processing, we've observed latency impact during CPU-intensive scans issued as part of row-level TTL jobs. We know from before that the existing slots based mechanism for CPU work can result in excessive scheduling latency for high-pri work in the presence of lower-pri work, affecting end-user latencies. This commit then tries to control the total CPU% used by row-level TTL selects through the elastic CPU limiter. For the KV work, this was trivial -- we already have integrations at the batch request level and now we pick out requests with the admissionpb.TTLLowPri bit set.

For the SQL portion of the work we introduce some minimal plumbing. Where previously we sought admission in the SQL-KV response queues after fetching each batch of KVs from KV as part of our volcano operator iteration, we now incrementally acquire CPU nanos. We do this specifically for row-level TTL work. Experimentally the CPU nanos we acquire here map roughly to the CPU utilization due to SQL work for row-level TTL selects.

Release note: None

cockroach-teamcity · 2023-08-15T21:06:40Z

This change is

yuzefovich

LGTM, but I'll defer to others for approval.

Reviewed 7 of 7 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @sumeerbhola)

pkg/kv/kvserver/kvadmission/kvadmission.go line 68 at r2 (raw file):

	"controls how many CPU tokens are allotted for each ttl read request",
	10*time.Millisecond,
	func(duration time.Duration) error {

nit: consider extracting this validation function as a helper (or as a helper constructor), probably in admission package, to be used in three places.

sumeerbhola

looks good

Reviewed 7 of 7 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, and @rafiss)

-- commits line 29 at r2:
fwiw, one could argue that this should be elastic work even if the slot based mechanism was working better, because an improved cpu admission mechanism for regular work will not try to keep scheduling latency as low as we do for elastic work.

pkg/kv/kvserver/kvadmission/kvadmission.go line 397 at r2 (raw file):

		// - We do the same for KV work done on the behalf of row-level TTL
		//   reads.
		isTTLRead := ba.IsReadOnly() && admissionInfo.Priority == admissionpb.TTLLowPri

I'm planning to fiddle with priorities, based on whether locks are held or not, and maybe more such changes in the future. So using the priority to decide whether something is a TTL scan is a bit brittle. One could argue that anything < UserLowPri is elastic work and ought to go through this path. Thoughts on the latter?

pkg/kv/kvserver/kvadmission/kvadmission.go line 412 at r2 (raw file):

			elasticWorkHandle, err := n.elasticCPUGrantCoordinator.ElasticCPUWorkQueue.Admit(
				ctx, admitDuration, admissionInfo,
			)

we're doing the initial admission, but the corresponding request processing code is not doing admission.ElasticCPUWorkHandleFromContext(ctx) in order to return if it consumes too much time. I suppose it isn't harming us because these requests are short lived, but I think this needs a code comment and a TODO.

pkg/sql/row/kv_batch_fetcher.go line 376 at r2 (raw file):

				CreateTime: txn.AdmissionHeader().CreateTime,
			})
	}

this code is duplicated. Can you pull it into a method, and call it TryInitAdmissionPacing (or something like that). And include a comment that for now we are only doing this for TTL.

pkg/sql/row/kv_batch_fetcher.go line 621 at r2 (raw file):

	// Do admission control after we've accounted for the response bytes.

	if br != nil {

This admission block is worth pulling out into a method too, and having a comment in that method implementation on why we are doing one or the other.

Also we need a code comment justifying why this pacing is effective for TTL work. Please work in some of the comments from the slack thread and maybe post the profiles onto the github issue so that the callstacks can be referenced here. I am assuming this is working because the bulk of the SQL CPU consumption for TTL work is happening on the goroutine that is driving this fetch.

And how will this be robust to future changes in the SQL code? This problem is not unique to this TTL instrumentation, so hopefully @yuzefovich can comment on how someone (perhaps sql-queries) could add unit tests. Also, I'm not suggesting it happen in this PR, but I worry about brittleness since:

The admission roachtests typically don't have pass/fail conclusions, so we may not notice if TTL isolation breaks because the instrumentation point is no longer relevant.
We have similar response path instrumentation in streamer.go. For some reason the TTL code is using txnKVFetcher and not that code, but presumably that could change.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @sumeerbhola)

pkg/sql/row/kv_batch_fetcher.go line 621 at r2 (raw file):

I am assuming this is working because the bulk of the SQL CPU consumption for TTL work is happening on the goroutine that is driving this fetch.

Yes, I believe that's the case because the SELECTs done by the TTL job are not distributed at SQL level (since our DistSQL physical planning heuristics deems it not worthy of distribution), and with the local plan we only have a single goroutine (unless maybeParallelizeLocalScans splits up the single scan into multiple TableReader processors).

And how will this be robust to future changes in the SQL code? This problem is not unique to this TTL instrumentation, so hopefully @yuzefovich can comment on how someone (perhaps sql-queries) could add unit tests.

If you're wondering how we'd verify that TTL code still benefits from this elastic CPU pacing, I'd consider adding tests for the SELECT queries issued by the TTL to ensure that they have local plans with a single TableReader processor in multi-node clusters. The assumption that such plans have a single goroutine is pretty fundamental and won't be changed (until the TableReaders are powered by the streamer #82164).

We have similar response path instrumentation in streamer.go. For some reason the TTL code is using txnKVFetcher and not that code, but presumably that could change.

Yes, that's a good point. This will change once #82164 is implemented, but that's not going to happen in 23.2. The streamer does have many goroutines involved, and the CPU intensive SQL work will happen on a different goroutine from the ones that evaluate the BatchRequests, so the integration will be much tricker there. However, we'll always have an option to disable the usage of the streamer in the TTL SELECT queries to preserve this elastic pacing.

irfansharif

PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @rafiss, @sumeerbhola, and @yuzefovich)

-- commits line 29 at r2:

Previously, sumeerbhola wrote…

fwiw, one could argue that this should be elastic work even if the slot based mechanism was working better, because an improved cpu admission mechanism for regular work will not try to keep scheduling latency as low as we do for elastic work.

Edited the sentence earlier to reflect this.

pkg/kv/kvserver/kvadmission/kvadmission.go line 68 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: consider extracting this validation function as a helper (or as a helper constructor), probably in admission package, to be used in three places.

Good idea, done.

pkg/kv/kvserver/kvadmission/kvadmission.go line 397 at r2 (raw file):

Previously, sumeerbhola wrote…

I'm planning to fiddle with priorities, based on whether locks are held or not, and maybe more such changes in the future. So using the priority to decide whether something is a TTL scan is a bit brittle. One could argue that anything < UserLowPri is elastic work and ought to go through this path. Thoughts on the latter?

Just saw the other PR. I was initially trying to be very targeted here, trying to affect only TTL reads, but applying it to < UserLowPri makes sense. Done.

pkg/kv/kvserver/kvadmission/kvadmission.go line 412 at r2 (raw file):

Previously, sumeerbhola wrote…

we're doing the initial admission, but the corresponding request processing code is not doing admission.ElasticCPUWorkHandleFromContext(ctx) in order to return if it consumes too much time. I suppose it isn't harming us because these requests are short lived, but I think this needs a code comment and a TODO.

We are using admission.ElasticCPUWorkHandleFromContext(ctx) in order to return unused tokens. See below where we populate ah.elasticCPUWorkHandle and the calling code in pkg/server/node.go. The unused CPU nanos gets returned when calling AdmittedKVWorkDone.

pkg/sql/row/kv_batch_fetcher.go line 376 at r2 (raw file):

Previously, sumeerbhola wrote…

this code is duplicated. Can you pull it into a method, and call it TryInitAdmissionPacing (or something like that). And include a comment that for now we are only doing this for TTL.

Done.

pkg/sql/row/kv_batch_fetcher.go line 621 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I am assuming this is working because the bulk of the SQL CPU consumption for TTL work is happening on the goroutine that is driving this fetch.

Yes, I believe that's the case because the SELECTs done by the TTL job are not distributed at SQL level (since our DistSQL physical planning heuristics deems it not worthy of distribution), and with the local plan we only have a single goroutine (unless maybeParallelizeLocalScans splits up the single scan into multiple TableReader processors).

And how will this be robust to future changes in the SQL code? This problem is not unique to this TTL instrumentation, so hopefully @yuzefovich can comment on how someone (perhaps sql-queries) could add unit tests.

If you're wondering how we'd verify that TTL code still benefits from this elastic CPU pacing, I'd consider adding tests for the SELECT queries issued by the TTL to ensure that they have local plans with a single TableReader processor in multi-node clusters. The assumption that such plans have a single goroutine is pretty fundamental and won't be changed (until the TableReaders are powered by the streamer #82164).

We have similar response path instrumentation in streamer.go. For some reason the TTL code is using txnKVFetcher and not that code, but presumably that could change.

Yes, that's a good point. This will change once #82164 is implemented, but that's not going to happen in 23.2. The streamer does have many goroutines involved, and the CPU intensive SQL work will happen on a different goroutine from the ones that evaluate the BatchRequests, so the integration will be much tricker there. However, we'll always have an option to disable the usage of the streamer in the TTL SELECT queries to preserve this elastic pacing.

Added a comment, TODOs, and a back reference in #82164 to this thread. Posting some internal experimental results over to #98722 for reference-ability.

yuzefovich

Reviewed 10 of 10 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @sumeerbhola)

pkg/util/admission/elastic_cpu_work_queue.go line 34 at r3 (raw file):

// ValidateElasticCPUDuration is a validation function for the {minimum,maximum}
// on-CPU time elastic requests can ask for when seeking admission.
func ValidateElasticCPUDuration(duration time.Duration) error {

nit: I think this is no longer needed (and unused) given Raphael's changes to settings API.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @rafiss, @sumeerbhola, and @yuzefovich)

pkg/util/admission/elastic_cpu_work_queue.go line 34 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I think this is no longer needed (and unused) given Raphael's changes to settings API.

Removed.

sumeerbhola

Reviewed 8 of 10 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @irfansharif, @rachitgsrivastava, @rafiss, and @yuzefovich)

-- commits line 29 at r2:

Previously, irfansharif (irfan sharif) wrote…

Edited the sentence earlier to reflect this.

Did you mean to collapse into 1 commit? The second commit description has been lost.

pkg/cmd/roachtest/tests/admission_control_row_level_ttl.go line 97 at r4 (raw file):

						// The TPCC fixtures have dates from 2006 for the ol_delivery_d column.
						expirationExpr = `'((ol_delivery_d::TIMESTAMP) + INTERVAL ''1000 years'') AT TIME ZONE ''UTC'''`
					}

Can you add a TODO for what metric(s) we should read and aggregate to check that AC is working as expected.

pkg/kv/kvserver/kvadmission/kvadmission.go line 412 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We are using admission.ElasticCPUWorkHandleFromContext(ctx) in order to return unused tokens. See below where we populate ah.elasticCPUWorkHandle and the calling code in pkg/server/node.go. The unused CPU nanos gets returned when calling AdmittedKVWorkDone.

Yes, I realize that. I was referring to the fact that there is nothing in the processing code (like we have in mvcc.go for ExportRequest), that returns if 10ms has been exceeded. So in some sense the integration is incomplete, even though it is probably harmless.

pkg/kv/kvserver/kvadmission/kvadmission.go line 365 at r4 (raw file):

		//   isolation to non-elastic ("latency sensitive") work running on the
		//   same machine.
		// - We do the same internally submitted low priority reads in general

nit: same for internally

pkg/sql/row/kv_batch_fetcher.go line 69 at r4 (raw file):

	settings.SystemOnly,
	"sqladmission.elastic_cpu.duration_per_low_pri_read_response",
	"controls how many CPU tokens are allotted for handling responses for internally submitted low priority reads",

kvadmission.elastic_cpu.duration_per_low_pri_read is 10ms while this is 100ms. Can you add some code commentary on why the difference? Presumably there was some experimental basis for this.

pkg/sql/row/kv_batch_fetcher.go line 75 at r4 (raw file):

// internalLowPriReadElasticControlEnabled determines whether the sql portion of
// internally submitted low-priorority reads (like row-level TTL selects)

nit: priority

pkg/sql/row/kv_batch_fetcher.go line 350 at r4 (raw file):

) {
	if sv == nil {
		return // only nil in tests

and in SQL pods?

pkg/sql/row/kv_batch_fetcher.go line 650 at r4 (raw file):

	if f.admissionPacer != nil {
		// If admissionPacer is initialized, we're using the elastic CPU control
		// mechanism (the work is elastic in nature using the slots based

nit: nature and using

Rename + own as an AC integration test, similar to ones we have for backup/changefeeds/etc. We'll integrate row-level TTL reads in the subsequent commit Part of cockroachdb#89208. Release note: None

Part of cockroachdb#98722. We do it at two levels, each appearing prominently in CPU profiles: - Down in KV, where we're handling batch requests issued as part of row-level TTL selects. This is gated by kvadmission.low_pri_read_elastic_control.enabled. - Up in SQL, when handling KV responses to said batch requests. This is gated by sqladmission.low_pri_read_response_elastic_control.enabled. Similar to backups, rangefeed initial scans, and changefeed event processing, we've observed latency impact during CPU-intensive scans issued as part of row-level TTL jobs. We know from before that the existing slots based mechanism for CPU work can result in excessive scheduling latency for high-pri work in the presence of lower-pri work, affecting end-user latencies. This is because the slots mechanisms aims for full utilization of the underlying resource, which is incompatible with low scheduling latencies. This commit then tries to control the total CPU% used by row-level TTL selects through the elastic CPU limiter. For the KV work, this was trivial -- we already have integrations at the batch request level and now we pick out requests with priorities less than admissionpb.UserLowPri, which includes admissionpb.TTLLowPri. For the SQL portion of the work we introduce some minimal plumbing. Where previously we sought admission in the SQL-KV response queues after fetching each batch of KVs from KV as part of our volcano operator iteration, we now incrementally acquire CPU nanos. We do this specifically for row-level TTL work. Experimentally the CPU nanos we acquire here map roughly to the CPU utilization due to SQL work for row-level TTL selects. (Note that we apply the elastic CPU limiter for all reads with priorities less than admissionpb.UserPriLow. This is typically internally submitted reads, and includes row-level TTL selects.) Release note: None

irfansharif

TFTR!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @rafiss, @sumeerbhola, and @yuzefovich)

-- commits line 29 at r2:

Previously, sumeerbhola wrote…

Did you mean to collapse into 1 commit? The second commit description has been lost.

Oops, separated back out.

pkg/cmd/roachtest/tests/admission_control_row_level_ttl.go line 97 at r4 (raw file):

Previously, sumeerbhola wrote…

Can you add a TODO for what metric(s) we should read and aggregate to check that AC is working as expected.

Done.

pkg/kv/kvserver/kvadmission/kvadmission.go line 412 at r2 (raw file):

Previously, sumeerbhola wrote…

Yes, I realize that. I was referring to the fact that there is nothing in the processing code (like we have in mvcc.go for ExportRequest), that returns if 10ms has been exceeded. So in some sense the integration is incomplete, even though it is probably harmless.

Ah, understood. Yes it should be harmless, but I added a TODO.

pkg/sql/row/kv_batch_fetcher.go line 69 at r4 (raw file):

Previously, sumeerbhola wrote…

kvadmission.elastic_cpu.duration_per_low_pri_read is 10ms while this is 100ms. Can you add some code commentary on why the difference? Presumably there was some experimental basis for this.

Done.

pkg/sql/row/kv_batch_fetcher.go line 350 at r4 (raw file):

Previously, sumeerbhola wrote…

and in SQL pods?

Done.

craig · 2023-08-22T15:39:03Z

Build succeeded:

Bazel Essential CI (Cockroach)

rafiss · 2023-12-04T15:39:39Z

This was backported to 23.1.9 and 22.2.14. The behavior is off-by-default in those releases. To enable, use the following hidden cluster settings:

SET CLUSTER SETTING kvadmission.low_pri_read_elastic_control.enabled = true;
SET CLUSTER SETTING sqladmission.low_pri_read_response_elastic_control.enabled = true;

In v23.2, this behavior is on-by-default, so those cluster settings are no longer needed.

irfansharif requested review from rafiss, yuzefovich and sumeerbhola August 15, 2023 21:06

irfansharif requested review from a team as code owners August 15, 2023 21:06

irfansharif requested review from rachitgsrivastava and DarrylWong and removed request for a team August 15, 2023 21:06

irfansharif changed the title ~~kv,sql: integrate row-level TTL reads with CPU limiterj~~ kv,sql: integrate row-level TTL reads with CPU limiter Aug 15, 2023

yuzefovich reviewed Aug 16, 2023

View reviewed changes

sumeerbhola requested changes Aug 16, 2023

View reviewed changes

yuzefovich reviewed Aug 16, 2023

View reviewed changes

irfansharif force-pushed the 230810.ttl-ac branch from 96e69b9 to 0cdc0e7 Compare August 21, 2023 18:13

irfansharif commented Aug 21, 2023

View reviewed changes

irfansharif mentioned this pull request Aug 21, 2023

ttl,admission: reduce performance impact of large row-level TTL jobs #98722

Closed

irfansharif force-pushed the 230810.ttl-ac branch 3 times, most recently from 745df70 to b373890 Compare August 21, 2023 18:53

yuzefovich reviewed Aug 21, 2023

View reviewed changes

irfansharif force-pushed the 230810.ttl-ac branch from b373890 to 2c22647 Compare August 22, 2023 00:09

irfansharif commented Aug 22, 2023

View reviewed changes

irfansharif force-pushed the 230810.ttl-ac branch from 2c22647 to 3e54e99 Compare August 22, 2023 00:09

sumeerbhola mentioned this pull request Aug 22, 2023

kvstreamer: consider using the streamer for the TableReader #82164

Open

4 tasks

sumeerbhola approved these changes Aug 22, 2023

View reviewed changes

irfansharif added 2 commits August 22, 2023 10:30

admission,roachtest: re-home row-level ttl test

fadfd2a

Rename + own as an AC integration test, similar to ones we have for backup/changefeeds/etc. We'll integrate row-level TTL reads in the subsequent commit Part of cockroachdb#89208. Release note: None

irfansharif force-pushed the 230810.ttl-ac branch from 3e54e99 to 4451735 Compare August 22, 2023 14:30

irfansharif commented Aug 22, 2023

View reviewed changes

craig bot merged commit fd3a78b into cockroachdb:master Aug 22, 2023
5 of 7 checks passed

irfansharif deleted the 230810.ttl-ac branch August 22, 2023 15:41

This was referenced Aug 22, 2023

release-23.1: kv,sql: integrate row-level TTL reads with CPU limiter #109257

Merged

release-22.2: kv,sql: integrate row-level TTL reads with CPU limiter #109259

Merged

dikshant mentioned this pull request Aug 29, 2023

Recommend 4 hour CRON schedule for TTL cockroachdb/docs#17668

Merged

rafiss mentioned this pull request Dec 4, 2023

ttl: consider either disabling or throttling splits caused by DELETEs from the job #90713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv,sql: integrate row-level TTL reads with CPU limiter #108815

kv,sql: integrate row-level TTL reads with CPU limiter #108815

irfansharif commented Aug 15, 2023 •

edited by rafiss

Loading

cockroach-teamcity commented Aug 15, 2023

yuzefovich left a comment

sumeerbhola left a comment

yuzefovich left a comment

irfansharif left a comment

yuzefovich left a comment

irfansharif left a comment

sumeerbhola left a comment

irfansharif left a comment

craig bot commented Aug 22, 2023

rafiss commented Dec 4, 2023

kv,sql: integrate row-level TTL reads with CPU limiter #108815

kv,sql: integrate row-level TTL reads with CPU limiter #108815

Conversation

irfansharif commented Aug 15, 2023 • edited by rafiss Loading

cockroach-teamcity commented Aug 15, 2023

yuzefovich left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

craig bot commented Aug 22, 2023

rafiss commented Dec 4, 2023

irfansharif commented Aug 15, 2023 •

edited by rafiss

Loading