ttl,admission: reduce performance impact of large row-level TTL jobs #98722

irfansharif · 2023-03-15T22:27:29Z

Is your feature request related to a problem? Please describe.

We've seen in several internal incidents that using row-level TTL can be a performance footgun with large tables. We observe throughput/latency impact either when initially applying row-level TTL on an existing table using ttl_expire_after (which kicks off a column backfill; using ttl_expiration_expression does not cause a backfill), or when a TTL job is unpaused and starts aggressively issuing row deletes through SQL. It can also appear as part of routine row-deletions without prolonged job pauses. This issue focuses on the latter case, and can look like below:

For the column backfill case, we replication admission control to help (#95563). Internal incidents:

Jira issue: CRDB-25463

Epic CRDB-25458

The text was updated successfully, but these errors were encountered:

ecwall · 2023-07-24T18:24:26Z

@irfansharif is there a way to specify a max cpu usage for low priority tasks like this?

For example if the threshold is set to 75% then the task should not run unless cpu usage falls below that level. If other tasks are using 75% then this task should never run.

Context https://cockroachlabs.slack.com/archives/CHKQGKYEM/p1690222820278609?thread_ts=1690207802.498629&cid=CHKQGKYEM

rafiss · 2023-07-28T17:43:11Z

The support ticket https://github.com/cockroachlabs/support/issues/2469#issuecomment-1640640043 contains information on how to reproduce the problem manually. SQL Foundations is also going to work on making that into a nightly test in #107496

rafiss · 2023-08-01T16:59:17Z

#107959 adds a roachtest that can be used to reproduce this.

Run with:

roachtest run row-level-ttl/during/tpcc/expired-rows=false --cockroach /path/to/cockroach-v23.1.6.linux-amd64
# or
roachtest run row-level-ttl/during/tpcc/expired-rows=true --cockroach /path/to/cockroach-v23.1.6.linux-amd64

In both of these, the TTL job will run 10 minutes after the TPCC load begins.

With the first command, the TTL job will scan for expired rows, but will not delete anything (since no rows will match the expiration expression). With the second command, the TTL job will scan and delete rows.

Here's some data from a run with expired-rows=false:

The TPCC workload starts at 15:58. The TTL job wakes up at 16:08, and we can see that the CPU usage goes up to 100% at that point.

Here's a snippet from the workload artifacts that shows that as soon as TTL starts running, the foreground latencies started going up (median as well as tail latencies):

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  260.0s        0           52.0           43.7     25.2     30.4     32.5     33.6 newOrder
  261.0s        0           41.0           43.7     25.2     32.5     37.7     37.7 newOrder
  262.0s        0           49.0           43.7     26.2     46.1     54.5     54.5 newOrder
  263.0s        0           45.0           43.7     25.2     71.3     92.3     92.3 newOrder
  264.0s        0           41.0           43.7     24.1     29.4     31.5     31.5 newOrder
  265.0s        0           35.0           43.7     25.2     52.4     83.9     83.9 newOrder
  266.0s        0           43.0           43.7     25.2     33.6     35.7     35.7 newOrder
  267.0s        0           43.0           43.7     26.2     54.5     67.1     67.1 newOrder
  268.0s        0           41.0           43.7     26.2     44.0     79.7     79.7 newOrder
  269.0s        0           45.0           43.7     26.2     37.7     48.2     48.2 newOrder
  270.0s        0           43.0           43.7     25.2     37.7     62.9     62.9 newOrder
  271.0s        0           52.0           43.7     25.2     29.4     33.6     35.7 newOrder
  272.0s        0           39.0           43.7     25.2     31.5     35.7     35.7 newOrder
  273.0s        0           41.0           43.7     25.2     31.5     56.6     56.6 newOrder
  274.0s        0           32.0           43.6     24.1     35.7     37.7     37.7 newOrder
  275.0s        0           26.0           43.6    570.4    838.9    939.5    939.5 newOrder
  276.0s        0           40.0           43.6    637.5   1006.6   1342.2   1342.2 newOrder
  277.0s        0           41.0           43.5    704.6   1040.2   1879.0   1879.0 newOrder
  278.0s        0           42.0           43.5    637.5   1040.2   1208.0   1208.0 newOrder
  279.0s        0           46.0           43.5    604.0   1073.7   1342.2   1342.2 newOrder
  280.0s        0           30.0           43.5    570.4    939.5    973.1    973.1 newOrder
  281.0s        0           46.0           43.5    704.6   1140.9   1879.0   1879.0 newOrder
  282.0s        0           40.0           43.5    637.5   1208.0   1543.5   1543.5 newOrder
  283.0s        0           38.0           43.5    570.4    939.5    973.1    973.1 newOrder
  284.0s        0           38.0           43.5    570.4    973.1   1342.2   1342.2 newOrder
  285.0s        0           35.0           43.4    570.4    805.3   1677.7   1677.7 newOrder
  286.0s        0           43.0           43.4    604.0    939.5   1140.9   1140.9 newOrder
  287.0s        0           47.0           43.4    637.5    939.5   1543.5   1543.5 newOrder
  288.0s        0           48.0           43.5    604.0    939.5   1208.0   1208.0 newOrder
  289.0s        0           43.0           43.5    570.4    973.1   1208.0   1208.0 newOrder
  290.0s        0           38.0           43.4    637.5    906.0   1006.6   1006.6 newOrder
  291.0s        0           49.0           43.5    604.0    872.4   1006.6   1006.6 newOrder
  292.0s        0           37.0           43.4    570.4    939.5   1006.6   1006.6 newOrder
  293.0s        0           42.0           43.4    671.1   1208.0   1275.1   1275.1 newOrder

Part of cockroachdb#98722. We do it at two levels, each appearing prominently in CPU profiles: - Down in KV, where we're handling batch requests issued as part of row-level TTL selects. This is gated by kvadmission.ttl_read_elastic_control.enabled. - Up in SQL, when handling KV responses to said batch requests. This is gated by sqladmission.ttl_read_response_elastic_control.enabled. Similar to backups, rangefeed initial scans, and changefeed event processing, we've observed latency impact during CPU-intensive scans issued as part of row-level TTL jobs. We know from before that the existing slots based mechanism for CPU work can result in excessive scheduling latency for high-pri work in the presence of lower-pri work, affecting end-user latencies. This commit then tries to control the total CPU% used by row-level TTL selects through the elastic CPU limiter. For the KV work, this was trivial -- we already have integrations at the batch request level and now we pick out requests with the admissionpb.TTLLowPri bit set. For the SQL portion of the work we introduce some minimal plumbing. Where previously we sought admission in the SQL-KV response queues after fetching each batch of KVs from KV as part of our volcano operator iteration, we now incrementally acquire CPU nanos. We do this specifically for row-level TTL work. Experimentally the CPU nanos we acquire here map roughly to the CPU utilization due to SQL work for row-level TTL selects. Release note: None

irfansharif · 2023-08-21T18:17:40Z

Copying over notes from internal experiments.

With and without #108815. p99 goes from 229ms -> 48ms (4.7x decrease) and p90 from 95.8ms -> 8.7ms (11x decrease) during TTL reads with the elastic CPU integration. It keeps CPU% to whatever it’s configured to - instead of the 100% before, now ~50% utilization is what’s considered ok by default WRT p99 scheduling latency.

Other than below-KV CPU% utilization (at the batch request level), there’s a substantial amount of CPU% during these TTL selects that happen up in SQL. Which is why we also acquire CPU tokens for this SQL work: profile-19.pb.txt

Part of cockroachdb#98722. We do it at two levels, each appearing prominently in CPU profiles: - Down in KV, where we're handling batch requests issued as part of row-level TTL selects. This is gated by kvadmission.low_pri_read_elastic_control.enabled. - Up in SQL, when handling KV responses to said batch requests. This is gated by sqladmission.low_pri_read_response_elastic_control.enabled. Similar to backups, rangefeed initial scans, and changefeed event processing, we've observed latency impact during CPU-intensive scans issued as part of row-level TTL jobs. We know from before that the existing slots based mechanism for CPU work can result in excessive scheduling latency for high-pri work in the presence of lower-pri work, affecting end-user latencies. This is because the slots mechanisms aims for full utilization of the underlying resource, which is incompatible with low scheduling latencies. This commit then tries to control the total CPU% used by row-level TTL selects through the elastic CPU limiter. For the KV work, this was trivial -- we already have integrations at the batch request level and now we pick out requests with priorities less than admissionpb.UserLowPri, which includes admissionpb.TTLLowPri. For the SQL portion of the work we introduce some minimal plumbing. Where previously we sought admission in the SQL-KV response queues after fetching each batch of KVs from KV as part of our volcano operator iteration, we now incrementally acquire CPU nanos. We do this specifically for row-level TTL work. Experimentally the CPU nanos we acquire here map roughly to the CPU utilization due to SQL work for row-level TTL selects. (Note that we apply the elastic CPU limiter for all reads with priorities less than admissionpb.UserPriLow. This is typically internally submitted reads, and includes row-level TTL selects.) Release note: None

108815: kv,sql: integrate row-level TTL reads with CPU limiter r=irfansharif a=irfansharif Part of #98722. We do it at two levels, each appearing prominently in CPU profiles: - Down in KV, where we're handling batch requests issued as part of row-level TTL selects. This is gated by kvadmission.ttl_read_elastic_control.enabled. - Up in SQL, when handling KV responses to said batch requests. This is gated by sqladmission.ttl_read_response_elastic_control.enabled. Similar to backups, rangefeed initial scans, and changefeed event processing, we've observed latency impact during CPU-intensive scans issued as part of row-level TTL jobs. We know from before that the existing slots based mechanism for CPU work can result in excessive scheduling latency for high-pri work in the presence of lower-pri work, affecting end-user latencies. This commit then tries to control the total CPU% used by row-level TTL selects through the elastic CPU limiter. For the KV work, this was trivial -- we already have integrations at the batch request level and now we pick out requests with the admissionpb.TTLLowPri bit set. For the SQL portion of the work we introduce some minimal plumbing. Where previously we sought admission in the SQL-KV response queues after fetching each batch of KVs from KV as part of our volcano operator iteration, we now incrementally acquire CPU nanos. We do this specifically for row-level TTL work. Experimentally the CPU nanos we acquire here map roughly to the CPU utilization due to SQL work for row-level TTL selects. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

Part of cockroachdb#98722. We do it at two levels, each appearing prominently in CPU profiles: - Down in KV, where we're handling batch requests issued as part of row-level TTL selects. This is gated by kvadmission.low_pri_read_elastic_control.enabled. - Up in SQL, when handling KV responses to said batch requests. This is gated by sqladmission.low_pri_read_response_elastic_control.enabled. Similar to backups, rangefeed initial scans, and changefeed event processing, we've observed latency impact during CPU-intensive scans issued as part of row-level TTL jobs. We know from before that the existing slots based mechanism for CPU work can result in excessive scheduling latency for high-pri work in the presence of lower-pri work, affecting end-user latencies. This is because the slots mechanisms aims for full utilization of the underlying resource, which is incompatible with low scheduling latencies. This commit then tries to control the total CPU% used by row-level TTL selects through the elastic CPU limiter. For the KV work, this was trivial -- we already have integrations at the batch request level and now we pick out requests with priorities less than admissionpb.UserLowPri, which includes admissionpb.TTLLowPri. For the SQL portion of the work we introduce some minimal plumbing. Where previously we sought admission in the SQL-KV response queues after fetching each batch of KVs from KV as part of our volcano operator iteration, we now incrementally acquire CPU nanos. We do this specifically for row-level TTL work. Experimentally the CPU nanos we acquire here map roughly to the CPU utilization due to SQL work for row-level TTL selects. (Note that we apply the elastic CPU limiter for all reads with priorities less than admissionpb.UserPriLow. This is typically internally submitted reads, and includes row-level TTL selects.) Release note: None

irfansharif · 2023-08-23T01:24:54Z

#108815 ~~fixes~~improves things. I’ll kick off another run for the delete work and column backfill, but I expect that they’re both improved with flow control.

irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Mar 15, 2023

exalate-issue-sync bot added the T-kv KV Team label Mar 15, 2023

irfansharif mentioned this issue Mar 22, 2023

admission: add integration tests #89208

Open

21 tasks

ecwall added the A-row-level-ttl label May 18, 2023

rafiss mentioned this issue May 23, 2023

Better usability for calculating TTL limit rates #103685

Open

rafiss mentioned this issue Jul 25, 2023

Run nightly benchmark for row level TTL #107496

Closed

rafiss mentioned this issue Aug 1, 2023

roachtest: add row-level-ttl/during/tpcc test #107959

Merged

irfansharif mentioned this issue Aug 15, 2023

kv,sql: integrate row-level TTL reads with CPU limiter #108815

Merged

irfansharif self-assigned this Aug 21, 2023

This was referenced Aug 22, 2023

release-23.1: kv,sql: integrate row-level TTL reads with CPU limiter #109257

Merged

release-22.2: kv,sql: integrate row-level TTL reads with CPU limiter #109259

Merged

irfansharif mentioned this issue Aug 22, 2023

release-23.1.9-rc: kv,sql: integrate row-level TTL reads with CPU limiter #109295

Merged

irfansharif closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ttl,admission: reduce performance impact of large row-level TTL jobs #98722

ttl,admission: reduce performance impact of large row-level TTL jobs #98722

irfansharif commented Mar 15, 2023 •

edited by rafiss

Loading

ecwall commented Jul 24, 2023

rafiss commented Jul 28, 2023

rafiss commented Aug 1, 2023

irfansharif commented Aug 21, 2023

irfansharif commented Aug 23, 2023 •

edited

Loading

ttl,admission: reduce performance impact of large row-level TTL jobs #98722

ttl,admission: reduce performance impact of large row-level TTL jobs #98722

Comments

irfansharif commented Mar 15, 2023 • edited by rafiss Loading

ecwall commented Jul 24, 2023

rafiss commented Jul 28, 2023

rafiss commented Aug 1, 2023

irfansharif commented Aug 21, 2023

irfansharif commented Aug 23, 2023 • edited Loading

irfansharif commented Mar 15, 2023 •

edited by rafiss

Loading

irfansharif commented Aug 23, 2023 •

edited

Loading