Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ttl,admission: reduce performance impact of large row-level TTL jobs #98722

Closed
irfansharif opened this issue Mar 15, 2023 · 5 comments
Closed
Assignees
Labels
A-admission-control A-row-level-ttl C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Mar 15, 2023

Is your feature request related to a problem? Please describe.

We've seen in several internal incidents that using row-level TTL can be a performance footgun with large tables. We observe throughput/latency impact either when initially applying row-level TTL on an existing table using ttl_expire_after (which kicks off a column backfill; using ttl_expiration_expression does not cause a backfill), or when a TTL job is unpaused and starts aggressively issuing row deletes through SQL. It can also appear as part of routine row-deletions without prolonged job pauses. This issue focuses on the latter case, and can look like below:

image
image
image

For the column backfill case, we replication admission control to help (#95563). Internal incidents:

Jira issue: CRDB-25463

Epic CRDB-25458

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Mar 15, 2023
@exalate-issue-sync exalate-issue-sync bot added the T-kv KV Team label Mar 15, 2023
@ecwall
Copy link
Contributor

ecwall commented Jul 24, 2023

@irfansharif is there a way to specify a max cpu usage for low priority tasks like this?

For example if the threshold is set to 75% then the task should not run unless cpu usage falls below that level. If other tasks are using 75% then this task should never run.

Context https://cockroachlabs.slack.com/archives/CHKQGKYEM/p1690222820278609?thread_ts=1690207802.498629&cid=CHKQGKYEM

@rafiss
Copy link
Collaborator

rafiss commented Jul 28, 2023

The support ticket https://github.com/cockroachlabs/support/issues/2469#issuecomment-1640640043 contains information on how to reproduce the problem manually. SQL Foundations is also going to work on making that into a nightly test in #107496

@rafiss
Copy link
Collaborator

rafiss commented Aug 1, 2023

#107959 adds a roachtest that can be used to reproduce this.

Run with:

roachtest run row-level-ttl/during/tpcc/expired-rows=false --cockroach /path/to/cockroach-v23.1.6.linux-amd64
# or
roachtest run row-level-ttl/during/tpcc/expired-rows=true --cockroach /path/to/cockroach-v23.1.6.linux-amd64

In both of these, the TTL job will run 10 minutes after the TPCC load begins.

With the first command, the TTL job will scan for expired rows, but will not delete anything (since no rows will match the expiration expression). With the second command, the TTL job will scan and delete rows.

Here's some data from a run with expired-rows=false:
image
The TPCC workload starts at 15:58. The TTL job wakes up at 16:08, and we can see that the CPU usage goes up to 100% at that point.

Here's a snippet from the workload artifacts that shows that as soon as TTL starts running, the foreground latencies started going up (median as well as tail latencies):

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  260.0s        0           52.0           43.7     25.2     30.4     32.5     33.6 newOrder
  261.0s        0           41.0           43.7     25.2     32.5     37.7     37.7 newOrder
  262.0s        0           49.0           43.7     26.2     46.1     54.5     54.5 newOrder
  263.0s        0           45.0           43.7     25.2     71.3     92.3     92.3 newOrder
  264.0s        0           41.0           43.7     24.1     29.4     31.5     31.5 newOrder
  265.0s        0           35.0           43.7     25.2     52.4     83.9     83.9 newOrder
  266.0s        0           43.0           43.7     25.2     33.6     35.7     35.7 newOrder
  267.0s        0           43.0           43.7     26.2     54.5     67.1     67.1 newOrder
  268.0s        0           41.0           43.7     26.2     44.0     79.7     79.7 newOrder
  269.0s        0           45.0           43.7     26.2     37.7     48.2     48.2 newOrder
  270.0s        0           43.0           43.7     25.2     37.7     62.9     62.9 newOrder
  271.0s        0           52.0           43.7     25.2     29.4     33.6     35.7 newOrder
  272.0s        0           39.0           43.7     25.2     31.5     35.7     35.7 newOrder
  273.0s        0           41.0           43.7     25.2     31.5     56.6     56.6 newOrder
  274.0s        0           32.0           43.6     24.1     35.7     37.7     37.7 newOrder
  275.0s        0           26.0           43.6    570.4    838.9    939.5    939.5 newOrder
  276.0s        0           40.0           43.6    637.5   1006.6   1342.2   1342.2 newOrder
  277.0s        0           41.0           43.5    704.6   1040.2   1879.0   1879.0 newOrder
  278.0s        0           42.0           43.5    637.5   1040.2   1208.0   1208.0 newOrder
  279.0s        0           46.0           43.5    604.0   1073.7   1342.2   1342.2 newOrder
  280.0s        0           30.0           43.5    570.4    939.5    973.1    973.1 newOrder
  281.0s        0           46.0           43.5    704.6   1140.9   1879.0   1879.0 newOrder
  282.0s        0           40.0           43.5    637.5   1208.0   1543.5   1543.5 newOrder
  283.0s        0           38.0           43.5    570.4    939.5    973.1    973.1 newOrder
  284.0s        0           38.0           43.5    570.4    973.1   1342.2   1342.2 newOrder
  285.0s        0           35.0           43.4    570.4    805.3   1677.7   1677.7 newOrder
  286.0s        0           43.0           43.4    604.0    939.5   1140.9   1140.9 newOrder
  287.0s        0           47.0           43.4    637.5    939.5   1543.5   1543.5 newOrder
  288.0s        0           48.0           43.5    604.0    939.5   1208.0   1208.0 newOrder
  289.0s        0           43.0           43.5    570.4    973.1   1208.0   1208.0 newOrder
  290.0s        0           38.0           43.4    637.5    906.0   1006.6   1006.6 newOrder
  291.0s        0           49.0           43.5    604.0    872.4   1006.6   1006.6 newOrder
  292.0s        0           37.0           43.4    570.4    939.5   1006.6   1006.6 newOrder
  293.0s        0           42.0           43.4    671.1   1208.0   1275.1   1275.1 newOrder

irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 15, 2023
Part of cockroachdb#98722.

We do it at two levels, each appearing prominently in CPU profiles:
- Down in KV, where we're handling batch requests issued as part of
  row-level TTL selects. This is gated by
  kvadmission.ttl_read_elastic_control.enabled.
- Up in SQL, when handling KV responses to said batch requests. This is
  gated by sqladmission.ttl_read_response_elastic_control.enabled.

Similar to backups, rangefeed initial scans, and changefeed event
processing, we've observed latency impact during CPU-intensive scans
issued as part of row-level TTL jobs. We know from before that the
existing slots based mechanism for CPU work can result in excessive
scheduling latency for high-pri work in the presence of lower-pri work,
affecting end-user latencies. This commit then tries to control the
total CPU% used by row-level TTL selects through the elastic CPU
limiter. For the KV work, this was trivial -- we already have
integrations at the batch request level and now we pick out requests
with the admissionpb.TTLLowPri bit set.

For the SQL portion of the work we introduce some minimal plumbing.
Where previously we sought admission in the SQL-KV response queues after
fetching each batch of KVs from KV as part of our volcano operator
iteration, we now incrementally acquire CPU nanos. We do this
specifically for row-level TTL work. Experimentally the CPU nanos we
acquire here map roughly to the CPU utilization due to SQL work for
row-level TTL selects.

Release note: None
@irfansharif
Copy link
Contributor Author

Copying over notes from internal experiments.

With and without #108815. p99 goes from 229ms -> 48ms (4.7x decrease) and p90 from 95.8ms -> 8.7ms (11x decrease) during TTL reads with the elastic CPU integration. It keeps CPU% to whatever it’s configured to - instead of the 100% before, now ~50% utilization is what’s considered ok by default WRT p99 scheduling latency.

image

Other than below-KV CPU% utilization (at the batch request level), there’s a substantial amount of CPU% during these TTL selects that happen up in SQL. Which is why we also acquire CPU tokens for this SQL work: profile-19.pb.txt

image

@irfansharif irfansharif self-assigned this Aug 21, 2023
irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 22, 2023
Part of cockroachdb#98722.

We do it at two levels, each appearing prominently in CPU profiles:

- Down in KV, where we're handling batch requests issued as part of
  row-level TTL selects. This is gated by
  kvadmission.low_pri_read_elastic_control.enabled.
- Up in SQL, when handling KV responses to said batch requests. This is
  gated by sqladmission.low_pri_read_response_elastic_control.enabled.

Similar to backups, rangefeed initial scans, and changefeed event
processing, we've observed latency impact during CPU-intensive scans
issued as part of row-level TTL jobs. We know from before that the
existing slots based mechanism for CPU work can result in excessive
scheduling latency for high-pri work in the presence of lower-pri
work, affecting end-user latencies. This is because the slots mechanisms
aims for full utilization of the underlying resource, which is
incompatible with low scheduling latencies. This commit then tries to
control the total CPU% used by row-level TTL selects through the elastic
CPU limiter. For the KV work, this was trivial -- we already have
integrations at the batch request level and now we pick out requests
with priorities less than admissionpb.UserLowPri, which includes
admissionpb.TTLLowPri.

For the SQL portion of the work we introduce some minimal plumbing.
Where previously we sought admission in the SQL-KV response queues after
fetching each batch of KVs from KV as part of our volcano operator
iteration, we now incrementally acquire CPU nanos. We do this
specifically for row-level TTL work. Experimentally the CPU nanos we
acquire here map roughly to the CPU utilization due to SQL work for
row-level TTL selects.

(Note that we apply the elastic CPU limiter for all reads with priorities
less than admissionpb.UserPriLow. This is typically internally submitted
reads, and includes row-level TTL selects.)

Release note: None
craig bot pushed a commit that referenced this issue Aug 22, 2023
108815: kv,sql: integrate row-level TTL reads with CPU limiter r=irfansharif a=irfansharif

Part of #98722.

We do it at two levels, each appearing prominently in CPU profiles:
- Down in KV, where we're handling batch requests issued as part of row-level TTL selects. This is gated by kvadmission.ttl_read_elastic_control.enabled.
- Up in SQL, when handling KV responses to said batch requests. This is gated by sqladmission.ttl_read_response_elastic_control.enabled.

Similar to backups, rangefeed initial scans, and changefeed event processing, we've observed latency impact during CPU-intensive scans issued as part of row-level TTL jobs. We know from before that the existing slots based mechanism for CPU work can result in excessive scheduling latency for high-pri work in the presence of lower-pri work, affecting end-user latencies. This commit then tries to control the total CPU% used by row-level TTL selects through the elastic CPU limiter. For the KV work, this was trivial -- we already have integrations at the batch request level and now we pick out requests with the admissionpb.TTLLowPri bit set.

For the SQL portion of the work we introduce some minimal plumbing. Where previously we sought admission in the SQL-KV response queues after fetching each batch of KVs from KV as part of our volcano operator iteration, we now incrementally acquire CPU nanos. We do this specifically for row-level TTL work. Experimentally the CPU nanos we acquire here map roughly to the CPU utilization due to SQL work for row-level TTL selects.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 22, 2023
Part of cockroachdb#98722.

We do it at two levels, each appearing prominently in CPU profiles:

- Down in KV, where we're handling batch requests issued as part of
  row-level TTL selects. This is gated by
  kvadmission.low_pri_read_elastic_control.enabled.
- Up in SQL, when handling KV responses to said batch requests. This is
  gated by sqladmission.low_pri_read_response_elastic_control.enabled.

Similar to backups, rangefeed initial scans, and changefeed event
processing, we've observed latency impact during CPU-intensive scans
issued as part of row-level TTL jobs. We know from before that the
existing slots based mechanism for CPU work can result in excessive
scheduling latency for high-pri work in the presence of lower-pri
work, affecting end-user latencies. This is because the slots mechanisms
aims for full utilization of the underlying resource, which is
incompatible with low scheduling latencies. This commit then tries to
control the total CPU% used by row-level TTL selects through the elastic
CPU limiter. For the KV work, this was trivial -- we already have
integrations at the batch request level and now we pick out requests
with priorities less than admissionpb.UserLowPri, which includes
admissionpb.TTLLowPri.

For the SQL portion of the work we introduce some minimal plumbing.
Where previously we sought admission in the SQL-KV response queues after
fetching each batch of KVs from KV as part of our volcano operator
iteration, we now incrementally acquire CPU nanos. We do this
specifically for row-level TTL work. Experimentally the CPU nanos we
acquire here map roughly to the CPU utilization due to SQL work for
row-level TTL selects.

(Note that we apply the elastic CPU limiter for all reads with priorities
less than admissionpb.UserPriLow. This is typically internally submitted
reads, and includes row-level TTL selects.)

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 22, 2023
Part of cockroachdb#98722.

We do it at two levels, each appearing prominently in CPU profiles:

- Down in KV, where we're handling batch requests issued as part of
  row-level TTL selects. This is gated by
  kvadmission.low_pri_read_elastic_control.enabled.
- Up in SQL, when handling KV responses to said batch requests. This is
  gated by sqladmission.low_pri_read_response_elastic_control.enabled.

Similar to backups, rangefeed initial scans, and changefeed event
processing, we've observed latency impact during CPU-intensive scans
issued as part of row-level TTL jobs. We know from before that the
existing slots based mechanism for CPU work can result in excessive
scheduling latency for high-pri work in the presence of lower-pri
work, affecting end-user latencies. This is because the slots mechanisms
aims for full utilization of the underlying resource, which is
incompatible with low scheduling latencies. This commit then tries to
control the total CPU% used by row-level TTL selects through the elastic
CPU limiter. For the KV work, this was trivial -- we already have
integrations at the batch request level and now we pick out requests
with priorities less than admissionpb.UserLowPri, which includes
admissionpb.TTLLowPri.

For the SQL portion of the work we introduce some minimal plumbing.
Where previously we sought admission in the SQL-KV response queues after
fetching each batch of KVs from KV as part of our volcano operator
iteration, we now incrementally acquire CPU nanos. We do this
specifically for row-level TTL work. Experimentally the CPU nanos we
acquire here map roughly to the CPU utilization due to SQL work for
row-level TTL selects.

(Note that we apply the elastic CPU limiter for all reads with priorities
less than admissionpb.UserPriLow. This is typically internally submitted
reads, and includes row-level TTL selects.)

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 22, 2023
Part of cockroachdb#98722.

We do it at two levels, each appearing prominently in CPU profiles:

- Down in KV, where we're handling batch requests issued as part of
  row-level TTL selects. This is gated by
  kvadmission.low_pri_read_elastic_control.enabled.
- Up in SQL, when handling KV responses to said batch requests. This is
  gated by sqladmission.low_pri_read_response_elastic_control.enabled.

Similar to backups, rangefeed initial scans, and changefeed event
processing, we've observed latency impact during CPU-intensive scans
issued as part of row-level TTL jobs. We know from before that the
existing slots based mechanism for CPU work can result in excessive
scheduling latency for high-pri work in the presence of lower-pri
work, affecting end-user latencies. This is because the slots mechanisms
aims for full utilization of the underlying resource, which is
incompatible with low scheduling latencies. This commit then tries to
control the total CPU% used by row-level TTL selects through the elastic
CPU limiter. For the KV work, this was trivial -- we already have
integrations at the batch request level and now we pick out requests
with priorities less than admissionpb.UserLowPri, which includes
admissionpb.TTLLowPri.

For the SQL portion of the work we introduce some minimal plumbing.
Where previously we sought admission in the SQL-KV response queues after
fetching each batch of KVs from KV as part of our volcano operator
iteration, we now incrementally acquire CPU nanos. We do this
specifically for row-level TTL work. Experimentally the CPU nanos we
acquire here map roughly to the CPU utilization due to SQL work for
row-level TTL selects.

(Note that we apply the elastic CPU limiter for all reads with priorities
less than admissionpb.UserPriLow. This is typically internally submitted
reads, and includes row-level TTL selects.)

Release note: None
@irfansharif
Copy link
Contributor Author

irfansharif commented Aug 23, 2023

#108815 fixesimproves things. I’ll kick off another run for the delete work and column backfill, but I expect that they’re both improved with flow control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-row-level-ttl C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants