Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eval delete: move batching of deletes into RPC handler and state #15117

Merged
merged 7 commits into from
Nov 14, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Nov 2, 2022

During unusual outage recovery scenarios on large clusters, a backlog of millions of evaluatons can appear. In these cases, the eval delete command can put excessive load on the cluster by listing large sets of evals to extract the IDs and then sending larges batches of IDs. Although the command's batch size was carefully tuned, we still need to be JSON deserialize, reserialize to messagepack, send the log entries through raft, and get the FSM applied.

To improve performance of this recovery case, move the batching process into the RPC handler and the state store. The design here is a little weird, so let's look at the failed options first:

  • A naive solution here would be to just send the filter as the raft request and let the FSM apply delete the whole set in a single operation. Benchmarking with 1M evals on a 3 node cluster demonstrated this can block the FSM apply for several minutes, which puts the cluster at risk if there's a leadership failover (the barrier write can't be made while this apply is in-flight).

  • A less naive but still bad solution would be to have the RPC handler filter and paginate, and then hand a list of IDs to the existing raft log entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and took roughly an hour to complete.

Instead, we're filtering and paginating in the RPC handler to find a page token, and then passing both the filter and page token in the raft log. The FSM apply recreates the paginator using the filter and page token to get roughly the same page of evaluations, which it then deletes. The pagination process is fairly cheap (only about 5% of the total FSM apply time), so counter-intuitively this rework of the pagination ends up being much faster. A benchmark of 1M evaluations showed this blocked the FSM apply for 20-30ms at a time (typical for normal operations) and completes in less than 4 minutes.

Note that, as with the existing design, this delete is not consistent: a new evaluation inserted "behind" the cursor of the pagination will fail to be deleted.

@tgross tgross added theme/cli backport/1.4.x backport to 1.4.x release line labels Nov 2, 2022
@tgross tgross added this to the 1.4.3 milestone Nov 2, 2022
@tgross tgross marked this pull request as ready for review November 2, 2022 21:02
command/eval_delete.go Outdated Show resolved Hide resolved
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; baring the lint failure and any updates based on adding count/len endpoint for evals.

nomad/eval_endpoint_test.go Outdated Show resolved Hide resolved
@tgross
Copy link
Member Author

tgross commented Nov 4, 2022

I've pulled the Eval.Count work out into #15147 and will rebase this PR on that once it's been merged.

During unusual outage recovery scenarios on large clusters, a backlog of
millions of evaluatons can appear. In these cases, the `eval delete` command can
put excessive load on the cluster by listing large sets of evals to extract the
IDs and then sending larges batches of IDs. Although the command's batch size
was carefully tuned, we still need to be JSON deserialize, reserialize to
messagepack, send the log entries through raft, and get the FSM applied.

To improve performance of this recovery case, move the batching process into the
RPC handler and the state store. The design here is a little weird, so let's
look a the failed options first:

* A naive solution here would be to just send the filter as the raft request and
  let the FSM apply delete the whole set in a single operation. Benchmarking with
  1M evals on a 3 node cluster demonstrated this can block the FSM apply for
  several minutes, which puts the cluster at risk if there's a leadership
  failover (the barrier write can't be made while this apply is in-flight).

* A less naive but still bad solution would be to have the RPC handler filter
  and paginate, and then hand a list of IDs to the existing raft log
  entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and
  took roughly an hour to complete.

Instead, we're filtering and paginating in the RPC handler to find a page token,
and then passing both the filter and page token in the raft log. The FSM apply
recreates the paginator using the filter and page token to get roughly the same
page of evaluations, which it then deletes. The pagination process is fairly
cheap (only abut 5% of the total FSM apply time), so counter-intuitively this
rework ends up being much faster. A benchmark of 1M evaluations showed this
blocked the FSM apply for 20-30ms at a time (typical for normal operations) and
completes in less than 4 minutes.

Note that, as with the existing design, this delete is not consistent: a new
evaluation inserted "behind" the cursor of the pagination will fail to be
deleted.
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to update docs.

nomad/eval_endpoint.go Outdated Show resolved Hide resolved
Comment on lines +554 to +557
// We *can* send larger raft logs but rough benchmarks for deleting 1M evals
// show that a smaller page size strikes a balance between throughput and
// time we block the FSM apply for other operations
perPage := structs.MaxUUIDsPerWriteRequest / 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only all of our magic numbers had such thorough comments. 😅

nomad/state/state_store.go Outdated Show resolved Hide resolved
tgross and others added 3 commits November 14, 2022 13:29
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
@tgross tgross merged commit 65b3d01 into main Nov 14, 2022
@tgross tgross deleted the eval-safe-delete-filter branch November 14, 2022 19:08
shoenig pushed a commit that referenced this pull request Nov 16, 2022
)

During unusual outage recovery scenarios on large clusters, a backlog of
millions of evaluations can appear. In these cases, the `eval delete` command can
put excessive load on the cluster by listing large sets of evals to extract the
IDs and then sending larges batches of IDs. Although the command's batch size
was carefully tuned, we still need to be JSON deserialize, re-serialize to
MessagePack, send the log entries through raft, and get the FSM applied.

To improve performance of this recovery case, move the batching process into the
RPC handler and the state store. The design here is a little weird, so let's
look a the failed options first:

* A naive solution here would be to just send the filter as the raft request and
  let the FSM apply delete the whole set in a single operation. Benchmarking with
  1M evals on a 3 node cluster demonstrated this can block the FSM apply for
  several minutes, which puts the cluster at risk if there's a leadership
  failover (the barrier write can't be made while this apply is in-flight).

* A less naive but still bad solution would be to have the RPC handler filter
  and paginate, and then hand a list of IDs to the existing raft log
  entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and
  took roughly an hour to complete.

Instead, we're filtering and paginating in the RPC handler to find a page token,
and then passing both the filter and page token in the raft log. The FSM apply
recreates the paginator using the filter and page token to get roughly the same
page of evaluations, which it then deletes. The pagination process is fairly
cheap (only abut 5% of the total FSM apply time), so counter-intuitively this
rework ends up being much faster. A benchmark of 1M evaluations showed this
blocked the FSM apply for 20-30ms at a time (typical for normal operations) and
completes in less than 4 minutes.

Note that, as with the existing design, this delete is not consistent: a new
evaluation inserted "behind" the cursor of the pagination will fail to be
deleted.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.4.x backport to 1.4.x release line theme/cli
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants