Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of eval broker: shed all but one blocked eval per job after ack into release/1.4.x #15272

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #14621 to be assessed for backporting due to the inclusion of the label backport/1.4.x.

The below text is copied from the body of the original PR.


When an evaluation is acknowledged by a scheduler, the resulting plan is guaranteed to cover up to the ModifyIndex ("wait index") set by the worker based on the most recent evaluation for that job in the state store. At that point, we no longer need to retain blocked evaluations in the broker that are older than that index.

Move these stale evals into a canceled set. When the Eval.Ack RPC returns from the eval broker it will retrieve a batch of canceable evals to write to raft. This paces the cancelations limited by how frequently the schedulers are acknowledging evals; this should reduce the risk of cancelations from overwhelming raft relative to scheduler progress.

Note that the evals will still need to be deleted during garbage collection, but there's not much we can do about that without preventing the evals from being created in the first place.

original approach

When a node updates its fingerprint or status, we need to create new evaluations to ensure that jobs waiting for resources get a chance to be evaluated. But in the case of a cluster with a large backup of evaluations and flapping nodes, we can get a large backlog of evaluations for the same job. Most of these will be canceled but we still need to write the evaluation in raft and then write its deletion in raft.

This changeset proposes that we avoid creating evals for jobs that already have a blocked eval in the eval broker. A blocked eval means that the broker already has work in-flight and work waiting to be re-enqueued, so it's safe to drop the evaluation.

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/f-evaluation-load-shedding/terribly-diverse-dane branch from 9c2aef7 to f28633c Compare November 16, 2022 21:10
@hc-github-team-nomad-core hc-github-team-nomad-core merged commit 1e7a918 into release/1.4.x Nov 16, 2022
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/f-evaluation-load-shedding/terribly-diverse-dane branch November 16, 2022 21:10
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants