-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eval broker: shed all but one blocked eval per job after ack #14621
Conversation
nomad/eval_broker.go
Outdated
// HasBlockedEval is used to check if the eval broker already has a blocked eval for this job. | ||
func (b *EvalBroker) HasBlockedEval(namespacedJobID structs.NamespacedID) bool { | ||
b.l.Lock() | ||
defer b.l.Unlock() | ||
_, ok := b.blocked[namespacedJobID] | ||
return ok | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node update evals always come in with their priority set to the job priority. Maybe we should check if the blocked eval has a lower priority here as well, so that we can bump-over an eval that was submitted by a user with a low priority?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial concerns around this idea are that things like SnapshotIndex
and NodeModifyIndex
aren't getting evals with updated values, but NodeModifyIndex
is just informational (never consumed by Nomad) and the SnapshotIndex
is set at scheduling time so there's nothing for us to screw up with that either -- none of the evals in the PendingEvals
list we call "blocked" here will have SnapshotIndex
set either.
(ref #6480 (comment) for the terrible confusion between a "blocked" eval and a eval with EvalStatusBlocked
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great analysis. I had assumed {Job,Node}ModifyIndex
would be used for the minimum index of the state store when initially dequeuing the eval in the worker, but that doesn't appear to be the case. The index is the max ModifyIndex
of all evals for the job.
(That actually seems like an overly pessimistic index to use as it ensures the oldest eval for a job evaluates against the same index as the latest eval for a job... why even bother processing those subsequent evals then since there's no guarantee the state will be later than when the first eval was committed? If a new eval comes in in the mean time, it would need to get processed.)
Unfortunately I think this logic poses a problem for your otherwise fantastic patch: you will no longer be creating an eval with a potentially higher ModifyIndex
than all of the blocked evals for the job. So the job may end up not getting evaluated against the latest index... which is particularly a problem if the whole point is to reevaluate the job against this index of the node.
To try to illustrate: The old behavior ("Blocked" means "dupe pending eval for a job" -- BlockedEvals don't factor in here AFAICT)
- Eval E1 for Job J1 created @ Index 10
- Eval E2 for Job J1 created and blocked @ Index 20
- Worker dequeues E1, gets a WaitIndex of 20
- Node.Update occurs, updates Node and creates Eval E3 for J1 at Index 30
- Worker dequeues E2, gets a WaitIndex of 30
- Worker dequeues E3, gets a WaitIndex of 30
With this PR:
-
- Eval E1 for Job J1 created @ Index 10
- Eval E2 for Job J1 created and blocked @ Index 20
- Worker dequeues E1, gets a WaitIndex of 20
- Node.Update occurs, updates Node at index 30, sees E2 blocked and skips eval creation
- Worker dequeues E2, gets a WaitIndex of 20
So we may never evaluate the node at its latest ModifyIndex (30).
I could be wrong though... lots of moving parts here. Honestly I'm not sure why we calculate WaitIndex the way we do... using Eval.{Job,Node}ModifyIndex
(based on TriggeredBy
) initially seems correct. Using max(Eval.ModifyIndex, Eval.SnapshotIndex)
always seems safe and potentially optimal though (I would assume Eval.ModifyIndex is always >= its {Job,Node}ModifyIndex
)
I almost want to remove {Job,Node}ModifyIndex
if they're not used. I can't imagine them being useful for end users. 🤷
If my assessment of the current behavior is correct, then obviously the eval you're skipping creating is useless to actually process, but it is necessary to ensure the modify index is high enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
I was trying to work up a model of the schedulers/broker/raft inspired by the aws s3 lightweight formal methods paper (which is an excellent read, btw). But I needed an invariant to test: we want any change to cluster state (i.e. state at a given index) to be eventually reflected in the last evaluation processed. But without writing the evaluation to raft we don't have a valid index to compare against anywhere in the scheduler.
So for some reason I had it in my head we only had a wait index concept in the planner but it's obvious in retrospect that we'd need it in the scheduler too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That actually seems like an overly pessimistic index to use as it ensures the oldest eval for a job evaluates against the same index as the latest eval for a job... why even bother processing those subsequent evals then since there's no guarantee the state will be later than when the first eval was committed? If a new eval comes in in the mean time, it would need to get processed.)
This observation could inform an approach to shed duplicates from the eval broker, even if we can't prevent them from getting written to raft. Don't try to avoid adding them to the blocked
list of PendingEvals
. But if we get an ack for an eval at ModifyIndex = n
, then I think we can immediately discard all evals in PendingEvals
for that job with ModifyIndex
< n
and batch cancel them for efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hacking on this idea over in https://github.com/hashicorp/nomad/compare/main..f-evaluation-load-shedding-take-2 but will update this PR with the results once it's at all suitable for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I've run into here is that when we heap.Pop
the blocked
queue, we're always getting the oldest evaluation instead of the newest one. So when it later gets ack'd, it's never newer than the others. The PendingEvals
type sorts by Priority and then CreateIndex FIFO, which is fine for pending
(we probably need to do that for fairness across jobs). But for the blocked
queue we want to sort by Priority but CreateIndex LIFO for this idea to work. Will have to split the types and see where that goes next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're always getting the oldest evaluation instead of the newest one. So when it later gets ack'd, it's never newer than the others.
While this is true, I don't think it matters because WaitIndex
given to workers as the min index to work from is calculated to be the max modify index for all evals for the job: https://github.com/hashicorp/nomad/blob/v1.4.0-beta.1/nomad/eval_endpoint.go#L195-L208
So for a given job it doesn't matter which of its evals we process first, we'll always process it against the latest known index at the time of dequeue. I think when completing an eval we could go ahead and cancel all other evals for the same job with a <= ModifyIndex to the plan's SnapshotIndex as that's guaranteed to be >= the WaitIndex.
Put another way, if you have the following evals for the same job:
- node-update @ 100
- preemption @ 200
Then even though the node-update @ 100
is dequeued first, the WaitIndex
will be 200
. The resulting Plan's SnapshotIndex
will be >= 200
so if the node-update @ 100
eval is complete, the preemption @ 200
eval can be canceled as well.
Oldest Eval ModifyIndex <= WaitIndex at Dequeue time <= Worker's Snapshot Index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that behavior doesn't show up at a unit testing level because we're not running a real worker, just making Dequeue
and Ack
calls, but that eliminates some extra code I've had to write here. Just need to update the tests to match that behavior and I think I've actually got this working.
d86f7e0
to
7caddce
Compare
diagram for the original approach to this PR, no longer applicablesequenceDiagram
Node RPC-->>Broker: node-update 1: has blocked eval? (no)
Node RPC->>Raft: node-update 1
Raft-->>Broker: enqueue eval
Node RPC-->>Broker: node-update 2: has blocked eval? (no)
Node RPC->>Raft: node-update 2
Raft-->>Broker: can't enqueue eval (blocked)
Node RPC-->>Broker: node-update 3: has blocked eval? (yes)
Broker-->>Scheduler: GetEval
Node RPC-->>Broker: node-update 4: has blocked eval? (no)
Node RPC->>Raft: node-update 4
Raft-->>Broker: can't enqueue eval (blocked)
|
7caddce
to
4ccb636
Compare
4ccb636
to
e5cd347
Compare
e5cd347
to
3ef6d15
Compare
3ef6d15
to
6eec29b
Compare
@schmichael @lgfa29 @shoenig I've now got this into working shape (and green on CI except for the usual test flakes in unrelated areas of code). I'm reasonably comfortable with the idea of landing this in 1.4.0-rc1 for next week but it's probably worth having a discussion about how good we feel about this change coming so late. |
6eec29b
to
9b3ed23
Compare
9b3ed23
to
95f70ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
I was trying to think if there was another place where the batch raft update happens to keep the logic closer together (like some kind of cancelled evals GC). I was also a bit confused initially why the endpoint was calling Cancelable(1000)
since it would return evals not necessarily related to the eval being ACK'ed, but I don't think the broker has access to raft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent most of my day trying to reason about this and my brain is fully and completely melted. 😅
I think everything you have done is correct and safe, but I don't think it's optimal. The current approach may not cancel any evals in realistic situations, but I'm not sure about that.
nomad/eval_broker.go
Outdated
@@ -559,6 +564,7 @@ func (b *EvalBroker) Ack(evalID, token string) error { | |||
return fmt.Errorf("Token does not match for Evaluation ID") | |||
} | |||
jobID := unack.Eval.JobID | |||
recentlyAckedIndex := unack.Eval.ModifyIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: See my comment below about canceling all but 1 eval. I don't think the rest of this comment matters at all, but I spent like 4 hours on it so by golly I'm leaving it. 😵💫
tl;dr -> I'm not sure this will cancel any evals. I think having the worker put Eval.SnapshotIndex
into structs.AckRequest
would allow the broker to cancel the maximum number of evals safely. This gets extremely difficult to reason about as it seems every layer of scheduler (broker, rpc, worker, scheduler) seems to do its own funny little index handling, so please check my work! 😬
AFAICT Eval.ModifyIndex
does not include the index at which the eval was updated after committing a plan. unack
is pulled out of EvalBroker.unack
which contains the original copy of the eval from when it was first inserted into the broker as ready.
This means that Eval.ModifyIndex
is pessimistic in the sense that it is the minimum possible value the Worker's snapshot was at when this evaluation was processed. Luckily we have a whole smorgasbord of indexes in the worker to choose from:
EvalDequeueResponse.WaitIndex
- the maximum Eval.ModifyIndex of all evals for this job when it was dequeued.Eval.SnapshotIndex
- the actual Raft index the worker's state snapshot used. Only available on the updated Eval.PlanResult.AllocIndex / PlanResponse.Index
- returned by thePlan.Submit
RPC to the worker who submitted the plan, but never used. I think the worker only ever uses thePlanResult.RefreshIndex
and that index never bubbles up to where the Worker callsAck
.
I think the following holds as an invariant:
PlanResult.AllocIndex > {Updated}Eval.SnapshotIndex >= EvalDequeueResponse.WaitIndex >= {Broker}Eval.ModifyIndex
^ ^ ^ ^
| | | |
| no point in processing new evals < this index | this is what's being used
| |
too new! cluster may have changed prior to this |
|
safe choice as if it's > ModifyIndex that indicates there are pending evals that can be optimized out!
I think Eval.SnapshotIndex
is precisely what we're looking for. There's no point processing a new eval for the same job if we're not going to process it at a higher index than this because we will come up with the same result as before since we haven't allowed the cluster state to change!
Since there could be updates between PR.AllocIndex
and SnapshotIndex
, we can't drop any evals based on AllocIndex
.
nomad/eval_broker.go
Outdated
// MarkForCancel is used to remove any evaluations older than the index from the | ||
// blocked list and returns a list of cancelable evals so that Eval.Ack RPCs can | ||
// write batched raft entries to cancel them. This must be called inside the | ||
// broker's lock. | ||
func (p *PendingEvaluations) MarkForCancel(index uint64) PendingEvaluations { | ||
|
||
// In pathological cases, we can have a large number of blocked evals but | ||
// will want to cancel most of them. Using heap.Remove requires we re-sort | ||
// for each eval we remove. Because we expect to have very few if any | ||
// blocked remaining, we'll just create a new heap | ||
retain := PendingEvaluations{} | ||
cancelable := PendingEvaluations{} | ||
|
||
for _, eval := range *p { | ||
if eval.ModifyIndex >= index { | ||
heap.Push(&retain, eval) | ||
} else { | ||
heap.Push(&cancelable, eval) | ||
} | ||
} | ||
|
||
*p = retain | ||
return cancelable | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold on... due to my "favorite" logic in the Eval.Dequeue
RPC: https://github.com/hashicorp/nomad/blob/v1.4.1/nomad/eval_endpoint.go#L185-L211
...I think we can cancel every eval except the one with the highest ModifyIndex?!
Since on Dequeue we always set WaitIndex = max(ModifyIndex)
it doesn't matter which eval we dequeue next for a job: we'll always process it against a new enough snapshot to capture every cluster change that triggered evals!
...it almost makes me think we could cancel these evals in the FSM, but we'd still need to cancel them in EvalBroker since the broker isn't a "live" view of the evals in the statestore... might not be impossible, but when in doubt I like to keep the FSM simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right on this, but I'm going to work up a diagram of what indexes we have and what eval states we have in the broker just to double-check.
My next instinct here is to mark any Edit: from my diagram below I think this doesn't work because we don't drop blocked evals from the broker on Nack, so let's put a pin in thisblocked
evals as cancelable at the time we enqueue rather than at the time we ack. Have EvalBroker.blocked
contain a spot for at most one eval, and have EvalBroker.Enqueue
bump the blocked
eval into the cancelable
immediately if it's got a newer one. (As you've said, I'd love to be able to have the FSM detect that and delete the eval right away as well, but the followers wouldn't see the eval broker state. This would be the next best thing.) Let me work up that diagram first and we can come back to this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the things that's misleading about the existing eval broker code is the PendingEvaluations
sorting by CreateIndex
. The PendingEvaluations
alias over a slice of evals gets used by both blocked
and ready
but only ready
actually cares about the index b/c it contains evals from multiple jobs. The blocked
slice is full of evals from the same job, so the order we pick them off doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well merging a state diagram together with the sequence diagram for the RPCs turned out to be a bad idea. But after a couple hours re-re-reading this code I've got a state diagram here that makes most of the problem clear, which I can annotate.
stateDiagram-v2
isPending: check if jobEvals\n(pending JobID) is empty
state if_state <<choice>>
[*] --> isPending: Enqueue
isPending --> if_state
if_state --> ready: no pending JobID.\nready is sorted\nby priority and CreateIndex
if_state --> blocked: pending JobID
blocked --> ready: Ack
ready --> unack: Dequeue
unack --> [*]: Ack
unack --> isPending: Nack
When we enqueue, we check jobEvals
(better described as "pending jobs") to see if its non-empty.
- If it's empty, we insert the job ID into
jobEvals
and push the eval onto theready
heap. - If it's non-empty, we insert the eval into
blocked
and return.
When we dequeue, we:
scanForSchedulers
:- peeks evals from
ready
for each scheduler - finds the eval with the highest priortity (or picks between ones with the same)
- in
dequeueForSched
, pop the selected eval fromready
and adds it tounack
- peeks evals from
getWaitIndex(eval.ns, eval.jobid, eval.ModifyIndex)
for the eval we got:- gets highest modify index for any eval for the job, or dequeued eval's modify index, whichever is greater. (In practice I don't think we'll never use the dequeued eval's modify index here if that eval was ever
blocked
)
- gets highest modify index for any eval for the job, or dequeued eval's modify index, whichever is greater. (In practice I don't think we'll never use the dequeued eval's modify index here if that eval was ever
When we submit the plan:
- updateEvalModifiyInxex updates the eval, but this will not be reflected in eval broker
// updateEvalModifyIndex is used to update the modify index of an evaluation that has been
// through a scheduler pass. This is done as part of plan apply. It ensures that when a subsequent
// scheduler workers process a re-queued evaluation it sees any partial updates from the plan apply.
When we ack the eval:
- remove from
unack
- delete jobID from
jobEvals
so the job isn't marked "pending" - enqueue top
blocked
item (which will mark us as "pending" again)
So ready
sorting is the only thing in the eval broker that cares about CreateIndex
, and only getWaitIndex
checks the modify index. As long as we're always setting the eval with the highest modify index as ready, we're good. I think we can cancel the entire blocked
list when we do so.
stateDiagram-v2
isPending: check if jobEvals\n(pending JobID) is empty
state if_state <<choice>>
[*] --> isPending: Enqueue
isPending --> if_state
if_state --> ready: no pending JobID.\nready is sorted\nby priority and CreateIndex
if_state --> blocked: pending JobID
blocked --> ready: on Ack\neval w/ highest ModifyIndex
blocked --> canceled: cancel all the rest
ready --> unack: Dequeue
unack --> [*]: Ack
unack --> isPending: Nack
Alternately, if we wanted to be extra paranoid to ensure we haven't missed something we could cancel all but one eval out of blocked
. That still helps us for the "large number of blocked evals" scenario but makes for bimodal operation where we won't hit this new cancel workflow except in cases where there's a backlog, which I don't love. Edit: this ends up being the safe route, so I've done that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll implement this and make sure we've got an integration test that exercises it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started on implementation and immediately remembered that evals have their own priority, so we can't just naively use the highest modify index either: we could get a higher-priority eval (ex 100) and then a lower-priority one (ex 50). If we just use highest modify index the lower-priority eval would be the one that gets set to ready and we'd discard the high-priority one. Then that low-priority eval could get starved by middling-priority evals for other jobs (ex 70). We'll need to account for both fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've solved this by having a separate PendingEvaluations
and BlockedEvaluations
with different sort criteria.
a6da173
to
3f012a6
Compare
94fa7eb
to
1735564
Compare
When an evaluation is acknowledged by a scheduler, the resulting plan is guaranteed to cover up to the `waitIndex` set by the worker based on the most recent evaluation for that job in the state store. At that point, we no longer need to retain blocked evaluations in the broker that are older than that index. Move all but the highest priority / highest `ModifyIndex` blocked eval into a canceled set. When the `Eval.Ack` RPC returns from the eval broker it will retrieve a batch of canceable evals to write to raft. This paces the cancelations limited by how frequently the schedulers are acknowledging evals; this should reduce the risk of cancelations from overwhelming raft relative to scheduler progress. In order to avoid straggling batches when the cluster is quiet, we also include a periodic sweep through the cancelable list.
1735564
to
e372037
Compare
I've run a middling-scale performance test of this PR and the results look pretty good. I stood up a cluster of 3
At this point, there are 49730 pending evaluations in the state store and in the After 15s (19:30):
After ~2m (19:32), there are still 16950 evals in the At this point, I stopped the 5000 client nodes. Within 10s, all remaining evals on the As each of the 5000 nodes miss heartbeats they spawn 10 evaluations each. We can see from the broker metrics that at no point do we have any backlog on the |
@tgross That test is phenomenal! Any chance you have the terraform or whatever laying around to run it again against 1.4.2? I'd love to compare the graphs. |
I re-ran the test with 1.4.2 and it takes about 25m to reach the state we were able to reach in 5m previously. Here's the chart for 1.4.2. A few notes:
And here's the same chart from above, with the time normalized to an elapsed time so we can compare apples-to-apples with the 1.4.2 chart. Conclusion: we're handling the same work in 20% of the time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No blockers or even logic changes! Merge away and fantastic work. This is the most significant eval broker improvement in years. 🎉
// PendingEvaluations is a list of ready evaluations across multiple jobs. We | ||
// implement the container/heap interface so that this is a priority queue. | ||
type PendingEvaluations []*structs.Evaluation | ||
|
||
// BlockedEvaluations is a list of blocked evaluations for a given job. We | ||
// implement the container/heap interface so that this is a priority queue. | ||
type BlockedEvaluations []*structs.Evaluation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference for renaming [Bb]locked -> [Pp]ending
everywhere and using ReadyEvaluations
for the ready
queue, but your naming scheme makes sense and changes less code which is nice for such a critical PR.
If we do want to rename for future readers, let's do it in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus I still need to fix the metric name for #6480 so we can do that at the same time.
@@ -230,7 +230,8 @@ func (e *Eval) Ack(args *structs.EvalAckRequest, | |||
if err := e.srv.evalBroker.Ack(args.EvalID, args.Token); err != nil { | |||
return err | |||
} | |||
return nil | |||
|
|||
return cancelCancelableEvals(e.srv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so this introduces a raft operation in Ack(...)
which didn't have one before.
Since cancelable evals are already moved out of the blocked queue, we don't need to commit them to make the optimization work. However the longer we wait between updating the Ack'd eval and committing canceled evals, the longer the statestore will confusingly have ready
evals that aren't be processed because they're sitting in the cancelable queue... 🤔
Since your benchmarking demonstrated this approach is already a massive improvement, let's leave this be.
However, I think we could make this async (maybe by just making it wakeup the reapCancelableEvaluations loop?) and potentially speed things up further.
If I'm right it might be worth commenting because it will be very hard for futures folks to load the context to make this determination:
return cancelCancelableEvals(e.srv) | |
// It's not necessary to cancel evals before Ack returns, but it's done here to | |
// commit canceled evals as close to the Ack'd eval being committed as possible. | |
return cancelCancelableEvals(e.srv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I think we could make this async (maybe by just making it wakeup the reapCancelableEvaluations loop?) and potentially speed things up further.
Yeah, that's part of why the reapCancelableEvaluations
loop runs fairly frequently. I like the idea of having it nudge the reap loop. I can do that under a separate PR.
In the meantime, made the comment fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the wake-up channel is probably a safer implementation -- if it's a buffered channel that'll ensure that Ack
never gets blocked on raft. I've done that here. That also lets me move the cancelCancelableEvals
into leader.go
, as it now has only that one caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to change this call to use the wakeup chan, because it's still performing the raft apply synchronously on main.
I don't think it's a big deal, especially because your benchmark demonstrated it's an immense speedup already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 Yes. Thanks for catching that. This line should now be
s.reapCancelableEvalsCh <-struct{}{}
return nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #15294
In #14621 we added an eval canelation reaper goroutine with a channel that allowed us to wake it up. But we forgot to actually send on this channel from `Eval.Ack` and are still committing the cancelations synchronously. Fix this by sending on the buffered channel to wake up the reaper instead.
In #14621 we added an eval canelation reaper goroutine with a channel that allowed us to wake it up. But we forgot to actually send on this channel from `Eval.Ack` and are still committing the cancelations synchronously. Fix this by sending on the buffered channel to wake up the reaper instead.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
When an evaluation is acknowledged by a scheduler, the resulting plan is guaranteed to cover up to the
ModifyIndex
("wait index") set by the worker based on the most recent evaluation for that job in the state store. At that point, we no longer need to retain blocked evaluations in the broker that are older than that index.Move these stale evals into a canceled set. When the
Eval.Ack
RPC returns from the eval broker it will retrieve a batch of canceable evals to write to raft. This paces the cancelations limited by how frequently the schedulers are acknowledging evals; this should reduce the risk of cancelations from overwhelming raft relative to scheduler progress.Note that the evals will still need to be deleted during garbage collection, but there's not much we can do about that without preventing the evals from being created in the first place.
original approach
When a node updates its fingerprint or status, we need to create new evaluations to ensure that jobs waiting for resources get a chance to be evaluated. But in the case of a cluster with a large backup of evaluations and flapping nodes, we can get a large backlog of evaluations for the same job. Most of these will be canceled but we still need to write the evaluation in raft and then write its deletion in raft.
This changeset proposes that we avoid creating evals for jobs that already have a blocked eval in the eval broker. A blocked eval means that the broker already has work in-flight and work waiting to be re-enqueued, so it's safe to drop the evaluation.