Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eval broker: shed all but one blocked eval per job after ack #14621

Merged
merged 3 commits into from
Nov 16, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Sep 19, 2022

When an evaluation is acknowledged by a scheduler, the resulting plan is guaranteed to cover up to the ModifyIndex ("wait index") set by the worker based on the most recent evaluation for that job in the state store. At that point, we no longer need to retain blocked evaluations in the broker that are older than that index.

Move these stale evals into a canceled set. When the Eval.Ack RPC returns from the eval broker it will retrieve a batch of canceable evals to write to raft. This paces the cancelations limited by how frequently the schedulers are acknowledging evals; this should reduce the risk of cancelations from overwhelming raft relative to scheduler progress.

Note that the evals will still need to be deleted during garbage collection, but there's not much we can do about that without preventing the evals from being created in the first place.

original approach

When a node updates its fingerprint or status, we need to create new evaluations to ensure that jobs waiting for resources get a chance to be evaluated. But in the case of a cluster with a large backup of evaluations and flapping nodes, we can get a large backlog of evaluations for the same job. Most of these will be canceled but we still need to write the evaluation in raft and then write its deletion in raft.

This changeset proposes that we avoid creating evals for jobs that already have a blocked eval in the eval broker. A blocked eval means that the broker already has work in-flight and work waiting to be re-enqueued, so it's safe to drop the evaluation.

Comment on lines 879 to 885
// HasBlockedEval is used to check if the eval broker already has a blocked eval for this job.
func (b *EvalBroker) HasBlockedEval(namespacedJobID structs.NamespacedID) bool {
b.l.Lock()
defer b.l.Unlock()
_, ok := b.blocked[namespacedJobID]
return ok
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node update evals always come in with their priority set to the job priority. Maybe we should check if the blocked eval has a lower priority here as well, so that we can bump-over an eval that was submitted by a user with a low priority?

Copy link
Member Author

@tgross tgross Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial concerns around this idea are that things like SnapshotIndex and NodeModifyIndex aren't getting evals with updated values, but NodeModifyIndex is just informational (never consumed by Nomad) and the SnapshotIndex is set at scheduling time so there's nothing for us to screw up with that either -- none of the evals in the PendingEvals list we call "blocked" here will have SnapshotIndex set either.

(ref #6480 (comment) for the terrible confusion between a "blocked" eval and a eval with EvalStatusBlocked)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great analysis. I had assumed {Job,Node}ModifyIndex would be used for the minimum index of the state store when initially dequeuing the eval in the worker, but that doesn't appear to be the case. The index is the max ModifyIndex of all evals for the job. ‼️

(That actually seems like an overly pessimistic index to use as it ensures the oldest eval for a job evaluates against the same index as the latest eval for a job... why even bother processing those subsequent evals then since there's no guarantee the state will be later than when the first eval was committed? If a new eval comes in in the mean time, it would need to get processed.)

Unfortunately I think this logic poses a problem for your otherwise fantastic patch: you will no longer be creating an eval with a potentially higher ModifyIndex than all of the blocked evals for the job. So the job may end up not getting evaluated against the latest index... which is particularly a problem if the whole point is to reevaluate the job against this index of the node.

To try to illustrate: The old behavior ("Blocked" means "dupe pending eval for a job" -- BlockedEvals don't factor in here AFAICT)

  1. Eval E1 for Job J1 created @ Index 10
  2. Eval E2 for Job J1 created and blocked @ Index 20
  3. Worker dequeues E1, gets a WaitIndex of 20
  4. Node.Update occurs, updates Node and creates Eval E3 for J1 at Index 30
  5. Worker dequeues E2, gets a WaitIndex of 30
  6. Worker dequeues E3, gets a WaitIndex of 30

With this PR:

    1. Eval E1 for Job J1 created @ Index 10
  1. Eval E2 for Job J1 created and blocked @ Index 20
  2. Worker dequeues E1, gets a WaitIndex of 20
  3. Node.Update occurs, updates Node at index 30, sees E2 blocked and skips eval creation
  4. Worker dequeues E2, gets a WaitIndex of 20

So we may never evaluate the node at its latest ModifyIndex (30).

I could be wrong though... lots of moving parts here. Honestly I'm not sure why we calculate WaitIndex the way we do... using Eval.{Job,Node}ModifyIndex (based on TriggeredBy) initially seems correct. Using max(Eval.ModifyIndex, Eval.SnapshotIndex) always seems safe and potentially optimal though (I would assume Eval.ModifyIndex is always >= its {Job,Node}ModifyIndex)

I almost want to remove {Job,Node}ModifyIndex if they're not used. I can't imagine them being useful for end users. 🤷

If my assessment of the current behavior is correct, then obviously the eval you're skipping creating is useless to actually process, but it is necessary to ensure the modify index is high enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. ☹️ I hit this problem from a different direction last night.

I was trying to work up a model of the schedulers/broker/raft inspired by the aws s3 lightweight formal methods paper (which is an excellent read, btw). But I needed an invariant to test: we want any change to cluster state (i.e. state at a given index) to be eventually reflected in the last evaluation processed. But without writing the evaluation to raft we don't have a valid index to compare against anywhere in the scheduler.

So for some reason I had it in my head we only had a wait index concept in the planner but it's obvious in retrospect that we'd need it in the scheduler too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That actually seems like an overly pessimistic index to use as it ensures the oldest eval for a job evaluates against the same index as the latest eval for a job... why even bother processing those subsequent evals then since there's no guarantee the state will be later than when the first eval was committed? If a new eval comes in in the mean time, it would need to get processed.)

This observation could inform an approach to shed duplicates from the eval broker, even if we can't prevent them from getting written to raft. Don't try to avoid adding them to the blocked list of PendingEvals. But if we get an ack for an eval at ModifyIndex = n, then I think we can immediately discard all evals in PendingEvals for that job with ModifyIndex < n and batch cancel them for efficiency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hacking on this idea over in https://github.com/hashicorp/nomad/compare/main..f-evaluation-load-shedding-take-2 but will update this PR with the results once it's at all suitable for discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I've run into here is that when we heap.Pop the blocked queue, we're always getting the oldest evaluation instead of the newest one. So when it later gets ack'd, it's never newer than the others. The PendingEvals type sorts by Priority and then CreateIndex FIFO, which is fine for pending (we probably need to do that for fairness across jobs). But for the blocked queue we want to sort by Priority but CreateIndex LIFO for this idea to work. Will have to split the types and see where that goes next.

Copy link
Member

@schmichael schmichael Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're always getting the oldest evaluation instead of the newest one. So when it later gets ack'd, it's never newer than the others.

While this is true, I don't think it matters because WaitIndex given to workers as the min index to work from is calculated to be the max modify index for all evals for the job: https://github.com/hashicorp/nomad/blob/v1.4.0-beta.1/nomad/eval_endpoint.go#L195-L208

So for a given job it doesn't matter which of its evals we process first, we'll always process it against the latest known index at the time of dequeue. I think when completing an eval we could go ahead and cancel all other evals for the same job with a <= ModifyIndex to the plan's SnapshotIndex as that's guaranteed to be >= the WaitIndex.

Put another way, if you have the following evals for the same job:

  1. node-update @ 100
  2. preemption @ 200

Then even though the node-update @ 100 is dequeued first, the WaitIndex will be 200. The resulting Plan's SnapshotIndex will be >= 200 so if the node-update @ 100 eval is complete, the preemption @ 200 eval can be canceled as well.

Oldest Eval ModifyIndex <= WaitIndex at Dequeue time <= Worker's Snapshot Index

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that behavior doesn't show up at a unit testing level because we're not running a real worker, just making Dequeue and Ack calls, but that eliminates some extra code I've had to write here. Just need to update the tests to match that behavior and I think I've actually got this working.

@tgross
Copy link
Member Author

tgross commented Sep 19, 2022

diagram for the original approach to this PR, no longer applicable
sequenceDiagram
    Node RPC-->>Broker: node-update 1: has blocked eval? (no)
    Node RPC->>Raft: node-update 1
    Raft-->>Broker: enqueue eval
    Node RPC-->>Broker: node-update 2: has blocked eval? (no)
    Node RPC->>Raft: node-update 2
    Raft-->>Broker: can't enqueue eval (blocked)
    Node RPC-->>Broker: node-update 3: has blocked eval? (yes)    
    Broker-->>Scheduler: GetEval
    Node RPC-->>Broker: node-update 4: has blocked eval? (no)
    Node RPC->>Raft: node-update 4
    Raft-->>Broker: can't enqueue eval (blocked)
Loading

@tgross
Copy link
Member Author

tgross commented Sep 23, 2022

@schmichael @lgfa29 @shoenig I've now got this into working shape (and green on CI except for the usual test flakes in unrelated areas of code). I'm reasonably comfortable with the idea of landing this in 1.4.0-rc1 for next week but it's probably worth having a discussion about how good we feel about this change coming so late.

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

I was trying to think if there was another place where the batch raft update happens to keep the logic closer together (like some kind of cancelled evals GC). I was also a bit confused initially why the endpoint was calling Cancelable(1000) since it would return evals not necessarily related to the eval being ACK'ed, but I don't think the broker has access to raft.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent most of my day trying to reason about this and my brain is fully and completely melted. 😅

I think everything you have done is correct and safe, but I don't think it's optimal. The current approach may not cancel any evals in realistic situations, but I'm not sure about that.

@@ -559,6 +564,7 @@ func (b *EvalBroker) Ack(evalID, token string) error {
return fmt.Errorf("Token does not match for Evaluation ID")
}
jobID := unack.Eval.JobID
recentlyAckedIndex := unack.Eval.ModifyIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: See my comment below about canceling all but 1 eval. I don't think the rest of this comment matters at all, but I spent like 4 hours on it so by golly I'm leaving it. 😵‍💫

tl;dr -> I'm not sure this will cancel any evals. I think having the worker put Eval.SnapshotIndex into structs.AckRequest would allow the broker to cancel the maximum number of evals safely. This gets extremely difficult to reason about as it seems every layer of scheduler (broker, rpc, worker, scheduler) seems to do its own funny little index handling, so please check my work! 😬

AFAICT Eval.ModifyIndex does not include the index at which the eval was updated after committing a plan. unack is pulled out of EvalBroker.unack which contains the original copy of the eval from when it was first inserted into the broker as ready.

This means that Eval.ModifyIndex is pessimistic in the sense that it is the minimum possible value the Worker's snapshot was at when this evaluation was processed. Luckily we have a whole smorgasbord of indexes in the worker to choose from:

  • EvalDequeueResponse.WaitIndex - the maximum Eval.ModifyIndex of all evals for this job when it was dequeued.
  • Eval.SnapshotIndex - the actual Raft index the worker's state snapshot used. Only available on the updated Eval.
  • PlanResult.AllocIndex / PlanResponse.Index - returned by the Plan.Submit RPC to the worker who submitted the plan, but never used. I think the worker only ever uses the PlanResult.RefreshIndex and that index never bubbles up to where the Worker calls Ack.

I think the following holds as an invariant:

PlanResult.AllocIndex > {Updated}Eval.SnapshotIndex >= EvalDequeueResponse.WaitIndex >= {Broker}Eval.ModifyIndex
          ^                          ^                               ^                              ^
          |                          |                               |                              |
          |    no point in processing new evals < this index         |                 this is what's being used
          |                                                          |
too new! cluster may have changed prior to this                      |
                                                                     |      
     safe choice as if it's > ModifyIndex that indicates there are pending evals that can be optimized out!

I think Eval.SnapshotIndex is precisely what we're looking for. There's no point processing a new eval for the same job if we're not going to process it at a higher index than this because we will come up with the same result as before since we haven't allowed the cluster state to change!

Since there could be updates between PR.AllocIndex and SnapshotIndex, we can't drop any evals based on AllocIndex.

nomad/eval_broker_test.go Outdated Show resolved Hide resolved
nomad/eval_endpoint.go Outdated Show resolved Hide resolved
nomad/eval_endpoint.go Outdated Show resolved Hide resolved
nomad/eval_endpoint.go Outdated Show resolved Hide resolved
nomad/eval_broker.go Outdated Show resolved Hide resolved
nomad/eval_broker.go Show resolved Hide resolved
Comment on lines 988 to 1036
// MarkForCancel is used to remove any evaluations older than the index from the
// blocked list and returns a list of cancelable evals so that Eval.Ack RPCs can
// write batched raft entries to cancel them. This must be called inside the
// broker's lock.
func (p *PendingEvaluations) MarkForCancel(index uint64) PendingEvaluations {

// In pathological cases, we can have a large number of blocked evals but
// will want to cancel most of them. Using heap.Remove requires we re-sort
// for each eval we remove. Because we expect to have very few if any
// blocked remaining, we'll just create a new heap
retain := PendingEvaluations{}
cancelable := PendingEvaluations{}

for _, eval := range *p {
if eval.ModifyIndex >= index {
heap.Push(&retain, eval)
} else {
heap.Push(&cancelable, eval)
}
}

*p = retain
return cancelable
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on... due to my "favorite" logic in the Eval.Dequeue RPC: https://github.com/hashicorp/nomad/blob/v1.4.1/nomad/eval_endpoint.go#L185-L211

...I think we can cancel every eval except the one with the highest ModifyIndex?!

Since on Dequeue we always set WaitIndex = max(ModifyIndex) it doesn't matter which eval we dequeue next for a job: we'll always process it against a new enough snapshot to capture every cluster change that triggered evals!

...it almost makes me think we could cancel these evals in the FSM, but we'd still need to cancel them in EvalBroker since the broker isn't a "live" view of the evals in the statestore... might not be impossible, but when in doubt I like to keep the FSM simple.

Copy link
Member Author

@tgross tgross Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right on this, but I'm going to work up a diagram of what indexes we have and what eval states we have in the broker just to double-check.

My next instinct here is to mark any blocked evals as cancelable at the time we enqueue rather than at the time we ack. Have EvalBroker.blocked contain a spot for at most one eval, and have EvalBroker.Enqueue bump the blocked eval into the cancelable immediately if it's got a newer one. (As you've said, I'd love to be able to have the FSM detect that and delete the eval right away as well, but the followers wouldn't see the eval broker state. This would be the next best thing.) Let me work up that diagram first and we can come back to this discussion. Edit: from my diagram below I think this doesn't work because we don't drop blocked evals from the broker on Nack, so let's put a pin in this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the things that's misleading about the existing eval broker code is the PendingEvaluations sorting by CreateIndex. The PendingEvaluations alias over a slice of evals gets used by both blocked and ready but only ready actually cares about the index b/c it contains evals from multiple jobs. The blocked slice is full of evals from the same job, so the order we pick them off doesn't matter.

Copy link
Member Author

@tgross tgross Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well merging a state diagram together with the sequence diagram for the RPCs turned out to be a bad idea. But after a couple hours re-re-reading this code I've got a state diagram here that makes most of the problem clear, which I can annotate.

stateDiagram-v2

    isPending: check if jobEvals\n(pending JobID) is empty
    state if_state <<choice>>

    [*] --> isPending: Enqueue
    isPending --> if_state
    if_state --> ready: no pending JobID.\nready is sorted\nby priority and CreateIndex
    if_state --> blocked: pending JobID

    blocked --> ready: Ack

    ready --> unack: Dequeue
    unack --> [*]: Ack

    unack --> isPending: Nack
Loading

When we enqueue, we check jobEvals (better described as "pending jobs") to see if its non-empty.

  • If it's empty, we insert the job ID into jobEvals and push the eval onto the ready heap.
  • If it's non-empty, we insert the eval into blocked and return.

When we dequeue, we:

  • scanForSchedulers:
    • peeks evals from ready for each scheduler
    • finds the eval with the highest priortity (or picks between ones with the same)
    • in dequeueForSched, pop the selected eval from ready and adds it to unack
  • getWaitIndex(eval.ns, eval.jobid, eval.ModifyIndex) for the eval we got:
    • gets highest modify index for any eval for the job, or dequeued eval's modify index, whichever is greater. (In practice I don't think we'll never use the dequeued eval's modify index here if that eval was ever blocked)

When we submit the plan:

// updateEvalModifyIndex is used to update the modify index of an evaluation that has been
// through a scheduler pass. This is done as part of plan apply. It ensures that when a subsequent
// scheduler workers process a re-queued evaluation it sees any partial updates from the plan apply.

When we ack the eval:

  • remove from unack
  • delete jobID from jobEvals so the job isn't marked "pending"
  • enqueue top blocked item (which will mark us as "pending" again)

So ready sorting is the only thing in the eval broker that cares about CreateIndex, and only getWaitIndex checks the modify index. As long as we're always setting the eval with the highest modify index as ready, we're good. I think we can cancel the entire blocked list when we do so.

stateDiagram-v2

    isPending: check if jobEvals\n(pending JobID) is empty
    state if_state <<choice>>

    [*] --> isPending: Enqueue
    isPending --> if_state
    if_state --> ready: no pending JobID.\nready is sorted\nby priority and CreateIndex
    if_state --> blocked: pending JobID

    blocked --> ready: on Ack\neval w/ highest ModifyIndex
    blocked --> canceled: cancel all the rest

    ready --> unack: Dequeue
    unack --> [*]: Ack

    unack --> isPending: Nack
Loading

Alternately, if we wanted to be extra paranoid to ensure we haven't missed something we could cancel all but one eval out of blocked. That still helps us for the "large number of blocked evals" scenario but makes for bimodal operation where we won't hit this new cancel workflow except in cases where there's a backlog, which I don't love. Edit: this ends up being the safe route, so I've done that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll implement this and make sure we've got an integration test that exercises it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started on implementation and immediately remembered that evals have their own priority, so we can't just naively use the highest modify index either: we could get a higher-priority eval (ex 100) and then a lower-priority one (ex 50). If we just use highest modify index the lower-priority eval would be the one that gets set to ready and we'd discard the high-priority one. Then that low-priority eval could get starved by middling-priority evals for other jobs (ex 70). We'll need to account for both fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've solved this by having a separate PendingEvaluations and BlockedEvaluations with different sort criteria.

@tgross tgross marked this pull request as draft November 7, 2022 16:23
nomad/eval_endpoint.go Outdated Show resolved Hide resolved
When an evaluation is acknowledged by a scheduler, the resulting plan is
guaranteed to cover up to the `waitIndex` set by the worker based on the most
recent evaluation for that job in the state store. At that point, we no longer
need to retain blocked evaluations in the broker that are older than that index.

Move all but the highest priority / highest `ModifyIndex` blocked eval into a
canceled set. When the `Eval.Ack` RPC returns from the eval broker it will
retrieve a batch of canceable evals to write to raft. This paces the
cancelations limited by how frequently the schedulers are acknowledging evals;
this should reduce the risk of cancelations from overwhelming raft relative to
scheduler progress. In order to avoid straggling batches when the cluster is
quiet, we also include a periodic sweep through the cancelable list.
@tgross
Copy link
Member Author

tgross commented Nov 9, 2022

I've run a middling-scale performance test of this PR and the results look pretty good. I stood up a cluster of 3 t3.large instances on AWS (8GiB RAM).

  • Disable the schedulers on all 3 instances by setting server.num_schedulers = 0 and reloading.
  • Deploy 10 sytem jobs.
  • Deploy 5000 simulated client nodes via nomad-nodesim.
  • Wait until all 5000 nodes are registered.

At this point, there are 49730 pending evaluations in the state store and in the blocked queue on the broker. After commenting-out server.num_schedulers = 0 and reloading, the schedulers are restarted and begin processing.
(19:30)

After 15s (19:30):

  • The blocked queue is empty.
  • Because some evals have been acked, we've already updated 16159 evals to canceled in raft, with 33694 evals in the canceleable queue on the broker.
  • At this point, the eval broker has no more work in-flight (in the ready queue).
  • No more acks are happening to drive cancellations, so we've fallen back to moving batches of evals out of the canelable list and writing them to raft as canceled every 5 sec.

After ~2m (19:32), there are still 16950 evals in the pending state (still on the cancelable queue).

At this point, I stopped the 5000 client nodes. Within 10s, all remaining evals on the cancelable queue have been writting to raft as caneled.

As each of the 5000 nodes miss heartbeats they spawn 10 evaluations each. We can see from the broker metrics that at no point do we have any backlog on the blocked list, because we're immediately moving batches off to cancelable and clearing them to canceled on the next ack. So long as you've got new evals coming in, the cancelable list is getting quickly cleared out.

Screen Shot 2022-11-09 at 2 58 45 PM

@schmichael
Copy link
Member

@tgross That test is phenomenal! Any chance you have the terraform or whatever laying around to run it again against 1.4.2? I'd love to compare the graphs.

@tgross
Copy link
Member Author

tgross commented Nov 15, 2022

I re-ran the test with 1.4.2 and it takes about 25m to reach the state we were able to reach in 5m previously.

Here's the chart for 1.4.2. A few notes:

  • We don't have the new Eval.Count here and I didn't notice until a little while into the test, so I had to dump nomad eval list and post-process to fill in that data. So that's got some trendlines over some of the data points, but the broker_blocked line shows the general progress.
  • If you look at the source data the number of client nodes isn't steadily 5001 -- doing this work loads the servers so much that we miss some heartbeats!
  • I stopped the nodes right around the same point as the original tests in terms of amount of outstanding work, but of course that was quite a bit later in terms of elapsed time.

Screen Shot 2022-11-15 at 11 44 56 AM

And here's the same chart from above, with the time normalized to an elapsed time so we can compare apples-to-apples with the 1.4.2 chart. Conclusion: we're handling the same work in 20% of the time.

Screen Shot 2022-11-15 at 11 45 04 AM

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blockers or even logic changes! Merge away and fantastic work. This is the most significant eval broker improvement in years. 🎉

.changelog/14621.txt Outdated Show resolved Hide resolved
nomad/eval_broker.go Outdated Show resolved Hide resolved
nomad/eval_broker.go Outdated Show resolved Hide resolved
Comment on lines +122 to +128
// PendingEvaluations is a list of ready evaluations across multiple jobs. We
// implement the container/heap interface so that this is a priority queue.
type PendingEvaluations []*structs.Evaluation

// BlockedEvaluations is a list of blocked evaluations for a given job. We
// implement the container/heap interface so that this is a priority queue.
type BlockedEvaluations []*structs.Evaluation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for renaming [Bb]locked -> [Pp]ending everywhere and using ReadyEvaluations for the ready queue, but your naming scheme makes sense and changes less code which is nice for such a critical PR.

If we do want to rename for future readers, let's do it in a followup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus I still need to fix the metric name for #6480 so we can do that at the same time.

nomad/eval_endpoint.go Outdated Show resolved Hide resolved
@@ -230,7 +230,8 @@ func (e *Eval) Ack(args *structs.EvalAckRequest,
if err := e.srv.evalBroker.Ack(args.EvalID, args.Token); err != nil {
return err
}
return nil

return cancelCancelableEvals(e.srv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so this introduces a raft operation in Ack(...) which didn't have one before.

Since cancelable evals are already moved out of the blocked queue, we don't need to commit them to make the optimization work. However the longer we wait between updating the Ack'd eval and committing canceled evals, the longer the statestore will confusingly have ready evals that aren't be processed because they're sitting in the cancelable queue... 🤔

Since your benchmarking demonstrated this approach is already a massive improvement, let's leave this be.

However, I think we could make this async (maybe by just making it wakeup the reapCancelableEvaluations loop?) and potentially speed things up further.

If I'm right it might be worth commenting because it will be very hard for futures folks to load the context to make this determination:

Suggested change
return cancelCancelableEvals(e.srv)
// It's not necessary to cancel evals before Ack returns, but it's done here to
// commit canceled evals as close to the Ack'd eval being committed as possible.
return cancelCancelableEvals(e.srv)

Copy link
Member Author

@tgross tgross Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think we could make this async (maybe by just making it wakeup the reapCancelableEvaluations loop?) and potentially speed things up further.

Yeah, that's part of why the reapCancelableEvaluations loop runs fairly frequently. I like the idea of having it nudge the reap loop. I can do that under a separate PR.

In the meantime, made the comment fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the wake-up channel is probably a safer implementation -- if it's a buffered channel that'll ensure that Ack never gets blocked on raft. I've done that here. That also lets me move the cancelCancelableEvals into leader.go, as it now has only that one caller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to change this call to use the wakeup chan, because it's still performing the raft apply synchronously on main.

I don't think it's a big deal, especially because your benchmark demonstrated it's an immense speedup already.

Copy link
Member Author

@tgross tgross Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 Yes. Thanks for catching that. This line should now be

s.reapCancelableEvalsCh <-struct{}{}
return nil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #15294

nomad/leader.go Outdated Show resolved Hide resolved
@tgross tgross merged commit 1c4307b into main Nov 16, 2022
@tgross tgross deleted the f-evaluation-load-shedding branch November 16, 2022 21:10
tgross added a commit that referenced this pull request Nov 17, 2022
In #14621 we added an eval canelation reaper goroutine with a channel that
allowed us to wake it up. But we forgot to actually send on this channel from
`Eval.Ack` and are still committing the cancelations synchronously. Fix this by
sending on the buffered channel to wake up the reaper instead.
tgross added a commit that referenced this pull request Nov 17, 2022
In #14621 we added an eval canelation reaper goroutine with a channel that
allowed us to wake it up. But we forgot to actually send on this channel from
`Eval.Ack` and are still committing the cancelations synchronously. Fix this by
sending on the buffered channel to wake up the reaper instead.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants