scheduler: cache resource comparisons for each alloc #11726

tgross · 2021-12-21T21:09:59Z

During scheduling, a given allocation will potentially be compared for
fit against a large number of nodes, particularly if the job has
spread blocks or the cluster is large and few nodes meet the
feasibility. The scheduler recalculates the ComparableResources
struct on the allocation for each node compared against, but this data
doesn't change for this evaluation.

Cache this information to reduce allocations and CPU time. This patch
was benchmarked on a large cluster with a spread job and reduced
heap allocations by ~16MB (which only amounts to ~2% of CPU time, but
can't hurt).

This popped up while investigating #11712 and obviously is small potatoes,
but a little extra efficiency without any tradeoff sounds good to me.

During scheduling, a given allocation will potentially be compared for fit against a large number of nodes, particularly if the job has `spread` blocks or the cluster is large and few nodes meet the feasibility. The scheduler recalculates the `ComparableResources` struct on the allocation for each node compared against, but this data doesn't change for this evaluation. Cache this information to reduce allocations and CPU time. This patch was benchmarked on a large cluster with a `spread` job and reduced heap allocations by ~16MB (which only amounts to ~2% of CPU time, but can't hurt).

schmichael · 2021-12-21T23:32:49Z

nomad/structs/funcs.go

@@ -158,7 +158,11 @@ func AllocsFit(node *Node, allocs []*Allocation, netIdx *NetworkIndex, checkDevi
 			continue
 		}

-		cr := alloc.ComparableResources()
+		if alloc.comparableResources == nil {
+			alloc.comparableResources = alloc.ComparableResources()


I think this is technically a race condition because 2 or more worker goroutines in the same process could be calculating the resources for the same allocs on the same node concurrently. It is the "safest" race I can imagine because (a) each goroutine will write the same values, and (b) cross cpu-core visibility of these writes isn't even necessary for the optimization to be beneficial.

However, I think it also breaks an invariant: objects in memdb are never mutated and only overwritten by the FSM. Again, I think this is the "safest" form of breaking that invariant. I can't think of a problem it could cause with existing code.

So I think we should consider another approach. It'd be nice to reliably use the race detector someday. 😬 (An even greater dream is some sort of immutability checker to assert the memdb invariant, but unless Go gains first class immutability support I don't see that happening.)

...so now I feel obligated to try to come up with an alternative and regretting everything. 😅 So many architectural aspects of Nomad work against us here:

No universal Alloc constructor where we could memoize

No thread locals or context bags to memoize into

alloc in this func come from 3 distinct sources! Worker->Memdb, Worker, and Plan Applier. Makes it harder to make this the caller's problem.

Throwing a Mutex into ComparableResources() and memoizing in it doesn't fix the no mutating memdb objects invariant and sets a dangerous precedent that might lead people to break that invariant more in future.

We presumably calculate this on the fly to save memory and bytes on the wire, so promoting it to a "real" field created by Workers and persisted by the FSM may cost more performance than memoization saves?

map[alloc.ID]*ComparableResources on servers is one way to fix this by storing the state elsewhere. Sadly either the state needs to be guarded by a mutex (due to access by multiple Workers) or it can't be shared. Not to mention the code would be far more complicated than this 4 liner. An LRU would let us use only a fixed amount of memory, but I have no idea how to choose a useful number for it.

Having the Worker generate alloc.comparableResources only for the allocs it creates is another solution, but I'm afraid leaking the calculated state to memdb is actually a really nice feature for real clusters (especially using spread{} where the same nodes with the same allocs might be iterated over multiple times).

If we intentionally persisted the results we'd obviously always save on CPU and memory allocations/gc, but I'm not sure how to judge that CPU vs disk/network/memory tradeoff.

Here's hoping I'm missing something and either this approach is ok or a slight variation.

Oof... I was thinking that we don't run into concurrency or memdb persistence issues because in the scheduler the Allocation object hasn't actually been persisted to memdb yet and is entirely owned by that scheduler worker... but I'd totally neglected to remember we call AllocFits in the plan applier as well. So this is definitely not safe as is.

I'm hesitant to trade-off a tiny benefit in scheduler CPU fo disk/network because we know disk IOPS in particular has been a source of trouble for end users. But in any event my assumption was that these values were only unchanging within a single scheduler stack iteration, and persisting them would break that assumption.

I was happy to try to get this in for an hour or two's worth of work for a tiny benefit, but I suspect we'd be better served using any larger amount of time to make the scheduler more observable so we can target bigger architectural issues there. So I think I'm going to close this out if you don't object?

I was happy to try to get this in for an hour or two's worth of work for a tiny benefit, but I suspect we'd be better served using any larger amount of time to make the scheduler more observable so we can target bigger architectural issues there. So I think I'm going to close this out if you don't object?

Yeah. I suspect we'll be referencing this PR for years to come as an example of how tricky these things are, so your efforts are not wasted!

github-actions · 2022-11-06T02:36:08Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/scheduling type/enhancement labels Dec 21, 2021

tgross marked this pull request as ready for review December 21, 2021 21:38

tgross requested review from schmichael and lgfa29 December 21, 2021 21:38

schmichael reviewed Dec 21, 2021

View reviewed changes

schmichael closed this Dec 22, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: cache resource comparisons for each alloc #11726

scheduler: cache resource comparisons for each alloc #11726

tgross commented Dec 21, 2021

schmichael Dec 21, 2021

tgross Dec 22, 2021 •

edited

Loading

schmichael Dec 22, 2021

github-actions bot commented Nov 6, 2022

scheduler: cache resource comparisons for each alloc #11726

scheduler: cache resource comparisons for each alloc #11726

Conversation

tgross commented Dec 21, 2021

schmichael Dec 21, 2021

Choose a reason for hiding this comment

tgross Dec 22, 2021 • edited Loading

Choose a reason for hiding this comment

schmichael Dec 22, 2021

Choose a reason for hiding this comment

github-actions bot commented Nov 6, 2022

tgross Dec 22, 2021 •

edited

Loading