Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: cache resource comparisons for each alloc #11726

Closed
wants to merge 1 commit into from

Conversation

tgross
Copy link
Member

@tgross tgross commented Dec 21, 2021

During scheduling, a given allocation will potentially be compared for
fit against a large number of nodes, particularly if the job has
spread blocks or the cluster is large and few nodes meet the
feasibility. The scheduler recalculates the ComparableResources
struct on the allocation for each node compared against, but this data
doesn't change for this evaluation.

Cache this information to reduce allocations and CPU time. This patch
was benchmarked on a large cluster with a spread job and reduced
heap allocations by ~16MB (which only amounts to ~2% of CPU time, but
can't hurt).

This popped up while investigating #11712 and obviously is small potatoes,
but a little extra efficiency without any tradeoff sounds good to me.

During scheduling, a given allocation will potentially be compared for
fit against a large number of nodes, particularly if the job has
`spread` blocks or the cluster is large and few nodes meet the
feasibility. The scheduler recalculates the `ComparableResources`
struct on the allocation for each node compared against, but this data
doesn't change for this evaluation.

Cache this information to reduce allocations and CPU time. This patch
was benchmarked on a large cluster with a `spread` job and reduced
heap allocations by ~16MB (which only amounts to ~2% of CPU time, but
can't hurt).
@@ -158,7 +158,11 @@ func AllocsFit(node *Node, allocs []*Allocation, netIdx *NetworkIndex, checkDevi
continue
}

cr := alloc.ComparableResources()
if alloc.comparableResources == nil {
alloc.comparableResources = alloc.ComparableResources()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is technically a race condition because 2 or more worker goroutines in the same process could be calculating the resources for the same allocs on the same node concurrently. It is the "safest" race I can imagine because (a) each goroutine will write the same values, and (b) cross cpu-core visibility of these writes isn't even necessary for the optimization to be beneficial.

However, I think it also breaks an invariant: objects in memdb are never mutated and only overwritten by the FSM. Again, I think this is the "safest" form of breaking that invariant. I can't think of a problem it could cause with existing code.

So I think we should consider another approach. It'd be nice to reliably use the race detector someday. 😬 (An even greater dream is some sort of immutability checker to assert the memdb invariant, but unless Go gains first class immutability support I don't see that happening.)

...so now I feel obligated to try to come up with an alternative and regretting everything. 😅 So many architectural aspects of Nomad work against us here:

  1. No universal Alloc constructor where we could memoize
  2. No thread locals or context bags to memoize into
  3. alloc in this func come from 3 distinct sources! Worker->Memdb, Worker, and Plan Applier. Makes it harder to make this the caller's problem.
  4. Throwing a Mutex into ComparableResources() and memoizing in it doesn't fix the no mutating memdb objects invariant and sets a dangerous precedent that might lead people to break that invariant more in future.
  5. We presumably calculate this on the fly to save memory and bytes on the wire, so promoting it to a "real" field created by Workers and persisted by the FSM may cost more performance than memoization saves?

map[alloc.ID]*ComparableResources on servers is one way to fix this by storing the state elsewhere. Sadly either the state needs to be guarded by a mutex (due to access by multiple Workers) or it can't be shared. Not to mention the code would be far more complicated than this 4 liner. An LRU would let us use only a fixed amount of memory, but I have no idea how to choose a useful number for it.

Having the Worker generate alloc.comparableResources only for the allocs it creates is another solution, but I'm afraid leaking the calculated state to memdb is actually a really nice feature for real clusters (especially using spread{} where the same nodes with the same allocs might be iterated over multiple times).

If we intentionally persisted the results we'd obviously always save on CPU and memory allocations/gc, but I'm not sure how to judge that CPU vs disk/network/memory tradeoff.

Here's hoping I'm missing something and either this approach is ok or a slight variation.

Copy link
Member Author

@tgross tgross Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof... I was thinking that we don't run into concurrency or memdb persistence issues because in the scheduler the Allocation object hasn't actually been persisted to memdb yet and is entirely owned by that scheduler worker... but I'd totally neglected to remember we call AllocFits in the plan applier as well. So this is definitely not safe as is.

I'm hesitant to trade-off a tiny benefit in scheduler CPU fo disk/network because we know disk IOPS in particular has been a source of trouble for end users. But in any event my assumption was that these values were only unchanging within a single scheduler stack iteration, and persisting them would break that assumption.

I was happy to try to get this in for an hour or two's worth of work for a tiny benefit, but I suspect we'd be better served using any larger amount of time to make the scheduler more observable so we can target bigger architectural issues there. So I think I'm going to close this out if you don't object?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was happy to try to get this in for an hour or two's worth of work for a tiny benefit, but I suspect we'd be better served using any larger amount of time to make the scheduler more observable so we can target bigger architectural issues there. So I think I'm going to close this out if you don't object?

Yeah. I suspect we'll be referencing this PR for years to come as an example of how tricky these things are, so your efforts are not wasted!

@schmichael schmichael closed this Dec 22, 2021
@github-actions
Copy link

github-actions bot commented Nov 6, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants