Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: cache resource comparisons for each alloc #11726

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion nomad/structs/funcs.go
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,11 @@ func AllocsFit(node *Node, allocs []*Allocation, netIdx *NetworkIndex, checkDevi
continue
}

cr := alloc.ComparableResources()
if alloc.comparableResources == nil {
alloc.comparableResources = alloc.ComparableResources()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is technically a race condition because 2 or more worker goroutines in the same process could be calculating the resources for the same allocs on the same node concurrently. It is the "safest" race I can imagine because (a) each goroutine will write the same values, and (b) cross cpu-core visibility of these writes isn't even necessary for the optimization to be beneficial.

However, I think it also breaks an invariant: objects in memdb are never mutated and only overwritten by the FSM. Again, I think this is the "safest" form of breaking that invariant. I can't think of a problem it could cause with existing code.

So I think we should consider another approach. It'd be nice to reliably use the race detector someday. 😬 (An even greater dream is some sort of immutability checker to assert the memdb invariant, but unless Go gains first class immutability support I don't see that happening.)

...so now I feel obligated to try to come up with an alternative and regretting everything. 😅 So many architectural aspects of Nomad work against us here:

  1. No universal Alloc constructor where we could memoize
  2. No thread locals or context bags to memoize into
  3. alloc in this func come from 3 distinct sources! Worker->Memdb, Worker, and Plan Applier. Makes it harder to make this the caller's problem.
  4. Throwing a Mutex into ComparableResources() and memoizing in it doesn't fix the no mutating memdb objects invariant and sets a dangerous precedent that might lead people to break that invariant more in future.
  5. We presumably calculate this on the fly to save memory and bytes on the wire, so promoting it to a "real" field created by Workers and persisted by the FSM may cost more performance than memoization saves?

map[alloc.ID]*ComparableResources on servers is one way to fix this by storing the state elsewhere. Sadly either the state needs to be guarded by a mutex (due to access by multiple Workers) or it can't be shared. Not to mention the code would be far more complicated than this 4 liner. An LRU would let us use only a fixed amount of memory, but I have no idea how to choose a useful number for it.

Having the Worker generate alloc.comparableResources only for the allocs it creates is another solution, but I'm afraid leaking the calculated state to memdb is actually a really nice feature for real clusters (especially using spread{} where the same nodes with the same allocs might be iterated over multiple times).

If we intentionally persisted the results we'd obviously always save on CPU and memory allocations/gc, but I'm not sure how to judge that CPU vs disk/network/memory tradeoff.

Here's hoping I'm missing something and either this approach is ok or a slight variation.

Copy link
Member Author

@tgross tgross Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof... I was thinking that we don't run into concurrency or memdb persistence issues because in the scheduler the Allocation object hasn't actually been persisted to memdb yet and is entirely owned by that scheduler worker... but I'd totally neglected to remember we call AllocFits in the plan applier as well. So this is definitely not safe as is.

I'm hesitant to trade-off a tiny benefit in scheduler CPU fo disk/network because we know disk IOPS in particular has been a source of trouble for end users. But in any event my assumption was that these values were only unchanging within a single scheduler stack iteration, and persisting them would break that assumption.

I was happy to try to get this in for an hour or two's worth of work for a tiny benefit, but I suspect we'd be better served using any larger amount of time to make the scheduler more observable so we can target bigger architectural issues there. So I think I'm going to close this out if you don't object?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was happy to try to get this in for an hour or two's worth of work for a tiny benefit, but I suspect we'd be better served using any larger amount of time to make the scheduler more observable so we can target bigger architectural issues there. So I think I'm going to close this out if you don't object?

Yeah. I suspect we'll be referencing this PR for years to come as an example of how tricky these things are, so your efforts are not wasted!

}

cr := alloc.comparableResources
used.Add(cr)

// Adding the comparable resource unions reserved core sets, need to check if reserved cores overlap
Expand Down
6 changes: 6 additions & 0 deletions nomad/structs/structs.go
Original file line number Diff line number Diff line change
Expand Up @@ -9341,6 +9341,12 @@ type Allocation struct {

// ModifyTime is the time the allocation was last updated.
ModifyTime int64

// comparableResources is a cache of calculated resources to
// reduce work done by the scheduler during node fit checking in
// AllocsFit. Note: this should never be updated to be visible
// outside this package, so that it's not serialized to disk.
comparableResources *ComparableResources
}

// ConsulNamespace returns the Consul namespace of the task group associated
Expand Down