scheduler: fix quadratic performance with spread blocks #11712

tgross · 2021-12-20T20:30:37Z

When the scheduler picks a node for each evaluation, the
LimitIterator provides at most 2 eligible nodes for the
MaxScoreIterator to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.

Jobs with a spread block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a spread block is evaluated against
all eligible nodes. Operators of large clusters have reported that
jobs with spread blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.

In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the RandomIterator at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of eligible nodes available for scoring, not the total number
of nodes.

This changeset sets the iterator limit for "large" spread block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.

Note for reviewers: I have some additional PRs to make that'll improve our ability to debug this kind of problem in the future (ex. instrumenting the iterators, tooling for handling large snapshots), but I've intentionally kept them out of this PR so we have a clean bugfix for backporting. I'll push those PRs up over the next couple days.

schmichael

Great work. I wonder if we should improve our documentation too. I think this paragraph could be expanded from:

Spread criteria are treated as a soft preference by the Nomad scheduler. If no nodes match a given spread criteria, placement is still successful.

To something like:

Spread criteria are treated as a soft preference by the Nomad scheduler. If no nodes match a given spread criteria, placement is still successful. To avoid scoring every node for every placement, allocations may not be perfectly spread. Spread works best on attributes with similar number of nodes: identically configured racks or similarly configured datacenters.

If anyone needs absolute precision with spread we could make the timeout for evaluations configurable. Given the high concurrency of scheduling workers, operators may tolerate a small number of very slow (eg 5 minute) evaluations.

scheduler/spread_test.go

When the scheduler picks a node for each evaluation, the `LimitIterator` provides at most 2 eligible nodes for the `MaxScoreIterator` to choose from. This keeps scheduling fast while producing acceptable results because the results are binpacked. Jobs with a `spread` block (or node affinity) remove this limit in order to produce correct spread scoring. This means that every allocation within a job with a `spread` block is evaluated against _all_ eligible nodes. Operators of large clusters have reported that jobs with `spread` blocks that are eligible on a large number of nodes can take longer than the nack timeout to evaluate (60s). Typical evaluations are processed in milliseconds. In practice, it's not necessary to evaluate every eligible node for every allocation on large clusters, because the `RandomIterator` at the base of the scheduler stack produces enough variation in each pass that the likelihood of an uneven spread is negligible. Note that feasibility is checked before the limit, so this only impacts the number of _eligible_ nodes available for scoring, not the total number of nodes. This changeset sets the iterator limit for "large" `spread` block and node affinity jobs to be equal to the number of desired allocations. This brings an example problematic job evaluation down from ~3min to ~10s. The included tests ensure that we have acceptable spread results across a variety of large cluster topologies.

tgross · 2021-12-21T14:14:49Z

I've picked up your suggestions and added the docs change. I've adjusted the test so that it's a bit less strict (effectively allowing for [0, 1, 2] which we need for the large rack counts) and produces more readable error outputs.

I've also added a note to the change in stack.go that notes that the limit value was empirically determined so that future developers will know there's nothing magical about 100 so they can feel free to adjust in the future if the scheduler changes significantly.

Adds a package `scheduler/benchmarks` with some examples of profile and benchmarking the scheduler, along with helpers for loading real-world data for profiling. This tooling comes out of work done for #11712. These profiles and test benchmarks have not been added to CI because these particular profiles are mostly examples and the runs will add an excessive amount of time to CI runs for code that rarely changes in a way that has any chance of impacting performance.

Adds a package `scheduler/benchmarks` with some examples of profiling and benchmarking the scheduler, along with helpers for loading real-world data for profiling. This tooling comes out of work done for #11712. These test benchmarks have not been added to CI because these particular profiles are mostly examples and the runs will add an excessive amount of time to CI runs for code that rarely changes in a way that has any chance of impacting performance.

When the scheduler picks a node for each evaluation, the `LimitIterator` provides at most 2 eligible nodes for the `MaxScoreIterator` to choose from. This keeps scheduling fast while producing acceptable results because the results are binpacked. Jobs with a `spread` block (or node affinity) remove this limit in order to produce correct spread scoring. This means that every allocation within a job with a `spread` block is evaluated against _all_ eligible nodes. Operators of large clusters have reported that jobs with `spread` blocks that are eligible on a large number of nodes can take longer than the nack timeout to evaluate (60s). Typical evaluations are processed in milliseconds. In practice, it's not necessary to evaluate every eligible node for every allocation on large clusters, because the `RandomIterator` at the base of the scheduler stack produces enough variation in each pass that the likelihood of an uneven spread is negligible. Note that feasibility is checked before the limit, so this only impacts the number of _eligible_ nodes available for scoring, not the total number of nodes. This changeset sets the iterator limit for "large" `spread` block and node affinity jobs to be equal to the number of desired allocations. This brings an example problematic job evaluation down from ~3min to ~10s. The included tests ensure that we have acceptable spread results across a variety of large cluster topologies.

github-actions · 2022-11-03T02:34:41Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross force-pushed the quadratic-spread-scheduling branch from 199d851 to 7daaf75 Compare December 20, 2021 20:31

vercel bot temporarily deployed to Preview – nomad December 20, 2021 20:32 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui December 20, 2021 20:32 View deployment

tgross added backport/1.0 labels Dec 20, 2021

tgross self-assigned this Dec 20, 2021

tgross added this to the 1.2.4 milestone Dec 20, 2021

tgross force-pushed the quadratic-spread-scheduling branch from 7daaf75 to e3f6c3b Compare December 20, 2021 20:38

vercel bot temporarily deployed to Preview – nomad December 20, 2021 20:38 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui December 20, 2021 20:38 View deployment

tgross requested review from schmichael and lgfa29 December 20, 2021 20:43

tgross marked this pull request as ready for review December 20, 2021 20:43

schmichael approved these changes Dec 20, 2021

View reviewed changes

scheduler/spread_test.go Outdated Show resolved Hide resolved

scheduler/spread_test.go Outdated Show resolved Hide resolved

vercel bot deployed to Preview – nomad-storybook-and-ui December 21, 2021 13:44 View deployment

vercel bot temporarily deployed to Preview – nomad December 21, 2021 13:44 Inactive

tgross force-pushed the quadratic-spread-scheduling branch from 7a0db35 to 252f93b Compare December 21, 2021 14:11

vercel bot deployed to Preview – nomad December 21, 2021 14:11 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui December 21, 2021 14:11 View deployment

tgross merged commit 2d4e5b8 into main Dec 21, 2021

tgross deleted the quadratic-spread-scheduling branch December 21, 2021 15:10

This was referenced Dec 21, 2021

scheduler: tooling for scheduler benchmarking #11725

Merged

scheduler: cache resource comparisons for each alloc #11726

Closed

lgfa29 added the stage/needs-backporting label Dec 22, 2021

lgfa29 removed the backport/1.0 label Jan 17, 2022

lgfa29 added backport/1.0 and removed backport/1.0 labels Jan 17, 2022

lgfa29 removed stage/needs-backporting labels Jan 19, 2022

tgross mentioned this pull request Jul 7, 2022

document expected evaluation counts #13468

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: fix quadratic performance with spread blocks #11712

scheduler: fix quadratic performance with spread blocks #11712

tgross commented Dec 20, 2021 •

edited

Loading

schmichael left a comment

tgross commented Dec 21, 2021

github-actions bot commented Nov 3, 2022

scheduler: fix quadratic performance with spread blocks #11712

scheduler: fix quadratic performance with spread blocks #11712

Conversation

tgross commented Dec 20, 2021 • edited Loading

schmichael left a comment

Choose a reason for hiding this comment

tgross commented Dec 21, 2021

github-actions bot commented Nov 3, 2022

tgross commented Dec 20, 2021 •

edited

Loading