Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: fix quadratic performance with spread blocks #11712

Merged
merged 1 commit into from
Dec 21, 2021

Conversation

tgross
Copy link
Member

@tgross tgross commented Dec 20, 2021

When the scheduler picks a node for each evaluation, the
LimitIterator provides at most 2 eligible nodes for the
MaxScoreIterator to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.

Jobs with a spread block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a spread block is evaluated against
all eligible nodes. Operators of large clusters have reported that
jobs with spread blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.

In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the RandomIterator at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of eligible nodes available for scoring, not the total number
of nodes.

This changeset sets the iterator limit for "large" spread block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.


Note for reviewers: I have some additional PRs to make that'll improve our ability to debug this kind of problem in the future (ex. instrumenting the iterators, tooling for handling large snapshots), but I've intentionally kept them out of this PR so we have a clean bugfix for backporting. I'll push those PRs up over the next couple days.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. I wonder if we should improve our documentation too. I think this paragraph could be expanded from:

Spread criteria are treated as a soft preference by the Nomad scheduler. If no nodes match a given spread criteria, placement is still successful.

To something like:

Spread criteria are treated as a soft preference by the Nomad scheduler. If no nodes match a given spread criteria, placement is still successful. To avoid scoring every node for every placement, allocations may not be perfectly spread. Spread works best on attributes with similar number of nodes: identically configured racks or similarly configured datacenters.

If anyone needs absolute precision with spread we could make the timeout for evaluations configurable. Given the high concurrency of scheduling workers, operators may tolerate a small number of very slow (eg 5 minute) evaluations.

scheduler/spread_test.go Outdated Show resolved Hide resolved
scheduler/spread_test.go Outdated Show resolved Hide resolved
When the scheduler picks a node for each evaluation, the
`LimitIterator` provides at most 2 eligible nodes for the
`MaxScoreIterator` to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.

Jobs with a `spread` block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a `spread` block is evaluated against
_all_ eligible nodes. Operators of large clusters have reported that
jobs with `spread` blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.

In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the `RandomIterator` at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of _eligible_ nodes available for scoring, not the total number
of nodes.

This changeset sets the iterator limit for "large" `spread` block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.
@tgross
Copy link
Member Author

tgross commented Dec 21, 2021

I've picked up your suggestions and added the docs change. I've adjusted the test so that it's a bit less strict (effectively allowing for [0, 1, 2] which we need for the large rack counts) and produces more readable error outputs.

I've also added a note to the change in stack.go that notes that the limit value was empirically determined so that future developers will know there's nothing magical about 100 so they can feel free to adjust in the future if the scheduler changes significantly.

@tgross tgross merged commit 2d4e5b8 into main Dec 21, 2021
@tgross tgross deleted the quadratic-spread-scheduling branch December 21, 2021 15:10
tgross added a commit that referenced this pull request Dec 21, 2021
Adds a package `scheduler/benchmarks` with some examples of profile
and benchmarking the scheduler, along with helpers for loading
real-world data for profiling.

This tooling comes out of work done for #11712. These profiles and
test benchmarks have not been added to CI because these particular
profiles are mostly examples and the runs will add an excessive amount
of time to CI runs for code that rarely changes in a way that has any
chance of impacting performance.
tgross added a commit that referenced this pull request Dec 22, 2021
Adds a package `scheduler/benchmarks` with some examples of profiling
and benchmarking the scheduler, along with helpers for loading
real-world data for profiling.

This tooling comes out of work done for #11712. These test benchmarks
have not been added to CI because these particular profiles are mostly
examples and the runs will add an excessive amount of time to CI runs
for code that rarely changes in a way that has any chance of impacting
performance.
tgross added a commit that referenced this pull request Dec 22, 2021
Adds a package `scheduler/benchmarks` with some examples of profiling
and benchmarking the scheduler, along with helpers for loading
real-world data for profiling.

This tooling comes out of work done for #11712. These test benchmarks
have not been added to CI because these particular profiles are mostly
examples and the runs will add an excessive amount of time to CI runs
for code that rarely changes in a way that has any chance of impacting
performance.
lgfa29 pushed a commit that referenced this pull request Jan 17, 2022
When the scheduler picks a node for each evaluation, the
`LimitIterator` provides at most 2 eligible nodes for the
`MaxScoreIterator` to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.

Jobs with a `spread` block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a `spread` block is evaluated against
_all_ eligible nodes. Operators of large clusters have reported that
jobs with `spread` blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.

In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the `RandomIterator` at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of _eligible_ nodes available for scoring, not the total number
of nodes.

This changeset sets the iterator limit for "large" `spread` block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.
lgfa29 pushed a commit that referenced this pull request Jan 17, 2022
When the scheduler picks a node for each evaluation, the
`LimitIterator` provides at most 2 eligible nodes for the
`MaxScoreIterator` to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.

Jobs with a `spread` block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a `spread` block is evaluated against
_all_ eligible nodes. Operators of large clusters have reported that
jobs with `spread` blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.

In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the `RandomIterator` at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of _eligible_ nodes available for scoring, not the total number
of nodes.

This changeset sets the iterator limit for "large" `spread` block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.
lgfa29 pushed a commit that referenced this pull request Jan 18, 2022
When the scheduler picks a node for each evaluation, the
`LimitIterator` provides at most 2 eligible nodes for the
`MaxScoreIterator` to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.

Jobs with a `spread` block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a `spread` block is evaluated against
_all_ eligible nodes. Operators of large clusters have reported that
jobs with `spread` blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.

In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the `RandomIterator` at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of _eligible_ nodes available for scoring, not the total number
of nodes.

This changeset sets the iterator limit for "large" `spread` block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.
@github-actions
Copy link

github-actions bot commented Nov 3, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants