Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for Scheduler performance tuning #10048

Merged
merged 2 commits into from
Sep 12, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions content/en/docs/concepts/configuration/scheduler-perf-tuning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
reviewers:
- bsalamat
title: Scheduler Performance Tuning
content_template: templates/concept
weight: 70
---

{{% capture overview %}}

{{< feature-state for_k8s_version="1.12" >}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we show its version state here ? e.g. :
{{< feature-state for_k8s_version="v1.12" state="alpha" >}}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not include an API change. It is a performance improvement shipped with 1.12. So, the backward compatibility and its associated guarantees that exist with alpha, beta, GA features do not apply to this one.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or probably, name this paragraph as background.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the overview section. In all of our documents the first section serves as the overview/background.

Kube-scheduler is the Kubernetes default scheduler. It is responsible for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this statement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the overview section. It would be helpful for someone unfamiliar with the topic to know what they are reading about.

placement of Pods on proper Nodes of a cluster. Nodes of a cluster that meet the
scheduling requirements of a Pod are called feasible Nodes for the Pod. The
scheduler finds feasible Nodes for a Pod and then runs a set of functions to
score the feasible Nodes and picks a Node with the highest score among the
feasible ones to run the Pod. It then notifies the API server about this
decision in a process called "Binding".

{{% /capture %}}

{{% capture body %}}

## Percentage of Nodes to Score

Before Kubernetes 1.12, Kube-scheduler used to check the feasibility of all the
nodes in a cluster and then scored the feasible ones. Kubernetes 1.12 has a new
feature that allows the scheduler to stop looking for more feasible nodes once
it finds a certain number of them. This improves the scheduler's performance in
large clusters. The number is specified as a percentage of the cluster size and
is controlled by a configuration option called `percentageOfNodesToScore`. The
range should be between 1 and 100. Other values are considered as 100%. The
default value of this option is 50%. You can change this value by providing a
different value in the scheduler configuration, but read further to find whether
you should change the value.

```yaml
apiVersion: componentconfig/v1alpha1
kind: KubeSchedulerConfiguration
algorithmSource:
provider: DefaultProvider

...

percentageOfNodesToScore: 50
```

{{< note >}} **Note**: In clusters with zero or few feasible nodes, the
scheduler still checks all the nodes, simply because there are not enough
feasible nodes to stop the scheduler's search early. {{< /note >}}

**To disable this feature**, you can set `percentageOfNodesToScore` to 100.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a feature gate for this feature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no feature gate, but as is mentioned in the doc, it can be practically disabled by setting the percentage to 100.


### Tuning percentageOfNodesToScore

As stated above, `percentageOfNodesToScore` must be a value between 1 and 100
with the default value of 50. There is also a hardcoded minimum value of 50
nodes which is applied internally. It means that the scheduler tries to find at
least 50 nodes regardless of the value of `percentageOfNodesToScore`. This means
that changing this option to lower values in clusters with several hundred nodes
will not have much impact on the number of feasible nodes that the scheduler
tries to find. This is intentional as this option is unlikely to improve
performance noticeably in smaller clusters. In large clusters with over a 1000
nodes setting this value to lower numbers may show a noticeable performance
improvement.

An important note to consider when setting this value is that when a smaller
number of nodes in a cluster are checked for feasibility, some nodes are not
sent to be scored for a given Pod. As a result, a Node which could possibly
score a higher value for running the given Pod might not even be passed to the
scoring phase. This would result in a less than ideal placement of the Pod. For
this reason, the value should not be set to very low percentages. A general rule
of thumb is to never set the value to anything lower than 30. Lower values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lower than 30 in any cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is percentage. So, general rule of thumb is not to set this value below 30% unless you know what you are doing.

should be used only when the scheduler's throughput is critical for your
application and the score of nodes is not important. In other words, you prefer
to run the Pod on any Node as long as it is feasible.

We do not recommend lowering this value from its default if your cluster has
only several hundred Nodes. It is unlikely to improve the scheduler's
performance significantly.

### How scheduler iterates over Nodes

This section is intended for those who want to understand the internal details
of this feature.

In order to give all the Nodes in a cluster a fair chance of being considered
for running Pods, the scheduler iterates over the nodes in a round robin
fashion. You can imagine that Nodes are in an array. The scheduler starts from
the start of the array and checks feasibility of the nodes until it finds enough
Nodes as specified by `percentageOfNodesToScore`. For the next Pod, the
scheduler continues from the point in the array that it stopped when checking
feasibility of the previous Pod.

If Nodes are in multiple zones, the scheduler iterates over Nodes in various
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, we can mention the side-effect of having unequal number of nodes in zones causing unequal spread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unequal number of nodes in zones shouldn't be a problem. It of course causes more pods to be scheduled in zones with larger number of zones, but that's expected and not different than before.

zones to ensure that Nodes from different zones are considered in the
feasibility checks. As an example, consider six nodes in two zones:

```
Zone 1: Node 1, Node 2, Node 3, Node 4
Zone 2: Node 5, Node 6
```

Scheduler evaluates feasibility of the nodes in this oder:

```
Node 1, Node 5, Node 2, Node 6, Node 3, Node 4
```

After going over all the Nodes, it goes back to Node 1.

{{% /capture %}}