Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc of 'Schedule DS Pod by default scheduler'. #1714

Merged
merged 1 commit into from
Feb 8, 2018

Conversation

k82cn
Copy link
Member

@k82cn k82cn commented Feb 1, 2018

xref kubernetes/enhancements#548

/cc @bgrant0607 , @bsalamat , @kubernetes/sig-apps-feature-requests

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 1, 2018
@k8s-github-robot k8s-github-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 1, 2018
@k82cn
Copy link
Member Author

k82cn commented Feb 1, 2018

That should be alpha in 1.10, as TaintNodeByCondition is still alpha right now. Any comments?

@mattfarina
Copy link
Contributor

/cc @kow3ns

@@ -0,0 +1,71 @@
#Schedule DaemonSet Pods by default scheduler, not DaemonSet controller
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a space after the #


* **Q**: Will this change/constrain update strategies, such as scheduling an updated pod to a node before the previous pod is gone?

**A**: nop, this will NOT change update strategies.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nop typo


Before the discussion of solutions/options, there’s some requirements/questions on DaemonSet:

* **Q**: DaemonSet controller can make pods even the network of node is unavailable, e.g. CNI network providers (Calico, Flannel), Will this impact bootstrapping, such as in the case that a DaemonSet is being used to provide the pod network?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even "if"


* **Q**: DaemonSet controller can make pods even the network of node is unavailable, e.g. CNI network providers (Calico, Flannel), Will this impact bootstrapping, such as in the case that a DaemonSet is being used to provide the pod network?

**A**: This will be handled in Support scheduling tolerating workloads on NotReady Nodes ([#45717](https://github.com/kubernetes/kubernetes/issues/45717)); after moving to check node’s taint, the DaemonSet pods will tolerant `NetworkUnavailable` taint.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should say 'tolerate' , not 'tolerant'

Currently, pods of DaemonSet are created and scheduled by DaemonSet controller:

1. DS controller filter nodes by nodeSelector and scheduler’s predicates
2. For each nodes, create a Pod for it by setting spec.hostName directly; it’ll skip default scheduler
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be 'node'

- dest_hostname
```

##Reference
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need space after ##

##Reference

* [DaemonsetController can't feel it when node has more resources, e.g. other Pod exits](https://github.com/kubernetes/kubernetes/issues/46935)
* [DaemonsetController can't feel it when node recovered from outofdisk state](https://github.com/kubernetes/kubernetes/issues/46935)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong link, this is the same link as above , should be kubernetes/kubernetes#45628

@k8s-github-robot k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Feb 6, 2018
Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @k82cn and sorry for my late review.

* Hard to debug why DaemonSet’s Pod is not created, e.g. not enough resources; it’s better to have a pending Pods with predicates’ event
* Hard to support preemption in different components, e.g. DS and default scheduler

After [discussions](https://docs.google.com/document/d/1v7hsusMaeImQrOagktQb40ePbK6Jxp1hzgFB9OZa_ew/edit#), we come to a agreement that making DaemonSet to just produce pods like every other controller, and let them be scheduled by the regular scheduler, than to be its own scheduler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/a agreement/an agreement/

I would rephrase the sentence "we come to an agreement..." to:
SIG scheduling approved changing DaemonSet controller to create DaemonSet Pods and set their node-affinity and let them be scheduled by default scheduler. After this change, DaemonSet controller will no longer schedule DaemonSet Pods directly.


* **Q**: DaemonSet controller can make pods even the network of node is unavailable, e.g. CNI network providers (Calico, Flannel), Will this impact bootstrapping, such as in the case that a DaemonSet is being used to provide the pod network?

**A**: This will be handled in Support scheduling tolerating workloads on NotReady Nodes ([#45717](https://github.com/kubernetes/kubernetes/issues/45717)); after moving to check node’s taint, the DaemonSet pods will tolerant `NetworkUnavailable` taint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/in Support/by supporting/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also tolerant should be tolerate


* **Q**: DaemonSet controller can make pods even when the scheduler has not been started, which can help cluster bootstrap.

**A**: As the scheduling logic is moved to default scheduler, the kube-scheduler is required after this proposal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/the kube-scheduler is required after this proposal/the kube-scheduler must be started during cluster start-up/


This option is to leverage NodeAffinity feature to avoid introducing scheduler’s predicates in DS controller:

1. DS controller filter nodes by nodeSelector, but did NOT check against scheduler’s predicates (e.g. PodFitHostResources)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/did not/does not/

This option is to leverage NodeAffinity feature to avoid introducing scheduler’s predicates in DS controller:

1. DS controller filter nodes by nodeSelector, but did NOT check against scheduler’s predicates (e.g. PodFitHostResources)
2. For each nodes, DS controller creates a Pod for it with following NodeAffinity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/each nodes/each node/
s/with following/with the following/

1. DS controller filter nodes by nodeSelector, but did NOT check against scheduler’s predicates (e.g. PodFitHostResources)
2. For each nodes, DS controller creates a Pod for it with following NodeAffinity
3. When sync Pods, DS controller will map nodes and pods by this NodeAffinity to check whether Pods are started for nodes
4. In scheduler, the Daemon pods will keep pending if predicates failed, e.g. PodFitHostResources; for critical daemons, DS controller will create Pods with critical pods annotation and leverage scheduler/kubelet’s logic to handle it; similar practice to [priority/preemption](https://github.com/kubernetes/features/issues/268 )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a part of enabling priority and preemption, we must ensure that all critical DaemonSet Pods get an appropriate critical priority. If they have critical priority, scheduler will ensure that they will be scheduled even when the cluster is under resource pressure. Scheduler preempts other Pods in such condition to schedule critical Pods.

@k82cn k82cn force-pushed the k8s_548 branch 2 times, most recently from 6597e10 to f936c42 Compare February 7, 2018 14:57
@k82cn
Copy link
Member Author

k82cn commented Feb 7, 2018

Comments addressed :) ping @kow3ns , @bsalamat

1. DS controller filter nodes by nodeSelector, but does NOT check against scheduler’s predicates (e.g. PodFitHostResources)
2. For each node, DS controller creates a Pod for it with the following NodeAffinity
3. When sync Pods, DS controller will map nodes and pods by this NodeAffinity to check whether Pods are started for nodes
4. In scheduler, the Daemon pods will keep pending if predicates failed, e.g. PodFitHostResources; for critical daemons,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/the Daemon pods will keep pending if predicates failed, e.g. PodFitHostResources; for critical daemons,/Daemon Pods will stay pending if scheduling predicates fail. To avoid this,/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Da K. Ma <madaxa@cn.ibm.com>
@bsalamat
Copy link
Member

bsalamat commented Feb 8, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these OWNERS Files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2018
@k8s-ci-robot k8s-ci-robot merged commit 1cc2e01 into kubernetes:master Feb 8, 2018
@k82cn k82cn deleted the k8s_548 branch February 8, 2018 07:21
MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
Design doc of 'Schedule DS Pod by default scheduler'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants