Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support inter-Pod affinity to one or more Pods #68701

Closed
bsalamat opened this issue Sep 15, 2018 · 59 comments
Closed

Support inter-Pod affinity to one or more Pods #68701

bsalamat opened this issue Sep 15, 2018 · 59 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@bsalamat
Copy link
Member

In the current implementation of inter-Pod affinity, the scheduler looks for a single existing pod that can satisfy all the terms of inter-pod affinity of an incoming pod.
With the recent changes made to the implementation of inter-Pod affinity, we can now support multiple pods satisfying inter-pod affinity. One of the main reasons we didn't pursue the idea before was the fact that the inter-pod affinity feature was very slow (3 orders of magnitude slower than other scheduler predicates). We didn't want to add more complication to an already slow predicate. However, we can now think about adding the feature.
With this feature, a pod can have multiple affinity terms satisfied by a group of pods, as opposed to only a single pod. For example:

- Assume that the cluster has two nodes: 
- nodeA located in in zone1/region1
- nodeB located in zone2/region1

- There are two existing pods on these nodes:
- Pod1:
    - nodeName: "nodeA"
    - label "foo":""

- Pod2:
    - nodeName: "nodeB"
    - label "bar":""

- Pod3 comes in with inter-pod affinity:
    - affinity terms:
        - {label "foo" exists, topologyKey: "region"}
        - {label "bar" exists, topologyKey: "zone"}

With our current (K8s 1.12) implementation, Pod3 is not schedulable, because there is no single pod that satisfies all of its affinity terms. However, if we support multiple pods satisfying the affinity terms, Pod3 can be scheduled on nodeB. Pod1 satisfies the first term of its affinity in region1 and Pod2 satisfies its second term in zone2. So, any node in zone2/region1 will be feasible for Pod3.

Given the current implementation of inter-pod affinity and the use of "Topology Pair Maps", I believe implementing this feature requires little changes and won't have noticeable performance impact.

/kind feature
/sig scheduling

cc/ @Huang-Wei @ahmad-diaa

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 15, 2018
@bsalamat
Copy link
Member Author

bsalamat commented Sep 15, 2018

It is worth noting that this will apply only to inter-pod affinity, not inter-pod anti-affinity. Inter-pod anti-affinity is considered "violated" if there is a pod that matches ANY terms of the anti-affinity. So, matching against a group of Pods does not make sense for anti-affinity.

@Huang-Wei
Copy link
Member

@bsalamat I have a full picture on affinity/anti-affinity code after delivering #68173. I'm more than happy to help on this :)

/assign

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2018
@misterikkit
Copy link

This question was probably answered elsewhere, but could the change in behavior disrupt existing clusters?

e.g. a canary workload is launched with pod-affinity for labels {app="foo", env="canary"}. That workload could end up in a topology containing {app="foo", env="prod"} & {app="bar", env="canary"} after this change.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 20, 2019
@Huang-Wei
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 20, 2019
@Huang-Wei
Copy link
Member

Re @misterikkit:

a canary workload is launched with pod-affinity for labels {app="foo", env="canary"}. That workload could end up in a topology containing {app="foo", env="prod"} & {app="bar", env="canary"} after this change.

If the goal is to gather pods with labels {app="foo", env="canary"}, the workload should have it defined within the same affinityTerm, in different expressions.

And if app="foo" and env="canary" are defined in two different affinityTerms, then yes, after the change, a topology containing {app="foo", env="prod"} & {app="bar", env="canary"} can be a fit.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 20, 2019
@Huang-Wei
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2019
@Huang-Wei
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 17, 2019
@Huang-Wei
Copy link
Member

I'd suggest putting it on our plate only when there is a couple of viable practical use cases.

@sanposhiho
Copy link
Member

Quoting from the request:

Suppose we got an incoming pod with 2 pod affinity requirements. One is region and the other is zone. E.g the first one is for affinity to some light-weight RPC service it depends on, while the second for affinity to some heavy-weight storage access.

It makes sense. To be more general, when people want to make sure dependent services exist in the same domain and the pod has more than one kind of dependent services (like the above example), then they cannot use today’s PodAffinity to achieve that.

@alculquicondor
Copy link
Member

I know it makes sense, but there doesn't seem to be enough pressure to get this done. This issue has been opened since 2018, and there aren't more people asking for it. Maybe you can collect some feedback sending an email to the mailing list?

@sanposhiho
Copy link
Member

👍 In sig-scheduling mailing list? or is there another suitable one?

@Huang-Wei
Copy link
Member

yes, and discuss.k8s.io

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 10, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 12, 2023
@sanposhiho
Copy link
Member

/remove-lifecycle rotten
/reopen

@k8s-ci-robot
Copy link
Contributor

@sanposhiho: Reopened this issue.

In response to this:

/remove-lifecycle rotten
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Apr 18, 2023
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 18, 2023
@fentas
Copy link

fentas commented Apr 19, 2023

Curious shouldn't this be possible by reading the design proposal?
I assumed (interpreted it) that each entry in requiredDuringSchedulingIgnoredDuringExecution looks up a set of pods and these pods reduce the set of nodes. Kind of confused why this is even an array then, as either matchExpressions or matchLabels can match multiple things. Why would I define a new entry in requiredD.. when I can add an expression in matchExpressions itself to match the same pod?
Also, it is nowhere written that you can't target multiple pods (or I over-/missread).

But it seems not, so my use case is related to:

We're running cilium in chaining with aks CSI. But since mid-last year removal of taints is only possible via Azure API breaking the cilium node readiness.
Right now, we're unable to run AKS in BYOCNI mode, unless it leaves its preview phase with Cilium.

Following this comment from Microsoft we followed the advice and created a mutation webhook (instead of calling the API) adding a podAffinity so that pods only get scheduled on nodes where cilium agent is already running.
But this breaks now other podAffinities.

@alculquicondor
Copy link
Member

I can't speak for the original intent of the design, as it predates my time here, but the reality is that it's not implemented like that. Then we can't change the behavior as it would be backwards-incompatible.

So two things:

  • Docs need to be updated to reflect the current implementation. Feel free to open one issue or I'll do so when I have a chance.
  • Is there still justification for this feature? In your case, the cilium agent is probably a daemonset, so it already is designed to run in a set of nodes that have a label. Then the pods can use node affinity, instead of pod affinity. Also, node affinity is faster to calculate, in case that matters to you.

@fentas
Copy link

fentas commented Apr 20, 2023

Thanks for the feedback.
node affinity won't work in this use case.

The idea is/was to make sure that cilium agent (daemonset) is scheduled first before any other pod is scheduled.
Normally this happens via taints and the agent removes the taint but Microsoft broke or rather disabled this functionality.

Our mutation webhook just merged this podAffinity to every pod created, but this breaks now any other podAffinity already set on the pod itself.

# resulting in this
  affinity:
    podAffinity:
      requiredduringschedulingignoredduringexecution:
      - labelSelector:
          matchLabels:
            example.com/name: myservice
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchExpressions:
          - key: k8s-app
            operator: In
            values:
            - cilium
        namespaces:
        - kube-system
        topologyKey: kubernetes.io/hostname

@kerthcet
Copy link
Member

Sorry, a bit confusion here, doesn't this works as expected? Unless the cilium pod launches, this new created pod will stuck in pending?

@alculquicondor
Copy link
Member

@kerthcet the problem is that kube-scheduler looks for a single pod that satisfies all affinities.

Well, even if we add the feature that you request, it would only be available in 1.28 at the earliest (potentially 1.29, as the feature has to be disabled by default first). So I would suggest you open an issue against Azure support.

Other than that, you are welcome to work on this feature.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

10 participants