Inter-pod affinity/anti-affinity #60

aronchick · 2016-07-24T22:03:16Z

Design Doc: https://github.com/kubernetes/kubernetes/blob/master/docs/design/podaffinity.md

e.g. "put these pods in zone us-central1-a"
predicate and priority function are done, it was alpha in 1.2, no change in 1.3
remaining work (low-priority since nobody has been asking for it) is to implement the "RequiredDuringExecution" option which means evict a pod if node labels change, or pod's affinity/anti-affinity request changes, such that the pod's affinity/anti-affinity is no longer satisfied
In theory we could move it to Beta in 1.4 but I think we should leave it as alpha for two reasons: (1) get more people using it so we can get feedback, (2) it shares the same annotation (scheduler.alpha.kubernetes.io/affinity) with inter-pod affinity/anti-affinity (see below), and we definitely need to keep that one in alpha in 1.4

Progress Tracker

FEATURE_STATUS is used for feature tracking and to be updated by @kubernetes/feature-reviewers.
FEATURE_STATUS: IN_DEVELOPMENT

More advice:

Design

Once you get LGTM from a @kubernetes/feature-reviewers member, you can check this checkbox, and the reviewer will apply the "design-complete" label.

Coding

Use as many PRs as you need. Write tests in the same or different PRs, as is convenient for you.
As each PR is merged, add a comment to this issue referencing the PRs. Code goes in the http://github.com/kubernetes/kubernetes repository,
and sometimes http://github.com/kubernetes/contrib, or other repos.
When you are done with the code, apply the "code-complete" label.
When the feature has user docs, please add a comment mentioning @kubernetes/feature-reviewers and they will
check that the code matches the proposed feature and design, and that everything is done, and that there is adequate
testing. They won't do detailed code review: that already happened when your PRs were reviewed.
When that is done, you can check this box and the reviewer will apply the "code-complete" label.

Docs

Write user docs and get them merged in.
User docs go into http://github.com/kubernetes/kubernetes.github.io.
When the feature has user docs, please add a comment mentioning @kubernetes/docs.
When you get LGTM, you can check this checkbox, and the reviewer will apply the "docs-complete" label.

The text was updated successfully, but these errors were encountered:

idvoretskyi · 2016-07-25T15:49:55Z

cc @kubernetes/sig-scheduling

jberkus · 2016-07-26T20:27:22Z

One thing which might make a common case of anti-affinity simpler is to allow expansion of the "spread" concept to an abitrary label. That is, if I could say:

spread: { type: database }

That would let me express the idea of "don't run a pod with type: database on a node with any other pod of type: database", and thus allow a very simple way of expressing "don't put two Postgres pods on the same node, and don't put them on the same node as Cassandra or MySQL".

I'd expect that there are a number of cases where a specific class of applications tends to use the same resources. For example, I can imagine not wanting two busy http routers to go on the same node due to network competition, even if one is HAProxy and the other is Nginx.

jberkus · 2016-07-26T20:29:15Z

... continued:

One refinement of this is that I can imagine wanting user-controllable "weak" vs. "hard" spread rule. For example, in most of my Postgres deployments, I would rather be short one or two pods than put two Postgres pods on the same machine (a hard rule). On the other hand, for Etcd, I could imagine saying "don't put two pods from this class on the same node if you can help it", which would be a soft rule.

davidopp · 2016-07-26T20:51:39Z

Both of the things you mentioned are supported. See the design doc linked to above (it had the wrong URL originally and I fixed it last night, so you may have read the wrong doc if you already looked at that).

jberkus · 2016-07-26T21:39:23Z

Ah, I read the design doc and I couldn't find that particular feature. Keywords/lines?

davidopp · 2016-07-26T21:50:51Z

"don't run a pod with type: database on a node with any other pod of type: database"

See "Can only schedule P onto nodes that are running pods that satisfy P1. (Assumes all nodes have a label with key node and value specifying their node name.)". Then substitute

P1 is a label selector that expresses "pod of type: database"
Can only -> Cannot (by using PodAntiAffinity instead of PodAffinity)

"weak" vs. "hard" spread rule.

See the comment for PreferredDuringSchedulingIgnoredDuringExecution (that's the "soft" flavor), as compared to the other two, in the PodAffinity/PodAntiAffinity types in the API section of the doc.

jberkus · 2016-07-26T22:03:24Z

Keen, thanks!

alex-mohr · 2016-08-17T18:40:32Z

@davidopp says this is done.

davidopp · 2016-08-17T19:40:01Z

Sorry, I was mis-remembering what this issue is; the part of this that is described in #51 is done, but this one was not intended to be finished in 1.4. I've moved to 1.5 milestone.

ivan4th · 2016-09-20T13:21:23Z

Trying to implement pod (anti)affinity for DaemonSets too. Someone PTAL: kubernetes/kubernetes#31136

davidopp · 2016-10-01T08:44:07Z

Goal for 1.5 is to move this to Beta. More details in kubernetes/kubernetes#25319

timothysc · 2016-10-03T16:26:01Z

/cc @rrati @jayunit100

idvoretskyi · 2016-10-18T16:48:56Z

@wojtek-t can you explain in which stage this feature is going to be delivered in 1.5? @davidopp has defined beta, while in this conversation kubernetes/kubernetes#31136 (comment) I see some comments with concerns?

wojtek-t · 2016-10-18T17:30:46Z

@idvoretskyi - most probably it won't get to beta, but that's not final decision from what I know.

davidopp · 2016-10-18T17:33:12Z

It's not going to be beta. There are a few features we recently decided to remove from the set we were going to move to beta in 1.5. I'll update the feature bugs shortly.

timothysc · 2016-10-18T20:30:21Z

The details can be found here: kubernetes/kubernetes#30819 and here: kubernetes/kubernetes#34508

The general gist is: annotations as a mechanism for alpha-beta-GA api promotion has a number of issues, and @kubernetes/sig-api-machinery is working on a "happy-path" which is still TBD.

davidopp · 2016-10-18T22:20:57Z

Yes, what @timothysc . I'm removing the beta-in-1.5 label and the 1.5 milestone.

jimmycuadra · 2016-11-30T10:49:39Z

I have a use case for this feature which I don't think is covered by the current design, but please correct me if I'm wrong.

Imagine a simple example cluster with two nodes. I want to create deployments in this cluster with two pod replicas each. I want to require that the two pods are not on the same node. I can do this with pod affinity based on a label like app=foo, but when I edit the deployment, creating a new replica set, the new pods can't be scheduled, because each node already has a pod with the label app=foo. I would have to change the deployment's labels and affinity rules each time I deploy.

What I really want is a way to require that pods with the same labels and the same pod-template-hash don't end up on the same node, but I don't think there's a way to express that in the current affinity system because there's no operator for "equal to the value of that label for this pod". In other words, I'd have to know the value of the pod-template-hash in advance somehow.

davidopp · 2016-12-04T09:15:15Z

My understanding of the way Deployments work for rolling update is that a second RS is created, initially with 0 replicas, and then the first RS is scaled down as the second RS is scaled up. So the total number of replicas across the two RSes is 2, except perhaps for transient conditions. Initially both are in the "old version," then one is in the "new version" and one is in the "old vesion", and finally both are in the "new vesion."

smarterclayton · 2016-12-05T01:00:21Z

If you have maxSurge == 0, you get "up to but not more than N" behavior. If you want to keep availability >= 100% of your original N, you'd need something that is unique for each RS as you note. We don't really have downward API for regular fields, but I could imagine something like that (have pod affinity rules depend on the current value of a label for a pod).

…

On Sun, Dec 4, 2016 at 4:15 AM, David Oppenheimer ***@***.***> wrote: My understanding of the way Deployments work for rolling update is that a second RS is created, initially with 0 replicas, and then the first RS is scaled down as the second RS is scaled up. So the total number of replicas across the two RSes is 2, except perhaps for transient conditions. Initially both are in the "old version," then one is in the "new version" and one is in the "old vesion", and finally both are in the "new vesion." — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#60 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p9bf5Dy7HTdkpRaiCseh01oJWtwTks5rEoSlgaJpZM4JTrhi> .

davidopp · 2016-12-05T07:17:02Z

Ah, thanks for the explanation.

As a workaround, could you have a pod label in your podTemplate with key "version" (or "generation" or something like that) and a value that is initially 0, and a corresponding pod anti-affinity annotation with the same key/value pair, and each time you modify the podTemplate, you bump up both values (label and anti-affinity annotation)? The value could be the hash of everything in the podTemplate except this one field, in which case I think it's basically equivalent to the feature @jimmycuadra requested. (Though a simple version number you bump up on each modification is simpler.)

davidopp · 2017-01-20T07:18:56Z

We will be moving this feature to beta in 1.6. Tracking issue is
kubernetes/kubernetes#25319

Current user guide documentation is here

jimmycuadra · 2017-01-20T10:38:38Z

Any chance of the use case I mentioned being part on the roadmap for the stable release? The suggested workaround might be prone to error. It'd be great to have the server aware of the user's intent. If not, would this be considered for a future iteration on this API? In that case, should I open a new issue somewhere to track it?

davidopp · 2017-01-20T10:59:23Z

You can open a feature request in the kubernetes/kubernetes repo and link it to this issue. We could consider it if enough people want it. Personally I'd prefer if Deployment controller managed the label changes (i.e. automate the "workaround") and we didn't change the API for pod (anti-)affinity.

ivan4th · 2017-01-20T13:40:13Z

Given that inter-pod (anti)affinity is going to be beta soon, can we get back to kubernetes/kubernetes#34543 maybe? It's a quirk (unneeded dependency) that makes it hard to move inter-pod affinity to General Predicates for instance.

davidopp · 2017-01-20T19:55:11Z

kubernetes/kubernetes#34543 (and moving to General Predicates) isn't an API change, so it's not strictly necessary for beta (i.e. it can be done after moving to beta).

Sorry we haven't reviewed that PR yet. We're working on getting more people up to speed on the scheduler code, but right now we only have the bandwidth to review things that are critical or trivial. I hope we'll get to it in the next couple of weeks.

Thanks for your patience...

idvoretskyi · 2017-03-08T18:22:44Z

@davidopp any update on this feature? Docs and release notes are required (please, provide them to the features spreadsheet.

davidopp · 2017-03-08T21:25:40Z

Updated spreadsheet with release note and link to documentation.

to workaround kubernetes/enhancements#60 (comment) when pods with anti affinity fail to be upgraded

fejta-bot · 2017-12-21T20:23:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-20T21:11:24Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-19T21:18:00Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Reorganize files into appropriate directories

aronchick added this to the v1.4 milestone Jul 24, 2016

aronchick assigned davidopp Jul 24, 2016

idvoretskyi added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jul 25, 2016

davidopp modified the milestones: v1.5, v1.4 Aug 17, 2016

davidopp mentioned this issue Oct 7, 2016

Workload spreading across failure domains (fix pod anti-affinity performance problem) #51

Closed

21 tasks

ivan4th mentioned this issue Oct 13, 2016

Implement Pod Affinity and AntiAffinity for DaemonSets kubernetes/kubernetes#31136

Closed

idvoretskyi added the beta-in-1.5 label Oct 13, 2016

davidopp modified the milestones: next-milestone, v1.5 Oct 18, 2016

davidopp removed the beta-in-1.5 label Oct 18, 2016

This was referenced Oct 18, 2016

Node affinity #106

Closed

Multiple/user-defined schedulers #107

Closed

kerneltime mentioned this issue Jan 20, 2017

Inter-pod affinity/anti-affinity #165

Closed

23 tasks

davidopp modified the milestones: v1.6, next-milestone Jan 20, 2017

idvoretskyi added the stage/beta Denotes an issue tracking an enhancement targeted for Beta status label Jan 26, 2017

davidopp mentioned this issue Mar 2, 2017

Update "node selection" documentation to reflect Beta affinity syntax kubernetes/website#2671

Closed

craigbox mentioned this issue Jun 1, 2017

Deployments to GA #194

Closed

alena1108 pushed a commit to alena1108/kubernetes-package that referenced this issue Jul 25, 2017

Use diff label in anti affinity for dns

21bfe31

to workaround kubernetes/enhancements#60 (comment) when pods with anti affinity fail to be upgraded

alena1108 mentioned this issue Jul 25, 2017

sky-dns deployment with self anti affinity fails to upgrade rancher/rancher#9461

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 20, 2018

k8s-ci-robot closed this as completed Feb 19, 2018

ingvagabund pushed a commit to ingvagabund/enhancements that referenced this issue Apr 2, 2020

Merge pull request kubernetes#60 from smarterclayton/reorganize

c5122ae

Reorganize files into appropriate directories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inter-pod affinity/anti-affinity #60

Inter-pod affinity/anti-affinity #60

aronchick commented Jul 24, 2016 •

edited by philips

Loading

idvoretskyi commented Jul 25, 2016

jberkus commented Jul 26, 2016

jberkus commented Jul 26, 2016 •

edited

Loading

davidopp commented Jul 26, 2016

jberkus commented Jul 26, 2016

davidopp commented Jul 26, 2016

jberkus commented Jul 26, 2016

alex-mohr commented Aug 17, 2016

davidopp commented Aug 17, 2016

ivan4th commented Sep 20, 2016 •

edited

Loading

davidopp commented Oct 1, 2016

timothysc commented Oct 3, 2016

idvoretskyi commented Oct 18, 2016

wojtek-t commented Oct 18, 2016

davidopp commented Oct 18, 2016

timothysc commented Oct 18, 2016

davidopp commented Oct 18, 2016

jimmycuadra commented Nov 30, 2016

davidopp commented Dec 4, 2016

smarterclayton commented Dec 5, 2016 via email

davidopp commented Dec 5, 2016

davidopp commented Jan 20, 2017

jimmycuadra commented Jan 20, 2017

davidopp commented Jan 20, 2017

ivan4th commented Jan 20, 2017 •

edited

Loading

davidopp commented Jan 20, 2017 •

edited

Loading

idvoretskyi commented Mar 8, 2017

davidopp commented Mar 8, 2017

fejta-bot commented Dec 21, 2017

fejta-bot commented Jan 20, 2018

fejta-bot commented Feb 19, 2018

Inter-pod affinity/anti-affinity #60

Inter-pod affinity/anti-affinity #60

Comments

aronchick commented Jul 24, 2016 • edited by philips Loading

Progress Tracker

idvoretskyi commented Jul 25, 2016

jberkus commented Jul 26, 2016

jberkus commented Jul 26, 2016 • edited Loading

davidopp commented Jul 26, 2016

jberkus commented Jul 26, 2016

davidopp commented Jul 26, 2016

jberkus commented Jul 26, 2016

alex-mohr commented Aug 17, 2016

davidopp commented Aug 17, 2016

ivan4th commented Sep 20, 2016 • edited Loading

davidopp commented Oct 1, 2016

timothysc commented Oct 3, 2016

idvoretskyi commented Oct 18, 2016

wojtek-t commented Oct 18, 2016

davidopp commented Oct 18, 2016

timothysc commented Oct 18, 2016

davidopp commented Oct 18, 2016

jimmycuadra commented Nov 30, 2016

davidopp commented Dec 4, 2016

smarterclayton commented Dec 5, 2016 via email

davidopp commented Dec 5, 2016

davidopp commented Jan 20, 2017

jimmycuadra commented Jan 20, 2017

davidopp commented Jan 20, 2017

ivan4th commented Jan 20, 2017 • edited Loading

davidopp commented Jan 20, 2017 • edited Loading

idvoretskyi commented Mar 8, 2017

davidopp commented Mar 8, 2017

fejta-bot commented Dec 21, 2017

fejta-bot commented Jan 20, 2018

fejta-bot commented Feb 19, 2018

aronchick commented Jul 24, 2016 •

edited by philips

Loading

jberkus commented Jul 26, 2016 •

edited

Loading

ivan4th commented Sep 20, 2016 •

edited

Loading

ivan4th commented Jan 20, 2017 •

edited

Loading

davidopp commented Jan 20, 2017 •

edited

Loading