-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create KEP for Instance Host Labelling #1127
Create KEP for Instance Host Labelling #1127
Conversation
Welcome @alculquicondor! |
Hi @alculquicondor. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: alculquicondor The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @ahg-g |
Related discussion in the mailing list https://groups.google.com/forum/#!topic/kubernetes-sig-cloud-provider/32N59IYXogY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
|
||
#### Story 1 | ||
|
||
As an application developer, I want pods to be scheduled in the same instance host so that networking between them is faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very valid scenario. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to highlight in which scenarios instance-host is useful over just kubernetes node topology for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we saying that instance-host affinity would have to guarantee faster inter-instance networking than a zone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I would call this (depending on faster networking by co-scheduling) an anti-pattern I don't think we should hide information or stop people from making their own decisions.
Under no circumstances can we proscribe the semantic implications of a platform when co-scheduling. It might be faster, it might be the same, or it might be slower. We can not say anything about that, in the abstract. A given implementation may choose to document this or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this use case bothers me a lot. Pods are the unit of co-scheduling by design to prevent this scenario which is degenerative. Typically if you care that much you will make it part of the pod itself otherwise you get into scheduling issues. A better way folks typically describe this is by use of other custom labels and calculations of distance. region/zone/cluster/rack are common labels and folks can apply weights to imply distance from those labels. This is how data-gravity scheduling typically works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Story 2 was actually the driver for this KEP #75274 . However, even pod spreading is a feature in development, whereas pod (anti)affinity already exists. We can remove this Story if it's something we don't want to promote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a story that I don't think we want to promote given the history behind pods (co-scheduling) and existing affinity rules.
#### Story 2 | ||
|
||
As an application developer, I want pods to be spread in different instance hosts so that if the host goes down, my service doesn’t have to scale from zero. | ||
I use `topology.kubernetes.io/instance-host` as topology key in [even pod spreading](/keps/sig-scheduling/20190221-even-pods-spreading.md) rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that this can already be solved by using the failure-domain.beta.kubernetes.io/zone
label. In public cloud environments, VM placement within a region is usually random (e.g. the user does not have full control on it). So a multi-node environment almost by default runs in multiple zones so that the zone
label is sufficient for this use case - and also provides higher redundancy than host isolation as zones in the public cloud are highly isolated from each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In private cloud environments, a zone is not strictly defined (almost arbitrary, like any other label that could be applied). Thus, in the vSphere cloud provider we give the operator flexibility to define what a zone is via vSphere tags associated to vSphere objects, e.g. a host, cluster or even data center. This tag is read and used for populating failure-domain.beta.kubernetes.io/zone
. I acknowledge that different on-prem cloud provider implementations have different semantics here (which makes this discussion not easier...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have feedback from other providers on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wearing my Google Cloud hat: GCP does not provide this information and does not guarantee that a VM will not be live-migrated to another physical host at runtime. In fact, we practically guarantee that any VM of sufficiently long life will get migrated eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are designing a solution for populating the label that supports migration, basically a daemonset that frequently checks if the VM got migrated, and updates the label accordingly. Of course this assumes that the cloud provider have an API that allows us to do that (in Google Cloud for example, perhaps this can be provided through the instance metadata service).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to double check but you can do this today with your own labeling, and it's not necessarily applicable across providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is covered in Alternatives section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaics, our advice to date has been that if you need failure resilience then you spread your pods by the abstract failure-domain.beta.kubernetes.io/zone
label. Once you have done that, then there is no need to also spread by instance host as proposed here (ditto for other possible concrete failure domains like "rack" or "power distribution unit").
Is there something in this story that is not already covered by our existing advice
re failure-domain.b.k.io/zone?
I understand that some clouds (notably AWS) also have a specific concept called a "zone", and that the cloud zones might nest differently to the k8s "zone". Perhaps the motivation for this KEP comes from a misunderstanding as to what/how the k8s "zone" label should be used? (in which case I think the solution is better documentation, or possibly renaming the label, not adding additional labels)
There is one case worth mentioning: | ||
An empty topology key for `preferredDuringSchedulingIgnoredDuringExecution` in inter-pod anti-affinity is interpreted as | ||
"all topologies", so the instance host should be considered as well. | ||
If the label value is empty, then the anti-affinity is still ensured by the hostname. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out!
|
||
- Risk: Some or all nodes are missing a value for the instance host label and an application developer tries to use the label as a topology key. | ||
|
||
Mitigation: This will be documented as undefined behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be no different that any other topo key.
- Risk: If a cloud provider or hypervisor performs a live migration of a VM, the host will change. | ||
|
||
Mitigation: The label should be updated. Evicting running pods is out-of-scope for the scheduler. | ||
New pods are scheduled according to the new labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the biggest concern I have with this KEP. YMMV and the following comments are highly dependent on the hypervisor/virtualization platform (capabilities available/exposed/configured).
@bsalamat brought up an important point:
We have accepted the fact that clusters are dynamic in nature and scheduling decisions may be invalidated in the future. This is an assumption that other new features should make too.
I agree with him (because fundamentally that's the current implementation of the scheduler). On the other side, from a UX/ user experience perspective this is not what is known/desired. Especially in cases where hard pod anti-affinity is needed, e.g. critical stateful services. Now, a workaround would be to use zone
(assuming zone membership is static and won't change, which might be or not be the case...) instead of instance
label here - and the users I've worked with do that and never reported any issues with it.
My feeling is that a majority of users would use the instance label for resiliency, which ultimately is not guaranteed when VMs migrate, labels need to be updated and workloads have to be evicted/restarted to reconcile. So why not keep using the zones label for resiliency than? I don't want to repeat every argument of the discussion here, so pointing to the former KEP which has more details: #997 (review)
Some more findings from the field (real-world on-prem deployments I was involved with): when we planned to add zones support in the vSphere cloud provider we went with a two phase approach: first, implement the basic functionality. Second, try to solve the "label drift" issue. It turned out that solving the "label drift" issues was impossible, i.e. it always had consequences (impact) to user workloads. This is not what we wanted to go through, so phase two never was implemented. Did that cause a problem? Not for our users. They leverage zones topology (again that could be arbitrary on-prem) to model resiliency constructs. Whenever needed, vSphere will be configured to never move VMs if that could cause "label drift", e.g. move VM from zone 1 to zone 2 because of cluster imbalance or host failure. I understand that this might not be possible (available) with all hypervisors - which complicates the matter unfortunately.
Also, @anfernee built a prototype to influence scheduling decisions honoring host information. It couldn't solve the "label drift" issue and also did not get a lot of traction because our user base was happy with the current (zones) implementation.
I am not saying to keep the status quo. But assuming that we could easily update labels and restart workloads is a weak assumption, especially for stateful production critical workloads in highly dynamic clusters. Also it might be worth pointing out that not all Kubernetes users can simply modify their workloads to comply with the Kubernetes semantics, be it because they don't have the manpower, access to source code, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair do you have any thoughts on these semantics for live migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that if a VM migrates, the label MUST be updated. Whether that ripples out into a pod reschedule is a totally different question. We don't have a rescheduler today, but hard anti-affinity might be broken by a migration, which would effectively require either a resched or that we loosen the guarantees for affinity/anti-affinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just highlighting that live-migrating an instance/pod, and then restarting the pod because its label changed, is equivalent to just not live-migrating it in the first place. I advocate taking the simpler option and just not engaging with this arms race..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anguslees the problem is that the hypervisor likely doesn't have a way to tell if it can migrate a VM without breaking any restrictions for the pod.
In the re-scheduling topic, note that we face the same problem when adding new taints to a node for which running pods don't have a toleration.
|
||
## Alternatives | ||
|
||
Do nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this sounds a bit naive but might just be ok 😀 On the other side, nothing speaks against adding this label if the use cases and impact (e.g. drift, etc.) are clearly articulated. Customers can then decide which label to use for proximity, resiliency, spreading, etc.
## Alternatives | ||
|
||
Do nothing. | ||
Cloud providers and on-prem clusters could support this topology by defining their own label and controllers to populate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should survey the existing providers to see if exposing this topology is even possible. I get the sense that it's currently only useful for 1-2 providers, for the others it's just not feasible or the API does not expose that information. If we find the use-case is common enough we should codify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can put up a vote on this one in SIG Cloud Provider? Just because a current cloud provider can't support it, doesn't mean they won't want it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Andrew, what is the best way to survey existing providers? should we do it on this KEP or send something like a form?
Also, can you please tag the stakeholders from sig-cloud-provider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something in the mailing list is ideal and then final vote on the next SIG call.
|
||
## Proposal | ||
|
||
Add a well-known label for instance host: `topology.kubernetes.io/instance-host` and make its default value empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't love the label name topology.kubernetes.io/instance-host
TBH, I feel it would get confused with just "instance". The alternative is "topology.kubernetes.io/hypervisor` but that is too specific maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatives I can think of:
physical-machine
sounds like jargon and it's long.vm-host
can it be confused with just vm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some others I've thought of:
topology.kubernetes.io/vmm
(virtual machine monitor)topology.kubernetes.io/hardware
None of them stick out to me though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about topology.kubernetes.io/physical-host
(mentioned on #997)? it avoids the confusion and is quite descriptive as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
topology.kubernetes.io/physical-host
could be a better name IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
physical-host is not egregious. The default should not be empty, but the whole label should be undefined.
/hold This one is pretty thorny and imo requires arch approval b/c it means treating label as fully fledged api consts. |
See thread in https://groups.google.com/forum/#!topic/kubernetes-sig-cloud-provider/32N59IYXogY for more context |
FYI to reviewers, a survey was created /cc @bsalamat |
@timothysc by removing semantic aspects of labels we will limit ourselves much from implementing advanced features in Kubernetes. Today, the scheduler uses labels to find about |
@bsalamat - we're lacking policy and formality here to treat them like 1st class APIs which they should be. TL;DR sig-arch really needs to take an official stance on this issue. We've skirted around it for a long time, saying on thing and doing another. I'm not opposed to this work, but it's abundantly clear that we are saying and doing two different things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat said:
Today, the scheduler uses labels to find about nodes, zones, and regions
One could argue that it shouldn't, or at least that the set of "hardcoded" topology labels should be a flag or config-file. That way, a provider which offers this new key can put that key into the config. Then it doesn't matter if it is kubernetes.io/something or provider.com/something. Obviously that would matter to users. I am not convinced this is widespread enough to worry about user-provider portability.
@timothysc said:
sig-arch really needs to take an official stance on this issue
I don't disagree, but we have SO MANY things in flight, I don't know that another architectural proclamation is going to help anyone. I care about this topic, but I simply don't have enough hours in the day to do the work.
Summary: I don't hate this but it's a one-off fix to a specific instance of a class of problems. If enough providers support it and we can agree to (and clearly document) sane semantics in the face of migration, I won't block this.
|
||
## Proposal | ||
|
||
Add a well-known label for instance host: `topology.kubernetes.io/instance-host` and make its default value empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
physical-host is not egregious. The default should not be empty, but the whole label should be undefined.
|
||
#### Story 1 | ||
|
||
As an application developer, I want pods to be scheduled in the same instance host so that networking between them is faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I would call this (depending on faster networking by co-scheduling) an anti-pattern I don't think we should hide information or stop people from making their own decisions.
Under no circumstances can we proscribe the semantic implications of a platform when co-scheduling. It might be faster, it might be the same, or it might be slower. We can not say anything about that, in the abstract. A given implementation may choose to document this or not.
#### Story 2 | ||
|
||
As an application developer, I want pods to be spread in different instance hosts so that if the host goes down, my service doesn’t have to scale from zero. | ||
I use `topology.kubernetes.io/instance-host` as topology key in [even pod spreading](/keps/sig-scheduling/20190221-even-pods-spreading.md) rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wearing my Google Cloud hat: GCP does not provide this information and does not guarantee that a VM will not be live-migrated to another physical host at runtime. In fact, we practically guarantee that any VM of sufficiently long life will get migrated eventually.
|
||
- Risk: Some or all nodes are missing a value for the instance host label and an application developer tries to use the label as a topology key. | ||
|
||
Mitigation: This will be documented as undefined behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be no different that any other topo key.
Why does this want to be a Daemonset rather than a simple controller
running $somewhere. E.g. the per-provider cloud-controller-manager ?
…On Tue, Jul 23, 2019 at 5:14 PM ahg-g ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In keps/sig-cloud-provider/20190704-instance-host-labelling.md
<#1127 (comment)>
:
> +Add a well-known label for instance host: `topology.kubernetes.io/instance-host` <http://topology.kubernetes.io/instance-host> and make its default value empty.
+
+The label will be named `failure-domain.kubernetes.io/instance-host` <http://failure-domain.kubernetes.io/instance-host> if KEP [Promoting cloud provider labels to GA](#839) doesn’t land.
+
+
+### User Stories
+
+#### Story 1
+
+As an application developer, I want pods to be scheduled in the same instance host so that networking between them is faster.
+I use `topology.kubernetes.io/instance-host` <http://topology.kubernetes.io/instance-host> as topology key in [inter-pod affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity-beta-feature) rules.
+
+#### Story 2
+
+As an application developer, I want pods to be spread in different instance hosts so that if the host goes down, my service doesn’t have to scale from zero.
+I use `topology.kubernetes.io/instance-host` <http://topology.kubernetes.io/instance-host> as topology key in [even pod spreading](/keps/sig-scheduling/20190221-even-pods-spreading.md) rules.
We are designing a solution for populating the label that supports
migration, basically a daemonset that frequently checks if the VM got
migrated, and updates the label accordingly. Of course this assumes that
the cloud provider have an API that allows us to do that (in Google,
perhaps we can provide that through the instance metadata service).
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#1127?email_source=notifications&email_token=ABKWAVB2PS4UOIFAGEQOYJLQA6NFPA5CNFSM4H55O752YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB7LNUHY#discussion_r306583619>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKWAVHPTJGOVVWDUXUYF6DQA6NFPANCNFSM4H55O75Q>
.
|
We may need to use a Daemonset if the API for retrieving the instance's physical host is local to the instance (e.g., GCE instance metadata, although GCE currently does not support that yet), but using a cloud-controller-manager is certainly the simpler option if that is not the case. |
Reminder for folks to fill out this survey by @alculquicondor so we can gauge how many providers want/can support this. https://docs.google.com/forms/d/e/1FAIpQLSew1s81jKE_dWt55CeZdbEqkgYJb04lJPshCNQqnFzg4VV3WA/viewform |
Thank you for this. Cloud providers AFIAK does not make any guarantee on I have not seen a cloud provider that provides means to learn about and (leak) For on-prem scenarios, the physical host name can be labeled on the node or fed as values for For the reasons above i don't think this KEP should be accepted |
Since it came up in conversation again today, a recap of my POV: Nothing stops any cloud provider from decorating nodes with any label they want. Topology is arbitrary and not necessarily hierarchical. A cloud provider might have any number of topology-related concepts (including 0) that a node exists in. If a cloud provider knows which "physical host" a node is on, it is free to add that as a label. If the "physical host" is a common enough thing that we want to grant it a "Physical host" is not a terrible name, but it is very precise. We may want something more abstract (or not). It could just as easily be "power supply" or "blast zone" or "underlying OS UUID". No topology key has any particular requirements for semantic meaning. Common things like "zone" and "region" have conventional meanings, and providers who violate those conventions may give their users a bad trip, but they are not invalid. If we give any label a If we want the scheduler to automatically spread across topologies, we should probably not hardcode the special labels. Instead we should take a flag or config which indicates which labels are to be treated specially. The default might be If sig-cloud-provider folks tell me they think this use-case is common enough to justify an elevated prefix, I will go with it, but I will want to see the answers to the above questions. |
From a users pov (rather than cloud provider): I just want a way to indicate "spread wide for fault resilience" or "group close for (presumably) better performance/cost". And I want to express that in a cloud-agnostic manner, so it can be burned into (eg) helm charts. I object to this KEP on those grounds. I think we already have the former useful knobs (in the abstract region/zone and concrete node labels), and I think this KEP introduces the latter harmful knobs. For the 0.1% of workloads that really do care about specific underlying topology, then they can (and should!) be configured using provider-specific labels, since they are already not naively-portable across providers. Imo, any mapping between instance-host (or whatever other significant topology) and abstract k8s scheduling concept (ie: region/zone) should be performed inside the cloud provider code. On cloud providers where instance-hosts exist and are exposed to users, then the provider can already make that information available to users in debugging status/logs or clearly-marked provider-specific node labels. |
Thanks @thockin!
This is the part we are trying to figure out at the moment. It seems like most cloud providers don't expose this topology information and for those that do it requires some workarounds. The sentiment I'm getting so far is that the use-case is actually not common enough because either the cloud provider doesn't expose the information or the live-migrate semantics are too complex and not worth the trade-off in which case I think we're okay to close this KEP? Will defer to @alculquicondor based on the results of the survey he sent out. |
I just want to state my concern with making it a recommended practice to use the zone label for physical hosts for on-prem. While it may make the spec portable between on-prem and on-cloud, I suspect users will find the cost for this portability high since some cloud providers charge for cross-zone traffic. |
Just some info from the IBM Cloud side of things. Today we do not expose this topology info for an instances that are running on shared infrastructure. There is some consideration for those who purchase dedicated hosts to more easily gather metadata about which host an instance is on. If you work from the host down then you can discover this information, but it is not readily available from within the OS. In the future if we were to add support for dedicated hosts in IKS we would likely leverage this instance host label feature. In our internal designs to date I was just planning on abusing the zone labeling. :) |
For public clouds I don't think this applies because they don't expose the physical host topology anyways so you would use zones to represent AZs For on-prem, you can still use regions topology for DC or site topology if you use zones for physical host. But also for on-prem, you rarely see clusters spanning multiple DCs (at least from my experience), so topologies typically won't span cross-zone anyways. |
Given:
We are abandoning this KEP. Thanks for the feedback to all the reviewers. |
Thanks Aldo for driving this issue to a decision, and thanks everyone for the feedback. |
Proposing a new well-known label for instance host.
In this PR: Summary, Motivation, Proposal, Drawbacks and Alternatives.
Design Details will be included in a follow up.
This PR is superseding #997
/sig cloud-provider
/sig scheduling
/ref kubernetes/kubernetes#75274