Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create KEP for Instance Host Labelling #1127

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Jul 4, 2019

Proposing a new well-known label for instance host.
In this PR: Summary, Motivation, Proposal, Drawbacks and Alternatives.
Design Details will be included in a follow up.

This PR is superseding #997

/sig cloud-provider
/sig scheduling

/ref kubernetes/kubernetes#75274

@k8s-ci-robot
Copy link
Contributor

Welcome @alculquicondor!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jul 4, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @alculquicondor. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 4, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alculquicondor
To complete the pull request process, please assign jagosan
You can assign the PR to them by writing /assign @jagosan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 4, 2019
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jul 4, 2019
@alculquicondor
Copy link
Member Author

/cc @ahg-g

@k8s-ci-robot k8s-ci-robot requested a review from ahg-g July 5, 2019 17:35
@andrewsykim
Copy link
Member

Related discussion in the mailing list https://groups.google.com/forum/#!topic/kubernetes-sig-cloud-provider/32N59IYXogY

Copy link
Contributor

@draveness draveness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 9, 2019
@alculquicondor
Copy link
Member Author

cc @thockin and @embano1, who laid down some ideas in the thread


#### Story 1

As an application developer, I want pods to be scheduled in the same instance host so that networking between them is faster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very valid scenario. 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to highlight in which scenarios instance-host is useful over just kubernetes node topology for this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we saying that instance-host affinity would have to guarantee faster inter-instance networking than a zone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I would call this (depending on faster networking by co-scheduling) an anti-pattern I don't think we should hide information or stop people from making their own decisions.

Under no circumstances can we proscribe the semantic implications of a platform when co-scheduling. It might be faster, it might be the same, or it might be slower. We can not say anything about that, in the abstract. A given implementation may choose to document this or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this use case bothers me a lot. Pods are the unit of co-scheduling by design to prevent this scenario which is degenerative. Typically if you care that much you will make it part of the pod itself otherwise you get into scheduling issues. A better way folks typically describe this is by use of other custom labels and calculations of distance. region/zone/cluster/rack are common labels and folks can apply weights to imply distance from those labels. This is how data-gravity scheduling typically works.

Copy link
Member Author

@alculquicondor alculquicondor Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Story 2 was actually the driver for this KEP #75274 . However, even pod spreading is a feature in development, whereas pod (anti)affinity already exists. We can remove this Story if it's something we don't want to promote.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a story that I don't think we want to promote given the history behind pods (co-scheduling) and existing affinity rules.

#### Story 2

As an application developer, I want pods to be spread in different instance hosts so that if the host goes down, my service doesn’t have to scale from zero.
I use `topology.kubernetes.io/instance-host` as topology key in [even pod spreading](/keps/sig-scheduling/20190221-even-pods-spreading.md) rules.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that this can already be solved by using the failure-domain.beta.kubernetes.io/zone label. In public cloud environments, VM placement within a region is usually random (e.g. the user does not have full control on it). So a multi-node environment almost by default runs in multiple zones so that the zone label is sufficient for this use case - and also provides higher redundancy than host isolation as zones in the public cloud are highly isolated from each other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In private cloud environments, a zone is not strictly defined (almost arbitrary, like any other label that could be applied). Thus, in the vSphere cloud provider we give the operator flexibility to define what a zone is via vSphere tags associated to vSphere objects, e.g. a host, cluster or even data center. This tag is read and used for populating failure-domain.beta.kubernetes.io/zone. I acknowledge that different on-prem cloud provider implementations have different semantics here (which makes this discussion not easier...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have feedback from other providers on this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wearing my Google Cloud hat: GCP does not provide this information and does not guarantee that a VM will not be live-migrated to another physical host at runtime. In fact, we practically guarantee that any VM of sufficiently long life will get migrated eventually.

Copy link
Member

@ahg-g ahg-g Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are designing a solution for populating the label that supports migration, basically a daemonset that frequently checks if the VM got migrated, and updates the label accordingly. Of course this assumes that the cloud provider have an API that allows us to do that (in Google Cloud for example, perhaps this can be provided through the instance metadata service).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to double check but you can do this today with your own labeling, and it's not necessarily applicable across providers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered in Alternatives section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaics, our advice to date has been that if you need failure resilience then you spread your pods by the abstract failure-domain.beta.kubernetes.io/zone label. Once you have done that, then there is no need to also spread by instance host as proposed here (ditto for other possible concrete failure domains like "rack" or "power distribution unit").

Is there something in this story that is not already covered by our existing advice
re failure-domain.b.k.io/zone?

I understand that some clouds (notably AWS) also have a specific concept called a "zone", and that the cloud zones might nest differently to the k8s "zone". Perhaps the motivation for this KEP comes from a misunderstanding as to what/how the k8s "zone" label should be used? (in which case I think the solution is better documentation, or possibly renaming the label, not adding additional labels)

There is one case worth mentioning:
An empty topology key for `preferredDuringSchedulingIgnoredDuringExecution` in inter-pod anti-affinity is interpreted as
"all topologies", so the instance host should be considered as well.
If the label value is empty, then the anti-affinity is still ensured by the hostname.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out!


- Risk: Some or all nodes are missing a value for the instance host label and an application developer tries to use the label as a topology key.

Mitigation: This will be documented as undefined behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be no different that any other topo key.

- Risk: If a cloud provider or hypervisor performs a live migration of a VM, the host will change.

Mitigation: The label should be updated. Evicting running pods is out-of-scope for the scheduler.
New pods are scheduled according to the new labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the biggest concern I have with this KEP. YMMV and the following comments are highly dependent on the hypervisor/virtualization platform (capabilities available/exposed/configured).

@bsalamat brought up an important point:

We have accepted the fact that clusters are dynamic in nature and scheduling decisions may be invalidated in the future. This is an assumption that other new features should make too.

I agree with him (because fundamentally that's the current implementation of the scheduler). On the other side, from a UX/ user experience perspective this is not what is known/desired. Especially in cases where hard pod anti-affinity is needed, e.g. critical stateful services. Now, a workaround would be to use zone (assuming zone membership is static and won't change, which might be or not be the case...) instead of instance label here - and the users I've worked with do that and never reported any issues with it.

My feeling is that a majority of users would use the instance label for resiliency, which ultimately is not guaranteed when VMs migrate, labels need to be updated and workloads have to be evicted/restarted to reconcile. So why not keep using the zones label for resiliency than? I don't want to repeat every argument of the discussion here, so pointing to the former KEP which has more details: #997 (review)

Some more findings from the field (real-world on-prem deployments I was involved with): when we planned to add zones support in the vSphere cloud provider we went with a two phase approach: first, implement the basic functionality. Second, try to solve the "label drift" issue. It turned out that solving the "label drift" issues was impossible, i.e. it always had consequences (impact) to user workloads. This is not what we wanted to go through, so phase two never was implemented. Did that cause a problem? Not for our users. They leverage zones topology (again that could be arbitrary on-prem) to model resiliency constructs. Whenever needed, vSphere will be configured to never move VMs if that could cause "label drift", e.g. move VM from zone 1 to zone 2 because of cluster imbalance or host failure. I understand that this might not be possible (available) with all hypervisors - which complicates the matter unfortunately.

Also, @anfernee built a prototype to influence scheduling decisions honoring host information. It couldn't solve the "label drift" issue and also did not get a lot of traction because our user base was happy with the current (zones) implementation.

I am not saying to keep the status quo. But assuming that we could easily update labels and restart workloads is a weak assumption, especially for stateful production critical workloads in highly dynamic clusters. Also it might be worth pointing out that not all Kubernetes users can simply modify their workloads to comply with the Kubernetes semantics, be it because they don't have the manpower, access to source code, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair do you have any thoughts on these semantics for live migration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that if a VM migrates, the label MUST be updated. Whether that ripples out into a pod reschedule is a totally different question. We don't have a rescheduler today, but hard anti-affinity might be broken by a migration, which would effectively require either a resched or that we loosen the guarantees for affinity/anti-affinity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just highlighting that live-migrating an instance/pod, and then restarting the pod because its label changed, is equivalent to just not live-migrating it in the first place. I advocate taking the simpler option and just not engaging with this arms race..)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anguslees the problem is that the hypervisor likely doesn't have a way to tell if it can migrate a VM without breaking any restrictions for the pod.
In the re-scheduling topic, note that we face the same problem when adding new taints to a node for which running pods don't have a toleration.


## Alternatives

Do nothing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this sounds a bit naive but might just be ok 😀 On the other side, nothing speaks against adding this label if the use cases and impact (e.g. drift, etc.) are clearly articulated. Customers can then decide which label to use for proximity, resiliency, spreading, etc.

## Alternatives

Do nothing.
Cloud providers and on-prem clusters could support this topology by defining their own label and controllers to populate it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should survey the existing providers to see if exposing this topology is even possible. I get the sense that it's currently only useful for 1-2 providers, for the others it's just not feasible or the API does not expose that information. If we find the use-case is common enough we should codify it.

Copy link
Member

@andrewsykim andrewsykim Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can put up a vote on this one in SIG Cloud Provider? Just because a current cloud provider can't support it, doesn't mean they won't want it in the future.

Copy link
Member

@ahg-g ahg-g Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Andrew, what is the best way to survey existing providers? should we do it on this KEP or send something like a form?

Also, can you please tag the stakeholders from sig-cloud-provider?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something in the mailing list is ideal and then final vote on the next SIG call.


## Proposal

Add a well-known label for instance host: `topology.kubernetes.io/instance-host` and make its default value empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't love the label name topology.kubernetes.io/instance-host TBH, I feel it would get confused with just "instance". The alternative is "topology.kubernetes.io/hypervisor` but that is too specific maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatives I can think of:

  • physical-machine sounds like jargon and it's long.
  • vm-host can it be confused with just vm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some others I've thought of:

  • topology.kubernetes.io/vmm (virtual machine monitor)
  • topology.kubernetes.io/hardware

None of them stick out to me though

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about topology.kubernetes.io/physical-host (mentioned on #997)? it avoids the confusion and is quite descriptive as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topology.kubernetes.io/physical-host could be a better name IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

physical-host is not egregious. The default should not be empty, but the whole label should be undefined.

@timothysc
Copy link
Member

/hold
/assign @bgrant0607 @smarterclayton @liggitt @thockin
/cc @kubernetes/sig-architecture-feature-requests

This one is pretty thorny and imo requires arch approval b/c it means treating label as fully fledged api consts.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2019
@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Jul 18, 2019
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 18, 2019
@andrewsykim
Copy link
Member

@timothysc timothysc self-assigned this Jul 18, 2019
@alculquicondor
Copy link
Member Author

FYI to reviewers, a survey was created

/cc @bsalamat

@bsalamat
Copy link
Member

This one is pretty thorny and imo requires arch approval b/c it means treating label as fully fledged api consts.

@timothysc by removing semantic aspects of labels we will limit ourselves much from implementing advanced features in Kubernetes. Today, the scheduler uses labels to find about nodes, zones, and regions. What is proposed here is very similar. We could remove all the logic that treats these labels specially, but then features such as default spreading of pods among nodes and zones that is one of the widely used features of K8s is removed as well.
It is fine and appreciated to have an arch review of this PR, but please consider all pros and cons of this feature when making a decision.

@timothysc
Copy link
Member

timothysc commented Jul 22, 2019

@bsalamat - we're lacking policy and formality here to treat them like 1st class APIs which they should be. TL;DR sig-arch really needs to take an official stance on this issue. We've skirted around it for a long time, saying on thing and doing another.

I'm not opposed to this work, but it's abundantly clear that we are saying and doing two different things.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat said:

Today, the scheduler uses labels to find about nodes, zones, and regions

One could argue that it shouldn't, or at least that the set of "hardcoded" topology labels should be a flag or config-file. That way, a provider which offers this new key can put that key into the config. Then it doesn't matter if it is kubernetes.io/something or provider.com/something. Obviously that would matter to users. I am not convinced this is widespread enough to worry about user-provider portability.

@timothysc said:

sig-arch really needs to take an official stance on this issue

I don't disagree, but we have SO MANY things in flight, I don't know that another architectural proclamation is going to help anyone. I care about this topic, but I simply don't have enough hours in the day to do the work.

Summary: I don't hate this but it's a one-off fix to a specific instance of a class of problems. If enough providers support it and we can agree to (and clearly document) sane semantics in the face of migration, I won't block this.


## Proposal

Add a well-known label for instance host: `topology.kubernetes.io/instance-host` and make its default value empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

physical-host is not egregious. The default should not be empty, but the whole label should be undefined.


#### Story 1

As an application developer, I want pods to be scheduled in the same instance host so that networking between them is faster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I would call this (depending on faster networking by co-scheduling) an anti-pattern I don't think we should hide information or stop people from making their own decisions.

Under no circumstances can we proscribe the semantic implications of a platform when co-scheduling. It might be faster, it might be the same, or it might be slower. We can not say anything about that, in the abstract. A given implementation may choose to document this or not.

#### Story 2

As an application developer, I want pods to be spread in different instance hosts so that if the host goes down, my service doesn’t have to scale from zero.
I use `topology.kubernetes.io/instance-host` as topology key in [even pod spreading](/keps/sig-scheduling/20190221-even-pods-spreading.md) rules.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wearing my Google Cloud hat: GCP does not provide this information and does not guarantee that a VM will not be live-migrated to another physical host at runtime. In fact, we practically guarantee that any VM of sufficiently long life will get migrated eventually.


- Risk: Some or all nodes are missing a value for the instance host label and an application developer tries to use the label as a topology key.

Mitigation: This will be documented as undefined behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be no different that any other topo key.

@thockin
Copy link
Member

thockin commented Jul 24, 2019 via email

@ahg-g
Copy link
Member

ahg-g commented Jul 24, 2019

Why does this want to be a Daemonset rather than a simple controller running $somewhere. E.g. the per-provider cloud-controller-manager ?

We may need to use a Daemonset if the API for retrieving the instance's physical host is local to the instance (e.g., GCE instance metadata, although GCE currently does not support that yet), but using a cloud-controller-manager is certainly the simpler option if that is not the case.

@andrewsykim
Copy link
Member

Reminder for folks to fill out this survey by @alculquicondor so we can gauge how many providers want/can support this.

https://docs.google.com/forms/d/e/1FAIpQLSew1s81jKE_dWt55CeZdbEqkgYJb04lJPshCNQqnFzg4VV3WA/viewform

@khenidak
Copy link
Contributor

Thank you for this. Cloud providers AFIAK does not make any guarantee on faster network for VMs on the same physical host. Instead they make guarantees around minimum bandwidth within a zone, datacenter or a similar geo concepts. Hence leaking the physical host info into the cluster won't create any additional value, at least from the public cloud point of view.

I have not seen a cloud provider that provides means to learn about and (leak) physical host name in a VM. Instead constructs like zone and fault domains are used. Those constructs are then fed into the cluster to be used as value for zone (or even custom labels) and are used by node selector and pod affinity features. Collectively they satisfy requirements that are described in Story 1 and Story 2.

For on-prem scenarios, the physical host name can be labeled on the node or fed as values for zone labels and still be used the same way. That at least grantees some level of portability between deployments created on public clouds and the ones created for on prem.

For the reasons above i don't think this KEP should be accepted

@thockin
Copy link
Member

thockin commented Jul 25, 2019

Since it came up in conversation again today, a recap of my POV:

Nothing stops any cloud provider from decorating nodes with any label they want.

Topology is arbitrary and not necessarily hierarchical. A cloud provider might have any number of topology-related concepts (including 0) that a node exists in.

If a cloud provider knows which "physical host" a node is on, it is free to add that as a label.

If the "physical host" is a common enough thing that we want to grant it a kubernetes.io/ prefix, but I am skeptical.

"Physical host" is not a terrible name, but it is very precise. We may want something more abstract (or not). It could just as easily be "power supply" or "blast zone" or "underlying OS UUID".

No topology key has any particular requirements for semantic meaning. Common things like "zone" and "region" have conventional meanings, and providers who violate those conventions may give their users a bad trip, but they are not invalid.

If we give any label a kubernetes.io/ prefix we must be clear on what is GUARANTEED to happen, what is guaranteed to NOT happen, and what is not guaranteed either way when that label changes (e.g. a live-migration of a VM to a new physical host). We must also define the implications for "hard" scheduling requirements that become invalidated by a migration.

If we want the scheduler to automatically spread across topologies, we should probably not hardcode the special labels. Instead we should take a flag or config which indicates which labels are to be treated specially. The default might be topology.kubernetes.io/zone, but any user could change that (if they control the master) and any hosted master (e.g. GKE, EKS, AKS, ...) could set it to whatever is appropriate for their cloud.

If sig-cloud-provider folks tell me they think this use-case is common enough to justify an elevated prefix, I will go with it, but I will want to see the answers to the above questions.

@anguslees
Copy link
Member

anguslees commented Jul 25, 2019

From a users pov (rather than cloud provider): I just want a way to indicate "spread wide for fault resilience" or "group close for (presumably) better performance/cost". And I want to express that in a cloud-agnostic manner, so it can be burned into (eg) helm charts.
What I don't want is to have to come up with a unique way to express that across each cloud provider, based on the specific provider's choice of technology and terminology.

I object to this KEP on those grounds. I think we already have the former useful knobs (in the abstract region/zone and concrete node labels), and I think this KEP introduces the latter harmful knobs. For the 0.1% of workloads that really do care about specific underlying topology, then they can (and should!) be configured using provider-specific labels, since they are already not naively-portable across providers.

Imo, any mapping between instance-host (or whatever other significant topology) and abstract k8s scheduling concept (ie: region/zone) should be performed inside the cloud provider code. On cloud providers where instance-hosts exist and are exposed to users, then the provider can already make that information available to users in debugging status/logs or clearly-marked provider-specific node labels.

@andrewsykim
Copy link
Member

Thanks @thockin!

If sig-cloud-provider folks tell me they think this use-case is common enough to justify an elevated prefix, I will go with it, but I will want to see the answers to the above questions.

This is the part we are trying to figure out at the moment. It seems like most cloud providers don't expose this topology information and for those that do it requires some workarounds. The sentiment I'm getting so far is that the use-case is actually not common enough because either the cloud provider doesn't expose the information or the live-migrate semantics are too complex and not worth the trade-off in which case I think we're okay to close this KEP? Will defer to @alculquicondor based on the results of the survey he sent out.

@ahg-g
Copy link
Member

ahg-g commented Jul 25, 2019

I just want to state my concern with making it a recommended practice to use the zone label for physical hosts for on-prem. While it may make the spec portable between on-prem and on-cloud, I suspect users will find the cost for this portability high since some cloud providers charge for cross-zone traffic.

@kitch
Copy link

kitch commented Jul 25, 2019

Just some info from the IBM Cloud side of things. Today we do not expose this topology info for an instances that are running on shared infrastructure. There is some consideration for those who purchase dedicated hosts to more easily gather metadata about which host an instance is on. If you work from the host down then you can discover this information, but it is not readily available from within the OS. In the future if we were to add support for dedicated hosts in IKS we would likely leverage this instance host label feature. In our internal designs to date I was just planning on abusing the zone labeling. :)

@andrewsykim
Copy link
Member

I suspect users will find the cost for this portability high since some cloud providers charge for cross-zone traffic.

For public clouds I don't think this applies because they don't expose the physical host topology anyways so you would use zones to represent AZs For on-prem, you can still use regions topology for DC or site topology if you use zones for physical host. But also for on-prem, you rarely see clusters spanning multiple DCs (at least from my experience), so topologies typically won't span cross-zone anyways.

@alculquicondor
Copy link
Member Author

Given:

  • The lack of support from most cloud providers to expose the information, and
  • The possibility for on-prem clusters to make use of zone to represent this topology

We are abandoning this KEP. Thanks for the feedback to all the reviewers.

@ahg-g
Copy link
Member

ahg-g commented Jul 25, 2019

Given:

  • The lack of support from most cloud providers to expose the information, and
  • The possibility for on-prem clusters to make use of zone to represent this topology

We are abandoning this KEP. Thanks for the feedback to all the reviewers.

Thanks Aldo for driving this issue to a decision, and thanks everyone for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.