Add proposal for GetPreferredAllocation() to TopologyManager KEP #1121

klueska · 2019-06-28T16:08:11Z

This proposal adds an API to allow a device plugin to forward a "preferred allocation" to the devicemanager so it can incorporate this information into its allocation decisions. It leaves the devicemanager in charge of making the final allocation, but gives the plugin the chance to help influence it more directly.

Using this new API call, the devicemanager will call out to a plugin at pod admission time, asking it for a preferred device allocation of a given size from a list of available devices. One call will be made per-container for each pod.

The list of available devices passed to the GetPreferredAllocation() call do not necessarily match the full list of available devices on the system. Instead, the devicemanager treats the GetPreferredAllocation() call as a "last-level" filter on the set of devices it has to choose from after taking all TopologyHint information into consideration. As such, the list of available devices passed to this call will already be pre-filtered by the topology constraints encoded in the TopologyHint.

As such, the preferred allocation is not guaranteed to be the allocation ultimately performed by the devicemanager. It is only designed to help the devicemanager make a more informed allocation decision when possible.

When deciding on a preferred allocation, a device plugin will likely take internal topology-constraints into consideration, that the devicemanager is unaware of. A good example of this is the case of allocating pairs of NVIDIA GPUs that always include an NVLINK.

On an 8 GPU machine, with a request for 2 GPUs, the best connected pairs by NVLINK might be:

{{0,3}, {1,2}, {4,7}, {5,6}}

Using GetPreferredAllocation() the NVIDIA device plugin is able to forward one of these preferred allocations to the device manager if the appropriate set of decvices are still available. Without this extra bit of information, the devicemanager would end up picking GPUs at random from the list of GPUs available after filerting by TopologyHint. This API, therefore allows it to ultimately perform a much better allocationt , with very minimal cost.

If a plugin does not implement this new GetPreferredAllocation() method, then we should simply follow the strategy that exists today with no change (i.e. allocate devices directly from the available devices list).

However, if GetPreferredAllocation() is implemented, then the preferred allocation should be chosen over simply pulling devices at random from the available devices list.

There are 4 cases to consider:

TopologyManager disabled, GetPreferredAllocation() not implemented
TopologyManager enabled, GetPreferredAllocation() not implemented
TopologyManager disabled, GetPreferredAllocation() implemented
TopologyManager enabled, GetPreferredAllocation() implemented

With the TopologyManager disabled and GetPreferredAllocation() unimplemented, the existing strategy is to simply pull devices from the front of the available devices list -- this should go unchanged.

With the TopologyManager enabled and GetPreferredAllocation() unimplemented, the existing strategy is to pull devices from the available devices list, such that they have the desired NUMA affinity -- this should also go unchanged.

With the TopologyManager disabled and GetPreferredAllocation() implemented, the new strategy should be to prefer allocations from the list returned by GetPreferredAllocation() if possible, and fall back to pulling devices from the front of the available devices list if not.

With the TopologyManager enabled and GetPreferredAllocation() implemented, the new strategy should be to prefer allocations from the list returned by GetPreferredAllocation() such that they have the desired NUMA affinity presented by the TopologyManager.

If that is not possible, fall back to pulling devices at random from the available devices list, such that they have the desired NUMA affinity. In this way, we will always follow a best-effort policy for honoring preferred allocations specified by this interface. We will NOT fail pod admission due to it.

k8s-ci-robot · 2019-06-28T16:08:18Z

Hi @klueska. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ConnorDoyle · 2019-06-28T20:47:59Z

/ok-to-test

klueska · 2019-06-28T21:03:44Z

One way to look at this proposal, is to think of it as a way of generating intra-device topology-aware allocation preferences from each plugin without having to expose any device specific topology information (e.g. NVLINK topologies) to the kubelet.

In this way, the TopologyManager can be restricted to only deal with common node-level topology constraints (e.g. NUMA node, PCIe bus, etc.), while still having a way of incorporating device-specific topology constraints into its allocation decisions.

keps/sig-node/0035-20190130-topology-manager.md

derekwaynecarr · 2019-07-08T14:19:06Z

/assign

derekwaynecarr

this seems like a simple way to make the kubelet indirectly aware of a device specific affinity preference. just a couple questions to make sure we have a common understanding. in particular, what is the behavior if a device plugin on a node does not yet implement the new method? will we fallback gracefully?

keps/sig-node/0035-20190130-topology-manager.md

derekwaynecarr · 2019-07-08T14:28:00Z

keps/sig-node/0035-20190130-topology-manager.md

+ // - Allocate allows kubelet to exposes additional artifacts in a pod's
+```
+
+Using this new API call, the `devicemanager` will call out to each device


do we think this should be a v1beta2 version of the api? if a plugin did not implement the new method, will we fall back on old behavior?

If a plugin does not implement the new method, then there should be no change from existing behaviour. As such, I don't think this requires an API bump.

klueska · 2019-07-08T15:26:14Z

in particular, what is the behavior if a device plugin on a node does not yet implement the new method? will we fallback gracefully?

If a plugin does not implement the new method, then we should simply follow the strategy that exists today (i.e. allocate devices directly from the available devices list). If however, GetPreferredAllocations() is implemented, then one of the preferred allocations should be chosen over simply pulling devices at random from the available devices list.

There are 4 cases to consider:

TopologyManager disabled, GetPreferredAllocations() not implemented
TopologyManager enabled, GetPreferredAllocations() not implemented
TopologyManager disabled, GetPreferredAllocations() implemented
TopologyManager enabled, GetPreferredAllocations() implemented

With the TopologyManager disabled and GetPreferredAllocations() unimplemented, the existing strategy is to simply pull devices from the front of the available devices list -- this should go unchanged.

With the TopologyManager enabled and GetPreferredAllocations() unimplemented, the strategy is to pull devices from the available devices list, such that they have the desired NUMA affinity -- this should also go unchanged.

With the TopologyManager disabled and GetPreferredAllocations() implemented, the new strategy should be to prefer allocations from the list returned by GetPreferredAllocations() if possible, and fall back to pulling devices from the front of the available devices list if not.

With the TopologyManager enabled and GetPreferredAllocations() implemented, the new strategy should be to prefer allocations from the list returned by GetPreferredAllocations() such that they have the desired NUMA affinity. If that is not possible, and we are in strict mode, then fail pod admission. If that is not possible and we are in preferred mode, then fall back to pulling devices at random from the available devices list, such that they have the desired NUMA affinity.

fejta-bot · 2019-10-07T13:54:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

derekwaynecarr · 2019-10-07T17:57:12Z

@klueska thanks for answering the question, do you see any further updates you would like to make for this?

klueska · 2020-05-20T11:04:38Z

There was an offline comment from @ipuustin a few months (!) back that I'm just getting around to adding here:

Regarding GetPreferredAllocations API: I think that in principle, if GetPreferredAllocations() had this type, it might be "future-proof" enough to implement this scheme:

rpc GetPreferredAllocations(PreferredAllocationsRequest) returns (PreferredAllocationsResponse) {}

message PreferredAllocationsForDomain {
ContainerAllocateRequest preferred_allocations = 1;
string resource_domain = 2;
int32 cost = 3;
}

message PreferredAllocationsResponse {
repeated PreferredAllcationsForDomain preferred_allocations_for_domain = 1;
}

The "cost" and "resource_domain" fields could be unspecified or const at this point. Even just having the one layer of indirection between the PreferredAllocationsResponse and ContainerAllocateRequest would help, so that the extra fields could be added later.

klueska · 2020-05-20T11:07:43Z

I think adding a level of indirection here is a reasonable thing to do. That way we can easily extend the repeatable element over time with more information than just the set of devices.

keps/sig-node/0035-20190130-topology-manager.md

This proposal allows a device plugin to forward lists of preferred allocations to the `devicemanager` so it can incorporate this information into its `TopologyHint` generation for shared NUMA affinity as well as help influence its final allocation decision once all `TopologyHint`s have been merged.

klueska · 2020-05-26T12:09:07Z

Updated KEP proposal based on feedback.

keps/sig-node/0035-20190130-topology-manager.md

klueska · 2020-06-04T12:34:51Z

@kad had concerns that plugins might "misuse" this new API call to try and game the system for the preferred allocations they decide to inform the kubelet about.

Given that this API call is only called optionally, and is designed to be a "last-level-filter" rather than a definitive call to perform an allocation, I am not too worried about this actually happening in practice.

If we come up with a future design of the TopologyManager that wouldn't benefit from making this call, then we could simply stop calling it without consequence.

As it stands today, adding this call can give us a large benefit for little cost, and I think it's worth moving forward with it for this reason.

derekwaynecarr · 2020-06-04T14:31:40Z

@klueska @ipuustin thanks for collaborating on this.

i agree this is a nice incremental low-risk change that can make a big impact.

/approve
/lgtm

k8s-ci-robot · 2020-06-04T14:32:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-node/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kad · 2020-06-05T07:49:19Z

I provided that feedback verbally, in several discussions, but will write it also in here: the approach of "last level filter" even were is not doing actual allocation, but is able to influence decision that allocator in device manager will have only one choice - obey whatever plugin is said as result of filter. Gamification of allocations by 3rd party plugins can't be prevented in that way.

Simply not calling over time this method is not really a solution of deprecation either: it will be plugins in wild that are dependant on that call, and breaking them will be considered as bad experience. Removing methods from gRPC and versions skews are also not a cheap things.

The details of this API can be found in: kubernetes/enhancements#1121

The details of this API can be found in: kubernetes/enhancements#1121 Kubernetes-commit: 202c4f0816be76ece0a9ba8b94192f458e55b35a

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 28, 2019

k8s-ci-robot requested review from dchen1107 and derekwaynecarr June 28, 2019 16:08

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 28, 2019

klueska force-pushed the add-preferred-allocations-to-tm-kep branch from 7b0591b to a5033cf Compare June 28, 2019 16:17

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 28, 2019

klueska changed the title ~~Add proposal for PreferredAllocationsRequest() to TopologyManager KEP~~ Add proposal for GetPreferredAllocations() to TopologyManager KEP Jul 1, 2019

klueska force-pushed the add-preferred-allocations-to-tm-kep branch 2 times, most recently from 5526900 to 9b5b309 Compare July 1, 2019 15:12

moshe010 reviewed Jul 8, 2019

View reviewed changes

k8s-ci-robot assigned derekwaynecarr Jul 8, 2019

derekwaynecarr reviewed Jul 8, 2019

View reviewed changes

klueska force-pushed the add-preferred-allocations-to-tm-kep branch from 9b5b309 to dcc8c72 Compare July 9, 2019 13:51

klueska mentioned this pull request Jul 18, 2019

REQUEST: New membership for klueska kubernetes/org#1018

Closed

6 tasks

klueska mentioned this pull request Jul 30, 2019

Update GetTopologyHints() for TopologyManager Hint Providers to return a map kubernetes/kubernetes#80569

Merged

This was referenced Aug 26, 2019

Device Plugin API change to include Topology Info in Devices kubernetes/kubernetes#74423

Merged

Add support for Topology Manager to Device Manager kubernetes/kubernetes#80570

Merged

klueska mentioned this pull request Oct 3, 2019

TopologyManager: Add support for device-specific topology constraints kubernetes/kubernetes#83483

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2019

klueska mentioned this pull request May 4, 2020

Node Topology Manager #693

Closed

11 tasks

klueska commented May 21, 2020

View reviewed changes

keps/sig-node/0035-20190130-topology-manager.md Show resolved Hide resolved

klueska force-pushed the add-preferred-allocations-to-tm-kep branch from dcc8c72 to d03c3b1 Compare May 26, 2020 12:03

k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label May 26, 2020

klueska changed the title ~~Add proposal for GetPreferredAllocations() to TopologyManager KEP~~ Add proposal for GetPreferredAllocation() to TopologyManager KEP May 26, 2020

ipuustin reviewed May 26, 2020

View reviewed changes

keps/sig-node/0035-20190130-topology-manager.md Outdated Show resolved Hide resolved

klueska force-pushed the add-preferred-allocations-to-tm-kep branch 2 times, most recently from 498bf54 to 1ef30a2 Compare May 26, 2020 14:12

Update to GetPreferredAllocation() in TopologyManager KEP from feedback

16f2244

klueska force-pushed the add-preferred-allocations-to-tm-kep branch from 1ef30a2 to 16f2244 Compare May 26, 2020 14:14

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2020

k8s-ci-robot merged commit 40bfa9f into kubernetes:master Jun 4, 2020

k8s-ci-robot added this to the v1.19 milestone Jun 4, 2020

klueska mentioned this pull request Jun 30, 2020

Add GetPreferredAllocation() call to the v1beta1 device plugin API kubernetes/kubernetes#92665

Merged

klueska added a commit to klueska/kubernetes that referenced this pull request Jul 2, 2020

Add GetPreferredAllocation() call to the device plugin api.proto

202c4f0

The details of this API can be found in: kubernetes/enhancements#1121

sseetharaman6 mentioned this pull request Jul 22, 2020

Support for allocating all VFs from a single PF (bin packing) k8snetworkplumbingwg/sriov-network-device-plugin#255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proposal for GetPreferredAllocation() to TopologyManager KEP #1121

Add proposal for GetPreferredAllocation() to TopologyManager KEP #1121

klueska commented Jun 28, 2019 •

edited

Loading

k8s-ci-robot commented Jun 28, 2019

ConnorDoyle commented Jun 28, 2019

klueska commented Jun 28, 2019

derekwaynecarr commented Jul 8, 2019

derekwaynecarr left a comment

derekwaynecarr Jul 8, 2019

klueska Jul 8, 2019

klueska commented Jul 8, 2019 •

edited

Loading

fejta-bot commented Oct 7, 2019

derekwaynecarr commented Oct 7, 2019

klueska commented May 20, 2020

klueska commented May 20, 2020

klueska commented May 26, 2020

klueska commented Jun 4, 2020

derekwaynecarr commented Jun 4, 2020

k8s-ci-robot commented Jun 4, 2020

kad commented Jun 5, 2020

Add proposal for GetPreferredAllocation() to TopologyManager KEP #1121

Add proposal for GetPreferredAllocation() to TopologyManager KEP #1121

Conversation

klueska commented Jun 28, 2019 • edited Loading

k8s-ci-robot commented Jun 28, 2019

ConnorDoyle commented Jun 28, 2019

klueska commented Jun 28, 2019

derekwaynecarr commented Jul 8, 2019

derekwaynecarr left a comment

Choose a reason for hiding this comment

derekwaynecarr Jul 8, 2019

Choose a reason for hiding this comment

klueska Jul 8, 2019

Choose a reason for hiding this comment

klueska commented Jul 8, 2019 • edited Loading

fejta-bot commented Oct 7, 2019

derekwaynecarr commented Oct 7, 2019

klueska commented May 20, 2020

klueska commented May 20, 2020

klueska commented May 26, 2020

klueska commented Jun 4, 2020

derekwaynecarr commented Jun 4, 2020

k8s-ci-robot commented Jun 4, 2020

kad commented Jun 5, 2020

klueska commented Jun 28, 2019 •

edited

Loading

klueska commented Jul 8, 2019 •

edited

Loading