Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proposal for GetPreferredAllocation() to TopologyManager KEP #1121

Merged

Conversation

klueska
Copy link
Contributor

@klueska klueska commented Jun 28, 2019

This proposal adds an API to allow a device plugin to forward a "preferred allocation" to the devicemanager so it can incorporate this information into its allocation decisions. It leaves the devicemanager in charge of making the final allocation, but gives the plugin the chance to help influence it more directly.

Using this new API call, the devicemanager will call out to a plugin at pod admission time, asking it for a preferred device allocation of a given size from a list of available devices. One call will be made per-container for each pod.

The list of available devices passed to the GetPreferredAllocation() call do not necessarily match the full list of available devices on the system. Instead, the devicemanager treats the GetPreferredAllocation() call as a "last-level" filter on the set of devices it has to choose from after taking all TopologyHint information into consideration. As such, the list of available devices passed to this call will already be pre-filtered by the topology constraints encoded in the TopologyHint.

As such, the preferred allocation is not guaranteed to be the allocation ultimately performed by the devicemanager. It is only designed to help the devicemanager make a more informed allocation decision when possible.

When deciding on a preferred allocation, a device plugin will likely take internal topology-constraints into consideration, that the devicemanager is unaware of. A good example of this is the case of allocating pairs of NVIDIA GPUs that always include an NVLINK.

On an 8 GPU machine, with a request for 2 GPUs, the best connected pairs by NVLINK might be:

{{0,3}, {1,2}, {4,7}, {5,6}}

Using GetPreferredAllocation() the NVIDIA device plugin is able to forward one of these preferred allocations to the device manager if the appropriate set of decvices are still available. Without this extra bit of information, the devicemanager would end up picking GPUs at random from the list of GPUs available after filerting by TopologyHint. This API, therefore allows it to ultimately perform a much better allocationt , with very minimal cost.

If a plugin does not implement this new GetPreferredAllocation() method, then we should simply follow the strategy that exists today with no change (i.e. allocate devices directly from the available devices list).

However, if GetPreferredAllocation() is implemented, then the preferred allocation should be chosen over simply pulling devices at random from the available devices list.

There are 4 cases to consider:

  1. TopologyManager disabled, GetPreferredAllocation() not implemented
  2. TopologyManager enabled, GetPreferredAllocation() not implemented
  3. TopologyManager disabled, GetPreferredAllocation() implemented
  4. TopologyManager enabled, GetPreferredAllocation() implemented

With the TopologyManager disabled and GetPreferredAllocation() unimplemented, the existing strategy is to simply pull devices from the front of the available devices list -- this should go unchanged.

With the TopologyManager enabled and GetPreferredAllocation() unimplemented, the existing strategy is to pull devices from the available devices list, such that they have the desired NUMA affinity -- this should also go unchanged.

With the TopologyManager disabled and GetPreferredAllocation() implemented, the new strategy should be to prefer allocations from the list returned by GetPreferredAllocation() if possible, and fall back to pulling devices from the front of the available devices list if not.

With the TopologyManager enabled and GetPreferredAllocation() implemented, the new strategy should be to prefer allocations from the list returned by GetPreferredAllocation() such that they have the desired NUMA affinity presented by the TopologyManager.

If that is not possible, fall back to pulling devices at random from the available devices list, such that they have the desired NUMA affinity. In this way, we will always follow a best-effort policy for honoring preferred allocations specified by this interface. We will NOT fail pod admission due to it.

@k8s-ci-robot
Copy link
Contributor

Hi @klueska. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 28, 2019
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 28, 2019
@klueska klueska force-pushed the add-preferred-allocations-to-tm-kep branch from 7b0591b to a5033cf Compare June 28, 2019 16:17
@ConnorDoyle
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 28, 2019
@klueska
Copy link
Contributor Author

klueska commented Jun 28, 2019

One way to look at this proposal, is to think of it as a way of generating intra-device topology-aware allocation preferences from each plugin without having to expose any device specific topology information (e.g. NVLINK topologies) to the kubelet.

In this way, the TopologyManager can be restricted to only deal with common node-level topology constraints (e.g. NUMA node, PCIe bus, etc.), while still having a way of incorporating device-specific topology constraints into its allocation decisions.

@klueska klueska changed the title Add proposal for PreferredAllocationsRequest() to TopologyManager KEP Add proposal for GetPreferredAllocations() to TopologyManager KEP Jul 1, 2019
@klueska klueska force-pushed the add-preferred-allocations-to-tm-kep branch 2 times, most recently from 5526900 to 9b5b309 Compare July 1, 2019 15:12
keps/sig-node/0035-20190130-topology-manager.md Outdated Show resolved Hide resolved
keps/sig-node/0035-20190130-topology-manager.md Outdated Show resolved Hide resolved
keps/sig-node/0035-20190130-topology-manager.md Outdated Show resolved Hide resolved
keps/sig-node/0035-20190130-topology-manager.md Outdated Show resolved Hide resolved
keps/sig-node/0035-20190130-topology-manager.md Outdated Show resolved Hide resolved
@derekwaynecarr
Copy link
Member

/assign

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a simple way to make the kubelet indirectly aware of a device specific affinity preference. just a couple questions to make sure we have a common understanding. in particular, what is the behavior if a device plugin on a node does not yet implement the new method? will we fallback gracefully?

keps/sig-node/0035-20190130-topology-manager.md Outdated Show resolved Hide resolved
// - Allocate allows kubelet to exposes additional artifacts in a pod's
```

Using this new API call, the `devicemanager` will call out to each device
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we think this should be a v1beta2 version of the api? if a plugin did not implement the new method, will we fall back on old behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a plugin does not implement the new method, then there should be no change from existing behaviour. As such, I don't think this requires an API bump.

@klueska
Copy link
Contributor Author

klueska commented Jul 8, 2019

in particular, what is the behavior if a device plugin on a node does not yet implement the new method? will we fallback gracefully?

If a plugin does not implement the new method, then we should simply follow the strategy that exists today (i.e. allocate devices directly from the available devices list). If however, GetPreferredAllocations() is implemented, then one of the preferred allocations should be chosen over simply pulling devices at random from the available devices list.

There are 4 cases to consider:

  1. TopologyManager disabled, GetPreferredAllocations() not implemented
  2. TopologyManager enabled, GetPreferredAllocations() not implemented
  3. TopologyManager disabled, GetPreferredAllocations() implemented
  4. TopologyManager enabled, GetPreferredAllocations() implemented

With the TopologyManager disabled and GetPreferredAllocations() unimplemented, the existing strategy is to simply pull devices from the front of the available devices list -- this should go unchanged.

With the TopologyManager enabled and GetPreferredAllocations() unimplemented, the strategy is to pull devices from the available devices list, such that they have the desired NUMA affinity -- this should also go unchanged.

With the TopologyManager disabled and GetPreferredAllocations() implemented, the new strategy should be to prefer allocations from the list returned by GetPreferredAllocations() if possible, and fall back to pulling devices from the front of the available devices list if not.

With the TopologyManager enabled and GetPreferredAllocations() implemented, the new strategy should be to prefer allocations from the list returned by GetPreferredAllocations() such that they have the desired NUMA affinity. If that is not possible, and we are in strict mode, then fail pod admission. If that is not possible and we are in preferred mode, then fall back to pulling devices at random from the available devices list, such that they have the desired NUMA affinity.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2019
@derekwaynecarr
Copy link
Member

@klueska thanks for answering the question, do you see any further updates you would like to make for this?

@klueska klueska mentioned this pull request May 4, 2020
11 tasks
@klueska
Copy link
Contributor Author

klueska commented May 20, 2020

There was an offline comment from @ipuustin a few months (!) back that I'm just getting around to adding here:

Regarding GetPreferredAllocations API: I think that in principle, if GetPreferredAllocations() had this type, it might be "future-proof" enough to implement this scheme:

rpc GetPreferredAllocations(PreferredAllocationsRequest) returns (PreferredAllocationsResponse) {}

message PreferredAllocationsForDomain {
ContainerAllocateRequest preferred_allocations = 1;
string resource_domain = 2;
int32 cost = 3;
}

message PreferredAllocationsResponse {
repeated PreferredAllcationsForDomain preferred_allocations_for_domain = 1;
}

The "cost" and "resource_domain" fields could be unspecified or const at this point. Even just having the one layer of indirection between the PreferredAllocationsResponse and ContainerAllocateRequest would help, so that the extra fields could be added later.

@klueska
Copy link
Contributor Author

klueska commented May 20, 2020

I think adding a level of indirection here is a reasonable thing to do. That way we can easily extend the repeatable element over time with more information than just the set of devices.

This proposal allows a device plugin to forward lists of preferred
allocations to the `devicemanager` so it can incorporate this
information into its `TopologyHint` generation for shared NUMA affinity
as well as help influence its final allocation decision once all
`TopologyHint`s have been merged.
@klueska klueska force-pushed the add-preferred-allocations-to-tm-kep branch from dcc8c72 to d03c3b1 Compare May 26, 2020 12:03
@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label May 26, 2020
@klueska klueska changed the title Add proposal for GetPreferredAllocations() to TopologyManager KEP Add proposal for GetPreferredAllocation() to TopologyManager KEP May 26, 2020
@klueska
Copy link
Contributor Author

klueska commented May 26, 2020

Updated KEP proposal based on feedback.

@klueska klueska force-pushed the add-preferred-allocations-to-tm-kep branch 2 times, most recently from 498bf54 to 1ef30a2 Compare May 26, 2020 14:12
@klueska klueska force-pushed the add-preferred-allocations-to-tm-kep branch from 1ef30a2 to 16f2244 Compare May 26, 2020 14:14
@klueska
Copy link
Contributor Author

klueska commented Jun 4, 2020

@kad had concerns that plugins might "misuse" this new API call to try and game the system for the preferred allocations they decide to inform the kubelet about.

Given that this API call is only called optionally, and is designed to be a "last-level-filter" rather than a definitive call to perform an allocation, I am not too worried about this actually happening in practice.

If we come up with a future design of the TopologyManager that wouldn't benefit from making this call, then we could simply stop calling it without consequence.

As it stands today, adding this call can give us a large benefit for little cost, and I think it's worth moving forward with it for this reason.

@derekwaynecarr
Copy link
Member

@klueska @ipuustin thanks for collaborating on this.

i agree this is a nice incremental low-risk change that can make a big impact.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2020
@k8s-ci-robot k8s-ci-robot merged commit 40bfa9f into kubernetes:master Jun 4, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Jun 4, 2020
@kad
Copy link
Member

kad commented Jun 5, 2020

I provided that feedback verbally, in several discussions, but will write it also in here: the approach of "last level filter" even were is not doing actual allocation, but is able to influence decision that allocator in device manager will have only one choice - obey whatever plugin is said as result of filter. Gamification of allocations by 3rd party plugins can't be prevented in that way.

Simply not calling over time this method is not really a solution of deprecation either: it will be plugins in wild that are dependant on that call, and breaking them will be considered as bad experience. Removing methods from gRPC and versions skews are also not a cheap things.

klueska added a commit to klueska/kubernetes that referenced this pull request Jul 2, 2020
k8s-publishing-bot pushed a commit to kubernetes/kubelet that referenced this pull request Jul 4, 2020
The details of this API can be found in:
kubernetes/enhancements#1121

Kubernetes-commit: 202c4f0816be76ece0a9ba8b94192f458e55b35a
k8s-publishing-bot pushed a commit to kubernetes-nightly/kubelet that referenced this pull request Jul 9, 2020
The details of this API can be found in:
kubernetes/enhancements#1121

Kubernetes-commit: 202c4f0816be76ece0a9ba8b94192f458e55b35a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants