-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proposal for GetPreferredAllocation() to TopologyManager KEP #1121
Add proposal for GetPreferredAllocation() to TopologyManager KEP #1121
Conversation
Hi @klueska. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
7b0591b
to
a5033cf
Compare
/ok-to-test |
One way to look at this proposal, is to think of it as a way of generating intra-device topology-aware allocation preferences from each plugin without having to expose any device specific topology information (e.g. NVLINK topologies) to the kubelet. In this way, the |
5526900
to
9b5b309
Compare
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a simple way to make the kubelet indirectly aware of a device specific affinity preference. just a couple questions to make sure we have a common understanding. in particular, what is the behavior if a device plugin on a node does not yet implement the new method? will we fallback gracefully?
// - Allocate allows kubelet to exposes additional artifacts in a pod's | ||
``` | ||
|
||
Using this new API call, the `devicemanager` will call out to each device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we think this should be a v1beta2 version of the api? if a plugin did not implement the new method, will we fall back on old behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a plugin does not implement the new method, then there should be no change from existing behaviour. As such, I don't think this requires an API bump.
If a plugin does not implement the new method, then we should simply follow the strategy that exists today (i.e. allocate devices directly from the available devices list). If however, There are 4 cases to consider:
With the With the With the With the |
9b5b309
to
dcc8c72
Compare
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@klueska thanks for answering the question, do you see any further updates you would like to make for this? |
There was an offline comment from @ipuustin a few months (!) back that I'm just getting around to adding here:
|
I think adding a level of indirection here is a reasonable thing to do. That way we can easily extend the repeatable element over time with more information than just the set of devices. |
This proposal allows a device plugin to forward lists of preferred allocations to the `devicemanager` so it can incorporate this information into its `TopologyHint` generation for shared NUMA affinity as well as help influence its final allocation decision once all `TopologyHint`s have been merged.
dcc8c72
to
d03c3b1
Compare
Updated KEP proposal based on feedback. |
498bf54
to
1ef30a2
Compare
1ef30a2
to
16f2244
Compare
@kad had concerns that plugins might "misuse" this new API call to try and game the system for the preferred allocations they decide to inform the kubelet about. Given that this API call is only called optionally, and is designed to be a "last-level-filter" rather than a definitive call to perform an allocation, I am not too worried about this actually happening in practice. If we come up with a future design of the As it stands today, adding this call can give us a large benefit for little cost, and I think it's worth moving forward with it for this reason. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, klueska The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I provided that feedback verbally, in several discussions, but will write it also in here: the approach of "last level filter" even were is not doing actual allocation, but is able to influence decision that allocator in device manager will have only one choice - obey whatever plugin is said as result of filter. Gamification of allocations by 3rd party plugins can't be prevented in that way. Simply not calling over time this method is not really a solution of deprecation either: it will be plugins in wild that are dependant on that call, and breaking them will be considered as bad experience. Removing methods from gRPC and versions skews are also not a cheap things. |
The details of this API can be found in: kubernetes/enhancements#1121
The details of this API can be found in: kubernetes/enhancements#1121 Kubernetes-commit: 202c4f0816be76ece0a9ba8b94192f458e55b35a
The details of this API can be found in: kubernetes/enhancements#1121 Kubernetes-commit: 202c4f0816be76ece0a9ba8b94192f458e55b35a
This proposal adds an API to allow a device plugin to forward a "preferred allocation" to the
devicemanager
so it can incorporate this information into its allocation decisions. It leaves thedevicemanager
in charge of making the final allocation, but gives the plugin the chance to help influence it more directly.Using this new API call, the
devicemanager
will call out to a plugin at pod admission time, asking it for a preferred device allocation of a given size from a list of available devices. One call will be made per-container for each pod.The list of available devices passed to the
GetPreferredAllocation()
call do not necessarily match the full list of available devices on the system. Instead, thedevicemanager
treats theGetPreferredAllocation()
call as a "last-level" filter on the set of devices it has to choose from after taking allTopologyHint
information into consideration. As such, the list of available devices passed to this call will already be pre-filtered by the topology constraints encoded in theTopologyHint
.As such, the preferred allocation is not guaranteed to be the allocation ultimately performed by the
devicemanager
. It is only designed to help thedevicemanager
make a more informed allocation decision when possible.When deciding on a preferred allocation, a device plugin will likely take internal topology-constraints into consideration, that the
devicemanager
is unaware of. A good example of this is the case of allocating pairs of NVIDIA GPUs that always include an NVLINK.On an 8 GPU machine, with a request for 2 GPUs, the best connected pairs by NVLINK might be:
Using
GetPreferredAllocation()
the NVIDIA device plugin is able to forward one of these preferred allocations to the device manager if the appropriate set of decvices are still available. Without this extra bit of information, thedevicemanager
would end up picking GPUs at random from the list of GPUs available after filerting byTopologyHint
. This API, therefore allows it to ultimately perform a much better allocationt , with very minimal cost.If a plugin does not implement this new
GetPreferredAllocation()
method, then we should simply follow the strategy that exists today with no change (i.e. allocate devices directly from the available devices list).However, if
GetPreferredAllocation()
is implemented, then the preferred allocation should be chosen over simply pulling devices at random from the available devices list.There are 4 cases to consider:
TopologyManager
disabled,GetPreferredAllocation()
not implementedTopologyManager
enabled,GetPreferredAllocation()
not implementedTopologyManager
disabled,GetPreferredAllocation()
implementedTopologyManager
enabled,GetPreferredAllocation()
implementedWith the
TopologyManager
disabled andGetPreferredAllocation()
unimplemented, the existing strategy is to simply pull devices from the front of the available devices list -- this should go unchanged.With the
TopologyManager
enabled andGetPreferredAllocation()
unimplemented, the existing strategy is to pull devices from the available devices list, such that they have the desired NUMA affinity -- this should also go unchanged.With the
TopologyManager
disabled andGetPreferredAllocation()
implemented, the new strategy should be to prefer allocations from the list returned byGetPreferredAllocation()
if possible, and fall back to pulling devices from the front of the available devices list if not.With the
TopologyManager
enabled andGetPreferredAllocation()
implemented, the new strategy should be to prefer allocations from the list returned byGetPreferredAllocation()
such that they have the desired NUMA affinity presented by theTopologyManager
.If that is not possible, fall back to pulling devices at random from the available devices list, such that they have the desired NUMA affinity. In this way, we will always follow a best-effort policy for honoring preferred allocations specified by this interface. We will NOT fail pod admission due to it.