Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for allocating all VFs from a single PF (bin packing) #255

Open
sseetharaman6 opened this issue Jul 21, 2020 · 15 comments · May be fixed by #443
Open

Support for allocating all VFs from a single PF (bin packing) #255

sseetharaman6 opened this issue Jul 21, 2020 · 15 comments · May be fixed by #443

Comments

@sseetharaman6
Copy link

What would you like to be added?

If I have a multiple PFs configured for SRIOV and advertised as the same resource pool (sriov_foo) , is it possible to enforce allocation of all VFs from a single PF before VFs from other PFs are allocated? It seems like pluginapi.AllocateRequest is picking devicesIDs at random, so I am not sure if this is possible/ can be supported.

What is the use case for this feature / enhancement?

@zshi-redhat
Copy link
Collaborator

@sseetharaman6 you're right that kubelet randomly chooses one healthy device from the advertised pool (sriov_foo), so if all VFs from PFs are grouped as one pool, then it's not guaranteed which PF the allocated VF is from. you might want to group the VFs from single PF as one pool and request device directly from that pool.

@sseetharaman6
Copy link
Author

Yea, but say I have 2 VFs per PF and request for 3 VFs in the pod spec , advertising each PF as its own resource will make this pod unschedulable.
In order to allocate all VFs from a PF before moving on to the next, DP has to support some kind of resource ordering or preferential allocate (can something like kubernetes/enhancements#1121 be used? )

@zshi-redhat
Copy link
Collaborator

zshi-redhat commented Jul 22, 2020

Yea, but say I have 2 VFs per PF and request for 3 VFs in the pod spec , advertising each PF as its own resource will make this pod unschedulable.

In this case, you will need to put two resource requests in the pod spec, the first request 2 VF resource, the second request 1 VF resource. I understand this may not be exactly what you have asked for.

In order to allocate all VFs from a PF before moving on to the next, DP has to support some kind of resource ordering or preferential allocate (can something like kubernetes/enhancements#1121 be used? )

Thanks for linking the reference!
First of all, I think we should update the device plugin to support this new interface GetPreferredAllocation change.
Regarding how device plugin shall decide the preferred allocation, my understanding is that if may differ per use cases.
For example, sometime user may want to distribute the workloads to different PFs to balance the load on each interface.
in other cases like you mentioned, it may be preferred to consume all resource from single PF before using the next one.
It looks to me that we may not have a unified solution on how device plugin shall decide the preferred allocation.
but maybe it is possible to define several preferred allocating polices and let user to choose which one to apply when launching the device plugin.

@RahulG115
Copy link

Facing same
+1

@killianmuldoon
Copy link
Collaborator

@zshi-redhat we should be able to implement this on a per-pool level with some device pools "packers" and others marked as "spreaders". Is there anything else the preferred allocation could be used for that might fit in - or be more relevant even?

@zshi-redhat
Copy link
Collaborator

@zshi-redhat we should be able to implement this on a per-pool level with some device pools "packers" and others marked as "spreaders". Is there anything else the preferred allocation could be used for that might fit in - or be more relevant even?

@killianmuldoon I think we could have two, as you already mentioned, one for allocating the VFs evenly across multiple PFs (in the same pool), the other for allocating all VFs from one PF until it's exhausted, then the next PF.

@sseetharaman6
Copy link
Author

@zshi-redhat - this approach makes sense to me. is there work underway to add interface for GetPreferredAllocation ?

@martinkennelly
Copy link
Member

martinkennelly commented Aug 11, 2020

@zshi-redhat - this approach makes sense to me. is there work underway to add interface for GetPreferredAllocation ?

I do not think anyone is working on this. It will be discussed at the next meeting of network and resource mgnt.

@zshi-redhat
Copy link
Collaborator

@zshi-redhat - this approach makes sense to me. is there work underway to add interface for GetPreferredAllocation ?

I do not think anyone is working on this. It will be discussed at the next meeting of network and resource mgnt.

Update: this was discussed on Monday meeting, we agreed to support this new API update in sriov device plugin. However, this is not currently assigned to anyone, please feel free to take it if you have interest working on this.

@zshi-redhat
Copy link
Collaborator

@sseetharaman6 FYI, this feature is added via PR #267 if you'd like to do some testing or have any suggestions.

@qingshanyinyin
Copy link

First scenario: I have two PFs (PF-A、PF-B),and i define two resources (R-A、R-B). Then, I create the pod resquestes two resources (R-A:1,R-B:1).
Second scenario: I have two PFs (PF-A、PF-B),and i define one resources (R). Then, I create the pod resquestes the resource(R:2), and the kubelet allocate the two VFs in the single PF-A or PF-B.
I wonder to know whether there is some difference between this two scenarioes for pod network. For example, which one is best for deep Learning (tensorflow、pytorch and so on ).
Thanks!

@adrianchiris
Copy link
Contributor

I wonder to know whether there is some difference between this two scenarioes for pod network.

if you need to have two additional network interfaces for the Pod, configured by a supporting CNI plugin then IIRC only the second scenario will work.

if you just want to have two VFs allocated to the pod (and no CNI conifg required) then sending traffic from different PFs (different uplinks) would probably be faster.

there is also another consideration which affects performance, the NUMA alignment for memory, CPU and PCI.
in this case you would want all to be aligned.

@martinkennelly
Copy link
Member

if you need to have two additional network interfaces for the Pod, configured by a supporting CNI plugin then IIRC only the second scenario will work.

For first scenario, couldn't you just define two NADs (net-a, net-b) with associated DP selectors (PFName) each selecting an individual PF? Then in your network request annotation put in net-a and net-b. You get VF from each PF then. What am I missing?

@adrianchiris
Copy link
Contributor

adrianchiris commented Sep 2, 2021

if you need to have two additional network interfaces for the Pod, configured by a supporting CNI plugin then IIRC only the second scenario will work.

correction, i meant first scenario. having two network-attachment-definition each associated with a different resource will work.
having both network-atttachment-definition associated with the same resource (i think) will not work.

since multus would need to provide each attachment with a different DeviceID from same resource on CmdAdd call.
(i.e pass to delegate CNI first device ID on first call and second device ID for second call )

@qingshanyinyin
Copy link

qingshanyinyin commented Sep 9, 2021

I have sovle it First scenario! Thanks! @adrianchiris @martinkennelly
Now i need to do annother task!
I will define only one resource for different PFs(8 or more), and i will make kubelet to allocate VFs from each PFs. For example:
request, sriov-resource: 1
allocate, VFs:8(if there are 8 PFs in the node, and 8VFs is from different PFs!)
I wonder to know whether this will be ok? if i do not edit the multus and only edit the sriov-device-plugin.

@wattmto wattmto linked a pull request Aug 30, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants