Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use the allocation policy #1774

Closed
aq2013 opened this issue Jul 8, 2024 · 2 comments
Closed

How to use the allocation policy #1774

aq2013 opened this issue Jul 8, 2024 · 2 comments

Comments

@aq2013
Copy link

aq2013 commented Jul 8, 2024

Describe the support request
We created gpu device plugin with allocation-policy balanced and shared-dev-num 2. One node with one of Intel GPU Flex 140 GPU card. So the cluster have 2 node with Flex 140 GPU card and each node have 2 gpu.intel.com/i915.

Now we deployed 2 gpu applications, each requests 1 gpu.intel.com/i915. As I understand, when the allocation-policy is balanced ( " balanced mode spreads workloads among GPU devices"), these 2 applications should be scheduled to 2 different nodes with gpu cards. But we found all 2 applications were scheduled to the same node.

device-plugin:

Args:
      -shared-dev-num=2
      -enable-monitoring
      -allocation-policy=balanced
      -v=5

worknode-104:
截屏2024-07-08 下午2 43 41

worknode-105:
截屏2024-07-08 下午2 44 31

2 applications on one node:
截屏2024-07-03 下午4 32 02

System (please complete the following information if applicable):

  • OS version: [e.g. Ubuntu 22.04]
  • Kernel version: [e.g. Linux 5.15]
  • Device plugins version: [e.g. v0.29.0]
  • Hardware info: [e.g. Flex 140 gpu]
@tkatila
Copy link
Contributor

tkatila commented Jul 9, 2024

Hi @aq2013 and thanks for the issue.

Now we deployed 2 gpu applications, each requests 1 gpu.intel.com/i915. As I understand, when the allocation-policy is balanced ( " balanced mode spreads workloads among GPU devices"), these 2 applications should be scheduled to 2 different nodes with gpu cards. But we found all 2 applications were scheduled to the same node.

GPU plugin works at node level so it can only affect the GPU selection among the GPUs that are in its control. Thus the "balanced" mode only applies within the node not over the whole cluster. e.g, when user deploys two Pods and they happen to be scheduled to the same node, GPU plugin will deploy one to GPU1 and the other to GPU2.

Depending on your goal, I can think of two ways to get where you'd want to be:

Also, Flex 140 should have two GPUs per a physical card. With "shared-dev-num" == 2, there should be 4 i915 resources.

EDIT: GAS doesn't solve the problem either. Its "balancedResource" also works at node level.

@aq2013
Copy link
Author

aq2013 commented Aug 9, 2024

Thanks for the reply. Now we added the anti-affinity rules to schedule Pods to different nodes. I will close this issue.

@aq2013 aq2013 closed this as completed Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants