Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Koordinator doesn't support multiple card sharing #2097

Closed
ZiMengSheng opened this issue Jun 11, 2024 · 8 comments
Closed

[BUG] Koordinator doesn't support multiple card sharing #2097

ZiMengSheng opened this issue Jun 11, 2024 · 8 comments
Assignees
Labels
area/koord-scheduler kind/bug Create a report to help us improve
Milestone

Comments

@ZiMengSheng
Copy link
Contributor

ZiMengSheng commented Jun 11, 2024

What happened:

A node has 8 GPU cards, each GPU card has 80 Gi GPU memory. I want to use four cards, each GPU card 40 Gi GPU Memory via koordinator.sh/gpu.shared. But pod will stuck in Pending phase.

apiVersion: v1
kind: Pod
metadata:
  name: pod-example
  namespace: default
spec:
  schedulerName: koord-scheduler
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 40m
        memory: 40Mi
        koordinator.sh/gpu.shared: "4"
        koordinator.sh/gpu-memory: 160Gi
      requests:
        cpu: 40m
        memory: 40Mi
        koordinator.sh/gpu.shared: "4"
        koordinator.sh/gpu-memory: 160Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always

What you expected to happen:

Pod should be scheduled.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • App version:
  • Kubernetes version (use kubectl version):
  • Install details (e.g. helm install args):
  • Node environment (for koordlet/runtime-proxy issue):
    • Containerd/Docker version:
    • OS version:
    • Kernal version:
    • Cgroup driver: cgroupfs/systemd
  • Others:
@ZiMengSheng ZiMengSheng added the kind/bug Create a report to help us improve label Jun 11, 2024
@ZiMengSheng ZiMengSheng changed the title [BUG] Koordinator [BUG] Koordinator doesn't support multiple card sharing Jun 11, 2024
@AdrianMachao
Copy link
Contributor

/assign

@ZiMengSheng
Copy link
Contributor Author

/assign

Welcome! You can refer to this proposal

@AdrianMachao
Copy link
Contributor

I have started doing it, but I need sometime to understand your design principle and code, I will try my best to complete it as soon as possible

@ZiMengSheng
Copy link
Contributor Author

I have started doing it, but I need sometime to understand your design principle and code, I will try my best to complete it as soon as possible

OK, if you need help, questions or discussions by this github issue or DingDing talk are both welcome!

@AdrianMachao
Copy link
Contributor

is it the implement of mutate and validate webhook in the path of pkg/webhook/pod/mutating/extended_resource_spec.go? I didn't see any work of gpu extender resource, I am doing this task now @ZiMengSheng

@AdrianMachao
Copy link
Contributor

I have started doing it, but I need sometime to understand your design principle and code, I will try my best to complete it as soon as possible

OK, if you need help, questions or discussions by this github issue or DingDing talk are both welcome!

what is your DingDing account, Can I add friends?

@ZiMengSheng
Copy link
Contributor Author

is it the implement of mutate and validate webhook in the path of pkg/webhook/pod/mutating/extended_resource_spec.go? I didn't see any work of gpu exte

王建宇

@ZiMengSheng
Copy link
Contributor Author

is it the implement of mutate and validate webhook in the path of pkg/webhook/pod/mutating/extended_resource_spec.go? I didn't see any work of gpu extender resource, I am doing this task now @ZiMengSheng

The scheduler need to calculcate requestsPerCard and numGPUs by gpu.shared protocol.

@zwzhang0107 zwzhang0107 added this to the v1.6 milestone Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/koord-scheduler kind/bug Create a report to help us improve
Projects
None yet
Development

No branches or pull requests

4 participants