Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630

Closed
nkflash opened this issue Jul 7, 2022 · 8 comments
Assignees

Comments

@nkflash
Copy link

nkflash commented Jul 7, 2022

I setup two new clusters, both of them meet same problem.

enable gang in training-operator
image

submit PyTorch job which in example directory

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch-simple-2
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command: ["/bin/sh"]
              args: ["-c", "echo \"Hello\"; sleep 6000"]
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command: ["/bin/sh"]
              args: ["-c", "echo \"Hello\"; sleep 600"]

The job will not create pod.

I check the pod group status:

- apiVersion: scheduling.volcano.sh/v1beta1
  kind: PodGroup
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"kubeflow.org/v1","kind":"PyTorchJob","metadata":{"annotations":{},"name":"pytorch-simple-2","namespace":"kubeflow"},"spec":{"pytorchReplicaSpecs":{"Master":{"replicas":1,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"args":["-c","echo \"Hello\"; sleep 6000"],"command":["/bin/sh"],"image":"docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727","imagePullPolicy":"Always","name":"pytorch"}]}}},"Worker":{"replicas":1,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"args":["-c","echo \"Hello\"; sleep 600"],"command":["/bin/sh"],"image":"docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727","imagePullPolicy":"Always","name":"pytorch"}]}}}}}}
    creationTimestamp: "2022-07-07T02:54:58Z"
    generation: 17
    name: pytorch-simple-2
    namespace: kubeflow
    ownerReferences:
    - apiVersion: kubeflow.org/v1
      blockOwnerDeletion: true
      controller: true
      kind: PyTorchJob
      name: pytorch-simple-2
      uid: b32c09f3-85e3-4fc5-853e-00e4078048ab
    resourceVersion: "19798"
    uid: ef6eaedc-9eb8-4d16-b005-3a9f30051bf0
  spec:
    minMember: 2
    minResources: {}
    queue: default
  status:
    conditions:
    - lastTransitionTime: "2022-07-07T03:11:22Z"
      message: '2/0 tasks in gang unschedulable: pod group is not ready, 2 minAvailable'
      reason: NotEnoughResources
      status: "True"
      transitionID: 9a60d40c-4669-494f-9727-c3d96f99cb9f
      type: Unschedulable
    phase: Inqueue
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

It seems the status is correct: "inqueue"

from operator log: the job in unschedule status and only show twice.

119 time="2022-07-07T02:54:58Z" level=info msg="PyTorchJob pytorch-simple-2 is created."
120 1.657162498112169e+09   DEBUG   No ElasicPolicy or Metric is specified, skipping HPA reconciling process        {"pytorchjob": "pytorch-simple-2"}
121 time="2022-07-07T02:54:58Z" level=info msg="Reconciling for job pytorch-simple-2"
122 time="2022-07-07T02:54:58Z" level=warning msg="Ignore task Master priority class : priorityclass.scheduling.k8s.io \"\" not found"
123 time="2022-07-07T02:54:58Z" level=warning msg="Ignore task Worker priority class : priorityclass.scheduling.k8s.io \"\" not found"
124 time="2022-07-07T02:54:58Z" level=warning msg="PodGroup kubeflow/pytorch-simple-2 unschedulable"
125 1.6571624983127658e+09  DEBUG   No ElasicPolicy or Metric is specified, skipping HPA reconciling process        {"pytorchjob": "pytorch-simple-2"}
126 time="2022-07-07T02:54:58Z" level=info msg="Reconciling for job pytorch-simple-2"
127 time="2022-07-07T02:54:58Z" level=warning msg="Ignore task Master priority class : priorityclass.scheduling.k8s.io \"\" not found"
128 time="2022-07-07T02:54:58Z" level=warning msg="Ignore task Worker priority class : priorityclass.scheduling.k8s.io \"\" not found"
129 time="2022-07-07T02:54:58Z" level=warning msg="PodGroup kubeflow/pytorch-simple-2 unschedulable"

If I modify the job yaml. it will trigger the job to run(operator will receive inqueue status and create pod).
If I disable gang "--enable-gang-scheduling=false", it work well

@nkflash nkflash changed the title training-operator can not get podgroup training-operator can not get podgroup status(inqueue) with volcano when enable gang Jul 7, 2022
@nkflash
Copy link
Author

nkflash commented Jul 8, 2022

image

image

This is an bug, training-operator didn't watch pod group status change. But check pod group status. So only pod group status change will not trigger reconcile

@nkflash
Copy link
Author

nkflash commented Jul 8, 2022

/assign @shinytang6

@shinytang6
Copy link
Member

Thanks for the report @nkflash, it's a bug, would you like to fix that?

@gaocegege
Copy link
Member

Thanks for the issue!

@nkflash
Copy link
Author

nkflash commented Jul 15, 2022

Thanks for the report @nkflash, it's a bug, would you like to fix that?

Sure,I will fix that later

@shinytang6
Copy link
Member

#1666 should fix this issue

@psheorangithub
Copy link

How long will it take to merge that PR? I'm also facing the same issue

@johnugeorge
Copy link
Member

This is fixed by #1666

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants