training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630

nkflash · 2022-07-07T03:07:27Z

I setup two new clusters, both of them meet same problem.

enable gang in training-operator

submit PyTorch job which in example directory

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch-simple-2
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command: ["/bin/sh"]
              args: ["-c", "echo \"Hello\"; sleep 6000"]
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command: ["/bin/sh"]
              args: ["-c", "echo \"Hello\"; sleep 600"]

The job will not create pod.

I check the pod group status:

- apiVersion: scheduling.volcano.sh/v1beta1
  kind: PodGroup
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"kubeflow.org/v1","kind":"PyTorchJob","metadata":{"annotations":{},"name":"pytorch-simple-2","namespace":"kubeflow"},"spec":{"pytorchReplicaSpecs":{"Master":{"replicas":1,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"args":["-c","echo \"Hello\"; sleep 6000"],"command":["/bin/sh"],"image":"docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727","imagePullPolicy":"Always","name":"pytorch"}]}}},"Worker":{"replicas":1,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"args":["-c","echo \"Hello\"; sleep 600"],"command":["/bin/sh"],"image":"docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727","imagePullPolicy":"Always","name":"pytorch"}]}}}}}}
    creationTimestamp: "2022-07-07T02:54:58Z"
    generation: 17
    name: pytorch-simple-2
    namespace: kubeflow
    ownerReferences:
    - apiVersion: kubeflow.org/v1
      blockOwnerDeletion: true
      controller: true
      kind: PyTorchJob
      name: pytorch-simple-2
      uid: b32c09f3-85e3-4fc5-853e-00e4078048ab
    resourceVersion: "19798"
    uid: ef6eaedc-9eb8-4d16-b005-3a9f30051bf0
  spec:
    minMember: 2
    minResources: {}
    queue: default
  status:
    conditions:
    - lastTransitionTime: "2022-07-07T03:11:22Z"
      message: '2/0 tasks in gang unschedulable: pod group is not ready, 2 minAvailable'
      reason: NotEnoughResources
      status: "True"
      transitionID: 9a60d40c-4669-494f-9727-c3d96f99cb9f
      type: Unschedulable
    phase: Inqueue
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

It seems the status is correct: "inqueue"

from operator log: the job in unschedule status and only show twice.

119 time="2022-07-07T02:54:58Z" level=info msg="PyTorchJob pytorch-simple-2 is created."
120 1.657162498112169e+09   DEBUG   No ElasicPolicy or Metric is specified, skipping HPA reconciling process        {"pytorchjob": "pytorch-simple-2"}
121 time="2022-07-07T02:54:58Z" level=info msg="Reconciling for job pytorch-simple-2"
122 time="2022-07-07T02:54:58Z" level=warning msg="Ignore task Master priority class : priorityclass.scheduling.k8s.io \"\" not found"
123 time="2022-07-07T02:54:58Z" level=warning msg="Ignore task Worker priority class : priorityclass.scheduling.k8s.io \"\" not found"
124 time="2022-07-07T02:54:58Z" level=warning msg="PodGroup kubeflow/pytorch-simple-2 unschedulable"
125 1.6571624983127658e+09  DEBUG   No ElasicPolicy or Metric is specified, skipping HPA reconciling process        {"pytorchjob": "pytorch-simple-2"}
126 time="2022-07-07T02:54:58Z" level=info msg="Reconciling for job pytorch-simple-2"
127 time="2022-07-07T02:54:58Z" level=warning msg="Ignore task Master priority class : priorityclass.scheduling.k8s.io \"\" not found"
128 time="2022-07-07T02:54:58Z" level=warning msg="Ignore task Worker priority class : priorityclass.scheduling.k8s.io \"\" not found"
129 time="2022-07-07T02:54:58Z" level=warning msg="PodGroup kubeflow/pytorch-simple-2 unschedulable"

If I modify the job yaml. it will trigger the job to run(operator will receive inqueue status and create pod).
If I disable gang "--enable-gang-scheduling=false", it work well

The text was updated successfully, but these errors were encountered:

nkflash · 2022-07-08T07:39:27Z

This is an bug, training-operator didn't watch pod group status change. But check pod group status. So only pod group status change will not trigger reconcile

nkflash · 2022-07-08T07:41:36Z

/assign @shinytang6

shinytang6 · 2022-07-14T06:08:17Z

Thanks for the report @nkflash, it's a bug, would you like to fix that?

gaocegege · 2022-07-14T06:29:15Z

Thanks for the issue!

nkflash · 2022-07-15T00:01:18Z

Thanks for the report @nkflash, it's a bug, would you like to fix that?

Sure，I will fix that later

shinytang6 · 2022-09-19T11:22:46Z

#1666 should fix this issue

psheorangithub · 2022-10-17T11:08:49Z

How long will it take to merge that PR? I'm also facing the same issue

johnugeorge · 2023-01-17T19:10:50Z

This is fixed by #1666

nkflash changed the title ~~training-operator can not get podgroup~~ training-operator can not get podgroup status(inqueue) with volcano when enable gang Jul 7, 2022

google-oss-prow bot assigned shinytang6 Jul 8, 2022

nkflash mentioned this issue Jan 17, 2023

paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729

Closed

johnugeorge closed this as completed Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630

training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630

nkflash commented Jul 7, 2022 •

edited

Loading

nkflash commented Jul 8, 2022

nkflash commented Jul 8, 2022

shinytang6 commented Jul 14, 2022

gaocegege commented Jul 14, 2022

nkflash commented Jul 15, 2022

shinytang6 commented Sep 19, 2022

psheorangithub commented Oct 17, 2022

johnugeorge commented Jan 17, 2023

training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630

training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630

Comments

nkflash commented Jul 7, 2022 • edited Loading

nkflash commented Jul 8, 2022

nkflash commented Jul 8, 2022

shinytang6 commented Jul 14, 2022

gaocegege commented Jul 14, 2022

nkflash commented Jul 15, 2022

shinytang6 commented Sep 19, 2022

psheorangithub commented Oct 17, 2022

johnugeorge commented Jan 17, 2023

nkflash commented Jul 7, 2022 •

edited

Loading