Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729

Closed
nkflash opened this issue Jan 17, 2023 · 4 comments · Fixed by #1730
Closed

paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729

nkflash opened this issue Jan 17, 2023 · 4 comments · Fixed by #1730
Assignees

Comments

@nkflash
Copy link

nkflash commented Jan 17, 2023

This is like #1630

image

image

paddle job hang, since pod has not been created.

job yaml like:

apiVersion: kubeflow.org/v1
kind: PaddleJob
metadata:
  creationTimestamp: "2023-01-17T08:34:04Z"
  generation: 1
  labels:
    job.baai.ac.cn/creator: elrond
    job.baai.ac.cn/creator-id: "215305"
    job.baai.ac.cn/queue-id: 506ef7b1-1943-4d6a-aac5-e93c23cff768
    job.baai.ac.cn/type: batch
  name: job-e9e13362-dbab-4709-9954-bd38d298cf59
  namespace: airs
  resourceVersion: "36758105"
  uid: 34cfd47a-bf64-465f-9b7e-35d23afde375
spec:
  paddleReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: Never
      template:
        metadata:
          annotations:
            airs-center-endpoint: airs-center.airs-citest.svc.cluster.local:6080
            proj.baai.ac.cn/id: dbe1edf4-d12d-406b-a64f-2f71f15ed613
            projset.baai.ac.cn/id: 00ffe2f5-2cf0-47ef-8631-bee744da1069
            volcano.sh/preemptable: "false"
          labels:
            job.baai.ac.cn/creator: elrond
            job.baai.ac.cn/creator-id: "215305"
            job.baai.ac.cn/name: job-e9e13362-dbab-4709-9954-bd38d298cf59
            job.baai.ac.cn/type: batch
            pod.baai.ac.cn/role: Worker
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: machine.baai.ac.cn/accelerator
                    operator: In
                    values:
                    - NVIDIA_T4
          containers:
          - command:
            - /bin/sh
            - -c
            - echo "PATH=/client-tools:$PATH" >> ~/.bashrc;env >> /etc/environment;/usr/sbin/sshd
              -f /etc/configmap/sshd_config;python -m paddle.distributed.launch run_check
            env:
            - name: TZ
              value: Asia/Shanghai
            image: harbor-dev.platform.baai-inner.ac.cn/library/paddle:gpu
            imagePullPolicy: Always
            name: paddle
            resources:
              limits:
                cpu: "10"
                memory: 30Gi
                nvidia.com/gpu: "2"
                rdma/mlnx_shared: "2"
              requests:
                cpu: "10"
                memory: 30Gi
                nvidia.com/gpu: "2"
                rdma/mlnx_shared: "2"
            securityContext:
              capabilities:
                add:
                - IPC_LOCK
            volumeMounts:
            - mountPath: /dev/shm
              name: shm-volume
            - mountPath: /home/elrond
              name: storage-volume0
            - mountPath: /etc/localtime
              name: localtime
            - mountPath: /etc/downwardapi
              name: downward-api
              readOnly: true
            - mountPath: /etc/configmap
              name: sshd-config
              readOnly: true
            - mountPath: /etc/pub
              name: sshproxy-keys-config
              readOnly: true
            - mountPath: /client-tools
              mountPropagation: HostToContainer
              name: client-tools
              readOnly: true
          imagePullSecrets:
          - name: harbor-platform-readonly-secret
          schedulerName: volcano
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 15Gi
            name: shm-volume
          - hostPath:
              path: /mnt/airs-business/airs/sharefs/00ffe2f5-2cf0-47ef-8631-bee744da1069_dbe1edf4-d12d-406b-a64f-2f71f15ed613/215305
            name: storage-volume0
          - downwardAPI:
              items:
              - fieldRef:
                  fieldPath: metadata.labels
                path: labels
              - fieldRef:
                  fieldPath: metadata.annotations
                path: annotations
            name: downward-api
          - hostPath:
              path: /etc/localtime
            name: localtime
          - configMap:
              name: sshd-config
            name: sshd-config
          - configMap:
              items:
              - key: id_rsa.pub
                path: id_rsa.pub
              name: sshproxy-keys-config
            name: sshproxy-keys-config
          - hostPath:
              path: /mnt/airs-business/client-tools/tools/bin
            name: client-tools
  runPolicy:
    cleanPodPolicy: Running
    schedulingPolicy:
      priorityClass: high-priority
      queue: 506ef7b1-1943-4d6a-aac5-e93c23cff768
    ttlSecondsAfterFinished: 120
status:
  conditions:
  - lastTransitionTime: "2023-01-17T08:34:05Z"
    lastUpdateTime: "2023-01-17T08:34:05Z"
    message: PaddleJob job-e9e13362-dbab-4709-9954-bd38d298cf59 is created.
    reason: PaddleJobCreated
    status: "True"
    type: Created
  lastReconcileTime: "2023-01-17T08:34:05Z"
  replicaStatuses: {}

But this only happen on paddle controller, In same env tf/pytorch work well

If I restart training-operator pod, the job will become running status.
image

@kuizhiqing
Copy link
Member

cc @shinytang6 Can you tell any difference in the paddle scenario to work with volcano ?

@tenzen-y
Copy link
Member

tenzen-y commented Jan 17, 2023

/assign

This is a bug caused by predicates.

@tenzen-y
Copy link
Member

@nkflash Probably, that bug was fixed. If you face the same error, feel free to reopen this issue.

@nkflash
Copy link
Author

nkflash commented Jan 19, 2023

@nkflash Probably, that bug was fixed. If you face the same error, feel free to reopen this issue.

I have verified this fix, It work well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants