Cant get mpijob status when pod template is invalid #604

congpeiqing · 2023-11-15T05:25:17Z

i created a mpijob with invalid pod template , i cant get mpijob status all the time ( i think the status should be Failed).
now i cant distinguish the mpijobs which are too new to get status and the mpijobs with invaild pod template

my mpijob shows below
kubectl get mpijob ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod -oyaml
the output is :

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  creationTimestamp: "2023-11-15T02:01:44Z"
  generation: 1
  labels:
    deadline: 2023-11-15_02-06-44
  name: ai62da0dbe-6406-4252-85d6-51ef87eab10d
  namespace: cpod
  resourceVersion: "2787007"
  uid: e5703c73-f27e-45ef-9049-fd40c152d4d6
spec:
  launcherCreationPolicy: WaitForWorkersReady
  mpiImplementation: OpenMPI
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: "111"
            imagePullPolicy: IfNotPresent
            name: launcher
          hostIPC: true
    Worker:
      replicas: 1
      template:
        spec:
          containers:
          - image: "111"
            imagePullPolicy: IfNotPresent
            name: worker
            resources:
              limits:
                nvidia.com/gpu: 1
            volumeMounts:
            - mountPath: "111"
              name: ckpt-pv
            - mountPath: "111"
              name: saved-model-pv
          hostIPC: true
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-GeForce-RTX-3090
          volumes:
          - name: ckpt-pv
            persistentVolumeClaim:
              claimName: ai62da0dbe-6406-4252-85d6-51ef87eab10d-ckpt
              readOnly: false
          - name: saved-model-pv
            persistentVolumeClaim:
              claimName: ai62da0dbe-6406-4252-85d6-51ef87eab10d-modelsave
              readOnly: false
  runPolicy:
    cleanPodPolicy: Running
    schedulingPolicy:
      minAvailable: 1
    suspend: false
  slotsPerWorker: 1
  sshAuthMountPath: /root/.ssh

when describe the mpijob

kubectl describe mpijob  ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod

output is :

Name:         ai62da0dbe-6406-4252-85d6-51ef87eab10d
Namespace:    cpod
Labels:       deadline=2023-11-15_02-06-44
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2023-11-15T02:01:44Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v2beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:deadline:
      f:spec:
        .:
        f:launcherCreationPolicy:
        f:mpiImplementation:
        f:mpiReplicaSpecs:
          .:
          f:Launcher:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
                f:hostIPC:
          f:Worker:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
                f:hostIPC:
                f:nodeSelector:
                f:volumes:
        f:runPolicy:
          .:
          f:cleanPodPolicy:
          f:schedulingPolicy:
            .:
            f:minAvailable:
          f:suspend:
        f:slotsPerWorker:
        f:sshAuthMountPath:
    Manager:         cpodmanager
    Operation:       Update
    Time:            2023-11-15T02:01:44Z
  Resource Version:  2787007
  UID:               e5703c73-f27e-45ef-9049-fd40c152d4d6
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Image:              111
            Image Pull Policy:  IfNotPresent
            Name:               launcher
          Host IPC:             true
    Worker:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Image:              111
            Image Pull Policy:  IfNotPresent
            Name:               worker
            Resources:
              Limits:
                nvidia.com/gpu:  1
            Volume Mounts:
              Mount Path:  111
              Name:        ckpt-pv
              Mount Path:  111
              Name:        saved-model-pv
          Host IPC:        true
          Node Selector:
            nvidia.com/gpu.product:  NVIDIA-GeForce-RTX-3090
          Volumes:
            Name:  ckpt-pv
            Persistent Volume Claim:
              Claim Name:  ai62da0dbe-6406-4252-85d6-51ef87eab10d-ckpt
              Read Only:   false
            Name:          saved-model-pv
            Persistent Volume Claim:
              Claim Name:  ai62da0dbe-6406-4252-85d6-51ef87eab10d-modelsave
              Read Only:   false
  Run Policy:
    Clean Pod Policy:  Running
    Scheduling Policy:
      Min Available:    1
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Events:
  Type     Reason         Age                   From                Message
  ----     ------         ----                  ----                -------
  Normal   MPIJobCreated  5m48s (x12 over 27m)  mpi-job-controller  MPIJob cpod/ai62da0dbe-6406-4252-85d6-51ef87eab10d is created.
  Warning  MPIJobFailed   5m48s (x12 over 27m)  mpi-job-controller  worker pod created failed: Pod "ai62da0dbe-6406-4252-85d6-51ef87eab10d-worker-0" is invalid: spec.containers[0].volumeMounts[1].mountPath: Invalid value: "111": must be unique

The text was updated successfully, but these errors were encountered:

terrytangyuan · 2023-11-15T12:50:34Z

The controller currently the item when there are errors during worker pod creation. It might be problematic to requeue regardless what the error is. If there are pod spec errors, then it should just fail instead of requeueing repeatedly.

https://github.com/kubeflow/mpi-operator/blob/master/pkg/controller/mpi_job_controller.go#L964

alculquicondor · 2023-11-15T13:57:36Z

Ideally, we should have a webhook, but this was never prioritized.

Alternatively, we can add a CEL validator https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#validation-expression

Happy to review a PR if you are interested in working on it.

tenzen-y · 2023-11-15T16:28:50Z

Previously, I have tried to introduce CEL validation to the traininig-operator:

kubeflow/training-operator#1708

However, I gave up introducing it since it is hard to validate podTemplate due to the cost budget of the CEL validations.

kubeflow/training-operator#1708 (comment)

Hence, we must introduce webhooks if we want to validate the podTemplates.

alculquicondor · 2023-11-15T16:58:29Z

You mean that CEL was too slow or what exactly?

tenzen-y · 2023-11-15T17:01:54Z

You mean that CEL was too slow or what exactly?

No, I meant CEL validation can not work due to the following errors:

Forbidden: contributed to estimated rule cost total exceeding cost limit for entire OpenAPIv3 schema, spec.validation.openAPIV3Schema: Forbidden: x-kubernetes-validations estimated rule cost total for entire OpenAPIv3 schema exceeds budget by factor of more than 100x (try simplifying the rule, or adding maxItems, maxProperties, and maxLength where arrays, maps, and strings are declared)]

This was caused by cost budget.

alculquicondor · 2023-11-15T17:05:01Z

Oh, so too many validation rules :)

tenzen-y · 2023-11-15T17:10:10Z

Oh, so too many validation rules :)

I guess that these exceedings are caused by replicaSpecs are defined by map because we can not set a limitation of the number of replicas and the search depth is infinity :(

alculquicondor · 2023-11-15T17:12:33Z

Ah, we shot ourselves in the foot by using a map instead of explicit fields.

congpeiqing · 2023-11-17T15:09:15Z

The controller currently the item when there are errors during worker pod creation. It might be problematic to requeue regardless what the error is. If there are pod spec errors, then it should just fail instead of requeueing repeatedly.

https://github.com/kubeflow/mpi-operator/blob/master/pkg/controller/mpi_job_controller.go#L964

#606
@terrytangyuan PR submitted , works in our environment .

congpeiqing linked a pull request Nov 17, 2023 that will close this issue

fix bug about status absence when worker pod spec is invalid #606

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cant get mpijob status when pod template is invalid #604

Cant get mpijob status when pod template is invalid #604

congpeiqing commented Nov 15, 2023

terrytangyuan commented Nov 15, 2023 •

edited

Loading

alculquicondor commented Nov 15, 2023

tenzen-y commented Nov 15, 2023

alculquicondor commented Nov 15, 2023

tenzen-y commented Nov 15, 2023 •

edited

Loading

alculquicondor commented Nov 15, 2023

tenzen-y commented Nov 15, 2023

alculquicondor commented Nov 15, 2023

congpeiqing commented Nov 17, 2023

Cant get mpijob status when pod template is invalid #604

Cant get mpijob status when pod template is invalid #604

Comments

congpeiqing commented Nov 15, 2023

terrytangyuan commented Nov 15, 2023 • edited Loading

alculquicondor commented Nov 15, 2023

tenzen-y commented Nov 15, 2023

alculquicondor commented Nov 15, 2023

tenzen-y commented Nov 15, 2023 • edited Loading

alculquicondor commented Nov 15, 2023

tenzen-y commented Nov 15, 2023

alculquicondor commented Nov 15, 2023

congpeiqing commented Nov 17, 2023

terrytangyuan commented Nov 15, 2023 •

edited

Loading

tenzen-y commented Nov 15, 2023 •

edited

Loading