Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cant get mpijob status when pod template is invalid #604

Open
congpeiqing opened this issue Nov 15, 2023 · 9 comments · May be fixed by #606
Open

Cant get mpijob status when pod template is invalid #604

congpeiqing opened this issue Nov 15, 2023 · 9 comments · May be fixed by #606

Comments

@congpeiqing
Copy link

i created a mpijob with invalid pod template , i cant get mpijob status all the time ( i think the status should be Failed).
now i cant distinguish the mpijobs which are too new to get status and the mpijobs with invaild pod template

my mpijob shows below
kubectl get mpijob ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod -oyaml
the output is :

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  creationTimestamp: "2023-11-15T02:01:44Z"
  generation: 1
  labels:
    deadline: 2023-11-15_02-06-44
  name: ai62da0dbe-6406-4252-85d6-51ef87eab10d
  namespace: cpod
  resourceVersion: "2787007"
  uid: e5703c73-f27e-45ef-9049-fd40c152d4d6
spec:
  launcherCreationPolicy: WaitForWorkersReady
  mpiImplementation: OpenMPI
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: "111"
            imagePullPolicy: IfNotPresent
            name: launcher
          hostIPC: true
    Worker:
      replicas: 1
      template:
        spec:
          containers:
          - image: "111"
            imagePullPolicy: IfNotPresent
            name: worker
            resources:
              limits:
                nvidia.com/gpu: 1
            volumeMounts:
            - mountPath: "111"
              name: ckpt-pv
            - mountPath: "111"
              name: saved-model-pv
          hostIPC: true
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-GeForce-RTX-3090
          volumes:
          - name: ckpt-pv
            persistentVolumeClaim:
              claimName: ai62da0dbe-6406-4252-85d6-51ef87eab10d-ckpt
              readOnly: false
          - name: saved-model-pv
            persistentVolumeClaim:
              claimName: ai62da0dbe-6406-4252-85d6-51ef87eab10d-modelsave
              readOnly: false
  runPolicy:
    cleanPodPolicy: Running
    schedulingPolicy:
      minAvailable: 1
    suspend: false
  slotsPerWorker: 1
  sshAuthMountPath: /root/.ssh

when describe the mpijob

kubectl describe mpijob  ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod 

output is :

Name:         ai62da0dbe-6406-4252-85d6-51ef87eab10d
Namespace:    cpod
Labels:       deadline=2023-11-15_02-06-44
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2023-11-15T02:01:44Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v2beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:deadline:
      f:spec:
        .:
        f:launcherCreationPolicy:
        f:mpiImplementation:
        f:mpiReplicaSpecs:
          .:
          f:Launcher:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
                f:hostIPC:
          f:Worker:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
                f:hostIPC:
                f:nodeSelector:
                f:volumes:
        f:runPolicy:
          .:
          f:cleanPodPolicy:
          f:schedulingPolicy:
            .:
            f:minAvailable:
          f:suspend:
        f:slotsPerWorker:
        f:sshAuthMountPath:
    Manager:         cpodmanager
    Operation:       Update
    Time:            2023-11-15T02:01:44Z
  Resource Version:  2787007
  UID:               e5703c73-f27e-45ef-9049-fd40c152d4d6
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Image:              111
            Image Pull Policy:  IfNotPresent
            Name:               launcher
          Host IPC:             true
    Worker:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Image:              111
            Image Pull Policy:  IfNotPresent
            Name:               worker
            Resources:
              Limits:
                nvidia.com/gpu:  1
            Volume Mounts:
              Mount Path:  111
              Name:        ckpt-pv
              Mount Path:  111
              Name:        saved-model-pv
          Host IPC:        true
          Node Selector:
            nvidia.com/gpu.product:  NVIDIA-GeForce-RTX-3090
          Volumes:
            Name:  ckpt-pv
            Persistent Volume Claim:
              Claim Name:  ai62da0dbe-6406-4252-85d6-51ef87eab10d-ckpt
              Read Only:   false
            Name:          saved-model-pv
            Persistent Volume Claim:
              Claim Name:  ai62da0dbe-6406-4252-85d6-51ef87eab10d-modelsave
              Read Only:   false
  Run Policy:
    Clean Pod Policy:  Running
    Scheduling Policy:
      Min Available:    1
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Events:
  Type     Reason         Age                   From                Message
  ----     ------         ----                  ----                -------
  Normal   MPIJobCreated  5m48s (x12 over 27m)  mpi-job-controller  MPIJob cpod/ai62da0dbe-6406-4252-85d6-51ef87eab10d is created.
  Warning  MPIJobFailed   5m48s (x12 over 27m)  mpi-job-controller  worker pod created failed: Pod "ai62da0dbe-6406-4252-85d6-51ef87eab10d-worker-0" is invalid: spec.containers[0].volumeMounts[1].mountPath: Invalid value: "111": must be unique
@terrytangyuan
Copy link
Member

terrytangyuan commented Nov 15, 2023

The controller currently the item when there are errors during worker pod creation. It might be problematic to requeue regardless what the error is. If there are pod spec errors, then it should just fail instead of requeueing repeatedly.

https://github.com/kubeflow/mpi-operator/blob/master/pkg/controller/mpi_job_controller.go#L964

@alculquicondor
Copy link
Collaborator

Ideally, we should have a webhook, but this was never prioritized.

Alternatively, we can add a CEL validator https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#validation-expression

Happy to review a PR if you are interested in working on it.

@tenzen-y
Copy link
Member

Previously, I have tried to introduce CEL validation to the traininig-operator:

kubeflow/training-operator#1708

However, I gave up introducing it since it is hard to validate podTemplate due to the cost budget of the CEL validations.

kubeflow/training-operator#1708 (comment)

Hence, we must introduce webhooks if we want to validate the podTemplates.

@alculquicondor
Copy link
Collaborator

You mean that CEL was too slow or what exactly?

@tenzen-y
Copy link
Member

tenzen-y commented Nov 15, 2023

You mean that CEL was too slow or what exactly?

No, I meant CEL validation can not work due to the following errors:

Forbidden: contributed to estimated rule cost total exceeding cost limit for entire OpenAPIv3 schema, spec.validation.openAPIV3Schema: Forbidden: x-kubernetes-validations estimated rule cost total for entire OpenAPIv3 schema exceeds budget by factor of more than 100x (try simplifying the rule, or adding maxItems, maxProperties, and maxLength where arrays, maps, and strings are declared)]

This was caused by cost budget.

@alculquicondor
Copy link
Collaborator

Oh, so too many validation rules :)

@tenzen-y
Copy link
Member

Oh, so too many validation rules :)

I guess that these exceedings are caused by replicaSpecs are defined by map because we can not set a limitation of the number of replicas and the search depth is infinity :(

@alculquicondor
Copy link
Collaborator

Ah, we shot ourselves in the foot by using a map instead of explicit fields.

@congpeiqing
Copy link
Author

The controller currently the item when there are errors during worker pod creation. It might be problematic to requeue regardless what the error is. If there are pod spec errors, then it should just fail instead of requeueing repeatedly.

https://github.com/kubeflow/mpi-operator/blob/master/pkg/controller/mpi_job_controller.go#L964

#606
@terrytangyuan PR submitted , works in our environment .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants