Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch and MPI Operator pulls hardcoded initContainer #1696

Closed
MuhammadZeeshan34 opened this issue Dec 2, 2022 · 2 comments
Closed

PyTorch and MPI Operator pulls hardcoded initContainer #1696

MuhammadZeeshan34 opened this issue Dec 2, 2022 · 2 comments

Comments

@MuhammadZeeshan34
Copy link

Problem
While running any pytorch or MPI operator job on on-prem clusters, the worker pods init-container try to pull default image i.e. alpine:3.12 in the case of pytorch and mpioperator/kubectl-delivery in the case of MPI. Since these jobs are running on-prem thus has no access to pull these images from public repositories.

In the previous version of pytorch operator, there was an option to override the default image used for init container.

Also, it can't be provided in the core deployment of training-operator since different operators require different image for initContainers.

Is there any way to override this image for specific operators?

state:
  waiting:
    message: Back-off pulling image "alpine:3.10"
    reason: ImagePullBackOff

In the previous versions when we had separate deployment of each operators we were overriding like this

spec:
  replicas: 1
  selector:
    matchLabels:
      app: pytorch-operator
      app.kubernetes.io/component: pytorch
      app.kubernetes.io/name: pytorch-operator
      kustomize.component: pytorch-operator
      name: pytorch-operator
  template:
    metadata:
      labels:
        app: pytorch-operator
        app.kubernetes.io/component: pytorch
        app.kubernetes.io/name: pytorch-operator
        kustomize.component: pytorch-operator
        name: pytorch-operator
    spec:
      containers:
      - command:
        - /pytorch-operator.v1
        - --alsologtostderr
        - -v=1
        - --monitoring-port=8443
        - --enable-gang-scheduling=true
        - --**init-container-image=<custom-deocker-repo>alpine:3.10**
@andreyvelich
Copy link
Member

Thank you for raising this @MuhammadZeeshan34!

For the PyTorchJob and MPIJob you still can use PyTorchInitContainerImage and MPIKubectlDeliveryImage flag for the Training Operator deployment to modify the InitContainer image.

@MuhammadZeeshan34
Copy link
Author

Thanks @andreyvelich . It has resolved the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants