Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661

Closed
vtlrazin opened this issue Sep 14, 2022 · 0 comments · Fixed by #1668
Closed

Comments

@vtlrazin
Copy link

Hi,
The training operator failed to start in OpenShift cluster v4.10.30 with error:

# oc -n kubeflow get pod
NAME                                 READY   STATUS                 RESTARTS   AGE
training-operator-5cc8cdfdd6-fpthk   0/1     CreateContainerError   0          2m29s

Warning  Failed  113s   kubelet  Error: container create failed: time="2022-09-14T11:35:13Z" level=error msg="runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?)"

System info:
OCP - v4.10.30
Worker node - NVIDIA DGX A100

The proposed solution to resolve the issue to increase the limit in daemonset deployment to:

        resources:
          limits:
            cpu: 500m
            memory: 300Mi

Best regards,
Vitaliy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant