Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinitely looping init-pytorch container may eventually exceed its memory limit #1734

Closed
arlofaria opened this issue Jan 21, 2023 · 6 comments · Fixed by #1756
Closed

Infinitely looping init-pytorch container may eventually exceed its memory limit #1734

arlofaria opened this issue Jan 21, 2023 · 6 comments · Fixed by #1756
Assignees

Comments

@arlofaria
Copy link

arlofaria commented Jan 21, 2023

The init-pytorch container generally uses very little memory and is expected to run for a relatively short time. However, in a failure condition where the master pod will never be reachable, this container will loop forever ... until, depending on your host OS, it may eventually terminate and result in a worker pod status of Init:OOMKilled -- which is a bit confusing.

The root cause is that this simple loop can slowly increases in its memory consumption, possibly due to a memory leak or perhaps more likely due to OS-level memory fragmentation, and may eventually exceed the hard-coded 20Mi memory limit (see line 42 below).

- name: init-pytorch
image: {{.InitContainerImage}}
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: 100m
memory: 20Mi
requests:
cpu: 50m
memory: 10Mi
command: ['sh', '-c', 'until nslookup {{.MasterAddr}}; do echo waiting for master; sleep 2; done;']`

One approach would be to simply increase the memory limit. However, it might also be nice if that loop could be exited with failure status after a certain timeout interval has elapsed.

@johnugeorge
Copy link
Member

johnugeorge commented Jan 21, 2023

@arlofaria Thanks for creating the issue. Can you add a fix if you can? You can keep the timeout interval to be a configurable parameter

@arlofaria
Copy link
Author

Hi @johnugeorge, thanks for looking at this!

Unfortunately, my current employer's policies make it very cumbersome for me to contribute to open-source projects -- and I'm not particularly familiar with this codebase or how to properly test it either.

However, I would think that a simple fix would be to modify pkg/config/config.go to add something like PyTorchInitContainerMaxTries ... and then adjust the logic of the one-line command in pytorch/initcontainer.go to use something like a finite loop (e.g. for i in $(seq {{ .InitContainerMaxTries }}); do ...) and then make sure to break out of the loop with a successful exit 0 if the master node is found, or otherwise exit 1 after the finite loop is completed.

You could also do something like configuring a PyTorchInitContainerTimeout, which would have somewhat more complex logic ... unless you changed the loop to sleep 1 instead of sleep 2, and then it'd be approximately the same as PyTorchInitContainerMaxTries.

Hope this helps, and sorry that I can't contribute more directly 😞

@tenzen-y
Copy link
Member

/kind feature

@AxeZhan
Copy link
Contributor

AxeZhan commented Feb 8, 2023

I just recently learned about kubeflow and am very interested in it. I'm looking for opportunity to contribute and this looks like a good one to start. I'd like to have a try.

@tenzen-y
Copy link
Member

tenzen-y commented Feb 8, 2023

Hi, @kidddddddddddddddddddddd. Welcome to the kubeflow community.

Once you are ready to work on this issue, feel free to assign yourself to this issue with /assign, the same as k/k.

@AxeZhan
Copy link
Contributor

AxeZhan commented Feb 9, 2023

/assign
Will start working on this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants