-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinitely looping init-pytorch container may eventually exceed its memory limit #1734
Comments
@arlofaria Thanks for creating the issue. Can you add a fix if you can? You can keep the timeout interval to be a configurable parameter |
Hi @johnugeorge, thanks for looking at this! Unfortunately, my current employer's policies make it very cumbersome for me to contribute to open-source projects -- and I'm not particularly familiar with this codebase or how to properly test it either. However, I would think that a simple fix would be to modify You could also do something like configuring a Hope this helps, and sorry that I can't contribute more directly 😞 |
/kind feature |
I just recently learned about kubeflow and am very interested in it. I'm looking for opportunity to contribute and this looks like a good one to start. I'd like to have a try. |
Hi, @kidddddddddddddddddddddd. Welcome to the kubeflow community. Once you are ready to work on this issue, feel free to assign yourself to this issue with |
/assign |
The
init-pytorch
container generally uses very little memory and is expected to run for a relatively short time. However, in a failure condition where the master pod will never be reachable, this container will loop forever ... until, depending on your host OS, it may eventually terminate and result in a worker pod status ofInit:OOMKilled
-- which is a bit confusing.The root cause is that this simple loop can slowly increases in its memory consumption, possibly due to a memory leak or perhaps more likely due to OS-level memory fragmentation, and may eventually exceed the hard-coded
20Mi
memory limit (see line 42 below).training-operator/pkg/controller.v1/pytorch/initcontainer.go
Lines 36 to 46 in 54a408f
One approach would be to simply increase the memory limit. However, it might also be nice if that loop could be exited with failure status after a certain timeout interval has elapsed.
The text was updated successfully, but these errors were encountered: