Infinitely looping init-pytorch container may eventually exceed its memory limit #1734

arlofaria · 2023-01-21T02:34:28Z

The init-pytorch container generally uses very little memory and is expected to run for a relatively short time. However, in a failure condition where the master pod will never be reachable, this container will loop forever ... until, depending on your host OS, it may eventually terminate and result in a worker pod status of Init:OOMKilled -- which is a bit confusing.

The root cause is that this simple loop can slowly increases in its memory consumption, possibly due to a memory leak or perhaps more likely due to OS-level memory fragmentation, and may eventually exceed the hard-coded 20Mi memory limit (see line 42 below).

training-operator/pkg/controller.v1/pytorch/initcontainer.go

Lines 36 to 46 in 54a408f

    
           - name: init-pytorch 
        
             image: {{.InitContainerImage}} 
        
             imagePullPolicy: IfNotPresent 
        
             resources: 
        
               limits: 
        
                 cpu: 100m 
        
                 memory: 20Mi 
        
               requests: 
        
                 cpu: 50m 
        
                 memory: 10Mi 
        
             command: ['sh', '-c', 'until nslookup {{.MasterAddr}}; do echo waiting for master; sleep 2; done;']`

One approach would be to simply increase the memory limit. However, it might also be nice if that loop could be exited with failure status after a certain timeout interval has elapsed.

The text was updated successfully, but these errors were encountered:

johnugeorge · 2023-01-21T11:15:22Z

@arlofaria Thanks for creating the issue. Can you add a fix if you can? You can keep the timeout interval to be a configurable parameter

arlofaria · 2023-01-23T22:51:00Z

Hi @johnugeorge, thanks for looking at this!

Unfortunately, my current employer's policies make it very cumbersome for me to contribute to open-source projects -- and I'm not particularly familiar with this codebase or how to properly test it either.

However, I would think that a simple fix would be to modify pkg/config/config.go to add something like PyTorchInitContainerMaxTries ... and then adjust the logic of the one-line command in pytorch/initcontainer.go to use something like a finite loop (e.g. for i in $(seq {{ .InitContainerMaxTries }}); do ...) and then make sure to break out of the loop with a successful exit 0 if the master node is found, or otherwise exit 1 after the finite loop is completed.

You could also do something like configuring a PyTorchInitContainerTimeout, which would have somewhat more complex logic ... unless you changed the loop to sleep 1 instead of sleep 2, and then it'd be approximately the same as PyTorchInitContainerMaxTries.

Hope this helps, and sorry that I can't contribute more directly 😞

tenzen-y · 2023-01-25T19:26:26Z

/kind feature

AxeZhan · 2023-02-08T15:27:27Z

I just recently learned about kubeflow and am very interested in it. I'm looking for opportunity to contribute and this looks like a good one to start. I'd like to have a try.

tenzen-y · 2023-02-08T17:08:35Z

Hi, @kidddddddddddddddddddddd. Welcome to the kubeflow community.

Once you are ready to work on this issue, feel free to assign yourself to this issue with /assign, the same as k/k.

AxeZhan · 2023-02-09T02:13:01Z

/assign
Will start working on this weekend.

google-oss-prow bot added the kind/feature label Jan 25, 2023

google-oss-prow bot assigned AxeZhan Feb 9, 2023

AxeZhan mentioned this issue Feb 10, 2023

fix infinite loop in init-pytorch container #1756

Merged

1 task

google-oss-prow bot closed this as completed in #1756 Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinitely looping init-pytorch container may eventually exceed its memory limit #1734

Infinitely looping init-pytorch container may eventually exceed its memory limit #1734

arlofaria commented Jan 21, 2023 •

edited

Loading

johnugeorge commented Jan 21, 2023 •

edited

Loading

arlofaria commented Jan 23, 2023

tenzen-y commented Jan 25, 2023

AxeZhan commented Feb 8, 2023

tenzen-y commented Feb 8, 2023

AxeZhan commented Feb 9, 2023

Infinitely looping init-pytorch container may eventually exceed its memory limit #1734

Infinitely looping init-pytorch container may eventually exceed its memory limit #1734

Comments

arlofaria commented Jan 21, 2023 • edited Loading

johnugeorge commented Jan 21, 2023 • edited Loading

arlofaria commented Jan 23, 2023

tenzen-y commented Jan 25, 2023

AxeZhan commented Feb 8, 2023

tenzen-y commented Feb 8, 2023

AxeZhan commented Feb 9, 2023

arlofaria commented Jan 21, 2023 •

edited

Loading

johnugeorge commented Jan 21, 2023 •

edited

Loading