[BUG] Controller tries to place pods on nodes with insufficient resources #1005

patrick-vonsteht · 2024-03-27T09:16:38Z

Version of Eraser

v1.3.1

Expected Behavior

The controller should skip nodes that don't have enough free resources for the eraser pods. There's a check implemented in the code for this: https://github.com/eraser-dev/eraser/blob/2ea877ca8ac933cc7b233f3dd123d67754d476f5/controllers/imagejob/imagejob_controller.go#L418C3-L418C49

Actual Behavior

The controller tries to place pods on nodes with insufficient resources. These pods don't start due to insufficient resources. They stay for example in status OutOfcpu.

I have only verified this behavior for CPU for now, but looking at the code I expect the other resources like memory to be affected as well.

Steps To Reproduce

Fill up one of the nodes in your Kubernetes cluster so that all of its available CPU is used.
Deploy and run eraser.
You'll see that:
a. There's an eraser pod on the full node with status OutOfcpu.
b. The eraser-controller-manager's log does not contain the expected "pod does not fit on node, skipping" message.

Are you willing to submit PRs to contribute to this bug fix?

Yes, I am willing to implement it.

The text was updated successfully, but these errors were encountered:

patrick-vonsteht · 2024-03-27T09:21:38Z

Here's my first analysis of this:

By adding some additional logging, I found out that the field nodeInfo.Requested.MilliCPU, which is used in the check for insufficient resources, always has a zero value.

The reason for this seems to be that the controller creates the nodeInfo object with the following code:

nodeInfo := framework.NewNodeInfo()
nodeInfo.SetNode(node)

But looking at the code of the NodeInfo type (https://github.com/kubernetes/kubernetes/blob/3cd242c51317aed8858119529ccab22079f523b1/pkg/scheduler/framework/types.go#L543), I see that we need to pass a list of pods to it and only then it will sum up the pods' resource requests to fill nodeInfo.Requested and other fields in it's update() function (https://github.com/kubernetes/kubernetes/blob/3cd242c51317aed8858119529ccab22079f523b1/pkg/scheduler/framework/types.go#L697)

I propose to add some code to list all the pods that belong to each node and then pass this list of pods to framework.NewNodeInfo(). Then, nodeInfo.Requested should be filled correctly and the check should work as expected.

patrick-vonsteht added the bug Something isn't working label Mar 27, 2024

lzhecheng mentioned this issue Aug 2, 2024

[BUG] Should adjust scanner CPU request in example yaml #1062

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Controller tries to place pods on nodes with insufficient resources #1005

[BUG] Controller tries to place pods on nodes with insufficient resources #1005

patrick-vonsteht commented Mar 27, 2024

patrick-vonsteht commented Mar 27, 2024

[BUG] Controller tries to place pods on nodes with insufficient resources #1005

[BUG] Controller tries to place pods on nodes with insufficient resources #1005

Comments

patrick-vonsteht commented Mar 27, 2024

Version of Eraser

Expected Behavior

Actual Behavior

Steps To Reproduce

Are you willing to submit PRs to contribute to this bug fix?

patrick-vonsteht commented Mar 27, 2024