Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Controller tries to place pods on nodes with insufficient resources #1005

Open
1 task done
patrick-vonsteht opened this issue Mar 27, 2024 · 1 comment
Open
1 task done
Labels
bug Something isn't working

Comments

@patrick-vonsteht
Copy link
Contributor

Version of Eraser

v1.3.1

Expected Behavior

The controller should skip nodes that don't have enough free resources for the eraser pods. There's a check implemented in the code for this: https://github.com/eraser-dev/eraser/blob/2ea877ca8ac933cc7b233f3dd123d67754d476f5/controllers/imagejob/imagejob_controller.go#L418C3-L418C49

Actual Behavior

The controller tries to place pods on nodes with insufficient resources. These pods don't start due to insufficient resources. They stay for example in status OutOfcpu.

I have only verified this behavior for CPU for now, but looking at the code I expect the other resources like memory to be affected as well.

Steps To Reproduce

  1. Fill up one of the nodes in your Kubernetes cluster so that all of its available CPU is used.
  2. Deploy and run eraser.
  3. You'll see that:
    a. There's an eraser pod on the full node with status OutOfcpu.
    b. The eraser-controller-manager's log does not contain the expected "pod does not fit on node, skipping" message.

Are you willing to submit PRs to contribute to this bug fix?

  • Yes, I am willing to implement it.
@patrick-vonsteht patrick-vonsteht added the bug Something isn't working label Mar 27, 2024
@patrick-vonsteht
Copy link
Contributor Author

Here's my first analysis of this:

By adding some additional logging, I found out that the field nodeInfo.Requested.MilliCPU, which is used in the check for insufficient resources, always has a zero value.

The reason for this seems to be that the controller creates the nodeInfo object with the following code:

nodeInfo := framework.NewNodeInfo()
nodeInfo.SetNode(node)

But looking at the code of the NodeInfo type (https://github.com/kubernetes/kubernetes/blob/3cd242c51317aed8858119529ccab22079f523b1/pkg/scheduler/framework/types.go#L543), I see that we need to pass a list of pods to it and only then it will sum up the pods' resource requests to fill nodeInfo.Requested and other fields in it's update() function (https://github.com/kubernetes/kubernetes/blob/3cd242c51317aed8858119529ccab22079f523b1/pkg/scheduler/framework/types.go#L697)

I propose to add some code to list all the pods that belong to each node and then pass this list of pods to framework.NewNodeInfo(). Then, nodeInfo.Requested should be filled correctly and the check should work as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant