-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare pod evictor for the descheduling framework plugin #846
Prepare pod evictor for the descheduling framework plugin #846
Conversation
21acbc7
to
a024e0c
Compare
CC @knelasevero |
/retest |
7fda69f
to
9b38519
Compare
klog.ErrorS(err, "Error evicting pod", "pod", klog.KObj(pod)) | ||
break | ||
} | ||
podEvictor.EvictPod(ctx, pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that functionality is being changed here? In other words, before it would break on a single "failed" eviction but now it loops through all pods.
Should it be preserved?
podEvictor.EvictPod(ctx, pod) | |
if !podEvictor.EvictPod(ctx, pod) { | |
break | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past there was only a single error issued (exceeding limit of number of pods allowed to be evicted per node). So it made sense to break and move to another node. However, few months back we added a limit for the number of pods evicted per namespace. So when the namespace limit gets exceeded we can still continue evicting.
Though, I plan to introduce a check for the node limit exceeded. I will incorporate it in this PR once #847 is merged. I am still performing changes to figure out the smallest amount of refactoring so I can move the evictor filter bits into a plugin. Once done we can start moving the strategies into plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just introduced NodeLimitExceeded
method for performing the check.
#847 needs to be merged first. |
9b38519
to
8874ce6
Compare
Rebasing on top of #847 |
8874ce6
to
f47cfe4
Compare
reason = ctx.Value("evictionReason").(string) | ||
} | ||
|
||
if pod.Spec.NodeName == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodLifeTime strategy allows Pods in Pending
state. I assume there could be a state where the Pod has not been scheduled yet or cannot but needs to be evicted for a retry?
/b2418ef481298c6caf185a5f88dd0bb6ddc1cdbf/pkg/descheduler/strategies/pod_lifetime.go#L42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will alter the code to take this case into account. Thanks for noticing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought... after scanning the code, it seems that it only lists Pods on Nodes. We should probably discuss changing that but not part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might remove condition in https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/pod/pods.go#L129-L131 to have pending pods included in getPodsAssignedToNode
function. Considering an empty nodename as a special case.
The method uses the node object to only get the node name. The node name can be retrieved from the pod object. Some strategies might try to evict a pod in Pending state which does not have the .spec.nodeName field set. Thus, skipping the test for the node limit.
When an error is returned a strategy either stops completely or starts processing another node. Given the error can be a transient error or only one of the limits can get exceeded it is fair to just skip a pod that failed eviction and proceed to the next instead. In order to optimize the processing and stop earlier, it is more practical to implement a check which will say when a limit was exceeded.
f47cfe4
to
c838614
Compare
// EvictPod evicts a pod while exercising eviction limits. | ||
// Returns true when the pod is evicted on the server side. | ||
// Eviction reason can be set through the ctx's evictionReason:STRING pair | ||
func (pe *PodEvictor) EvictPod(ctx context.Context, pod *v1.Pod) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious to know why the return signature was changed to not return the error.
Previously, we had the flexibility to return an error for namespace limit reached, node limit reached, and the potential for some other limit in the future. But now it's up to each strategy to handle limit exceeded logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because each plugin should now be responsible for deciding if it should continue/abort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because each plugin should now be responsible for deciding if it should continue/abort?
There are several limits now. Not all of them require to stop processing the current node. E.g. when a namespace limit is reached, a plugin can continue processing another pod on the same node. Thus yes, a plugin is allowed to decide if the right course of action is to stop processing a node and continue with another one. Or, just skipping a pod and taking another on the same node
// EvictPod evicts a pod while exercising eviction limits. | ||
// Returns true when the pod is evicted on the server side. | ||
// Eviction reason can be set through the ctx's evictionReason:STRING pair | ||
func (pe *PodEvictor) EvictPod(ctx context.Context, pod *v1.Pod) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we continue to leave the v1.Node
parameter? See #859 for use-case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied in #859 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@@ -74,6 +74,7 @@ func RemovePodsViolatingNodeAffinity(ctx context.Context, client clientset.Inter | |||
|
|||
switch nodeAffinity { | |||
case "requiredDuringSchedulingIgnoredDuringExecution": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm actually, but i just wonder why this name requiredDuringSchedulingIgnoredDuringExecution
so long...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chosen by the designers: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity. There's probably a historical reason for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get it. look good to me now~
/ok-to-test |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: a7i The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -302,7 +302,7 @@ func RunDeschedulerStrategies(ctx context.Context, rs *options.DeschedulerServer | |||
continue | |||
} | |||
evictorFilter := evictions.NewEvictorFilter(nodes, getPodsAssignedToNode, evictLocalStoragePods, evictSystemCriticalPods, ignorePvcPods, evictBarePods, evictions.WithNodeFit(nodeFit), evictions.WithPriorityThreshold(thresholdPriority)) | |||
f(ctx, rs.Client, strategy, nodes, podEvictor, evictorFilter, getPodsAssignedToNode) | |||
f(context.WithValue(ctx, "strategyName", string(name)), rs.Client, strategy, nodes, podEvictor, evictorFilter, getPodsAssignedToNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intended to be the final approach or just a work in progress step? (passing values through the context)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing some values through the context is the final approach. At least the strategyName. Which will get changed into pluginName. The context key can be then read and used on other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason for doing that rather than adding a new parameter to the functions that use it? Looks like this is just being used to pass the strategy name and reason to EvictPod
, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better approach would be to define an "options" struct to pass as an optional param to EvictPod
. that struct can have strategy
and reasons
fields to start with, and if we decide to add more options in the future then we don't need to change EvictPod
's signature.
what do you think about something like that? As this is, I don't think it's an appropriate use of context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that said, I don't need to block on this right now since this PR has been open for a while (and that's my bad for not getting around to reviewing until now). If my idea sounds good I'll take it as a follow up refactor and we can unhold this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might as well move the strategy name from EvictPod
. It's used mostly for metrics. Then two lines for logging which can be logged without the strategy name (the same for the reason). Instead, printing both the strategy name and the reason (if there's any) outside of EvictPod
through a wrapper which will get introduced after the framework primitives are implemented.
My reasoning is to use EvictPod
only for the actual eviction. The method does not need to know anything about the strategy/plugin/reason/etc. in order to evict a pod. We can have higher invokers to log the additional information.
If my idea sounds good I'll take it as a follow up refactor and we can unhold this PR
+1 for refactoring the code more in the follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea of passing additional info/options to EvictPod
makes sense. It's already a wrapper for a private evictPod
function (not sure why though), but as the public interface it seems like a good spot to expose a handle for customization/logging/metrics.
/hold |
Unholding based on #846 (comment) |
Pre-requisite for #837