Why leader re-election should work after the default timeout 5-min when a worker node is failed? #3498
Labels
triage/unresolved
Indicates an issue that can not or will not be resolved.
Milestone
Feature Request
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Example: "I have an issue when (...)"
From /pkg/leader/leader.go, leader re-election works after the default timeout 5-min since the condition
Pod.status.phase == "Failed" && Pod.Status.Reason == "Evicted" when a worker node is failed.
I have an opinion that leader re-election can work almost immediately when the condition contains checking the status of the node where the leader pod is running.
Describe the solution you'd like
A clear and concise description of what you want to happen. Add any considered drawbacks.
the condition should change [Pod.status.phase == "Failed" && Pod.Status.Reason == "Evicted"] to [Pod.status.phase == "Failed" && Pod.Status.Reason == "Evicted" || Node.Status == "Not Ready"]
Making --pod-eviction-timeout to be short can be another approach. However, I sure that above approach can bring more reliability since we don't know appropriate time out.
I have a one more question..
What kinds of drawback exist when making --pod-eviction-timeout to be very very short?
The text was updated successfully, but these errors were encountered: