Why leader re-election should work after the default timeout 5-min when a worker node is failed? #3498

HyungJune · 2020-07-23T05:23:47Z

Feature Request

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Example: "I have an issue when (...)"

From /pkg/leader/leader.go, leader re-election works after the default timeout 5-min since the condition
Pod.status.phase == "Failed" && Pod.Status.Reason == "Evicted" when a worker node is failed.
I have an opinion that leader re-election can work almost immediately when the condition contains checking the status of the node where the leader pod is running.

Describe the solution you'd like
A clear and concise description of what you want to happen. Add any considered drawbacks.

the condition should change [Pod.status.phase == "Failed" && Pod.Status.Reason == "Evicted"] to [Pod.status.phase == "Failed" && Pod.Status.Reason == "Evicted" || Node.Status == "Not Ready"]

Making --pod-eviction-timeout to be short can be another approach. However, I sure that above approach can bring more reliability since we don't know appropriate time out.

I have a one more question..
What kinds of drawback exist when making --pod-eviction-timeout to be very very short?

estroz · 2020-07-23T15:36:24Z

@HyungJune given the placement of the condition you're referring to, you seem to be suggesting to look up the node the leader is on, and if it isn't ready then delete the pod on that node? That doesn't make sense to me, since the pod won't even exist if the node isn't ready. If that isn't what you meant, can you elaborate on your solution?

Also it's worth taking a look at #784, which discusses some of what you're talking about.

…ork#3498)

HyungJune · 2020-07-31T01:22:02Z

@estroz While I try to write my solution, the pkg (containing leader.go) in master branch is moved on "operator-lib" repository.
Do I create new issue (the same issue) and make pull request in "operator-lib" repository?

HyungJune · 2020-08-06T06:52:14Z

I move this issue to more appropriate repository (operator-framework/operator-lib#24).

estroz added the triage/unresolved Indicates an issue that can not or will not be resolved. label Jul 23, 2020

estroz assigned asmacdo Jul 27, 2020

estroz added this to the Backlog milestone Jul 27, 2020

kasonglee added a commit to HyungJune/operator-sdk that referenced this issue Jul 29, 2020

Ansible: check a node status when selecting a leader (operator-framew…

94eb440

…ork#3498)

kasonglee added a commit to HyungJune/operator-sdk that referenced this issue Jul 29, 2020

Ansible: check a node status when selecting a leader (operator-framew…

5072bd2

…ork#3498)

HyungJune mentioned this issue Jul 30, 2020

Ansible: check a node status when selecting a leader (#3498) HyungJune/operator-sdk#1

Open

2 tasks

kasonglee mentioned this issue Jul 31, 2020

Any contribution guide? operator-framework/operator-lib#22

Closed

HyungJune closed this as completed Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why leader re-election should work after the default timeout 5-min when a worker node is failed? #3498

Why leader re-election should work after the default timeout 5-min when a worker node is failed? #3498

HyungJune commented Jul 23, 2020 •

edited

Loading

estroz commented Jul 23, 2020

HyungJune commented Jul 31, 2020 •

edited

Loading

HyungJune commented Aug 6, 2020

Why leader re-election should work after the default timeout 5-min when a worker node is failed? #3498

Why leader re-election should work after the default timeout 5-min when a worker node is failed? #3498

Comments

HyungJune commented Jul 23, 2020 • edited Loading

Feature Request

estroz commented Jul 23, 2020

HyungJune commented Jul 31, 2020 • edited Loading

HyungJune commented Aug 6, 2020

HyungJune commented Jul 23, 2020 •

edited

Loading

HyungJune commented Jul 31, 2020 •

edited

Loading