Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why leader re-election should work after the default timeout 5-min when a worker node is failed? #3498

Closed
HyungJune opened this issue Jul 23, 2020 · 3 comments
Assignees
Labels
triage/unresolved Indicates an issue that can not or will not be resolved.
Milestone

Comments

@HyungJune
Copy link

HyungJune commented Jul 23, 2020

Feature Request

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Example: "I have an issue when (...)"

From /pkg/leader/leader.go, leader re-election works after the default timeout 5-min since the condition
Pod.status.phase == "Failed" && Pod.Status.Reason == "Evicted" when a worker node is failed.
I have an opinion that leader re-election can work almost immediately when the condition contains checking the status of the node where the leader pod is running.

Describe the solution you'd like
A clear and concise description of what you want to happen. Add any considered drawbacks.

the condition should change [Pod.status.phase == "Failed" && Pod.Status.Reason == "Evicted"] to [Pod.status.phase == "Failed" && Pod.Status.Reason == "Evicted" || Node.Status == "Not Ready"]

Making --pod-eviction-timeout to be short can be another approach. However, I sure that above approach can bring more reliability since we don't know appropriate time out.

I have a one more question..
What kinds of drawback exist when making --pod-eviction-timeout to be very very short?

@estroz
Copy link
Member

estroz commented Jul 23, 2020

@HyungJune given the placement of the condition you're referring to, you seem to be suggesting to look up the node the leader is on, and if it isn't ready then delete the pod on that node? That doesn't make sense to me, since the pod won't even exist if the node isn't ready. If that isn't what you meant, can you elaborate on your solution?

Also it's worth taking a look at #784, which discusses some of what you're talking about.

@estroz estroz added the triage/unresolved Indicates an issue that can not or will not be resolved. label Jul 23, 2020
@estroz estroz added this to the Backlog milestone Jul 27, 2020
kasonglee added a commit to HyungJune/operator-sdk that referenced this issue Jul 29, 2020
kasonglee added a commit to HyungJune/operator-sdk that referenced this issue Jul 29, 2020
@HyungJune
Copy link
Author

HyungJune commented Jul 31, 2020

@estroz While I try to write my solution, the pkg (containing leader.go) in master branch is moved on "operator-lib" repository.
Do I create new issue (the same issue) and make pull request in "operator-lib" repository?

@HyungJune
Copy link
Author

I move this issue to more appropriate repository (operator-framework/operator-lib#24).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/unresolved Indicates an issue that can not or will not be resolved.
Projects
None yet
Development

No branches or pull requests

3 participants