Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader election may be too slow to re-elect new master #784

Closed
jsafrane opened this issue Nov 28, 2018 · 4 comments
Closed

Leader election may be too slow to re-elect new master #784

jsafrane opened this issue Nov 28, 2018 · 4 comments
Assignees

Comments

@jsafrane
Copy link

Current leader election depends on Kubernetes to delete faulty pods relatively quickly. It does not work well when a leader is on a node that becomes unresponsive (network partition, kubelet hangs, ...). The pod is not deleted automatically, the leader is not working and new one cannot be elected. I'd expect a new leader to be available in ~ 1 minute even in the worst conditions.

@mhrivnak
Copy link
Member

Leader election for operators is primarily focused on guaranteeing that in a scenario where multiple pods are running as the same operator, only one of them can be active. Most operators run a single pod at a time, but overlap can happen during operator upgrade, pod rescheduling for whatever reason, etc. Leader election is less focused on providing high availability, of the sort where you would run multiple pods all the time in order to have a warm spare that can take over should the leader fail. You can do that, and leader election will work for that case. But as you observe, the scenario of a failed node is difficult.

This documentation describes in detail the challenges and potential ambiguity that comes with node failure. Simply put, if a node fails, it may be impossible to determine if a pod is still running on it or not. If your leader happens to be on a failed node, in most cases it will be deleted after the pod-eviction-timeout.

If it is important to you that an operator on an unreachable node gets rescheduled more quickly, you may also have the same concern about other workloads, and it would make sense to look at lowering the pod-eviction-timeout. I'm not sure why the default is 5 minutes, which seems rather long, except perhaps to be conservative for the case of pods that are expensive to re-schedule.

Otherwise if your priority is quick recovery from a node that has gone silent, you might prefer to use the lease-based leader election that is provided by controller-runtime. But using the leased-based approach is a trade-off. It has weaker guarantees about preventing concurrent leadership, but it does recover from the missing/frozen/disconnected/silent/etc node problem more quickly.

Hopefully total failure of a node, of the kind where it suddenly goes silent and kubernetes is not otherwise able to determine if it's still alive, will be rare for you. I think most people will prefer the guarantees that come with our leader election and will be able to work with a > 1 minute SLA in case of ambiguous node failure.

@hasbro17
Copy link
Contributor

Just for reference the controller-runtime supports turning on lease based leader election via the manager options.

Perhaps we can document that more clearly as an alternative to allow users to choose the tradeoff they want to make with their choice of leader election.

@estroz estroz added the docs label Nov 29, 2018
@jsafrane
Copy link
Author

But using the leased-based approach is a trade-off. It has weaker guarantees about preventing concurrent leadership, but it does recover from the missing/frozen/disconnected/silent/etc node problem more quickly.

Is there a list of issues with client-go/leader-election? In my opinion it's more reliable than pod deletion. I prefer faster recovery, critical openshift components depend on it. IMO we can't afford a controller being unavailable for 5 minutes.

Perhaps we can document that more clearly as an alternative to allow users to choose the tradeoff they want to make with their choice of leader election.

+1

@hasbro17
Copy link
Contributor

Long overdue but with #1052 we now have a section that explains the two options for leader election.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants