Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc/user-guide: show how to enable both leader election options #1052

Merged
merged 8 commits into from
Feb 18, 2019

Conversation

hasbro17
Copy link
Contributor

@hasbro17 hasbro17 commented Feb 4, 2019

Description of the change:
Added a brief overview of the two leader election implementations that can be used along with their respective tradeoffs and example code snippets on how to use them.

Motivation for the change:

Fixes #665 #784

I'm aiming to keep the explanations as concise as possible without being too vague and referring to the pkg godocs for more details on the implementations themselves.

I also need a second opinion on whether this is bloating the user-guide, and if we would rather keep this in a separate doc and just link it from the user-guide.

@hasbro17 hasbro17 added the docs label Feb 4, 2019
@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 4, 2019
Copy link
Member

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits, overall makes sense to me 👍

doc/user-guide.md Outdated Show resolved Hide resolved
doc/user-guide.md Outdated Show resolved Hide resolved
lilic and others added 2 commits February 4, 2019 10:57
Co-Authored-By: hasbro17 <hasbro17@gmail.com>
Co-Authored-By: hasbro17 <hasbro17@gmail.com>
@hasbro17 hasbro17 removed the request for review from mhrivnak February 4, 2019 18:58
Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one nit

doc/user-guide.md Outdated Show resolved Hide resolved
Co-Authored-By: hasbro17 <hasbro17@gmail.com>
Copy link
Member

@estroz estroz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few nits and a question.

doc/user-guide.md Outdated Show resolved Hide resolved
doc/user-guide.md Outdated Show resolved Hide resolved
doc/user-guide.md Outdated Show resolved Hide resolved

When the operator is not running in a cluster, the Manager will return an error on starting since it can't detect the operator's namespace in order to create the configmap for leader election. You can override this namespace by setting the Manager's `LeaderElectionNamespace` option.

[tools_leaderelection]: https://godoc.org/github.com/kubernetes/client-go/tools/leaderelection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this link supposed to be used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, no this got left in from an earlier change. I'll remove this.

Copy link
Member

@mhrivnak mhrivnak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this! I made a few suggestions.

@@ -400,6 +400,70 @@ func main() {

After adding new import paths to your operator project, run `dep ensure` in the root of your project directory to fulfill these dependencies.

## Leader election

For improved reliability and quick failover it may be necessary to run multiple replicas of the operator with leader election, so that one leader instance handles the reconciliation while the other instances are inactive but ready to take over when the leader fails.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to give people the right guidance, do we know of any controllers or operators that actually do this? I think there's a risk that people will read this as advice on something good to do, when in reality it is uncommon. For most people, we don't want to double their number of operator pods, and brief operator downtime is completely acceptable (eventual consistency being the goal).

As I understand it, leader election is more about avoiding contention during rollout of a newer operator, and when an unhealthy operator gets replaced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know it's mostly the control plane controllers (like the scheduler) for whom high availability is an important requirement.
But I agree that improved reliability is not the dominant use case for using leader election and we probably shouldn't give that impression. I'll update this to point out that the focus is to avoid contention between multiple instances during rollouts.


There are two different leader election implementations to choose from, each with its own tradeoff.

- [Leader-for-life][leader_for_life]: The leader pod only gives up leadership (via garbage collection) when it is deleted. Avoids the possibility of 2 instances mistakenly running as leaders(split brain). However, this method can be slow to elect a new leader when the leader pod is on an unresponsive node(partitioned) where it takes a long time for the leader pod to be deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than use words like "slow", which is relative and subjective, I suggest referencing the node failure timings. I don't remember the name of the setting off the top of my head, but there is a configurable time frame (default 5 minutes) after which workloads on an unresponsive node get deleted. Referencing that setting here will also give people an opportunity to consider modifying it. I'll try to look it up later on.

I'm learning more about node lifecycle, and my suspicion is that in practice the period of uncertainty will normally not last for 5 minutes. Using the cloud-controller-manager, the status of the underlying machine can be queried with the cloud environment (AWS, azure, openstack, whatever), and a dead machine can have its node deleted more promptly. So I suspect in real life this won't be a common scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that's the pod-eviction-timeout. From the node architecture docs:

If the Status of the Ready condition remains Unknown or False for longer than the pod-eviction-timeout, an argument is passed to the kube-controller-manager and all the Pods on the node are scheduled for deletion by the Node Controller. The default eviction timeout duration is five minutes

I'll make this more specific by pointing out the default time for pod eviction that can result in an unresponsive leader for the given period of time.

There are two different leader election implementations to choose from, each with its own tradeoff.

- [Leader-for-life][leader_for_life]: The leader pod only gives up leadership (via garbage collection) when it is deleted. Avoids the possibility of 2 instances mistakenly running as leaders(split brain). However, this method can be slow to elect a new leader when the leader pod is on an unresponsive node(partitioned) where it takes a long time for the leader pod to be deleted.
- [Leader-with-lease][leader_with_lease]: The leader pod periodically renews the leader lease and gives up leadership when it can't renew the lease. This method allows for a faster transition to a new leader when the existing leader is isolated, but there is a possibility of split brain in [certain situations][lease_split_brain].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add that it assumes the leader is capable of stopping what it is doing immediately upon lease expiry. Depending on how a controller is written, it may or may not be able to make that guarantee.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhrivnak Aren't all leaders capable of stopping abruptly when using leader election via the manager? Since that will just exit the program when the lease expires for the leader and it stops leading.
https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/internal.go#L279-L283
https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/internal.go#L230-L233
https://github.com/operator-framework/operator-sdk-samples/blob/master/memcached-operator/cmd/manager/main.go#L85-L88

In my understanding this should exit all background goroutines abruptly as well without blocking for any defer statements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it does look like they've guarded it well. It may be too late to cancel API requests that are in-progress, but that's quite an edge case.

estroz and others added 4 commits February 11, 2019 13:38
Co-Authored-By: hasbro17 <hasbro17@gmail.com>
Co-Authored-By: hasbro17 <hasbro17@gmail.com>
Co-Authored-By: hasbro17 <hasbro17@gmail.com>
@hasbro17
Copy link
Contributor Author

@mhrivnak PTAL. Updated per your comments.

@hasbro17 hasbro17 merged commit 7e578b4 into operator-framework:master Feb 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants