Leader election locking for Operators #136

philips · 2018-03-28T00:04:00Z

We need to ensure that in each namespace an Operator is a singleton and that only one Operator is reconciling against the resources at a single time.

To accomplish this we should provide automatic singleton leader election logic in the SDK.

squeed · 2018-08-10T16:30:13Z

As we spin up operator development, this will be critical. Otherwise, every single team will have to write their own leader election code.

mhrivnak · 2018-08-31T14:39:51Z

Some form of leader election may come mostly for free by way of controller-runtime: kubernetes-sigs/controller-runtime#93

mhrivnak · 2018-08-31T15:47:01Z

To ensure we get this right, it will help to capture more specific requirements. I'll state some assumptions (please help clarify them if they're not right) and ask some questions. @spahl and @hasbro17 please add your thoughts.

In a leader election scenario, we expect that for some operator, two or more instances of it will be running concurrently in the same namespace, but in separate pods, and only one instance will be active. Non-active instances will be waiting for the leader to disappear, in which case a new election decides which becomes the new leader.

Question: What requirement drives the need for a "hot spare" operator to be already running when the leader goes away, vs. just starting a new operator pod when needed? Put another way, what use case does leader election meet vs. using a Deployment with replicas: 1?

Speculating an answer: Is leader election for operators not about HA, but actually about the scenario where two operators get deployed in the same namespace for unrelated purposes (for example, two unrelated apps get deployed to the same namespace, and each independently comes with a deployment of a mysql operator)?

Question: Is it acceptable to have two leaders for a very brief period of time?

Question: Is it acceptable to have no leader at all for a very brief period of time? In this case, some events may be missed, but the new leader's reconciliation would presumably do the right things regardless.

hasbro17 · 2018-08-31T18:16:09Z

what use case does leader election meet vs. using a Deployment with replicas: 1

You're right that it's not primarily for HA. More so to prevent two operator instances from handling events in the same namespace.

Is it acceptable to have two leaders for a very brief period of time?

I would think not. Even if it's momentary, two operators reconciling the same events would be problematic.

Is it acceptable to have no leader at all for a very brief period of time?

Yes that's normal. It's expected that an operator can be down or partitioned from the cluster and miss events, but once its back up it should reconcile from the current state.

I saw the leader election PR on the controller-runtime. I think it's fine if we just review and stay on top of that to ensure it fulfills our use case(and is configurable for retry period, lease duration etc).

kfox1111 · 2018-09-04T16:04:35Z

Deployment+replicas=1 does not guarantee that only one instance will only ever run at the same time. I think Statefulset with replicas=1 does though. A statefulset on a down node doesn't always automatically recover though.

mhrivnak · 2018-09-04T22:44:11Z

That all lines up with my thinking as well. Thanks for the input.

The leader election algorithm being proposed for addition into controller-runtime allows for the possibility of two controllers being active at the same time. It works on a concept of leases, and it expects two key behaviors:

A leader will step down if it's not able to renew its lease. Such inability could happen due to temporary network failure, system overload preventing the process from doing anything for a short period, or other temporary but impactful circumstances.
Other candidates can take the lease if the leader fails to renew it in time.

The important observation to make is that a leader could be in the middle of still trying to do "leader things", like change cluster state, unaware that some other candidate has taken the lease out from under it. Presumably it would notice eventually and step down, but there could be a period of overlap with the new leader.

Another factor is the ongoing burden of the leader needing to renew its lease, check if it needs to step down, and try to step down gracefully if so.

Given all of that, I think the leader election in the controller-runtime PR may be sufficient in cases where nothing goes wrong, and we just want to enable multiple operators to coordinate within a namespace. But it does not meet all of our desired behaviors, and it does require a controller/operator to be designed for the ability to step down at arbitrary times.

I'll pitch an alternative in the next comment.

mhrivnak · 2018-09-04T22:50:39Z

I'll now describe a "leader for life" approach, and I'd appreciate feedback on its viability.

In the "leader for life" approach, a specific Pod is the leader. Once established, the Pod is the leader until it is destroyed. There is no possibility for multiple pods to think they the leader at the same time. There is no lease renewal or stepping down.

Simplified algorithm:

At startup, our operator process attempts to create a ConfigMap with a known name (perhaps "myapp-operator-lock"). The ConfigMap will have its OwnerReference set to our operator's Pod.
If the create fails because a ConfigMap with that name already exists, that means a leader exists. Our operator will periodically continue trying to create the ConfigMap until it eventually succeeds.
Once our operator process becomes the leader by successfully creating a ConfigMap, it has no further work to do with respect to elections. There is no lease renewal, and certainly no stepping down to worry about.
When our operator's Pod eventually gets removed for some reason, the garbage collector will also remove its ConfigMap.
With the ConfigMap gone, another candidate process is able to create its own and become the new leader.

There are of course some potential enhancements to that approach, but I'm looking for feedback on the general concept. Would this approach meet our needs?

A real advantage of this approach is the simplicity of use. A developer can call a blocking function called something like BecomeLeader() at startup, and then move on with life as the leader.

I put together a working PoC and tested it with the ansible-operator. You can see it here: https://github.com/water-hole/ansible-operator/compare/master...mhrivnak:leader?expand=1

If this seems like a reasonable approach, I can write up a more complete proposal.

shawn-hurley · 2018-09-05T12:47:50Z

PTAL @fanminshi @hasbro17 @AlexNPavel

I think the simplified approach will work for 80-90% percent of the use cases and makes it incredibly easy as a user. If a user needs a more advanced approach with lease renewal or stepping down they could use the controller-runtimes version. Thoughts?

hasbro17 · 2018-09-07T00:45:48Z

@mhrivnak Thanks for the detailed comparisons. This quite helpful. And I think you meant to point to this link for your implementation.
mhrivnak/ansible-operator@93d8334
I've added my comments for both approaches below but I'm leaning towards the controller-runtime's utility.

Controller-runtime's election

The important observation to make is that a leader could be in the middle of still trying to do "leader things", like change cluster state, unaware that some other candidate has taken the lease out from under it. Presumably it would notice eventually and step down, but there could be a period of overlap with the new leader.

We could minimize this overlap to a reasonable extent by configuring the lease duration and renew deadline. Even the controller-runtime's defaults should work for most cases:

		LeaseDuration: 15 * time.Second,
		RenewDeadline: 10 * time.Second,

https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection.go#L21-L32

This works for most cases and in the case of extreme clock skew we can still configure LeaseDuration to be 3-4x the renew deadline to ensure leaders expire well before any followers attempt to get the lease.

Another factor is the ongoing burden of the leader needing to renew its lease, check if it needs to step down, and try to step down gracefully if so.

Doing a lease check as a follower every 15s and renewing the lease every 10s as the leader is not really a performance burden compared to an operator's normal operations(watches, client operations etc). And those periods are configurable to change the performance impact.

it does require a controller/operator to be designed for the ability to step down at arbitrary times

Isn't that part is already taken care of by the runtime's utility? The manager/operator will exit the moment it loses(can't renew) the lock.
https://github.com/kubernetes-sigs/controller-runtime/pull/118/files#diff-77faf6b20512574869434402d5c5b6a2R165
And besides an operator is expected to crash/restart for other reasons besides stepping down and should be able to handle that.

Leader for Life

I'm a little hesitant on building our own leader election algorithm without leases. Leader election can be quite tricky and while I can't immediately see any edge cases with your implementation we'll need to carefully consider any non-obvious effects of the "for life" approach.

My biggest concern with maintaining our own implementation is ensuring its correctness as opposed to reusing an upstream solution that's already used/tested by the k8s control plane components like the controller-manger, scheduler etc
https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-controller-manager/app/controllermanager.go#L218-L241

A real advantage of this approach is the simplicity of use. A developer can call a blocking function called something like BecomeLeader() at startup, and then move on with life as the leader.

I don't know if enabling the runtime's leader election is any more complicated from a usage perspective. The users just have to pass in 1 extra option to the manager in main.go.

manager.New(cfg, Options{LeaderElection: true})

TLDR: I don't see the upstream solution being as being more complicated from a user perspective(for the default case).
And without sufficiently testing our own implementation, I would have more confidence in an implementation used by upstream components.

So I would vote to wait and use the runtime's utility. But if more people see that as too complicated to use we can revisit our own implementation.
/cc @spahl @fanminshi

mhrivnak · 2018-09-10T14:58:38Z

Thanks for taking a look. I'll add a little more context and some responses.

FWIW, neither of these algorithms is new or unique. Simple leader election, unless you're going for a concensus algorithm (like raft), very often involves having all the candidates try to create the same record in a data store that enforces uniqueness constraints. Often it's a table in a DB, but can just as easily be the k8s API.

Leases are an imperfect, but usually the least-bad, way to identify when the leader is gone. In most systems, you don't have a way to trigger an action when the leader disappears, but in our case we have garbage collection on our side. Or we could even watch the leader's pod directly and respond when it gets deleted. The key advantage is that kubernetes lets us directly tie into the leader's lifecycle, rather than indirectly observe it.

It appears that leader election in client-go was implemented before owner references and associated garbage collection existed. I can also see why low-level components such as controller-manager may not want to depend on garbage collection or other lifecycle events. So I think that's why we have lease-based election to begin with, and why it will continue to have value for some components.

We could minimize this overlap to a reasonable extent...

Regarding the requirements question I asked above: "Is it acceptable to have two leaders for a very brief period of time?" The answer above seemed to say "no", but I read this as a "yes", so that could use clarification. Using leases and tolerating a few seconds of overlap is usually done because there's no other choice, but in our case we do have another option. Personally I think it's much safer to draw a line in the sand and give operator authors confidence that two instances will not run concurrently. A lot can happen in a few seconds of overlap.

Clearly this would be valuable to most users of controller-runtime, so I would hope to get a leader-for-life approach added there. Starting with an implementation in the SDK and exercizing it there seems reasonable before submitting it to controller-runtime, but I'm open to other approaches.

Doing a lease check as a follower every 15s and renewing the lease every 10s as the leader is not really a performance burden

Completely agreed. I did not mean to imply there was any performance burden. The burden is more in having and maintaining code that does active lease renewal. We get it "for free" from client-go, and the code in controller-runtime that will use it. But it is an alpha feature of client-go that includes the warning "Depend on this API at your own risk." It's not a big factor, because it's not a lot of code or a lot of background work, but it is an extra layer of complexity on an already complex and highly concurrent process that may not be necessary. My main point here is, if we don't need a whole extra background activity in the leader that continuously interacts with an API over the network, why have it.

The manager/operator will exit...

I agree it's doing the right thing, and this approach does not present much burden. But why build in a non-graceful exit path if we don't need it?

I'm with you that using the feature that's built into client-go and already in use elsewhere is an easy default choice. But in this case, it does not match our requirement for no overlap of leaders. Since the whole lease concept is just a mechanism for recognizing when a leader is gone, and we have a simpler and better mechanism available, I think it's worth pursuing an approach that connects leader election directly to the leader's lifecycle, rather than indirectly observe it through leases.

hasbro17 · 2018-09-11T22:41:30Z

"Is it acceptable to have two leaders for a very brief period of time?" The answer above seemed to say "no", but I read this as a "yes", so that could use clarification.

You're right I did go back and forth on that. I agree that we should fulfill the requirement to guarantee that no two leaders run at the same time.
So we can go with the leader for life approach as the default and add our implementation to the SDK.

And it doesn't stop the user from switching to the "leader with leases" approach via the controller-runtime if they need to for some reason.

Can you please put up a brief proposal to explain this approach and the API from your PoC (i.e how leader.Become() should be called in main.go). After that's agreed upon, we can cleanup and repurpose your PoC and add it to the SDK.

/cc @spahl and @fanminshi let me know if you have any issues with this approach.

re operator-framework#136

re #136

joelanford · 2018-12-11T21:09:04Z

@mhrivnak is this issue closed by #724?

hasbro17 · 2018-12-17T21:16:09Z

@joelanford Yes we have leader election enabled by default using the SDK's leader-for-life utility.
The follow up to this is #784 where we need to better highlight how/when to use the controller-runtime's leader election with leases.

…istency-openshift-4.9-openshift-enterprise-ansible-operator Updating openshift-enterprise-ansible-operator images to be consistent with ART

spahl added kind/feature Categorizes issue or PR as related to a new feature. design labels Jun 5, 2018

mhrivnak added a commit to mhrivnak/operator-sdk that referenced this issue Sep 13, 2018

proposes leader-for-life election

fa13100

re operator-framework#136

mhrivnak mentioned this issue Sep 13, 2018

proposes leader-for-life election #481

Merged

mhrivnak added a commit to mhrivnak/operator-sdk that referenced this issue Sep 18, 2018

doc/proposals: leader election

a17d7ce

re operator-framework#136

hasbro17 pushed a commit that referenced this issue Sep 18, 2018

doc/proposals: leader election (#481)

4d42f93

re #136

hasbro17 closed this as completed Dec 17, 2018

sebgl mentioned this issue Aug 24, 2020

Add operator election support elastic/cloud-on-k8s#3632

Merged

falox mentioned this issue Oct 28, 2020

Leader election support falox/csharp-operator-sdk#9

Open

estroz mentioned this issue Jan 19, 2021

Make leader-for-life leader election more integrated with controller-runtime operator-framework/operator-lib#48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader election locking for Operators #136

Leader election locking for Operators #136

philips commented Mar 28, 2018

squeed commented Aug 10, 2018

mhrivnak commented Aug 31, 2018

mhrivnak commented Aug 31, 2018

hasbro17 commented Aug 31, 2018

kfox1111 commented Sep 4, 2018

mhrivnak commented Sep 4, 2018

mhrivnak commented Sep 4, 2018

shawn-hurley commented Sep 5, 2018

hasbro17 commented Sep 7, 2018

mhrivnak commented Sep 10, 2018

hasbro17 commented Sep 11, 2018

joelanford commented Dec 11, 2018

hasbro17 commented Dec 17, 2018

Leader election locking for Operators #136

Leader election locking for Operators #136

Comments

philips commented Mar 28, 2018

squeed commented Aug 10, 2018

mhrivnak commented Aug 31, 2018

mhrivnak commented Aug 31, 2018

hasbro17 commented Aug 31, 2018

kfox1111 commented Sep 4, 2018

mhrivnak commented Sep 4, 2018

mhrivnak commented Sep 4, 2018

shawn-hurley commented Sep 5, 2018

hasbro17 commented Sep 7, 2018

mhrivnak commented Sep 10, 2018

hasbro17 commented Sep 11, 2018

joelanford commented Dec 11, 2018

hasbro17 commented Dec 17, 2018