proposes leader-for-life election #481

mhrivnak · 2018-09-13T19:40:13Z

kfox1111

If the goal is to allow failing from one instance to another, you can also accomplish this by using a statefulset and always using index0 as the leader. (IE, size > 1 has no effect). Statefulsets guarantee that no more then one will run at a time even when a node is uncontactable and still might be running the workload. It is only removed and replaced when the user deletes the pod or the node object in k8s, signalling k8s that it is safe to replace it with a new pod.

mhrivnak · 2018-09-13T20:01:13Z

@kfox1111 that's useful to know, but I don't think it will completely fit this use case. I think a goal here is to allow for the possibility that the same operator gets deployed more than once to the same namespace. Maybe I deploy wordpress and mediawiki to the same namespace, and they each come with a mariadb operator included. So now there are two separate instances of the mariadb operator that happen to be deployed in the same namespace, and we need them to pick a leader.

kfox1111 · 2018-09-13T20:25:36Z

ah. gotcha. +1 then.

xiang90 · 2018-09-15T00:21:11Z

doc/proposals/leader-for-life.md

+## Goals
+
+* Provide leader election that is easy to use
+* Provide leader election that prohibits multiple leaders


Basically you need the leader to auto step down after X seconds since the last update state time if it fails to update its state. And ensure that no leader can be elected within X seconds after the last known leader active time.

That would be characteristic of a lease-based approach. This leader-for-life approach avoids that complexity and any uncertainty about when or if a leader steps down.

@xiang90 We're looking at a simplified leader for life approach.
In summary the leader pod creates a configmap(lock) with its ownerRef, and then when that leader pod exits/fails the k8s GC should remove that configmap(unlock) so that another pod could become the leader.
https://github.com/mhrivnak/leaderelection#simplified-algorithm

Let me know if you can think of any issues with this approach.

@hasbro17

From theoretical point of view, it is still the same as lease base approach and suffers from the same problem.

It seems simple since, now we rely on the pod deletion, which MIGHT have causality relationship with the actual process death. The process can still be running even if the pod itself is deleted and the configmap is therefor GC-ed by k8s controller.

My understanding is that after each process gets sent a TERM signal, it must either exit gracefully, or after a grace period it will be sent a KILL signal, before the Pod itself gets deleted. That's my interpretation of these docs:

https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

So based on that, it seems there is no opportunity for the process to continue. Or is there some scenario I'm missing?

What about the case of a pod being evicted due to a node not being able to reach the API server for an extended period of time? The pod should get stopped when able to reach the API again, though it's possible it's still acting.

I'm not 100% sure in the difference, but in the lease based election, your pod has a grace period and could potentially renew after getting connectivity.

In the for-life approach, I'm not sure, since I'm not 100% sure when the pod is actually deleted from the API during eviction.

Either way it's worth spending time researching and documenting more cases like this I think since there's a few that have already come up.

@hasbro17 The node controller might just remove the node if inactive for long enough without confirming the pod is indeed stopped. There are quite a few cases I can think of. The correctness of this API will rely on the configuration on the Kubernetes cluster probably or the external environment. If you can control that config, it might not be a big issue.

@chancez

Exactly. You beat me by 1 minute :P.

For force deletion, I agree that's not much of a concern. It's well-documented in multiple places that it is unsafe to use, and this case would be no exception.

Pulling a node's network connection is an interesting scenario to vet. This documentation seems to spell it out clearly that we do not need to be concerned:

https://kubernetes.io/docs/concepts/architecture/nodes/#condition

Regarding pods on an unreachable node: "the node controller does not force delete pods until it is confirmed that they have stopped running in the cluster. You can see the pods that might be running on an unreachable node as being in the Terminating or Unknown state. In cases where Kubernetes cannot deduce from the underlying infrastructure if a node has permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects running on the node to be deleted from the apiserver, and frees up their names."

Thank you all for thoroughly vetting this. :)

@mhrivnak Right so if the node controller doesn't delete Pods from the apiserver for unreachable nodes then the configmap of the unreachable pod won't get GC'ed. And we won't have two leaders running at the same time.
We'll just be stuck with the partitioned leader pod until its node is reachable again, or an admin deletes the node.

We can keep documenting more cases where the pod-deletion=>process-stopped causality breaks down as we go along. But for now we can move forward with this approach.

hasbro17 · 2018-09-17T20:10:34Z

@mhrivnak One thing we need to point out is that if the pod is deleted it should be deleted with a background cascading deletion policy.
If the deletion policy is set to foreground then the owned configmap gets GCed before the leader pod is deleted.
I need to double check but I think background deletion is the default.

Another issue as @chancez pointed out is if someone force deletes the leader pod it will immediately remove the pod from the API and kick off the GC(to remove the configmap and unlock) but the pod could still be running on the node.
https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#force-deletion
That means the old leader could still run for a while if it's force deleted.

Not sure how big of an issue this would be if we don't expect the user to force delete operator pods if they're part of deployment/stateful-set, but we should point this out in the proposal or document this caveat somewhere.

mhrivnak · 2018-09-17T22:08:35Z

@hasbro17 sounds good. We'll document clearly: leave the deletion policy to the default, and don't force-delete the leader.

I'm pretty sure the default policy is "background", but I'm having a tough time finding that documented anywhere. I asked in sig-api-machinery, but didn't get very far except to "look at the handler strategy". But I'm not sure where to find that, and my grepping isn't turning up anything conclusive.

fanminshi · 2018-09-18T21:20:00Z

I think once we document the caveats and also how to properly use the leader for life api. The leader for life api seems much easier to reason though than that of the leader election api provided by kubernetes. I'd say we go ahead with leader-for-life election approach. If it doesn't workout, user can always use the kubernetes leader election api.

LGTM

hasbro17 · 2018-09-18T21:28:27Z

LGTM

hasbro17 · 2018-09-18T21:37:30Z

Just a minor nit: Can you clean up commit title to doc/proposals: leader election.

re operator-framework#136

mhrivnak · 2018-09-18T21:40:22Z

Just a minor nit: Can you clean up commit title to doc/proposals: leader election.

@hasbro17 done

kfox1111 reviewed Sep 13, 2018

View reviewed changes

xiang90 reviewed Sep 15, 2018

View reviewed changes

doc/proposals: leader election

a17d7ce

re operator-framework#136

mhrivnak force-pushed the propose-leaderelection branch from fa13100 to a17d7ce Compare September 18, 2018 21:39

hasbro17 merged commit 4d42f93 into operator-framework:master Sep 18, 2018

hasbro17 mentioned this pull request Sep 3, 2019

Evicted pod do not release controller ConfigMap lock Bug 1749620 #1874

Closed

barkbay mentioned this pull request Aug 26, 2020

Add operator election support elastic/cloud-on-k8s#3632

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposes leader-for-life election #481

proposes leader-for-life election #481

mhrivnak commented Sep 13, 2018

kfox1111 left a comment

mhrivnak commented Sep 13, 2018

kfox1111 commented Sep 13, 2018

xiang90 Sep 15, 2018

mhrivnak Sep 17, 2018

hasbro17 Sep 17, 2018

xiang90 Sep 17, 2018

mhrivnak Sep 17, 2018

chancez Sep 17, 2018

xiang90 Sep 17, 2018

xiang90 Sep 17, 2018

mhrivnak Sep 18, 2018

hasbro17 Sep 18, 2018

hasbro17 commented Sep 17, 2018

mhrivnak commented Sep 17, 2018

fanminshi commented Sep 18, 2018 •

edited

Loading

hasbro17 commented Sep 18, 2018

hasbro17 commented Sep 18, 2018

mhrivnak commented Sep 18, 2018

proposes leader-for-life election #481

proposes leader-for-life election #481

Conversation

mhrivnak commented Sep 13, 2018

kfox1111 left a comment

Choose a reason for hiding this comment

mhrivnak commented Sep 13, 2018

kfox1111 commented Sep 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hasbro17 commented Sep 17, 2018

mhrivnak commented Sep 17, 2018

fanminshi commented Sep 18, 2018 • edited Loading

hasbro17 commented Sep 18, 2018

hasbro17 commented Sep 18, 2018

mhrivnak commented Sep 18, 2018

fanminshi commented Sep 18, 2018 •

edited

Loading