Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposes leader-for-life election #481

Merged
merged 1 commit into from
Sep 18, 2018

Conversation

mhrivnak
Copy link
Member

re #136

Copy link

@kfox1111 kfox1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to allow failing from one instance to another, you can also accomplish this by using a statefulset and always using index0 as the leader. (IE, size > 1 has no effect). Statefulsets guarantee that no more then one will run at a time even when a node is uncontactable and still might be running the workload. It is only removed and replaced when the user deletes the pod or the node object in k8s, signalling k8s that it is safe to replace it with a new pod.

@mhrivnak
Copy link
Member Author

@kfox1111 that's useful to know, but I don't think it will completely fit this use case. I think a goal here is to allow for the possibility that the same operator gets deployed more than once to the same namespace. Maybe I deploy wordpress and mediawiki to the same namespace, and they each come with a mariadb operator included. So now there are two separate instances of the mariadb operator that happen to be deployed in the same namespace, and we need them to pick a leader.

@kfox1111
Copy link

ah. gotcha. +1 then.

## Goals

* Provide leader election that is easy to use
* Provide leader election that prohibits multiple leaders
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically you need the leader to auto step down after X seconds since the last update state time if it fails to update its state. And ensure that no leader can be elected within X seconds after the last known leader active time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be characteristic of a lease-based approach. This leader-for-life approach avoids that complexity and any uncertainty about when or if a leader steps down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 We're looking at a simplified leader for life approach.
In summary the leader pod creates a configmap(lock) with its ownerRef, and then when that leader pod exits/fails the k8s GC should remove that configmap(unlock) so that another pod could become the leader.
https://github.com/mhrivnak/leaderelection#simplified-algorithm

Let me know if you can think of any issues with this approach.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hasbro17

From theoretical point of view, it is still the same as lease base approach and suffers from the same problem.

It seems simple since, now we rely on the pod deletion, which MIGHT have causality relationship with the actual process death. The process can still be running even if the pod itself is deleted and the configmap is therefor GC-ed by k8s controller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that after each process gets sent a TERM signal, it must either exit gracefully, or after a grace period it will be sent a KILL signal, before the Pod itself gets deleted. That's my interpretation of these docs:

https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

So based on that, it seems there is no opportunity for the process to continue. Or is there some scenario I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case of a pod being evicted due to a node not being able to reach the API server for an extended period of time? The pod should get stopped when able to reach the API again, though it's possible it's still acting.

I'm not 100% sure in the difference, but in the lease based election, your pod has a grace period and could potentially renew after getting connectivity.

In the for-life approach, I'm not sure, since I'm not 100% sure when the pod is actually deleted from the API during eviction.

Either way it's worth spending time researching and documenting more cases like this I think since there's a few that have already come up.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hasbro17 The node controller might just remove the node if inactive for long enough without confirming the pod is indeed stopped. There are quite a few cases I can think of. The correctness of this API will rely on the configuration on the Kubernetes cluster probably or the external environment. If you can control that config, it might not be a big issue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chancez

Exactly. You beat me by 1 minute :P.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For force deletion, I agree that's not much of a concern. It's well-documented in multiple places that it is unsafe to use, and this case would be no exception.

Pulling a node's network connection is an interesting scenario to vet. This documentation seems to spell it out clearly that we do not need to be concerned:

https://kubernetes.io/docs/concepts/architecture/nodes/#condition

Regarding pods on an unreachable node: "the node controller does not force delete pods until it is confirmed that they have stopped running in the cluster. You can see the pods that might be running on an unreachable node as being in the Terminating or Unknown state. In cases where Kubernetes cannot deduce from the underlying infrastructure if a node has permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects running on the node to be deleted from the apiserver, and frees up their names."

Thank you all for thoroughly vetting this. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhrivnak Right so if the node controller doesn't delete Pods from the apiserver for unreachable nodes then the configmap of the unreachable pod won't get GC'ed. And we won't have two leaders running at the same time.
We'll just be stuck with the partitioned leader pod until its node is reachable again, or an admin deletes the node.

We can keep documenting more cases where the pod-deletion=>process-stopped causality breaks down as we go along. But for now we can move forward with this approach.

@hasbro17
Copy link
Contributor

@mhrivnak One thing we need to point out is that if the pod is deleted it should be deleted with a background cascading deletion policy.
If the deletion policy is set to foreground then the owned configmap gets GCed before the leader pod is deleted.
I need to double check but I think background deletion is the default.

Another issue as @chancez pointed out is if someone force deletes the leader pod it will immediately remove the pod from the API and kick off the GC(to remove the configmap and unlock) but the pod could still be running on the node.
https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#force-deletion
That means the old leader could still run for a while if it's force deleted.

Not sure how big of an issue this would be if we don't expect the user to force delete operator pods if they're part of deployment/stateful-set, but we should point this out in the proposal or document this caveat somewhere.

@mhrivnak
Copy link
Member Author

@hasbro17 sounds good. We'll document clearly: leave the deletion policy to the default, and don't force-delete the leader.

I'm pretty sure the default policy is "background", but I'm having a tough time finding that documented anywhere. I asked in sig-api-machinery, but didn't get very far except to "look at the handler strategy". But I'm not sure where to find that, and my grepping isn't turning up anything conclusive.

@fanminshi
Copy link
Contributor

fanminshi commented Sep 18, 2018

I think once we document the caveats and also how to properly use the leader for life api. The leader for life api seems much easier to reason though than that of the leader election api provided by kubernetes. I'd say we go ahead with leader-for-life election approach. If it doesn't workout, user can always use the kubernetes leader election api.

LGTM

@hasbro17
Copy link
Contributor

LGTM

@hasbro17
Copy link
Contributor

Just a minor nit: Can you clean up commit title to doc/proposals: leader election.

@mhrivnak
Copy link
Member Author

Just a minor nit: Can you clean up commit title to doc/proposals: leader election.

@hasbro17 done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants