Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make leader-for-life leader election more integrated with controller-runtime #48

Open
joelanford opened this issue Jan 11, 2021 · 4 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@joelanford
Copy link
Member

Feature Request

Is your feature request related to a problem? Please describe.
Yes. It isn't possible to use leader-for-life leader election with controller-runtime's manager when also using liveness and readiness probes.

Using controller-runtime's manager out of the box, the following sequence of events happens when manager.Start() is called:

  1. Liveness and readiness probes are started
  2. Leader election is started.
  3. Controllers are started.

When using leader-for-life from this repo, it must be called prior to manager.Start() since controller-runtime doesn't support pluggable leader election implementations. The sequence of events in this case is:

  1. Leader election is started.
  2. Liveness and readiness probes are started
  3. Controllers are started.

Notice that 1) and 2) are swapped. This swap causes deadlocks when upgrading operator deployments that use leader-for-life. When the deployment is attempting to rollout a new version, the new pod starts up and first attempts to become the leader, failing indefinitely until the old pod relinquishes ownership. However the old pod will not relinquish ownership until it disappears and it won't disappear until the new pod reports that it's healthy. Unfortunately the new pod will never be able to report that it's healthy because it needs to be the leader before it starts its liveness and readiness probe servers.

Describe the solution you'd like
To work upstream to make controller-runtime support a pluggable leader election implementation such that leader-for-life can be used by the manager.

@estroz
Copy link
Member

estroz commented Jan 19, 2021

I'd like to suggest deprecating this package in favor of controller-runtime/pkg/leaderelection, or at least make a note that it has this bug until it is fixed to deter users. client-go's leader-with-lease (and controller-runtime's wrapper) are quite stable and easy to use now (they were not back when this leader-for-life library was originally written), and even though it does not guarantee no overlap between elections it seems to be the de-facto standard upstream.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 20, 2021
@joelanford
Copy link
Member Author

/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 20, 2021
@erikgb
Copy link

erikgb commented Jan 30, 2022

Anyone working on this? What I would love to see, is this leader-for-life feature available in controller-runtime! A pluggable leader election mechanism could be useful on it's own, but I think getting leader-for-life into controller-runtime would be more sustainable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

5 participants