Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return error when leader lease is lost #1130

Merged
merged 1 commit into from
Oct 12, 2023

Conversation

kate-osborn
Copy link
Contributor

@kate-osborn kate-osborn commented Oct 11, 2023

Proposed changes

Problem: If NGF Pod loses connection to the API server and cannot renew the leader election lease, it stops leading and cannot become the leader again. After it stops leading, it will not report any statuses until it is restarted.

Solution: Update the leader elector Start method to return an error if the leader lease is lost. This will cause the controller-runtime manager to exit, and the NGF container will restart. The new container will then attempt to become the leader again. This aligns with how the controller-runtime library handles losing a leader lease.

Testing: Verified that if NGF cannot renew the leader lease, it retries until it reaches the renew deadline of 10s and then exits:

I1011 16:53:39.872956       8 leaderelection.go:250] attempting to acquire leader lease nginx-gateway/nginx-gateway-leader-election...
I1011 16:53:39.891998       8 leaderelection.go:260] successfully acquired lease nginx-gateway/nginx-gateway-leader-election
{"level":"info","ts":"2023-10-11T16:53:39Z","logger":"leaderElector","msg":"Started leading"}
E1011 16:54:36.257646       8 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
E1011 16:54:38.261217       8 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
E1011 16:54:40.262373       8 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
E1011 16:54:42.261366       8 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
E1011 16:54:44.261783       8 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
I1011 16:54:46.254506       8 leaderelection.go:285] failed to renew lease nginx-gateway/nginx-gateway-leader-election: timed out waiting for the condition
{"level":"info","ts":"2023-10-11T16:54:46Z","logger":"leaderElector","msg":"Stopped leading"}
{"level":"info","ts":"2023-10-11T16:54:46Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2023-10-11T16:54:46Z","msg":"Stopping and waiting for leader election runnables"}
Error: failed to start control loop: leader election lost
      --leader-election-disable            Disable leader election. Leader election is used to avoid multiple replicas of the NGINX Gateway Fabric reporting the status of the Gateway API resources. If disabled, all replicas of NGINX Gateway Fabric will update the statuses of the Gateway API resources.
      --leader-election-lock-name string   The name of the leader election lock. A Lease object with this name will be created in the same Namespace as the controller. (default "nginx-gateway-leader-election-lock")
failed to start control loop: leader election lost

Once the underlying error is resolved, which in this case is missing RBAC permissions, the new NGF Pod becomes the leader:

I1011 16:54:47.416192      67 leaderelection.go:250] attempting to acquire leader lease nginx-gateway/nginx-gateway-leader-election...
E1011 16:54:47.421393      67 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
{"level":"info","ts":"2023-10-11T16:54:47Z","logger":"statusUpdater","msg":"Skipping updating Gateway API status because not leader"}
{"level":"info","ts":"2023-10-11T16:54:47Z","logger":"statusUpdater","msg":"Skipping updating Nginx Gateway status because not leader"}
E1011 16:54:50.564170      67 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
I1011 16:54:54.367481      67 leaderelection.go:260] successfully acquired lease nginx-gateway/nginx-gateway-leader-election
{"level":"info","ts":"2023-10-11T16:54:54Z","logger":"leaderElector","msg":"Started leading"}

Please focus on (optional): Right now, we use the core client's default values for leader election:

// The duration that the acting master will retry refreshing leadership before giving up.
renewDeadline = 10 * time.Second
//  The duration that non-leader candidates will wait to force acquire leadership
leaseDuration = 15 * time.Second
// The duration the LeaderElector clients should wait between tries of actions.
retryPeriod   = 2 * time.Second

If NGF loses connectivity to the API server, it will try to renew the leader lease 5 times over a 10-second period.

We can increase the lease duration and renew deadline so NGF will try to renew the leader lease for a longer period of time and be able to recover from temporary connection issues. However, increasing these values means that it will take longer to recover after the leader is lost. A candidate Pod waits until the existing lease exceeds the lease duration, and then a new election is forced.

Closes #1100

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

@kate-osborn kate-osborn requested a review from a team as a code owner October 11, 2023 18:10
@github-actions github-actions bot added the bug Something isn't working label Oct 11, 2023
@pleshakov
Copy link
Contributor

This will cause the controller-runtime manager to exit, and the Pod will restart. The new Pod will then attempt to become the leader again.

Will NGINX container continue to run? or will it also be brought down?

If NGF loses connectivity to the API server, it will try to renew the leader lease 5 times over a 10-second period.

is it true for non-leader as well?

Let's imagine we have 5 NGF pods running and they temporarily loose connectivity to the API server (let's say it is undergoing an upgrade). Could that lead to a restart of all 5 pods? or only the leader?

@kate-osborn
Copy link
Contributor Author

Will NGINX container continue to run? or will it also be brought down?

It continues to run.

is it true for non-leader as well?

No. The non-leaders continue to retry until they succeed or the context is canceled.

Let's imagine we have 5 NGF pods running and they temporarily loose connectivity to the API server (let's say it is undergoing an upgrade). Could that lead to a restart of all 5 pods? or only the leader?

I tested this and only the leader restarts.

@kate-osborn kate-osborn force-pushed the fix/leader-election branch 2 times, most recently from b4b4f73 to 988c4ee Compare October 12, 2023 14:11
@kate-osborn kate-osborn merged commit 4f40fca into nginxinc:main Oct 12, 2023
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Statuses not reported because no leader gets elected
4 participants