Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled HNC leader election #88

Closed

Conversation

joe2far
Copy link
Contributor

@joe2far joe2far commented Oct 6, 2021

  • block reconciler controller setup until instance becomes the leader using mgr.Elected() channel
  • adds leases permissions to cluster role
  • tunes leader election default configurations

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 6, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @joe2far. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 6, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joe2far
To complete the pull request process, please assign tashimi after the PR has been reviewed.
You can assign the PR to them by writing /assign @tashimi in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@adrianludwin adrianludwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @joe2far ! I've left some small comments in the review but I have some deeper ones.

Today, HNC runs as a single pod, and if it's killed, it'll be restarted. The typically happens quite quickly AIUI, but on any cluster of any reasonable size, it then takes a while for the controllers to sync the state of every namespace and object in the system. This PR doesn't change that - it will (iiuc) wait for the prior leader to fail for at least 10s, and then will start the controller. This is only an improvement in availability if it previously took more than 10s to restart the pod, which I don't think is the case. By contrast, on larger clusters, it can take minutes for the controllers to fully start. So this doesn't really add a lot in the way of availability.

Moreover, the followers in this deployment can't actually respond to webhook requests without the controllers running, but the webhook Service will select these pods and try to send them traffic. We would need to prevent this using readiness probes.

I think there's a better alternative, but it takes more work. Every controller in HNC follows a strict loop: read info from apiserver; make decisions; update in-memory data structures; apply changes to the apiserver. In HA mode, if an instance is a follower, only that last step needs to be suppressed. So you could have all three pods fully up and running (and responding to webhooks), but only enable the final step if the pod is the leader.

In this scenario, there will be a race condition between the old leader dying and the new one taking over, so it's possible that none of the pods will commit a decision that needs to be committed, because they'll assume that the (now dead) leader did it. So when a pod takes over as leader, it will need to re-synchronize the entire cluster, in order to catch anything it missed. Usually, it won't have missed anything, so this will be a no-op, but it will still take time and use quota. So we'd probably want to manually throttle it, e.g. have a goroutine that tried to sync, say, one namespace every 200ms or so (we'd have to tune this). This would leave HNC responsive enough to deal with any new user actions, while still ensuring that anything that had fallen through the cracks would be fixed in a reasonable amount of time.

Does all this make sense?

cmd/manager/main.go Outdated Show resolved Hide resolved
cmd/manager/main.go Outdated Show resolved Hide resolved
@joe2far
Copy link
Contributor Author

joe2far commented Oct 11, 2021

Thanks for this @joe2far ! I've left some small comments in the review but I have some deeper ones.

Today, HNC runs as a single pod, and if it's killed, it'll be restarted. The typically happens quite quickly AIUI, but on any cluster of any reasonable size, it then takes a while for the controllers to sync the state of every namespace and object in the system. This PR doesn't change that - it will (iiuc) wait for the prior leader to fail for at least 10s, and then will start the controller. This is only an improvement in availability if it previously took more than 10s to restart the pod, which I don't think is the case. By contrast, on larger clusters, it can take minutes for the controllers to fully start. So this doesn't really add a lot in the way of availability.

Moreover, the followers in this deployment can't actually respond to webhook requests without the controllers running, but the webhook Service will select these pods and try to send them traffic. We would need to prevent this using readiness probes.

I think there's a better alternative, but it takes more work. Every controller in HNC follows a strict loop: read info from apiserver; make decisions; update in-memory data structures; apply changes to the apiserver. In HA mode, if an instance is a follower, only that last step needs to be suppressed. So you could have all three pods fully up and running (and responding to webhooks), but only enable the final step if the pod is the leader.

In this scenario, there will be a race condition between the old leader dying and the new one taking over, so it's possible that none of the pods will commit a decision that needs to be committed, because they'll assume that the (now dead) leader did it. So when a pod takes over as leader, it will need to re-synchronize the entire cluster, in order to catch anything it missed. Usually, it won't have missed anything, so this will be a no-op, but it will still take time and use quota. So we'd probably want to manually throttle it, e.g. have a goroutine that tried to sync, say, one namespace every 200ms or so (we'd have to tune this). This would leave HNC responsive enough to deal with any new user actions, while still ensuring that anything that had fallen through the cracks would be fixed in a reasonable amount of time.

Does all this make sense?

Yes, above does makes sense. Intention here was to improve availability using existing leader election logic. During upgrades, where there is a lot of namespace activity, coupled with node updates/scaling, we have seen issues where HNC could be taken offline due to a node upgrade and during this time certain namespace operations are getting blocked from completing

@joe2far
Copy link
Contributor Author

joe2far commented Oct 12, 2021

While I had made updates to this PR, given that there will be a race condition between the old leader dying and the new one taking over, so it's possible that none of the pods will commit a decision that needs to be committed, is there much point in adding this basic leader election to HNC? or is the recommendation to implement the more involved work in making reconcilers aware of leader/follower mode and re-queueing changes when there is a new leader elected

To this end, I have created a new PR which starts updating the controllers to be leader aware rather than function with a leader and standby instances (waiting to become a leader)
ref: #94

@adrianludwin
Copy link
Contributor

My intuition tells me that this "basic" leader election won't help availability, so unless you've tried this out locally and seen an improvement, I think you're right. Let's close this one off and continue on #94.

@adrianludwin
Copy link
Contributor

/close

@k8s-ci-robot
Copy link
Contributor

@adrianludwin: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants