Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable HNC leader election in controllers #94

Closed

Conversation

joe2far
Copy link
Contributor

@joe2far joe2far commented Oct 12, 2021

  • updates hnc reconcilers to be leader aware and only perform writes when leader
  • non leader instance run updating it's forest tree cache, but will not perform any writes
  • all hnc instances can respond to validating / mutating webhooks
  • when a leader election occurs objects are re-queued
  • adds leases permissions to cluster role

manual testing completed with 3 instances and --enable-leader-election

  • manually killing leader instance at different times and running tests creating many parent child ns hierarchys
  • validated that the new instance took over and handled updating child ns resources
  • also validated that all 3 instances were successfully handling validation webhook calls

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 12, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @joe2far. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 12, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joe2far
To complete the pull request process, please assign tashimi after the PR has been reviewed.
You can assign the PR to them by writing /assign @tashimi in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@adrianludwin
Copy link
Contributor

Hey I'll review this shortly, but FYI I didn't know anyone had any big changes under way and moved a lot of files around this weekend :( Sorry about that! See #92 for details, but the summary is that instead of arranging the code by function (reconcilers, mutators, and validators), I've arranged the code by feature (anchors, hierarchyconfig, hncconfig, namespace, objects). Most of the old files still exist but have been renamed, e.g.:

internal/reconcilers/anchor.go -> internal/anchor/reconciler.go
internal/reconcilers/hierarchy_config.go -> internal/hierarchyconfig/reconciler.go
internal/reconcilers/hnc_config.go -> internal/hncconfig/reconciler.go
internal/reconcilers/object.go -> internal/objects/reconciler.go
internal/reconcilers/setup.go -> internal/setup/reconcilers.go

Hopefully most of the changes should be easy to port.

Copy link
Contributor

@adrianludwin adrianludwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, thanks!

internal/reconcilers/anchor.go Outdated Show resolved Hide resolved
internal/reconcilers/anchor.go Outdated Show resolved Hide resolved
internal/reconcilers/hnc_config.go Outdated Show resolved Hide resolved
internal/reconcilers/object.go Outdated Show resolved Hide resolved
internal/forest/forest.go Outdated Show resolved Hide resolved
@adrianludwin
Copy link
Contributor

Hey @joe2far , I don't want to rush you or anything but I'm thinking about releasing HNC v0.9 shortly (e.g. in the next week or so). Do you think you'll be working on this over the next few days - i.e., do you want me to wait for this to go in? Or is this a lower-priority project for you? Thanks!

@joe2far
Copy link
Contributor Author

joe2far commented Oct 15, 2021

@adrianludwin sorry, haven't had time to progress this further, it would be nice to have this in the upcoming release, but it won't be ready in time, I hope to get this PR in better state for review by mid next week

@adrianludwin
Copy link
Contributor

adrianludwin commented Oct 15, 2021 via email

@joe2far
Copy link
Contributor Author

joe2far commented Oct 18, 2021

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2021
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 18, 2021
@joe2far joe2far force-pushed the leader-election-support2 branch 2 times, most recently from 3ac4bb9 to 6a6250c Compare October 18, 2021 14:00
cmd/manager/main.go Outdated Show resolved Hide resolved
cmd/manager/main.go Outdated Show resolved Hide resolved
internal/anchor/reconciler.go Outdated Show resolved Hide resolved
internal/hncconfig/reconciler.go Outdated Show resolved Hide resolved
internal/forest/forest.go Outdated Show resolved Hide resolved
internal/setup/reconcilers.go Outdated Show resolved Hide resolved
@joe2far
Copy link
Contributor Author

joe2far commented Oct 19, 2021

@adrianludwin I have addressed your added comments, but have hit a more fundamental problem (while testing) with the approach of having reconcilers and webhooks active/running on non-leader hnc instances.

the reconciler for webhook cert rotator and other hnc reconcilers are added as leaderElectionRunnables (ref: https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/internal.go#L69)

sequence in controller manager for starting runnable is:

go startNonLeaderElectionRunnables
  -> waitForCache
  -> start nonLeaderElectionRunnables
[After leader election]
go startLeaderElectionRunnables
  -> waitForCache
  -> start leaderElectionRunnables

due to above the reconciler for webhook cert rotator is only ever started after the instance is successfully elected the leader,
so webhooks, even though they are nonLeaderElectionRunnables are blocked by cert setup + will only ever start after instance is elected the leader

from logic in manager, it determines leader vs nonleader runnable from NeedLeaderElection() method
https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/internal.go#L209
I had started to add NeedLeaderElection()=false to the hnc reconcilers, but the cert rotator reconciler is set up in external lib open-policy-agent/cert-controller and blocks webhooks + reconcilers setup

I am now not sure what the best approach is? and would like to get your input on direction?
1.) Should we implement our own HNC LeaderElection lease logic to set IsLeader config for each instance (allowing ctrl mgr to start all runnables separately from this config toggle)?
2.) Or should we be using the leader election logic in controller manager runtime, making all webhooks and reconcilers nonLeaderElectionRunnables (and pulling the mgr Elected() chan to toggle their behaviour via IsLeader config)?

@adrianludwin
Copy link
Contributor

adrianludwin commented Oct 19, 2021

Hmm. Is it a problem that the cert rotator isn't running? The cert rotator's output is dumped into the webhook config and secret on the apiserver, so it shouldn't matter if it's only running in one of the three replicas, right? The only problem I might see is that after the secret is updated, it can take up to 90s for all the pods to get updated with the new secret, so if the leader dies within 90s of first starting, the other pods might be in a bad state until the 90s are up. But this seems like a bit of a corner case.

So functionally, we shouldn't care (IIUC). Are you saying that something in the framework is stopping the webhooks from running at all?

@joe2far
Copy link
Contributor Author

joe2far commented Oct 19, 2021

Hmm. Is it a problem that the cert rotator isn't running? The cert rotator's output is dumped into the webhook config and secret on the apiserver, so it shouldn't matter if it's only running in one of the three replicas, right? The only problem I might see is that after the secret is updated, it can take up to 90s for all the pods to get updated with the new secret, so if the leader dies within 90s of first starting, the other pods might be in a bad state until the 90s are up. But this seems like a bit of a corner case.

So functionally, we shouldn't care (IIUC). Are you saying that something in the framework is stopping the webhooks from running at all?

at following point ->
https://github.com/kubernetes-sigs/hierarchical-namespaces/blob/master/cmd/manager/main.go#L203
we currently block registration of webhooks until certs are created, so webhooks will only be adde/set up after cert rotator setup is run, which is only run when the current instance is elected as the leader as a leaderElectionRunnable

@adrianludwin
Copy link
Contributor

adrianludwin commented Oct 19, 2021 via email

@joe2far
Copy link
Contributor Author

joe2far commented Oct 19, 2021

I haven't looked into the cert-rotator logic, but as you said it does already knows not to touch anything if the certs are all good

I have done a test, where I updated the logic in the controller manger (at https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/internal.go#L211) to force all hnc webhooks , controllers/reconcilers(including the cert rotator) into the nonLeaderElectionRunnables for the mgr

From this test, I got the expected behaviour with all controllers + webhooks starting up on all instances (not just the leader) and was able to see mutating/validating webhooks being hit all all instance (not just the leader) so it looks like cert controller might work, but there is likely to be some race condition issues (which I may have just been lucky to have missed...)

I will do some more testing and look further into the cert-controller library... but getting all HNC runnables into the mgr as nonLeaderElection runnables may work to get the behaviour we are looking for with HNC leader-follower, provided they all can function in leader-follower mode (or with possibly with multiple instances running...)

@erikgb
Copy link
Contributor

erikgb commented Jan 30, 2022

@joe2far What is the status on this? I think enabling HNC leader election would be excellent to get into the next release (GA, 1.0?). Please let me know if I can assist in any way to get this over the finish line. 😄

@erikgb
Copy link
Contributor

erikgb commented Jan 30, 2022

I am not sure if introducing an additional dependency would be a good idea, but we might have a look at the "Leader for life" approach implemented as part of operator-sdk/operator-lib:

https://github.com/operator-framework/operator-lib/blob/0449c0c509287014f1095739b06279e9113105da/leader/leader.go#L88

Docs: https://sdk.operatorframework.io/docs/building-operators/golang/advanced-topics/#leader-for-life

Relevant issue: operator-framework/operator-lib#48

@joe2far
Copy link
Contributor Author

joe2far commented Jan 30, 2022

hi @erikgb , I have not looked at this for some time, so PR is stale but leader election behaviour was manually tested, when I was last looking into this....

the solution here, uses the leader election from controller-runtime which is k8s lease based, however in order for hnc reconcilers to work in leader-follower mode using this, they need to be updated with NeedLeaderElection=false to make them start as NonLeaderElectionRunnables and then based upon mgr.Elected() chan they can become leader instance ( only the leader instance performs writes, whereas follower instances don't do any k8s writes but respond to webhooks and perform other actions, in order to keep their forest cache in sync, so they are warm instances, that can become the leader when elected)

I feel using k8s-runtime leader election, but setting up all mgr reconcilers as non leader election runnables is a bit hacky and would agree that using a solution similar to leader election from https://github.com/operator-framework/operator-lib/
may be better than, working around the the k8s-runtime, which is really more designed for just a single active leader. instance.
@adrianludwin what are your thoughts on swapping from using k8s-runtime leader election, to use operator-framework/operator-lib/ leader or a similar implementation?
better yet - may be to use https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection.go similar to what is being used in k8s-runtime but with hnc so we have better controller over it...

I will have some free cycles in the coming week and will look into reviving this PR (but would @erikgb welcome any assistance, on offer) and will probably swap over the leader election implementation used in this change.

@joe2far
Copy link
Contributor Author

joe2far commented Jan 31, 2022

for example following is an example leader-follower setup, using the operator-framework/operator-lib/ leader used by vault
see: https://github.com/hashicorp/vault-k8s/blob/c02054a37dd6c141c02975679a08cfa1053d245d/leader/leader.go

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 1, 2022

CLA Signed

The committers are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 1, 2022
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 1, 2022
@joe2far
Copy link
Contributor Author

joe2far commented Feb 1, 2022

I have now rebased this PR to latest HNC master and have added some logging around the reconcilers leader election
And I have again tested locally the basic logic...
@erikgb it would be good to get additional review / testing to verify logic?

...
while above comments still stand, I feel moving the leader election logic into hierarchical ns controller, while on one hand is cleaner (as k8s-runtime leader election is not really meant to support leader-follower mode) but separately as part of moving this into hnc, it would likely be duplicating / adding similar the logic to wrap client-go/tools/leaderelection and disabling the leader election in k8s-runtime, which may be equally as confusing...

@erikgb
Copy link
Contributor

erikgb commented Feb 1, 2022

One thing I am not able to understand: Why do we need a leader-follower leader-election? I would prefer if we could use the standard single leader approach provided by default by controller-runtime. We use that for all our in-house controllers, and can scale the number of replicas solely based on webhook requirements. The webhooks must work at all times, as it is extremely important for the general cluster availability, and I have heard people setting up 9 replicas just to reduce the possibility of no webhook endpoint not available.

I tested the standard controller-runtime leader election locally with HNC, and that does not work at all. And I'll guess that the reason this PR is suggesting to build some kind of custom leader election mechanism. Why wouldn't it be possible to use the standard leader election with some refactoring/changes? I think the webhooks should be able to work independently of the reconciler manager. So if you spin up 10 replicas, you will have 10 fully functional webhooks, but just a single manager performing reconciles. AFAIK most in-tree reconcilers in K8s runs with a single worker, i.e. ReplicaSet->Pod.

Maybe @adrianludwin can explain why HNC cannot do this? IMO the webhooks should have a read-only cache of the forest, while the reconcilers should have a read-write cache. Would that be possible to achieve? Where is the potential blocker? The webhook cert-issuer?

@adrianludwin
Copy link
Contributor

adrianludwin commented Feb 1, 2022 via email

@erikgb
Copy link
Contributor

erikgb commented Feb 1, 2022

I am not worried about webhook performance, but webhook availability is essential IMO. A failing webhook endpoint can affect all API calls in the cluster, so you typically want to run at least two replicas serving webhook endpoints - to be able to handle smooth updates and eventual cluster issues - like loosing a node. And several controllers I am aware of, run webhooks in dedicated workloads - just for this reason. One example is cert-manager. And this was also noted/suggested in this comment: #68 (comment)

Why cannot the webhooks run with their dedicated (read-only) forest - based on code shared with the reconcilers? I understand that this would eventually be a huge change in HNC, but HA for webhooks is important, and as a bonus we might be able to use the standard leader-election in controller-runtime.

When is the next community meeting? Maybe this subject is better addressed there, than as comments in this PR. 😉

@adrianludwin
Copy link
Contributor

adrianludwin commented Feb 1, 2022 via email

@erikgb
Copy link
Contributor

erikgb commented Feb 1, 2022

Good point about availability. Although at least the object webhook fails open IIRC so your cluster doesn't come to a crashing halt. But better if we don't need to rely on that.

I would consider any webhook failing for "technical reasons" something that can be improved (bug). 😉

I'm not sure how you'd share code with the webhooks without basically reimplementing almost the entire reconciler. Most of the reconcilers' logic is in fact about updating the forest to match what's on the server, so it seems much easier to me to modify the reconcilers to run in read-only mode (i.e., exactly what's necessary to create a read-only forest) than to try to factor out the common code, and then set up our own custom replacement for the controller-manager, just so that we can use standard leader election. Something's going to have to be nonstandard :)

As I wrote in my previous comment, I understand that changing this to something that represents reusable code will be a BIG refactoring. That said, I think it could make sense to decouple the forest cache and reconcilers in long-term. Somehow I imagine that the forest cache could include an informer component, that would be responsible for dispatching events to the reconcilers. And function as an adapter around the standard write-through cache + informer in controller-runtime.

We don't usually discuss issues like this in our regular meetings, if needed we could have a sync in the next couple of days but otherwise I'm ok to keep this on PRs.

Yes, I would be interested in a sync - if you are available. Just ping me on Slack.

To summarize my stand on short-term goals, I think enabling HA on webhooks is very important. And to get it, it seems like we need some kind of leader election.

@adrianludwin
Copy link
Contributor

adrianludwin commented Feb 2, 2022 via email

@erikgb
Copy link
Contributor

erikgb commented Feb 2, 2022

(leader election is a pretty simple feature)

I disagree 😉, and I think I have some support for my stand from controller-runtime and operator-lib/operator-sdk: https://sdk.operatorframework.io/docs/building-operators/golang/advanced-topics/#leader-election and https://github.com/kubernetes/client-go/blob/30b06a83d67458700a5378239df6b96948cb9160/tools/leaderelection/leaderelection.go#L21-L24.

I'll try to give you a call on Friday morning Toronto time

Ack! 👍

@adrianludwin
Copy link
Contributor

adrianludwin commented Feb 2, 2022 via email

@adrianludwin
Copy link
Contributor

Oh man I forgot to hit "submit" on a review I wrote in early November. Sorry! Will re-review now.

Copy link
Contributor

@adrianludwin adrianludwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. @joe2far , thanks so much for reviving this.

I'll try to summarize the conversation I had with @erikgb on Friday. Basically, his top priority is making the webhooks highly available, and not the reconciler. Technically, this means we don't actually need LE for this one purpose; instead, we could:

  • Have a deployment of a single pod for the reconciler which doesn't serve webhooks
  • Have another deployment of 3 pods that only serve the webhooks

These might be exactly the same image, just running with a different command-line flag. And it would probably work just fine. The only problem would be that if the reconciler pod ever died, it could be a few minutes (on a large cluster) before it became active again. And in that time, the webhooks could show some undefined behaviour if someone was busy modifying the hierarchy.

However, why should we wait for a few minutes for the reconciler pod to restart, if all the webhook pods already have all the information needed to reconcile as well? Also, this could use up a lot of the apiserver's bandwidth for a while. Instead, we can use LE to just turn one of the replicas into the leader, since it already has all its information loaded into memory and ready to go. Even re-syncing all its reconcilers is probably very fast since all the API objects will be in the controller-runtime cache. There might still be a short period where the webhooks are not fully in sync with what's on the apiserver but that will at least resolve itself far more quickly.

@erikgb , just to your specific comments:

  • What Joe's implemented is essentially the same of the "leader for life" approach you suggested, at least as I understand it. I've even suggested using the atomic.Value approach from that code here as well.
  • You mentioned that you "tested the standard controller-runtime leader election locally with HNC, and that does not work at all." Were you testing what's currently built into HNC (which is known not to work) or this change that Joe's proposing? I wouldn't expect anything to work until this change is merged.

@erikgb , if you're happy with this I'm ready to merge this change (after a few nits), especially if @joe2far can manually test it one last time. @rjbez17 , can you please have a look at this as well?

Thanks everyone and sorry for the delays here!

@@ -56,6 +58,11 @@ type Reconciler struct {
// Reconcile sets up some basic variables and then calls the business logic. It currently
// only handles the creation of the namespaces but no deletion or state reporting yet.
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// If not leader exiting, without performing any write operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious that we can safely skip this entire reconciler. Maybe a better comment would be: this reconciler never updates the forest, so it's fine to skip it entirely if it's not the leader.

@@ -415,9 +422,72 @@ func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
}},
}
}
return ctrl.NewControllerManagedBy(mgr).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than changing this initialization logic, can't we just add the NeedLeaderElection() method to all the reconcilers? That seems easier and less disruptive to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh never mind, I forgot about the distinction between Controller and Reconciler. Sorry.


// enqueueAllObjects enqueues all the current anchor objects in all namespaces.
func (r *Reconciler) enqueueAllObjects(ctx context.Context, log logr.Logger) error {
keys := r.Forest.GetNamespaceNames()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need r.Forest.Lock() and defer r.Forest.Unlock() here

Comment on lines +11 to +13

// IsLeader is global which repesents whether this HNC instance is the leader
var IsLeader bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using atomic.Value()? I'd guess that adds a memory fence which might be useful if the value has just changed.


// enqueueAllObjects enqueues all the current objects in all namespaces.
func (r *Reconciler) enqueueAllObjects(ctx context.Context, log logr.Logger) error {
keys := r.Forest.GetNamespaceNames()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock/unlock forest here

Comment on lines +166 to +170
// NeedLeaderElection is false as this runnable can run in standby mode without leader lock
func (cr *CertRotator) NeedLeaderElection() bool {
return false
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit weird - what happens if multiple instances try to create conflicting certs for the same webhook at the same time? I wonder if this will normally work but not always. Do you know?

Otherwise, an alternative option here might be to leave the runnables as requiring leader election, but then provide a function for the followers to determine when it's safe to start. E.g. maybe modify ensureCertsMounted() so that it's a standalone function that you can call (or just factor it out so as to preserve the existing interface), and it simply blocks until the certs exist. wdyt?

Either way, we should add this to the main cert-controller as well. Can you please file an issue to do so? Thanks!

@k8s-ci-robot
Copy link
Contributor

@joe2far: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 8, 2022
@adrianludwin
Copy link
Contributor

I think that #163 will be our solution for HA support, not leader election. We'll also focus on greatly accelerating startup time in v1.1, further reducing the need for active/passive replicas. Since there's not a lot of activity on this PR, let's close this and continue work on #163 and its derivatives.

/close

@k8s-ci-robot
Copy link
Contributor

@adrianludwin: Closed this PR.

In response to this:

I think that #163 will be our solution for HA support, not leader election. We'll also focus on greatly accelerating startup time in v1.1, further reducing the need for active/passive replicas. Since there's not a lot of activity on this PR, let's close this and continue work on #163 and its derivatives.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants