-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor NTO to use controller runtime lib #302
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: yanirq The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
9c57e0c
to
ffe2c34
Compare
f8b874c
to
98e56a8
Compare
/test e2e-aws-operator |
980b80f
to
df6183a
Compare
cd47e3c
to
f307fb9
Compare
/test e2e-aws-operator |
Congrats on the |
All the previous PRs can be closed. /cc @cynepco3hahue |
/retest |
/retest |
2 similar comments
/retest |
/retest |
I did a bit more testing similar to what Jiri already shared. I tested 3 cases for the old code and the new code: 1. fully idle 2. creating ~1000 pods in the background without any custom profile, 3. creating ~1000 pods in the background with a Profile that matches to those pods. Here are the results:
This confirms that the controller-runtime implementation is doing some work when pods are created even when the pod label matching functionality is unused. It also shows that when the pod label matching functionality is used, the new implementation is using many more CPU cycles than the old implementation. |
/retest |
@yanirq: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/hold - #316 is examined as an alternative for this PR |
LeaderElection: true, | ||
LeaderElectionID: config.OperatorLockName, | ||
LeaderElectionNamespace: ntoNamespace, | ||
LeaseDuration: &le.LeaseDuration.Duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was to keep the original behavior before the refactor
restConfig := ctrl.GetConfigOrDie() | ||
le := util.GetLeaderElectionConfig(restConfig, enableLeaderElection) | ||
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{ | ||
NewCache: cache.MultiNamespacedCacheBuilder(namespaces), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you like to use this option?
How many ns would you like to inform here?
See that if you try to cache all -1 or 2 namespaces on the cluster then, you will probably check performance issues.
Please, check the comment: https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/cache/multi_namespace_cache.go#L40-L46
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main reason here was to watch cluster resources such as nodes and pods and have a distinct namespace for NTO to be used for filtering
LeaseDuration: &le.LeaseDuration.Duration, | ||
RetryPeriod: &le.RetryPeriod.Duration, | ||
RenewDeadline: &le.RenewDeadline.Duration, | ||
Namespace: ntoNamespace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are passing an NS and NewCache: cache.MultiNamespacedCacheBuilder(namespaces)
Note that you need to have permissions ( scope ) to read/update/delete resources and in this way cache them where the operator will be installed. Then, you can:
a) Watching/Catching resources in a set of Namespaces
It is possible to use MultiNamespacedCacheBuilder from Options to watch and manage resources in a set of Namespaces.
OR
b) Watching/Catching resources in a single Namespace (where the. operator will be installed) by using the Namespace option. See:
// Namespace if specified restricts the manager's cache to watch objects in
// the desired namespace Defaults to all namespaces
//
// Note: If a namespace is specified, controllers can still Watch for a
// cluster-scoped resource (e.g Node). For namespaced resources the cache
// will only hold objects from the desired namespace.
Namespace string
Ref: https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.11.1/pkg/manager#Manager
OR
c) Do not add both which means grant cluster-scope permission for the project. Your operator will be watching/catching the whole cluster
Be aware that how much more resources/namespaces do you catching/watching more resources you will consume.
RetryPeriod: &le.RetryPeriod.Duration, | ||
RenewDeadline: &le.RenewDeadline.Duration, | ||
Namespace: ntoNamespace, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default SDK/Kubebuilder scaffold the projects as cluster-scope, see: https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v3/memcached-operator/main.go#L68-L79
But you can change the. scope: https://sdk.operatorframework.io/docs/building-operators/golang/operator-scope/
@@ -34,10 +34,10 @@ rules: | |||
- apiGroups: ["security.openshift.io"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by adopting SDK/Kubebuilder you will be able to work with makers, e.g: https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v3/memcached-operator/controllers/memcached_controller.go#L44-L48
Then, when you run make generate
the RBAC will be generated at the config]/rbac/
dir: https://github.com/operator-framework/operator-sdk/tree/master/testdata/go/v3/memcached-operator/config/rbac
Also, you can use make bundle
and have the whole OLM bundle generated by you with all your kustomize configs, see: https://github.com/operator-framework/operator-sdk/tree/master/testdata/go/v3/memcached-operator/bundle
That is very helpful to work with the releases and provide the solutions for OLM.
You can also add customizations on in your base CSV: operator-sdk/testdata/go/v3/memcached-operator/config/manifests/bases/
To know more about the default layout see: https://sdk.operatorframework.io/docs/overview/project-layout/
@@ -1,13 +1,15 @@ | |||
package operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What means mc.go?
What is its purpose?
return e.Object.GetName() == tunedv1.TunedClusterOperatorResourceName | ||
}, | ||
UpdateFunc: func(e event.UpdateEvent) bool { | ||
if !validateUpdateEvent(&e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you doing it to address the scenarios that the reconciliation fails because the resource changed on the cluster? If yes, I'd suggest using the client and fetching the resource that you want to change/update before calling the update. Then, if it fails return err in the reconciliation to ensure that it will be executed again.
@@ -10,22 +10,26 @@ import ( | |||
"k8s.io/klog" | |||
) | |||
|
|||
func GetLeaderElectionConfig(ctx context.Context, restcfg *rest.Config) configv1.LeaderElection { | |||
// GetLeaderElectionConfig returns leader election configs defaults based on the cluster topology | |||
func GetLeaderElectionConfig(restcfg *rest.Config, enabled bool) configv1.LeaderElection { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this ultis for leader election?
By default, you have the C+R implementation for leader election : https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/leaderelection/leader_election.go
What is the extra requirement on top of that?
Also, you might give a look in the doc https://sdk.operatorframework.io/docs/building-operators/golang/advanced-topics/#leader-election which has compressive info about it.
@@ -24,10 +24,12 @@ require ( | |||
k8s.io/klog v1.0.0 | |||
k8s.io/klog/v2 v2.30.0 | |||
k8s.io/utils v0.0.0-20210930125809-cb0fa318a74b | |||
sigs.k8s.io/controller-runtime v0.11.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, see; wdyt about check it with the bug fix: https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.11.1
sigs.k8s.io/controller-runtime => sigs.k8s.io/controller-runtime v0.11.0 | ||
sigs.k8s.io/controller-tools => sigs.k8s.io/controller-tools v0.7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need these replaces?
Why do you import sigs.k8s.io/controller-tools?
@camilamacedo86 Thank you for the extensive review, it is super informative and helpful (also for deeper understanding of controller runtime "under the hood"). |
@yanirq: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @yanirq, #302 (comment) Feel free to reach out if you need. |
This PR will not be the format we will be using. |
Refactor cluster node tuning operator to use controller runtime library (release 0.11).
The functionality is internal only and replaces the direct application of a controller with the controller runtime scheme.
The new controller(s) structure:
Current implementation that might be subject to change depending on the operator performance (or reviews):
Pending tasks checklist:
This is also a preliminary work to set up the stage for moving Performance addons operator under NTO as documented here: openshift/enhancements#867