Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore events that are outside of namespaces that the controller manages #5058

Closed

Conversation

naemono
Copy link
Contributor

@naemono naemono commented Nov 18, 2021

This will ignore any CRDs laid down within a Kubernetes cluster that are outside of the namespaces that ECK is set to manage (managedNamespaces option).

resolves #4168

@botelastic botelastic bot added the triage label Nov 18, 2021
@naemono naemono added the >enhancement Enhancement of existing functionality label Nov 19, 2021
@botelastic botelastic bot removed the triage label Nov 19, 2021
@naemono
Copy link
Contributor Author

naemono commented Nov 22, 2021

This PR takes the most minimal approach to solving the issue referenced in #4168, which references the issue where our custom CRDs themselves, from namespace a, are ignored by the operator which is managing a set of namespaces b,c,d. I'm currently testing whether we'd also need to include this/these predicates in other watches, for objects owned by these parent objects, such as https://github.com/elastic/cloud-on-k8s/blob/master/pkg/controller/agent/controller.go#L68-L73 Any early 👀 on this would be welcomed.

@naemono naemono marked this pull request as ready for review November 22, 2021 15:21
@naemono
Copy link
Contributor Author

naemono commented Nov 22, 2021

Yeah, my testing is showing that these will need to be ignored as well. That change is a WIP.

@naemono
Copy link
Contributor Author

naemono commented Nov 22, 2021

Ok, changes complete. I'm not seeing these error messages any longer when running 2 operators within the same cluster, managing 2x separate namespaces:

operator 1

❯ helm get values -n elastic-system elastic-operator
USER-SUPPLIED VALUES:
managedNamespaces:
- default

operator 2

❯ helm get values -n es-system eck-operator
USER-SUPPLIED VALUES:
fullnameOverride: eck-operator
installCRDs: false
managedNamespaces:
- elastic
nameOverride: eck-operator

With the following resources

❯ kc get es,kibana,agent -n default
NAME                                                 HEALTH    NODES   VERSION   PHASE             AGE
elasticsearch.elasticsearch.k8s.elastic.co/testing   unknown   1       7.15.1    ApplyingChanges   5d20h

NAME                                          HEALTH   NODES   VERSION   AGE
kibana.kibana.k8s.elastic.co/kibana-testing   green    1       7.15.0    5d20h

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could not assume that there is only one predicate, and also that this predicate is immutable. Based on this assumption we would have the predicate initialized in only one place, and shared as a public variable.

For example, the predicate can be managed in a predicate package:

package predicate

import (
	"sigs.k8s.io/controller-runtime/pkg/event"
	"sigs.k8s.io/controller-runtime/pkg/predicate"
)

var ManagedNamespacePredicate predicate.Predicate

// NewManagedNamespacesPredicate will return a predicate that will ignore events
// that exist outside of the given managed namespaces,
func NewManagedNamespacesPredicate(managedNamespaces []string) predicate.Predicate {
	/* logic to create the predicate*/
}

The predicate would be then initialized in main.go :

// configure the manager cache based on the number of managed namespaces
managedNamespaces := viper.GetStringSlice(operator.NamespacesFlag)
predicate.ManagedNamespacePredicate = predicate.NewManagedNamespacesPredicate(managedNamespaces)
switch {
   case len(managedNamespaces) == 0:
....

The predicate can now be used from any controller, without the need of adding a new function parameter, or a new member in a struct.
For example:

// addWatches adds watches for all resources this controller cares about
func addWatches(c controller.Controller, r *ReconcileBeat) error {
	// Watch for changes to Beat
	if err := c.Watch(&source.Kind{Type: &beatv1beta1.Beat{}}, &handler.EnqueueRequestForObject{}, predicate.ManagedNamespacePredicate); err != nil {
		return err
	}

	// Watch DaemonSets
	if err := c.Watch(&source.Kind{Type: &appsv1.DaemonSet{}}, &handler.EnqueueRequestForOwner{
		IsController: true,
		OwnerType:    &beatv1beta1.Beat{},
	}, predicate.ManagedNamespacePredicate); err != nil {
		return err
	}

	// Watch Deployments
	if err := c.Watch(&source.Kind{Type: &appsv1.Deployment{}}, &handler.EnqueueRequestForOwner{
		IsController: true,
		OwnerType:    &beatv1beta1.Beat{},
	}, predicate.ManagedNamespacePredicate); err != nil {
		return err
	}

	// Watch Pods, to ensure `status.version` is correctly reconciled on any change.
	// Watching Deployments or DaemonSets only may lead to missing some events.
	if err := watches.WatchPods(c, beatcommon.NameLabelName, predicate.ManagedNamespacePredicate); err != nil {
		return err
	}

	// Watch owned and soft-owned Secrets
	if err := c.Watch(&source.Kind{Type: &corev1.Secret{}}, &handler.EnqueueRequestForOwner{
		IsController: true,
		OwnerType:    &beatv1beta1.Beat{},
	}, predicate.ManagedNamespacePredicate); err != nil {
		return err
	}
	if err := watches.WatchSoftOwnedSecrets(c, beatv1beta1.Kind, predicate.ManagedNamespacePredicate); err != nil {
		return err
	}

	// Watch dynamically referenced Secrets
	return c.Watch(&source.Kind{Type: &corev1.Secret{}}, r.dynamicWatches.Secrets, predicate.ManagedNamespacePredicate)
}

@naemono
Copy link
Contributor Author

naemono commented Dec 1, 2021

@barkbay Yeah, those suggested changes would make things cleaner. Thanks for the suggestion. I'll get those changes taken care of.

@naemono naemono force-pushed the 4168-ignore-unmanaged-namespace-events branch from 777a727 to 8c32098 Compare December 1, 2021 17:54
@naemono
Copy link
Contributor Author

naemono commented Dec 1, 2021

I'm investigating why integration tests are failing in github.com/elastic/cloud-on-k8s/pkg/controller/license, otherwise it's ready for 👀 again.

@naemono
Copy link
Contributor Author

naemono commented Dec 2, 2021

I'm investigating why integration tests are failing in github.com/elastic/cloud-on-k8s/pkg/controller/license, otherwise it's ready for 👀 again.

Resolved.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While doing a new review I realized that the predicate is missing in the trial controller.

This made me think that we should maybe try another approach, in which we don't have to pass the predicate as a parameter in all the Watch(...) functions. I remembered that most of the controllers are created using a custom function, common.NewController. For example in the case of Elasticsearch:

// Add creates a new Elasticsearch Controller and adds it to the Manager with default RBAC. The Manager will set fields
// on the Controller and Start it when the Manager is Started.
// this is also called by cmd/main.go
func Add(mgr manager.Manager, params operator.Parameters) error {
	reconciler := newReconciler(mgr, params)
	c, err := common.NewController(mgr, name, reconciler, params)
	if err != nil {
		return err
	}
	return addWatches(c, reconciler)
}

(There is the exception of the webhook certificate controller , but maybe we can also use common.NewController(...) here ?)

I think an alternative would be to inject the predicate in a new implementation of the controller.Controller interface, something along those lines in pkg/controller/common/controller.go:

// NewController creates a new controller with the given name, reconciler and parameters and registers it with the manager.
func NewController(mgr manager.Manager, name string, r reconcile.Reconciler, p operator.Parameters) (controller.Controller, error) {
	c, err := controller.New(name, mgr, controller.Options{Reconciler: r, MaxConcurrentReconciles: p.MaxConcurrentReconciles})
	if err != nil {
		return nil, err
	}
	return newNamespaceAwareWatchersController(c, p.ManagedNamespaces, p.OperatorNamespace), nil
}

var _ controller.Controller = &namespaceAwareController{}

// namespaceAwareController implements the controller.Controller interface and automatically include a predicate to filter events
// which are not in a managed namespace.
type namespaceAwareController struct {
	controller.Controller
	namespacePredicate predicate.Predicate
}

func newNamespaceAwareWatchersController(c controller.Controller, managedNamespaces []string, operatorNamespace string) controller.Controller {
	watchedNamespaces := managedNamespaces
	if !slices.Contains(managedNamespaces, operatorNamespace) {
		watchedNamespaces = append(watchedNamespaces, operatorNamespace)
	}
	return &namespaceAwareController{
		Controller: c,
		namespacePredicate: predicate.Funcs{
			CreateFunc: func(e event.CreateEvent) bool {
				return IsNamespaceManaged(e.Object.GetNamespace(), managedNamespaces)
			},
			UpdateFunc: func(e event.UpdateEvent) bool {
				return IsNamespaceManaged(e.ObjectNew.GetNamespace(), managedNamespaces)
			},
			DeleteFunc: func(e event.DeleteEvent) bool {
				return IsNamespaceManaged(e.Object.GetNamespace(), managedNamespaces)
			},
			GenericFunc: func(e event.GenericEvent) bool {
				return IsNamespaceManaged(e.Object.GetNamespace(), managedNamespaces)
			},
		},
	}
}

func (n *namespaceAwareController) Watch(src source.Source, eventhandler handler.EventHandler, predicates ...predicate.Predicate) error {
	return n.Controller.Watch(src, eventhandler, append(predicates, n.namespacePredicate)...)
}

Sorry for all those back and forth, I think that if this approach works it'll make this PR a bit smaller and reduce the chance of a missing a controller.

cmd/manager/main.go Show resolved Hide resolved
@@ -466,6 +467,9 @@ func startOperator(ctx context.Context) error {

// configure the manager cache based on the number of managed namespaces
managedNamespaces := viper.GetStringSlice(operator.NamespacesFlag)
// initialize the managed namespace predicate to ignore events outside of the namespaces the operator is concerned with
predicates.ManagedNamespacePredicate = predicates.NewManagedNamespacesPredicate(managedNamespaces)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also watch resources in the operatorNamespace, to reconcile on license events for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @barkbay for all of the suggestions in this PR. Don't worry about the back and forth at all. The last suggestion significantly cleaned up the changes. 👍 As for this comment, the operator namespace is considered in the common controller newNamespaceAwareWatchersController func now, and this code is simply gone.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing some additional tests it turns out that we also need to update telemetry.go. First error is raised when reading Kibana secrets here:

if err := r.client.Get(context.Background(), nsName, &secret); err != nil {
log.Error(err, "failed to get Kibana secret")
continue
}

{
	"log.level": "error",
	"@timestamp": "2021-12-09T08:49:49.727Z",
	"log.logger": "usage",
	"message": "failed to get Kibana secret",
	"service.version": "2.0.0-SNAPSHOT+a25633ec",
	"service.type": "eck",
	"ecs.version": "1.4.0",
	"error": "unable to get: namespace-a/kb-apm-sample-kb-config because of unknown namespace for the cache",
	"error.stack_trace": "github.com/elastic/cloud-on-k8s/cmd/manager.asyncTasks.func2\n\t/go/src/github.com/elastic/cloud-on-k8s/cmd/manager/main.go:655"
}

An option might be to use once again the predicate as we do for GarbageCollectAllSoftOwnedOrphanSecrets. However I fear we'll have the same problem in all the getStatsFn functions (esStats, kbStats...). I'm then wondering if a better solution would not be to reuse the original flag, as provided by the user, when the reporter is created here:

	if !disableTelemetry {
		// Start the telemetry reporter
		go func() {
			tr := telemetry.NewReporter(operatorInfo, mgr.GetClient(), operatorNamespace, viper.GetStringSlice(operator.NamespacesFlag), telemetryInterval)
			tr.Start()
		}()
	}

Overall I think we have a problem with the managedNamespaces, it's confusing to have this empty string. Maybe we should just make it immutable:

		// Add the empty namespace to allow watching cluster-scoped resources if storage class validation is enabled.
		if viper.GetBool(operator.ValidateStorageClassFlag) {
			opts.NewCache = cache.MultiNamespacedCacheBuilder(append(managedNamespaces, ""))
		} else {
			opts.NewCache = cache.MultiNamespacedCacheBuilder(managedNamespaces)
		}

@naemono
Copy link
Contributor Author

naemono commented Dec 9, 2021

@barkbay may I ask how you generated this error?

GarbageCollectAllSoftOwnedOrphanSecrets has this potential issue, as it's doing a non-namespace scoped list operation, and then iterating over the list, and running a .Get, which is where the error originates:

if err := c.List(context.Background(),
&secrets,
client.HasLabels{SoftOwnerNamespaceLabel, SoftOwnerNameLabel, SoftOwnerKindLabel},
); err != nil {

but telemetry is doing a namespace scoped list operation, and then iterating over it, which should only get Kibanas in it's set of managed namespaces, right?

if err := r.client.List(context.Background(), &kibanaList, client.InNamespace(ns)); err != nil {

reuse the original flag

To which flag are you referring? Surely not disableTelemetry, as this code path isn't even traversed when telemetry is disabled. As for the managedNamespaces/operator.NamespacesFlag, that is passed along to the NewReporter func.

Overall I think we have a problem with the managedNamespaces, it's confusing to have this empty string

I agree that it's confusing, but isn't that what the apimachinery uses to designate all namespaces:

https://github.com/kubernetes/apimachinery/blob/15885e270e2cefaaff9aa2b7b05eb004192bb5db/pkg/apis/meta/v1/types.go#L294

	// NamespaceAll is the default argument to specify on a context when you want to list or filter resources across all namespaces
	NamespaceAll = ""

Do you see a path to making this less confusing in this code?

@barkbay
Copy link
Contributor

barkbay commented Dec 13, 2021

@barkbay may I ask how you generated this error?

I have a "restricted" operator running with --telemetry-interval=5s (default interval is 1 hour which might explain why it is easy to miss it)

but telemetry is doing a namespace scoped list operation, and then iterating over it, which should only get Kibanas in it's set of managed namespaces, right?

No because r.managedNamespaces may contain the empty string "". The operator first iterates over the "real" managed namespaces, but eventually it will try to list Kibana instances across all of them:

image

To which flag are you referring? [...] As for the managedNamespaces/operator.NamespacesFlag, that is passed along to the NewReporter func.

Yes, I was referring to viper.GetStringSlice(operator.NamespacesFlag). What is passed to NewReporter is actually append(viper.GetStringSlice(operator.NamespacesFlag), ""): it is the original flag and the empty string "". It is not the original value of viper.GetStringSlice(operator.NamespacesFlag), and this is the root of this issue.

I agree that it's confusing, but isn't that what the apimachinery uses to designate all namespaces:

I don't think we want to watch all the namespaces. We want to watch StorageClass resources, which are cluster-scoped resources. Unfortunately it seems that the only way to watch cluster-scoped resources is to add "" to the managed namespace. Watching all the namespaces is an unfortunate side-effect.

Do you see a path to making this less confusing in this code?

I was wondering if adding NamespaceAll only when the manager is created would help:

// Add the empty namespace to allow watching cluster-scoped resources if storage class validation is enabled.
if viper.GetBool(operator.ValidateStorageClassFlag) {
   opts.NewCache = cache.MultiNamespacedCacheBuilder(append(managedNamespaces, ""))
} else {
   opts.NewCache = cache.MultiNamespacedCacheBuilder(managedNamespaces)
}

@naemono
Copy link
Contributor Author

naemono commented Dec 13, 2021

My apologies. I didn't realize all your initial comments were coming from the fact that we do this:

		if viper.GetBool(operator.ValidateStorageClassFlag) {
			managedNamespaces = append(managedNamespaces, "")
		}

now it all makes sense.

I completely agree that it's quite confusing to have the "all namespaces" namespace in the list of managed namespaces. I'll get these changes in.

naemono and others added 18 commits December 13, 2021 12:03
Co-authored-by: Michael Morello <michael.morello@gmail.com>
always add managedNamespacePredicate in addWatches in association package
Adjust Webhooks controller to use the common path for creating a controller.
…ed atm.

remove WithPredicates as it's not used anymore either.
remove predicates in remoteca controller create as it's using common controller, so unneeded.
… of controller options on create.

Use NewControllerWithOptions in webhook controller to avoid the default setting of MaxConcurrentReconciles.
@naemono naemono force-pushed the 4168-ignore-unmanaged-namespace-events branch from 7699ec5 to 9f83fff Compare December 13, 2021 19:40
@naemono
Copy link
Contributor Author

naemono commented Dec 13, 2021

Fixed the integration test, which caught an issue where we weren't managing the empty slice of managed namespaces properly when initializing the controller.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spotted another place where some resources are listed and processed without predicates/options:

// list all clusters
err := c.List(context.Background(), &clusters)
if err != nil {
return nil, err
}

We might have the same issue in the remote ca controller:

var list esv1.ElasticsearchList
if err := c.List(context.Background(), &list, &client.ListOptions{}); err != nil {
return nil, err
}
(not tested, to be confirmed)

I'm afraid fixing all these errors / events and maintaining the code is tricky 😕

func newNamespaceAwareWatchersController(c controller.Controller, managedNamespaces []string, operatorNamespace string) controller.Controller {
watchedNamespaces := managedNamespaces
// if the length of watchedNamespaces is 0, then we're watching all namespaces, and shouldn't append anything to the slice, as
// it will just cause issues wth the managed namespaces predicate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// it will just cause issues wth the managed namespaces predicate.
// it will just cause issues with the managed namespaces predicate.

// if the length of watchedNamespaces is 0, then we're watching all namespaces, and shouldn't append anything to the slice, as
// it will just cause issues wth the managed namespaces predicate.
if len(watchedNamespaces) > 0 && !slices.Contains(watchedNamespaces, operatorNamespace) {
watchedNamespaces = append(watchedNamespaces, operatorNamespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really required ? The operatorNamespace is supposed to be already added to the managedNamespaces: https://github.com/elastic/cloud-on-k8s/pull/5058/files#diff-435eecb1d2af40ee96747f88ab71eb2758da989c92f76dcb1f714fc1b300c633R479-R480

@barkbay
Copy link
Contributor

barkbay commented Dec 15, 2021

While having a look at the multicache implementation I noticed that cluster scoped resources are now handled:

	// If the object is clusterscoped, get the informer from clusterCache,
	// if not use the namespaced caches.
	isNamespaced, err := objectutil.IsAPINamespaced(obj, c.Scheme, c.RESTMapper)
	if err != nil {
		return nil, err
	}
	if !isNamespaced {
		clusterCacheInf, err := c.clusterCache.GetInformer(ctx, obj)
		if err != nil {
			return nil, err
		}
		informers[globalCache] = clusterCacheInf

		return &multiNamespaceInformer{namespaceToInformer: informers}, nil
	}

(see this PR)

I think we can just stop adding the empty namespace: we should still be able to read StorageClasses and it should solve the issue without the need of a predicate.

@naemono
Copy link
Contributor Author

naemono commented Dec 20, 2021

@barkbay great catch. I'm testing to see if this simple fix is sufficient now in a separate branch. Will report back.

@naemono
Copy link
Contributor Author

naemono commented Dec 20, 2021

closing this per #5187

@naemono naemono closed this Dec 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unable to get: x/y because of unknown namespace for the cache
2 participants