Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ansible] be able to define watch[Dependent,ClusterScoped]Resources as CR annotations #3364

Closed
jmazzitelli opened this issue Jul 8, 2020 · 27 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. language/ansible Issue is related to an Ansible operator project lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Milestone

Comments

@jmazzitelli
Copy link

Today, a user can modify the reconcile period by annotating the CR with ansible.operator-sdk/reconcile-period: "120s". The default is defined in the watches.yaml setting called reconcilePeriod.

There is no analogous CR annotation to disable/enable dependent resource watching. For example, in watches.yaml there are settings called watchDependentResources and watchClusterScopedResources. This enhancement request is to create annotations that can be put on the CR to set those values in the same way the reconcile-period annotation can override the watches.yaml's reconcilePeriod setting.

From @fabianvf on slack:

I think it would definitely be doable, it's just the sort of thing we'd
want to be careful about adding as it allows the user to fundamentally
alter the behavior of the operator in a way that the operator author
may not have intended. Not sure what security options we have but
if we allow per-cr customization to that level we might need to also
add some controls so that authors can disable the behaviors
@camilamacedo86

This comment has been minimized.

@camilamacedo86 camilamacedo86 self-assigned this Jul 9, 2020
@camilamacedo86 camilamacedo86 added language/ansible Issue is related to an Ansible operator project triage/support Indicates an issue that is a support question. labels Jul 9, 2020
@jmazzitelli

This comment has been minimized.

@camilamacedo86 camilamacedo86 reopened this Jul 9, 2020
@camilamacedo86

This comment has been minimized.

@camilamacedo86 camilamacedo86 removed their assignment Jul 9, 2020
@camilamacedo86 camilamacedo86 added kind/feature Categorizes issue or PR as related to a new feature. and removed triage/support Indicates an issue that is a support question. labels Jul 9, 2020
@jmazzitelli
Copy link
Author

@camilamacedo86 Let me explain how I think things work today (in case I misunderstand something myself), and perhaps make this enhancement request more clear. In short, it is not involving anything about reconcile period.

I understand the watches file and the CR annotation can define a reconcile period of X seconds - this allows the playbook to be blindly run every X seconds. So, if reconcilePeriod is set to "60s" that means every 60 seconds the playbook is executed.

That is NOT what I am concerned about.

Now, I also know the watches file has two other settings called watchDependentResources and watchClusterScopedResources. These are booleans (true or false) - they are NOT reconcile periods or time durations.

As I understand it, when those settings are true, they will create k8s watches on all resources that the operator creates (cluster scoped and namespace scoped). When those watches are triggered (say, by a user deleting or modifying one of those operator-created resources), the playbook is immediately executed. There is no reconcile period consulted - there is no time delay or "every X seconds" - involved here. Whenever a dependent resource (i.e. an operator-created resource) is touched, the playbook is immediately invoked.

That behavior is only enabled or disabled by the operator author via the watches.yaml settings. However, I am asking if it is possible to allow the CR author to enable or disable that functionality by some new annoations, say, something like:

annotations:
  ansible.operator-sdk/watch-dependent-resources: "true"
  ansible.operator-sdk/watch-cluster-scoped-resources: "true"

I understand @fabianvf concern that this might turn on or off capabilities that the operator author does not intend. However, in my use case, it would be beneficial to allow a user to enable this capability.

@camilamacedo86

This comment has been minimized.

@jmazzitelli

This comment has been minimized.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Jul 10, 2020

Hi @jmazzitelli,

So, by following your suggestion we will need to implement the watch to ignore any resource that for example has the annotation ansible.operator-sdk/watch-dependent-resources: "false". Am I right?

The watch for the dependent ansible files shows be implemented here. And then, it would need to be implemented to have filters for the labels as described in kubernetes-sigs/controller-runtime#244.

Also, note that you can indeed disable the watch for specific GVKs by using the feature blacklist already which I understand that is not exactly what you are looking for but can help you with so far.

Example:

# ConfigMaps owned by a Memcached CR will not be watched or cached.
- version: v1alpha1
  group: cache.example.com
  kind: Memcached
  role: /opt/ansible/roles/memcached
  blacklist:
    - group: ""
      version: v1
      kind: ConfigMap

@jmazzitelli
Copy link
Author

jmazzitelli commented Jul 10, 2020

I do remember the blacklist feature now that you mention it, but I don't think you can specify a blacklist in the CR (i.e. it can only be declared by the operator author in the watches.yaml) - see https://sdk.operatorframework.io/docs/ansible/reference/watches/

So, by following your suggestion we will need to implement the watch to ignore any resource that for example has the annotation ansible.operator-sdk/watch-dependent-resources: "false". Am I right?

Right. That's what I am proposing. So, following your example there, suppose I have a Memcached CR that looks like this:

apiVersion: "cache.example.com/v1alpha1"
kind: "Memcached"
metadata:
  name: "example-memcached"
  annotations:
    ansible.operator-sdk/watch-dependent-resources: "false"
spec:
  size: 3

This would mean the operator will not watch any dependent resources, EVEN IF the Memcached operator has watchDependentResources: true in its watches.yaml.

Conversely, if I annotate the Memcached CR with this instead:

    ansible.operator-sdk/watch-dependent-resources: "true"

Then I'm telling the Memcached Operator that I want it to watch all dependent resources and re-run its playbook when a dependent resource is touched, EVEN IF the Memcached operator has watchDependentResources: false in its watches.yaml.

Now, this gets to the issue that @fabianvf raises. What if an operator is written with the explicit requirement that watchDependentResources must be true (i.e. the operator needs those watches and must be told when a dependent resource is touched). Or vice-versa - perhaps the operator explicitly does NOT want to be told when dependent resources are touched and so needs watchDependentResources to be false. In those cases, we should not allow the user to overwrite the watch flag in the CR via that annotation. So, perhaps we would need to add a flag in the watches.yaml that the operator author can use to indicate "I do NOT want the user to be able to overwrite the watch-dependent-resources flag". Something like this in watches.yaml:

watchDependentResources: true
watchDependentResourcesOverrideAllowed: false

When the author sets watchDependentResourcesOverrideAllowed to false then the operator SDK should emit an error if a user puts that annotation in the CR and its value is different than the value in the watches.yaml - i.e. the user should not be allowed to change the value that the author wants.

Therefore, I propose the following:

Add two new settings in watches.yaml:

  1. watchDependentResourcesOverrideAllowed (whose value is a boolean)
  2. watchClusterScopedResourcesOverrideAllowed (whose value is a boolean)

Add two new annotations that can be added to a CR:

annotations:
  ansible.operator-sdk/watch-dependent-resources: "<boolean>"
  ansible.operator-sdk/watch-cluster-scoped-resources: "<boolean>"

If the annotation ansible.operator-sdk/watch-dependent-resources is true, the operator SDK should add watches to dependent resources, regardless of the value of watchDependentResources in watches.yaml. If the annotation is false, the operator SDK should NOT add watches to dependent resources, regardless of the value in watches.yaml. Analogous behavior should occur for ansible.operator-sdk/watch-cluster-scoped-resources and watchClusterScopedResources.

But if watches.yaml sets watchDependentResourcesOverrideAllowed to false and the user added the annotation ansible.operator-sdk/watch-dependent-resources: <bool> then the annotation value must be the same as the value of watchDependentResources in the watches.yaml or an error should result. Same analogous behavior should occur with watchClusterScopedResourcesOverrideAllowed and ansible.operator-sdk/watch-cluster-scoped-resources.

Hopefully that makes sense. From the brief discussion I had on this with @fabianvf , he seemed to indicate this kind of thing was possible, with the caveat that it might cause problems if the operator author expects a certain watch-dependent-resource behavior - but perhaps my suggestion for the two "OverrideAllowed" settings in watches.yaml alleviates that concern.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Jul 10, 2020

Hi @jmazzitelli,

I catcher the @fabianvf's point and it shows introduce a complexity not just in the implementation in order to make it maintained and works as you described but for users know how to work with. So, what are the problems and use cases that would be solved/beneficiated with these features? Could you describe what are the scenarios that would justify this introduction?

@jmazzitelli
Copy link
Author

Could you describe what are the scenarios that would justify this introduction?

Yes. In my particular case - the Kiali Operator - the operator works perfectly fine without any periodic reconciliation (reconcilePeriod=0) and without any dependent-resource watching (watchDependentResources=false, watchClusterScopedResources=false). In other words, the Kiali Operator does not require those features in order for the Kiali Operator to function properly. The Kiali Operator sits dormant for the vast majority of the time (taking up very little cluster resources) until someone creates or modifies a Kiali CR. I like Kiali Operator not having to consume resources when it doesn't have to.

However, I had a user that wanted the operator to run a reconciliation when, say, someone accidentally/purposefully deleted the Kiali Deployment - he wanted the Deployment to get resurrected by the operator automatically. A perfect reason to turn on either periodic reconciliation or dependent-resource-watching. Because of this feature request, I wrote up this issue here.

But I noticed that when I turned on periodic reconciliation in watches.yaml (e.g. "reconcilePeriod=60s"), obviously my operator's playbook is run EVERY 60 seconds, even when it doesn't need to (and remember, Kiali Operator works fine without periodic reconciliation). So my operator immediately went from using almost no cluster resources (e.g. CPU/memory) the vast majority of time to needing resources every minute; and this is bad especially considering that each run of the Kiali Operator's playbook could take 30 seconds or even a minute to complete (thus the Kiali Operator would effectively be continually running its playbook non-stop). So I did not want to enable that by default. I could tell the user to turn on periodic reconciliation via the CR annotation, but that would mean that user would be seeing the Kiali Operator take up those resources every minute and I didn't think that would be good. The alternative would be to change it to something like 5 minutes, but.. meh...

OK, so then I said to myself, "why not try this dependent-resource-watch feature?" So I went back and set reconcilePeriod back to "0" but enabled watchDependentResources=true and watchClusterScopedResources=true (note: my operator does create and manage cluster-scoped resources). And that worked better. Now the operator stays dormant until either the CR is created or modified OR one of the resources that operator creates is touched. That's exactly what the user wanted.

However, there is one thing that bugs me about how that feature works. I noticed that every time a new CR is created (i.e. a Kiali CR is created to indicate that the operator needs to install an operand - the Kiali Server) my playbook is invoked twice. I talked to @fabianvf about this to see if I can find out what event causes this second invocation with the hope of being able to quickly abort the playbook run if I can determine it is an unimportant event that doesn't require the playbook to re-run, but he told me the SDK does not even log what event or dependent resource triggered the playbook invocation. So I cannot tell you exactly why this immediate second invocation happens, but it happens, and the operator has no way of determining if it should circumvent the full playbook reconciliation run. And notice that my playbook takes about 30 to 60 seconds to fully complete (even when it changes nothing). So it is annoying to effectively have the Kiali Operator playbook run twice when a Kiali CR is created, rather than just the one time that is necessary in order to install the operand at CR creation time.

So - the solution to this would be - I'll turn off dependent-resource watch by default. Only if a user WANTS this capability to reconcile changes to dependent resources (with the negative side effect of having the playbook triggered multiple times, thus the operator taking more time and resources before going "dormant" again), then the user merely has to annotate his CR with these annotations we are talking about: ansible.operator-sdk/watch-dependent-resources and
ansible.operator-sdk/watch-cluster-scoped-resources.

Thus the best of both worlds - the operator can be configured, by default, to use very little cluster resources and will only run when absolutely necessary (at CR creation or modification time) but the user can configure the operator to do reconciliation of dependent resources if they want that additional capability (at the expense of having the operator use more resources). This trade-off is why I would like the flexibility of being able to turn on and off the dependent-resource-watching feature by the user.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Jul 10, 2020

Hi @jmazzitelli,

So, IMO we might need to address these via docs instead of that such as make clear to the users that:

  • we could make a statement for the reconcilePeriod similar to the sync period attribute of manager.Options:
// A lower period will correct entropy more quickly, but reduce
// responsiveness to change if there are many watched resources. Change this
// value only if you know what you are doing.
  • they can create many CRD's and use the APIs to persuade the concepts such as single responsibility, cohesion and encapsulation
  • and then, by persuading these concepts; they are able to re-use more their code, increase the maintenance ability as take more advantage of the Ansible operator features and has more flexibility to do more granular configurations since they are able to define watchDependentResources, watchClusterScopedResources, blackList per GVK(role/playbook)as its reconciliation period as well in the watches file.

Example:


- version: v1alpha1
  group: app.example.com
  kind: AppService
  playbook: playbook.yml
  maxRunnerArtifacts: 30
  reconcilePeriod: 60s
  manageStatus: False
  watchDependentResources: True

- version: v1alpha1
  group: app.example.com
  kind: Databse
  playbook: playbook.yml

In this way, users probably would not end up in this scenario and needing these features. However, if users still requiring this kind of customizations it can be addressed with the hybrid operators. See: #670

@fabianvf wdyt?

@jmazzitelli
Copy link
Author

they are able to re-use more their code, increase the maintenance ability as take more advantage
of the Ansible operator features and has more flexibility to do more granular configurations since'
they are able to define watchDependentResources, watchClusterScopedResources, blackList per
GVK(role/playbook)as its reconciliation period as well in the watches file.
...
if users still requiring this kind of customizations it can be addressed with the hybrid operators.

Is the suggestion that users of Kiali build their own operator (extending the existing Kiali Operator) if they want to optionally perform the dependent resource watch? I certainly do not want to ask my users to build an operator. Just getting them to deploy and use an operator is a difficult task (do you know how many times people have asked me to just provide a helm chart for my operand? People are not enthusiastic when asked to just USE an operator - I shutter to think what the reaction would be if we tell them they have to build their own just to get an operand to install).

My users only care about "I want to install Kiali." Today, they have to install the Kiali Operator and create a Kiali CR in a watch namespace to see the operand get installed somewhere. Most of my users (I contend, the majority of them) don't even know what an operator is, let alone how to build one.

I think it is a non-starter if a solution includes asking operand users to build their own operators.

That suggestion in the previous comment's example:

- version: v1alpha1
  group: app.example.com
  kind: AppService
  playbook: playbook.yml
  maxRunnerArtifacts: 30
  reconcilePeriod: 60s
  manageStatus: False
  watchDependentResources: True

- version: v1alpha1
  group: app.example.com
  kind: Databse
  playbook: playbook.yml

requires the user to actually build an operator. Of course, for Red Hat customers, this is a non-starter because their operator will not show up in the OperatorHub as a "Red Hat Provided" operator and obviously will not be supported as one.

So for these reasons I really think this is not a viable option.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Jul 10, 2020

Hi @jmazzitelli,

Just to clarifies. When I say users, I mean SDK users and not users of some project built with this tool. Ansible Operator SDK developers can build their projects by persuading good practices. In this way, IMO the problem described that motivates you to raise the RFE could be solved/addressed by improving the docs to make more clear some recommendations.

Otherwise, it also can be addressed with the hybrid operators. See: #670

@jmazzitelli
Copy link
Author

In this way, IMO the problem described that motivates you to raise the RFE could be solved/addressed by improving the docs to make more clear some recommendations.

So would the recommendation be something like: set reconcilePeriod to some large value (say, 15 minutes) but also turn on dependent-resource-watching? I actually don't think I need both - I just need one or the other. But each has pros and cons - and I would prefer to give the user the option/authority to pick which one they want, rather than me as the operator author to impose that choice upon the user (e.g. I don't want to force the user to use extra cluster resources if that is something the user doesn't want to compromise on).

@joelanford
Copy link
Member

Doing a drive-by here (@fabianvf shoot this down if this isn't what you have in mind)

This makes sense to me, but the implementation is going to look at little different than what the UX makes it seem like is happening.

If a watch is setup for a particular resource type, events will come in regardless of whether the CR has this new annotation. So when the event comes in, we would lookup the owner, check its annotation, and only then allow the event to trigger a reconciliation.

The naive implementation would require that watchDependentResources is true and then allow the CR annotation to set it to false.

Doing it the other way around (default to NOT watching dependent resources, but allow it if a CR annotation is set to true), would mean still actually watching all the dependent resources, but only triggering reconciliation if the annotation is true.

@jmazzitelli
Copy link
Author

@joelanford ah.. yeah, I didn't think about the implementation of my suggestion. Now, considering I do not know anything about the ansible operator SDK implementation details, what I say here may be naive but -- couldn't the SDK wait to decide if it needs to watch the dependent resources until after the CR is initially created? Because (a) the resources aren't created until the CR is created and (b) but at that point, the CR annotations are also known.

So, the operator could have watchDependentResources set to false in watches.yaml - thus the SDK defaults to not watching. But then when the CR is created (with the annotation turning watch on) - at that point, the SDK sees the resources need to be watched - and it is at that same time the SDK triggers the playbook (so it can set a flag of some sort to say, "anything this playbook creates needs to be watched").

Ignore me if this makes no sense because, as I say, I'm oblivious to the SDK Go implementation code :)

@joelanford
Copy link
Member

I think that would be problematic. Consider this scenario:

  • We have a primary CR that reconciles a Secret
  • watchDependentResources: false is set in watches.yaml
  • CR1 comes in with annotations["watchDependentResources"]: true, so we start watches for Secrets owned by CR instances.
  • CR2 comes in with no annotation set. It is reconciled and its child Secret is created.
  • CR2's child Secret is manually changed.
  • Due to the watch setup from when CR1 was reconciled, an event comes in for CR2's Secret that triggers CR2 to be reconciled. This would be unexpected since the global watches.yaml value is set to false and it was not overridden in CR2's annotations.

@jmazzitelli
Copy link
Author

jmazzitelli commented Jul 10, 2020

  • Due to the watch setup from when CR1 was reconciled, an event comes in for CR2's Secret that triggers CR2 to be reconciled. This would be unexpected since the global watches.yaml value is set to false and it was not overridden in CR2's annotations.

Yes, I see your point. Though, couldn't the SDK check, "there is no annotation on the CR that owns the Secret (CR2), and the watches.yaml value is "false" so even though I got an event for this CR2 Secret resource, the owning resource (CR2) desired behavior (based on no annotation and global watches.yaml) dictates that the playbook should not be triggered right now"

@joelanford
Copy link
Member

It could. I'm just pointing out that we need to be careful with the implementation to handle these cases (and write unit tests to verify them!)

@camilamacedo86
Copy link
Contributor

My suggestion to address the scenario which motivates the RFE is: #3384. IHMO is that this kind of customizations would more appropriately done in the specific cases by the owner projects when/if they find required via the hybrid operators instead of provided by default.

@jmazzitelli
Copy link
Author

jmazzitelli commented Jul 10, 2020

@camilamacedo86 I finally understand what you are suggesting to do. You are saying, "Rather than have the user add annotations to the CR, have the user create a different Kind of CR to indicate what behavior they want".

So in the Memcached example, I could create this:

apiVersion: "cache.example.com/v1alpha1"
kind: "Memcached"
metadata:
  name: "example-memcached"
spec:
  size: 3

Or this:

apiVersion: "cache.example.com/v1alpha1"
kind: "MemcachedWatchDependentResources"
metadata:
  name: "example-memcached"
spec:
  size: 3

and the operator would watch for the different CRDs (one to indicate dependent resource watching is on, one with it off).

I'm meh on that because (a) it adds more CRDs that my operator has to be associated with (what happens when I need to bump up my CRD versioning - I now have 2 I need to bump and manage). I just don't like having more CRDs that I have to maintain than I have to. And (b) to me it is confusing/misleading to have different CRDs just to be able to alter this watch behavior (to me, a different CRD would indicate a different operand - not simply just a small change in operator implementation behavior - that seems the classic purpose for annotations).

Also, what if the operator utilizes the built in variable "_$GROUP_$KIND" - now the operator has to be refactored so it can handle the different versions of CRDs (e.g. _cache_example_com_memcached and _cache_example_com_memcachedwatchdependentresources) even though it really shouldn't matter (at least in my case) - the playbook functionality is the same. It almost seems like a hack to do it with multiple CRDs.

It just seems to me this triggering of the reconciliation (i.e. the calling of the playbook) on changes to a dependent resource is very analogous to the annoation "reconcile-period" - it seems to make sense to have the same mechanism (annotations) to trigger the calling of the playbook when a dependent resource is touched just as when you want some arbitrary time period expiring do the same thing.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Jul 10, 2020

Hi @jmazzitelli,

Just to clarifies my POV over And (b) to me it is confusing/misleading to have different CRDs just to be able to alter this watch behavior

The suggestion is NOT for developers only be able to have this watch behaviour. The suggestion is for developers following good practices.

See that when a CR/CRD is created/defined means a new resource/class(representation) with these attributes. So, if we use the same class to define all the different sorts of resources, we may be hurting concepts such as encapsulation, single responsibility principle and almost for sure cohesion that may cause unexpected sides effects such difficulty to read, maintain/improve and extend, just to mention a few. Just to clarify my POV, it is the same as defining a User class which will have attributes related to other objects and do not respect its domains. For example; add to the Client Kind attributes that are part of a Shoe Kind domain. Particularly I'd recommend you persuade the DDD concept to design the solutions.

@kensipe kensipe added this to the Backlog milestone Jul 13, 2020
camilamacedo86 added a commit that referenced this issue Jul 15, 2020
**Description of the change:**
- add info over reconcilePeriod similar to the sync period attribute of [manager.Options](https://godoc.org/sigs.k8s.io/controller-runtime/pkg/manager#Options) to make clear that users should avoid small periods
- clarifies that users are able to create many CRD's and use the APIs to persuade the concepts such as single responsibility, cohesion and encapsulation which would bring more flexibility and optimization since they are able to do configurations per GKV via the watches file. 

**Motivation for the change:**

#3364
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2020
@jmazzitelli
Copy link
Author

I still would like this feature. However, I hacked around it myself by creating my own setting in the CR to be able to turn on and off reconciliation manually if the user sees the operator going into infinite reconciliation loops (typically this happens due to bugs in the operator). Yes, it's an ugly hack :}

@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 26, 2020
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. language/ansible Issue is related to an Ansible operator project lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants