Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Does Operator-SDK have some sort of "don't reconcile this" concept? #3418

Closed
samuel-hawker opened this issue Jul 14, 2020 · 5 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs discussion
Milestone

Comments

@samuel-hawker
Copy link

Type of question

Are you asking about community best practices, how to implement a specific feature, or about general context and help around the operator-sdk?
community best practices

Question

Currently I am looking into the concept of a pause reconciliation (further specified in the proposal below), a way for a custom resource to be flagged as do not reconcile me to allow for maintenance and manual intervention. Does this concept exist witthin the operator-sdk framework, or in Kubernetes as a whole? I have looked into this and found little on the topic. Was hoping for advice and to start a wider conversation around this if not already existent.

Additional context

I put together a proposal of sorts internally for my company and wanted to share it, in case additional context is needed:

Pause custom resource reconcile proposal

A user of an operator might want to pause their operator's function for multiple reasons, this could include:

  • Make unsupported changes to an instance, without the operator interfering (for example for diagnostics)
  • Manually fix an erroneous state that the operator cannot fix (Custom support scenario)
  • Migration data/persistent volumes from one version of an application to another version (Manual migration tasks)
  • Controlled shutdown, ahead of a hardware shutdown or power outage. To ensure the state of an application instance is set to tolerate such an interruption. (Controlled Shutdown)

The spec.pause field was added to prometheus some time ago: prometheus-operator/prometheus-operator@b8e650b

// When a **** deployment is paused, no actions except for deletion
// will be performed on the underlying objects.

is the behaviour they currently have in code.

I propose ew adopt a model similar to this, though semantically I think retaining any functionality aside from the 'delete' reconciliation action is potentially dangerous on pause.
To expand on this, I think the operator should still do any 'clean-up' operations relating to an actual deletion of a custom resource, but it should not remove a component if that component is no longer specified in the spec of a custom resource.
An example of this in Event Streams is if spec.pause is true when you delete a Kafka custom resource, the operator will still delete your PVCs, if spec.kafka.storage.deleteClaim is set to true, so that persistent claims are deleted on custom resource deletion.
If a custom resource is deleted, since so many resources will be using owner references, it must continue to use deletion logic, even if paused. A pause should not pause deletion logic, the pause is no longer in effect when you delete the custom resource, since the field is actually no longer present (along with the rest of the custom resource).
A user could potentially put the instance in a weird state where bespoke deletion logic would not be able to work, but this is user error and deletion logic should tolerate some elements of failure during deletion.
For the most part deletion logic is to be avoided for operators anyway as owner references are a convenient and simple solution to cleanup.

The advantage of such a mechanism is that when an operator managing an instance would normally have to be uninstalled, instead the operator can continue operation,
but the 'paused' instance will no longer be managed.

Proposal

The exact model I propose is as follows:

  • spec.pause is added as a boolean field to all custom resources.

  • If 'paused' a custom resource will exit reconcile before any interactions are done with the application resources, application or related applications, the one exception to this would be the addition of a new status condition that indicates it is Paused - or an Unknown status condition to indicate the operator has no way of verifying the instance state.
    This means operator validation requires the field is false or not present.
    It is not checked on web-hook validation as the spec.pause field being true is a valid custom resource state, but on validation within the operator, a custom resource having spec.pause true denotes it is paused and that the reconcile should end (successfully) with a pause state

  • Reconciles will still be triggered on custom resource changes and periodic reconciles if configured within the operator, but will drop out of reconcile loop.

  • Operator will write a Paused phase or condition to the custom resource to make clear it is unmanaged.

  • Once manual intervention is finished, user deletes the pause field, or sets the field to false. This will trigger a reconcile and operator will continue normal operations.

Alternative names for spec.pause

spec.pause may not be the best name, alternatives could be:

  • pauseReconcilliation
  • skipReconcilliation
  • unmanaged
  • ignored
  • ignoreChanges
@estroz estroz added kind/feature Categorizes issue or PR as related to a new feature. needs discussion labels Jul 20, 2020
@estroz estroz self-assigned this Jul 20, 2020
@estroz estroz added this to the Backlog milestone Jul 20, 2020
@mhrivnak
Copy link
Member

This feature might be more compelling as an annotation.

Suppose we have a Wordpress CR and a Wordpress operator. The CR is our API for describing the wordpress we want, aka the operand. The feature you have described is a way to influence not the operand, but the operator itself. The pause feature is literally meta-data with respect to the API, and I think it would be a better fit as an annotation. There are many examples of using annotations to influence how a controller behaves beyond the strict implementation of its API.

This feature could be implemented in a library of helper code for controllers, or even just documented as a pattern in a controller patterns cookbook.

Regardless, this would be a good feature to bring up with the controller-runtime community.

@estroz estroz removed their assignment Jul 20, 2020
@samuel-hawker
Copy link
Author

@mhrivnak you mention this should be brought up to the controller-runtime community, do you have any recommendations on how I reach out to them? Any specific place I should drop this proposal, or a community meeting to engage in?

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2020
@camilamacedo86
Copy link
Contributor

HI @samuel-hawker,

The place is https://github.com/kubernetes-sigs/controller-runtime. It is not something that is part of the domain of responsibility of SDK. I'd recommend you raise your suggestion there.

I am closing this one as sorted out.

@estroz
Copy link
Member

estroz commented Jul 23, 2021

FYI: this has been implemented as an annotation applied to CRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs discussion
Projects
None yet
Development

No branches or pull requests

6 participants