Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline deployment deletion can not kick off the reconcile function, leading to no deployment recreation #25

Closed
houshengbo opened this issue Aug 15, 2019 · 24 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Comments

@houshengbo
Copy link

Expected Behavior

pipeline deployment deletion should kick off reconcile func, so that the pipeline deployment can be recreated.

Actual Behavior

no recreation of deployment, though deployment is registered:


	err = c.Watch(
		&source.Kind{Type: &appsv1.Deployment{}},
		&handler.EnqueueRequestForOwner{
			IsController: true,
			OwnerType:    &op.Config{},
		})
	if err != nil {
		return err
	}

Steps to Reproduce the Problem

  1. install the operator, remove the either deployment of the pipeline
  2. pipeline deployment can not be revived.
  3. reconcile func is not called.

Additional Info

@houshengbo
Copy link
Author

@nikhil-thomas I have not found out the reason. Do you have any clue?

@nikhil-thomas
Copy link
Member

At present, the operator doesnot have that logic built into it.

In the current implemetation, the deployment deletion event will be detected by the reconcile function, but it will be ignored either here:

if ignoreRequest(req) {

or
here:

@nikhil-thomas
Copy link
Member

we need to add the deployment deleted reconcile logic to the reconciler.

@nikhil-thomas nikhil-thomas added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 19, 2019
@vincent-pli
Copy link
Member

@nikhil-thomas
I think the reason is not only you mentioned.
controller-runtime is still not support cluster-scope owner in v0.1.12 (We adopted), let alone to v0.1.10 (operator sdk v0.9 adopted).
See the link: kubernetes-sigs/controller-runtime#274

I try to use controller-runtime v0.2.0-xx but get issue raised from operator sdk

/root/go/pkg/mod/github.com/operator-framework/operator-sdk@v0.9.0/pkg/log/zap/logger.go:37:18: undefined: "sigs.k8s.io/controller-runtime/pkg/runtime/log".KubeAwareEncoder

So I guess to resolve the issue, we need enhance operator sdk or make Config as namespace scope.

@houshengbo
Copy link
Author

houshengbo commented Aug 19, 2019

@nikhil-thomas Here is what I found: if the reason is due to the IGNORE logic, then we should have at least the log message "" shows up, as "log.Info("reconciling config change")", but when I deleted the deployment, this message did not pop up in the log, which means the reconcile was not kicked off by the deployment deletion.

@vincent-pli I was guessing it might be the reason you mentioned(serving-operator is using the a namespace-scope CR and the deployment change can call the reconcile function), but not verified so far.
BTW, serving-operator uses operator-sdk v0.8.0, but tekton operator uses v0.9.1-0.20190715204459-936584d47ff9 already.

@nikhil-thomas
Copy link
Member

@vincent-pli, @houshengbo, thank you for catching it and filing an issue in operator-sdk project. I have raised a query in https://coreos.slack.com forum-operator-fw. Let see what they say and then we can make a decision.

@nikhil-thomas
Copy link
Member

i got a reply from operator-sdk team

yes, we plan to upgrade the SDK to C-R and C-T v0.2.0 when the stable releases are made upstream. upstream should be this week or next week barring any last minute bugs that come up 🤞. SDK will follow shortly thereafter with a release of our own.

@nikhil-thomas
Copy link
Member

and

Also, if you’re really wanting to get started early, you can use a build from SDK’s refactor/controller-runtime-v0.2.0 branch.

https://coreos.slack.com/archives/C3VS0LV41/p1566225013185700

@nikhil-thomas
Copy link
Member

in my opinion, let us wait for the operator-sdk update. it should be an simple change for us.

@vdemeester @sthaha

@nikhil-thomas
Copy link
Member

@hrishin

@sthaha
Copy link
Member

sthaha commented Aug 20, 2019

@nikhil-thomas @houshengbo @vincent-pli

3. reconcile func is not called.

Please help me understand this better. Here is the test that I ran

  1. setup operator
  2. let it install pipeline
  3. delete the webhook deployment

Noticed the reconcile does indeed get called but it is ignored

019-08-20T17:30:06.685+1000    INFO    ctrl.config.reconcile   reconciling config change       {"Request.Namespace": "tekton-pipeline
s", "Request.NamespaceName": "tekton-pipelines/cluster", "Request.Name": "cluster"}
2019-08-20T17:30:06.685+1000    INFO    ctrl.config.reconcile   ignoring event of resource watched that is not cluster-wide     {"Requ
est.Namespace": "tekton-pipelines", "Request.NamespaceName": "tekton-pipelines/cluster", "Request.Name": "cluster"}

If you don't see Reconcile being called, could you please ensure

 ownerReferences:
  - apiVersion: operator.tekton.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Config
    name: cluster

the deployments have the ownerReferences as above?

Is this different to what you see?

@nikhil-thomas
Copy link
Member

@sthaha
#25 (comment)

remove the either deployment of the pipeline

did you try deleting either the tekton-pipeline-controller or tekton-pipeline-webhook deployments

@sthaha
Copy link
Member

sthaha commented Aug 20, 2019

did you try deleting either the tekton-pipeline-controller or tekton-pipeline-webhook deployments

of course :)

@vincent-pli
Copy link
Member

@sthaha
The log is not about deployment delete.
see the content:

{"Request.Namespace": "tekton-pipelines", 
"Request.NamespaceName": "tekton-pipelines/cluster", 
"Request.Name": "cluster"}

It's about a cluster object in namespace tekton-pipelines

@nikhil-thomas
Copy link
Member

@houshengbo @vincent-pli
are you guys using the --namespace flag

in operator-sdk up local --namespace=""
without it operator-sdk takes current namespace as watch-namespace

ref: https://github.com/operator-framework/operator-sdk/blob/640171cc31c8a442cf8d669603ffc1b62f5d6b44/cmd/operator-sdk/up/local.go#L84

@nikhil-thomas
Copy link
Member

@vincent-pli

The log is not about deployment delete.
see the content:

That is what is expected. Operator-sdk give you a referece to the primary resource
ref: https://github.com/operator-framework/operator-sdk/blob/master/doc/user-guide.md#reconcile-loop

From a kubernetes stand point it makes sense (Edge-Triggered but Level-Driven)

Edge-Triggered: we get a notification that something has happened
Level-Driven: we have to query the api (lister) and decide what to do to reconcile the state

@nikhil-thomas
Copy link
Member

@vincent-pli, @houshengbo
could you guys confirm whether the deployment events are not triggering a reconcile loop
when run using
operator-sdk up local --namespace=""

@vincent-pli
Copy link
Member

I use the standard approach, install operator in k8s cluster.
I can try operator-sdk up local --namespace="" if necessary, but I think the issue is about the issue in Operator sdk i mentioned upper, isn't it?

@nikhil-thomas
Copy link
Member

nikhil-thomas commented Aug 20, 2019

@vincent-pli

I can try operator-sdk up local --namespace="" if necessary, but I think the issue is about the issue in Operator sdk i mentioned upper, isn't it?

in the context of this issue, the reconcile will be triggered if we use the --namespace="" flag. The reason being when the --namespace flag is not set operator-sdk uses current namespace` from kubeconfig.

$ operator-sdk up local
INFO[0000] Running the operator locally.                
INFO[0000] Using namespace default.

When we set `--namespace="", the operator watches all namespaces

$ operator-sdk up local --namespace=""
INFO[0000] Running the operator locally.                
INFO[0000] Using namespace .

We do this while development using up local command
When we run the operator with an image we specify the WATCH_NAMESPACE="" in deployment

- name: WATCH_NAMESPACE

I found out just now which is incorrectly set :)

@nikhil-thomas
Copy link
Member

i have pushed a fix here:
#27

@vincent-pli
Copy link
Member

@nikhil-thomas
Great work, thanks👍

@nikhil-thomas
Copy link
Member

@houshengbo shall we close this issue

@houshengbo
Copy link
Author

@nikhil-thomas I will verify it today. Thx.

@houshengbo
Copy link
Author

We close it, since it has been resolved. Next, we should make sure the deployment can be recreated after deletion.

nikhil-thomas pushed a commit to nikhil-thomas/tektoncd-operator that referenced this issue Jul 7, 2021
Using tkn-aac-sa as SA and using nightly-ci-github-hub-token
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet
Development

No branches or pull requests

4 participants