Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent creating a controller if a CRD does not exist. #840

Closed
sophieliu15 opened this issue Mar 5, 2020 · 17 comments
Closed

Prevent creating a controller if a CRD does not exist. #840

sophieliu15 opened this issue Mar 5, 2020 · 17 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Milestone

Comments

@sophieliu15
Copy link

sophieliu15 commented Mar 5, 2020

I hope to create a controller for each GVK that I added dynamically. Previously I noticed that if I started the manager and then installed a new CRD on the fly, I would get an error saying the CRD did not exist when I called NewControllerManagedBy(mgr)...Complete(r). Specifically, this line would return an error. The error message was the following, where CronTab was a new CRD that I installed after starting the controller manager:

Error while trying to create ObjectReconciler","gvk":"stable.example.com/v1, Kind=CronTab","error":"no matches for kind \"CronTab\" in version \"stable.example.com/v1\"

Then I found #554 and updated controller-runtime version to the latest one (v0.5.0) to use DynamicRESTMapper. After updated, I started manager without installing CRD for CronTab. I found that I could create controller for CronTab successfully (i.e., this line does not return any error). However, when the Reconcile method for the controller of CronTab was called, I got the error from pkg/source/source.go complaining the CRD was not found. Following was the error:

{"level":"error","ts":1583448155.5018284,"logger":"controller-runtime.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":"CronTab.stable.example.com","error":"no matches for kind \"CronTab\" in version \"stable.example.com/v1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/source/source.go:88\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:165\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:198\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Add.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/manager/internal.go:226"}

My questions are:

  • Is there any way to prevent creating controller if a specific CRD does not exist in the apiserver? I am currently using:
_, err := r.Manager.GetRESTMapper().RESTMapping(gvk.GroupKind(), gvk.Version)
if err != nil {
  // Do not create controller
}

Not sure if there is a better way (or common practice) to do that?

  • After updated to the latest controller-runtime, I would expect the controller creation to fail if I did not install the CRD and succeed after I created the CRD on the fly. I am curious why I was able to create the controller without installing CRD after the update? Is this change expected?
@sophieliu15
Copy link
Author

sophieliu15 commented Mar 5, 2020

/help

@k8s-ci-robot
Copy link
Contributor

@sophieliu15:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help
/triage/support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@sophieliu15: The label(s) triage/ cannot be applied, because the repository doesn't have them

In response to this:

/help
/triage/support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 5, 2020
@DirectXMan12
Copy link
Contributor

Using the RESTMapper is a reasonable choice if you want to continue on, but not create the controller.

As for the "no longer fails on .Complete", I have a suspicion that this is due to 0fdf465, which moved around when we start Watches to avoid leaks.

@DirectXMan12
Copy link
Contributor

Based on the linked issue, this is mainly a "don't start this optional controller till this thing is installed", right?

@DirectXMan12
Copy link
Contributor

As for the "no longer fails on .Complete"

Specifically, it should now fail when you start the manager instead.

@DirectXMan12
Copy link
Contributor

or if the manager is already started (which it looks like from the stack trace), this error should induce manager shutdown

@DirectXMan12
Copy link
Contributor

DirectXMan12 commented Mar 6, 2020

In retrospect this is semi, expected behavior, but it was not intended to break workflows. I've added a note to the v0.4.0 release. Technically, we're okay because that was a breaking release, but we shouldn't have done it without at least a note.

@DirectXMan12 DirectXMan12 added kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. and removed help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 6, 2020
sophieliu15 added a commit to sophieliu15/multi-tenancy that referenced this issue Mar 6, 2020
Reason for upgrade:
The new version uses [DynamicRESTMapper](https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/client/apiutil/dynamicrestmapper.go) as default `RESTMapper` for controller runtime manager. `DynamicRESTMapper` will "reload the delegated `meta.RESTMapper` on a cache miss" (see kubernetes-sigs/controller-runtime#554), which can solve problem that we need to restart HNC after adding a new CRD to create corresponding object reconciler when using controller-runtime v0.2.2 (see details in kubernetes-retired#488).

Incompatibility issues addressed in this PR:

- Upgrade go version in `go.mod` and `Dockerfile` to 1.13. The [errors.As](https://golang.org/pkg/errors/#As) in [dynamicrestmapper.go](https://github.com/kubernetes-sigs/controller-runtime/blob/bfc982769615817ee15db8a543662652d900d27b/pkg/client/apiutil/dynamicrestmapper.go#L48) requires go 1.13.
- A higher version for k8s.io/cli-runtime and k8s.io/client-go are required after upgrading controller-runtime to v0.5.0
- Version changes of other packages in `go.mod` are updated automatically.
- serializer.DirectCodecFactory was renamed to [serializer.WithoutConversionCodecFactory](https://godoc.org/k8s.io/apimachinery/pkg/runtime/serializer#WithoutConversionCodecFactory) after k8s.io/apimachinery 1.16 (see [here](kubernetes/apimachinery@ed8af17), and [here](kubernetes/apimachinery@4fac835))
- Default [LeaderElectionID](https://github.com/kubernetes-sigs/controller-runtime/blob/bfc982769615817ee15db8a543662652d900d27b/pkg/leaderelection/leader_election.go#L46) in controller-runtime manager is removed after kubernetes-sigs/controller-runtime#446.
- [NewlineReporter](https://godoc.org/sigs.k8s.io/controller-runtime/pkg/envtest/printer) is moved from `sigs.k8s.io/controller-runtime/pkg/envtest/` to `https://godoc.org/sigs.k8s.io/controller-runtime/pkg/envtest/printer` by [this](kubernetes-sigs/controller-runtime@748f55d#diff-42de1d59fbe8f8b90154f01dd01f5748) commit.
- In controller-runtime v0.2.2, if a resource does not exist, a
controller cannot be created successfully. After update
controller-runtime to v0.5.0, a controller can be created without error.
However, when the `Reconcile` method is triggered, there will be an
error complaining the resource does not exist. Therefore, we will
explicitly check if a resource exists before creating the corresponding
object reconciler in `createObjectReconciler` in `hnc_config.go` (see
details in kubernetes-sigs/controller-runtime#840)

Tested:

- Unit tests.
- Went through [demo script](https://docs.google.com/document/d/1tKQgtMSf0wfT3NOGQx9ExUQ-B8UkkdVZB6m4o3Zqn64/edit#) to make sure HNC behaves as expected on a GKE cluster.
- Manually test if the PR solves the restart problem as described in kubernetes-retired#488 with following workflow:
     - Install HNC
     - Install a new CRD
     - Config the new type in `config` singleton

Before this PR, corresponding object reconciler for the new type will not be created unless we restart HNC. After the change, corresponding object reconciler can be created and it reconciles objects of the new type as expected without restarting HNC.

This partly solve:  kubernetes-retired#488
@DirectXMan12
Copy link
Contributor

I do feel mildly uncomfortable about inducing manager failure here, but the general policy is that individual controller failure causes general manager failure, so it's in line with that.

@sophieliu15
Copy link
Author

Hi @DirectXMan12

Thanks so much for the explanation! See my replies below.

Based on the linked issue, this is mainly a "don't start this optional controller till this thing is installed", right?

Yes that is correct!

or if the manager is already started (which it looks like from the stack trace), this error should induce manager shutdown

The manager is already started. Looks like in v0.2.2, if the kind does not exist, we cannot create the new controller, but the manager is also not shutted down. In the latest version, the controller can be created, but then it will fail at controller start time and cause manager shutdown (basically what you said in the release notes).

Thanks for the answers and updating the release notes.

@adrianludwin
Copy link

or if the manager is already started (which it looks like from the stack trace), this error should induce manager shutdown

Hey @DirectXMan12 , wdyt about just having NewControllerManagedBy(mgr)...Complete(r) fail if the type doesn't exist, as opposed to having the manager as a whole terminate? There's now several use cases of users being able to effectively start new reconcilers on the fly (HNC and Gatekeeper), so it might be nice to have controller-runtime provide an error rather than just shutting down if there's a problem.

Obviously this can wait for a future release of controller-runtime.

sophieliu15 added a commit to sophieliu15/multi-tenancy that referenced this issue Mar 10, 2020
… reconciler

After upgrading sigs.k8s.io/controller-runtime version to v0.5.0, we can create reconciler successfully even when the resource does not exist in the cluster. Therefore, we explicitly check if the resource exists before creating the reconciler. See detailed discussion in: kubernetes-sigs/controller-runtime#840

Tested: unit tests, GKE cluster
sophieliu15 added a commit to sophieliu15/multi-tenancy that referenced this issue Mar 10, 2020
After upgrading sigs.k8s.io/controller-runtime version to v0.5.0, we can create reconciler successfully even when the resource does not exist in the cluster. Therefore, we explicitly check if the resource exists before creating the reconciler. See detailed discussion in: kubernetes-sigs/controller-runtime#840

Tested: unit tests, GKE cluster
@DirectXMan12
Copy link
Contributor

Hmm...

The whole weird thing here is that now non-existence is potentially transient, and that it doesn't actually matter if the type doesn't exist on a call to New -- it only matters when we go to run the watches.

I think this is a question of semantics of New -- it's traditionally in CR (mostly) been "prepare a thing", while Start is "make the thing go". I generally see New as "struct initialization + extra setup" (since Go can't do RAII), so I'm not sure I'd expect "type doesn't exist on the server" to fail, since that's more of a "make the thing go"-type error.

On the other hand, delaying it till start doesn't really make sense when the manager is the one in charge of calling start on all the things -- there's no good way to capture the specific failure without wrapping the controller is a new runnable, and there's currently no good way to say "run this and don't stop the manager if it fails".

All of this is a long-winded way of saying "I'm not sure". I can't help but think this a symptom of something being wrong with the overall architecture of manager, but I'm not quite certain what (maybe error handling in Start and the stoppability of controllers? something with context? unclear).

Anyone have any thoughts? Making .Complete go back to failing, while somewhat semantically troubling, is a pretty straightforward solution, so it's possible I'm just overcomplicating things.

@DirectXMan12
Copy link
Contributor

FWIW, doing non-initialization things in New has led to leaks in the past. This obv wouldn't, but making the semantic distinction is potentially useful for avoiding that.

@vincepri vincepri added this to the Next milestone Mar 26, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 24, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants