Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator panics if it starts before creating the CRD #183

Closed
fanminshi opened this issue Apr 11, 2018 · 8 comments
Closed

Operator panics if it starts before creating the CRD #183

fanminshi opened this issue Apr 11, 2018 · 8 comments
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.

Comments

@fanminshi
Copy link
Contributor

When deploying the operator image built by operator-sdk, if the deployment of the operator image happens before the creation of the CRD the operator panics.

Reproducible steps:

# setup
$ operator-sdk new app-operator --api-version=app.example.com/v1alpha1 --kind=App
$ cd app-operator/
$ operator-sdk build quay.io/coreos/operator-sdk-dev:app-operator
$ docker push quay.io/coreos/operator-sdk-dev:app-operator

# remove crd creation from operator.yaml.
$ cat deploy/operator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      name: app-operator
  template:
    metadata:
      labels:
        name: app-operator
    spec:
      containers:
        - name: app-operator
          image: quay.io/coreos/operator-sdk-dev:app-operator
          command:
          - app-operator
          imagePullPolicy: Always
      imagePullSecrets:
      - name: operator-sdk-secret

# deploy app-operator
kubectl create -f deploy/operator.yaml

# logs
$ kubectl logs -f app-operator-67c6b694-jtgjt
time="2018-04-11T17:57:57Z" level=info msg="Go Version: go1.10"
time="2018-04-11T17:57:57Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-04-11T17:57:57Z" level=info msg="operator-sdk Version: 0.0.4"
time="2018-04-11T17:57:57Z" level=error msg="failed to get resource client for (apiVersion:app.example.com/v1alpha1, kind:App, ns:default): failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(app.example.com/v1alpha1, Kind=App): no matches for app.example.com/, Kind=App"
panic: failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(app.example.com/v1alpha1, Kind=App): no matches for app.example.com/, Kind=App

goroutine 1 [running]:
github.com/coreos-inc/app-operator/vendor/github.com/coreos/operator-sdk/pkg/sdk.Watch(0x1025c40, 0x18, 0x1017e6d, 0x3, 0x10199f2, 0x7, 0x5)
	/Users/fanminshi/work/src/github.com/coreos-inc/app-operator/vendor/github.com/coreos/operator-sdk/pkg/sdk/api.go:48 +0x389
main.main()
	/Users/fanminshi/work/src/github.com/coreos-inc/app-operator/cmd/app-operator/main.go:22 +0x72

Because the operator deployment has a default restartPolicy: Always, the operator will be re-started until the CRD is created. Hence, the operator will still work as usual.

Potential fix:
The operator should wait until the crd is created before proceeding.

@fanminshi
Copy link
Contributor Author

cc/ @mikewied I think this is the issue that you have found and described in #169

@bradbeam
Copy link
Contributor

bradbeam commented May 4, 2018

Would it be appropriate for the operator to register/manage the CRD spec internally instead of applying it externally? Similarly to what the prometheus operator is doing -- https://github.com/coreos/prometheus-operator/blob/72b2e6847dec58269dfe4950097344155f8bc6cf/pkg/alertmanager/operator.go#L536-L562

@mikewied
Copy link

@bradbeam This won't always work in production use cases because installing the CRD requires cluster-level permissions, but running the operator in a namespace only requires permissions in that namespace. Some installations will require the CRD to be installed by a more privileged user than the person who is actually using the operator.

@sermilrod
Copy link

It can be a under a flag that you can enable in case you want to make the framework generate it automatically

@spahl
Copy link
Contributor

spahl commented Jun 5, 2018

Operator should exit with an obvious error message if it can't find the CRD.

@spahl spahl added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. labels Jun 5, 2018
trusch added a commit to trusch/operator-sdk that referenced this issue Jun 11, 2018
*What*

* indefinitely retry to get a resource client in the watch call
* add a mutex to save the `informers` global from concurrent access
* move informer setup into own go-routine

*Why*

* until now all operators will die when they are deployed before their CRD they are managing
* `informers` is a global variable in the sdk package. To make it safe for concurrent use it's now guarded by an mutex
* this fixes operator-framework#183
trusch added a commit to trusch/operator-sdk that referenced this issue Jun 11, 2018
*What*
* indefinitely retry to get a resource client in the watch call
* add a mutex to save the `informers` global from concurrent access
* move informer setup into own go-routine
*Why*
* until now all operators will die when they are deployed before their CRD they are managing
* `informers` is a global variable in the sdk package. To make it safe for concurrent use it's now guarded by an mutex

fixes operator-framework#183
@shawn-hurley
Copy link
Member

I believe that the controller runtime prints this out:

error no matches for kind "<Kind>" in version "<group>/<version>"

Do people think this is sufficient?

/cc @lilic @hasbro17 @joelanford

@lilic
Copy link
Member

lilic commented Nov 16, 2018

@shawn-hurley That looks clear enough to me. I think we can close this and get back to it if some user in the future doesn't find it clear enough. 👍 Unless someone has other ideas?

@hasbro17
Copy link
Contributor

Yeah I think this is clear enough.
And on a related note we don't need to do anything smarter like having the operator retry(refresh with a deferred restmapper) and wait for the CRD to be registered.
The operator Deployment pod should keep restarting as it exits with error, and eventually proceed whenever the CRD is registered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants