Operator panics if it starts before creating the CRD #183

fanminshi · 2018-04-11T18:04:50Z

When deploying the operator image built by operator-sdk, if the deployment of the operator image happens before the creation of the CRD the operator panics.

Reproducible steps:

# setup
$ operator-sdk new app-operator --api-version=app.example.com/v1alpha1 --kind=App
$ cd app-operator/
$ operator-sdk build quay.io/coreos/operator-sdk-dev:app-operator
$ docker push quay.io/coreos/operator-sdk-dev:app-operator

# remove crd creation from operator.yaml.
$ cat deploy/operator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      name: app-operator
  template:
    metadata:
      labels:
        name: app-operator
    spec:
      containers:
        - name: app-operator
          image: quay.io/coreos/operator-sdk-dev:app-operator
          command:
          - app-operator
          imagePullPolicy: Always
      imagePullSecrets:
      - name: operator-sdk-secret

# deploy app-operator
kubectl create -f deploy/operator.yaml

# logs
$ kubectl logs -f app-operator-67c6b694-jtgjt
time="2018-04-11T17:57:57Z" level=info msg="Go Version: go1.10"
time="2018-04-11T17:57:57Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-04-11T17:57:57Z" level=info msg="operator-sdk Version: 0.0.4"
time="2018-04-11T17:57:57Z" level=error msg="failed to get resource client for (apiVersion:app.example.com/v1alpha1, kind:App, ns:default): failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(app.example.com/v1alpha1, Kind=App): no matches for app.example.com/, Kind=App"
panic: failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(app.example.com/v1alpha1, Kind=App): no matches for app.example.com/, Kind=App

goroutine 1 [running]:
github.com/coreos-inc/app-operator/vendor/github.com/coreos/operator-sdk/pkg/sdk.Watch(0x1025c40, 0x18, 0x1017e6d, 0x3, 0x10199f2, 0x7, 0x5)
	/Users/fanminshi/work/src/github.com/coreos-inc/app-operator/vendor/github.com/coreos/operator-sdk/pkg/sdk/api.go:48 +0x389
main.main()
	/Users/fanminshi/work/src/github.com/coreos-inc/app-operator/cmd/app-operator/main.go:22 +0x72

Because the operator deployment has a default restartPolicy: Always, the operator will be re-started until the CRD is created. Hence, the operator will still work as usual.

Potential fix:
The operator should wait until the crd is created before proceeding.

The text was updated successfully, but these errors were encountered:

fanminshi · 2018-04-11T18:09:00Z

cc/ @mikewied I think this is the issue that you have found and described in #169

bradbeam · 2018-05-04T14:14:29Z

Would it be appropriate for the operator to register/manage the CRD spec internally instead of applying it externally? Similarly to what the prometheus operator is doing -- https://github.com/coreos/prometheus-operator/blob/72b2e6847dec58269dfe4950097344155f8bc6cf/pkg/alertmanager/operator.go#L536-L562

mikewied · 2018-05-10T19:17:50Z

@bradbeam This won't always work in production use cases because installing the CRD requires cluster-level permissions, but running the operator in a namespace only requires permissions in that namespace. Some installations will require the CRD to be installed by a more privileged user than the person who is actually using the operator.

sermilrod · 2018-05-27T10:18:59Z

It can be a under a flag that you can enable in case you want to make the framework generate it automatically

spahl · 2018-06-05T21:26:58Z

Operator should exit with an obvious error message if it can't find the CRD.

*What* * indefinitely retry to get a resource client in the watch call * add a mutex to save the `informers` global from concurrent access * move informer setup into own go-routine *Why* * until now all operators will die when they are deployed before their CRD they are managing * `informers` is a global variable in the sdk package. To make it safe for concurrent use it's now guarded by an mutex * this fixes operator-framework#183

*What* * indefinitely retry to get a resource client in the watch call * add a mutex to save the `informers` global from concurrent access * move informer setup into own go-routine *Why* * until now all operators will die when they are deployed before their CRD they are managing * `informers` is a global variable in the sdk package. To make it safe for concurrent use it's now guarded by an mutex fixes operator-framework#183

shawn-hurley · 2018-11-16T14:38:59Z

I believe that the controller runtime prints this out:

error no matches for kind "<Kind>" in version "<group>/<version>"

Do people think this is sufficient?

/cc @lilic @hasbro17 @joelanford

lilic · 2018-11-16T14:51:23Z

@shawn-hurley That looks clear enough to me. I think we can close this and get back to it if some user in the future doesn't find it clear enough. 👍 Unless someone has other ideas?

hasbro17 · 2018-11-16T18:33:08Z

Yeah I think this is clear enough.
And on a related note we don't need to do anything smarter like having the operator retry(refresh with a deferred restmapper) and wait for the CRD to be registered.
The operator Deployment pod should keep restarting as it exits with error, and eventually proceed whenever the CRD is registered.

…-master Bug 2029645: Merge upstream tag v1.15.0

fanminshi mentioned this issue May 3, 2018

The Watch should keep retrying on failure and not panic #224

Closed

hasbro17 self-assigned this May 3, 2018

spahl added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. labels Jun 5, 2018

trusch mentioned this issue Jun 11, 2018

sdk: retry Watch() call on failure. #311

Closed

ccronca mentioned this issue Sep 24, 2018

Operator cleanly exits in non-existent CRD #523

Closed

hasbro17 closed this as completed Nov 16, 2018

geerlingguy mentioned this issue Jul 22, 2019

Can't get Operator to start on Minikube / 1.15.0 thom8/drupal-operator#3

Open

Madhu-1 mentioned this issue Aug 17, 2021

missing CRDs cause crash on startup instead of failure on replication creation csi-addons/volume-replication-operator#97

Closed

m1kola pushed a commit to m1kola/operator-sdk that referenced this issue Jun 7, 2024

Merge pull request operator-framework#183 from jmrodri/v1.15.0-rebase…

ad205e2

…-master Bug 2029645: Merge upstream tag v1.15.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator panics if it starts before creating the CRD #183

Operator panics if it starts before creating the CRD #183

fanminshi commented Apr 11, 2018

fanminshi commented Apr 11, 2018

bradbeam commented May 4, 2018 •

edited

Loading

mikewied commented May 10, 2018

sermilrod commented May 27, 2018

spahl commented Jun 5, 2018

shawn-hurley commented Nov 16, 2018

lilic commented Nov 16, 2018

hasbro17 commented Nov 16, 2018

Operator panics if it starts before creating the CRD #183

Operator panics if it starts before creating the CRD #183

Comments

fanminshi commented Apr 11, 2018

fanminshi commented Apr 11, 2018

bradbeam commented May 4, 2018 • edited Loading

mikewied commented May 10, 2018

sermilrod commented May 27, 2018

spahl commented Jun 5, 2018

shawn-hurley commented Nov 16, 2018

lilic commented Nov 16, 2018

hasbro17 commented Nov 16, 2018

bradbeam commented May 4, 2018 •

edited

Loading