Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): handle error for invalid osp #396

Merged
merged 1 commit into from
Jun 11, 2024

Conversation

oliverbaehler
Copy link
Contributor

What this PR does / why we need it:

I have created my own CustomOperatingSystemProfile and attempted to use it in my clusters. However whenever a MachineDeployment was associated with the profile the osm controller started crashing because of an uncaught nil pointer:

$ kubectl logs -f operating-system-manager-f849789b8-hgk7w -n cluster-n57gh8qjcd
Defaulted container "operating-system-manager" out of: operating-system-manager, copy-http-prober (init)
{"level":"info","time":"2024-06-10T17:43:43.487Z","logger":"http-prober","caller":"http-prober/main.go:137","msg":"Probing","attempt":1,"max-attempts":100,"target":"https://apiserver-external.cluster-n57gh8qjcd.svc.cluster.local./healthz"}
{"level":"info","time":"2024-06-10T17:43:43.491Z","logger":"http-prober","caller":"http-prober/main.go:126","msg":"Hostname resolved","hostname":"apiserver-external.cluster-n57gh8qjcd.svc.cluster.local.","address":"10.100.3.234:443"}
{"level":"info","time":"2024-06-10T17:43:43.494Z","logger":"http-prober","caller":"http-prober/main.go:150","msg":"Endpoint is available"}
{"level":"info","time":"2024-06-10T17:43:43.518Z","caller":"osm-controller/main.go:309","msg":"starting manager"}
{"level":"info","time":"2024-06-10T17:43:43.519Z","logger":"controller-runtime.metrics","caller":"manager/runnable_group.go:223","msg":"Starting metrics server"}
{"level":"info","time":"2024-06-10T17:43:43.519Z","logger":"controller-runtime.metrics","caller":"manager/runnable_group.go:223","msg":"Serving metrics server","bindAddress":"0.0.0.0:8080","secure":false}
{"level":"info","time":"2024-06-10T17:43:43.519Z","caller":"manager/runnable_group.go:223","msg":"starting server","kind":"health probe","addr":"[::]:8085"}
I0610 17:43:43.519278       1 leaderelection.go:250] attempting to acquire leader lease kube-system/operating-system-manager...
I0610 17:45:13.387812       1 leaderelection.go:260] successfully acquired lease kube-system/operating-system-manager
{"level":"info","time":"2024-06-10T17:45:13.388Z","caller":"controller/controller.go:234","msg":"Starting EventSource","controller":"operating-system-config-controller","source":"kind source: *v1alpha1.MachineDeployment"}
{"level":"info","time":"2024-06-10T17:45:13.388Z","caller":"controller/controller.go:234","msg":"Starting EventSource","controller":"OperatingSystemProfileController","source":"kind source: *v1.Deployment"}
{"level":"info","time":"2024-06-10T17:45:13.388Z","caller":"controller/controller.go:234","msg":"Starting Controller","controller":"operating-system-config-controller"}
{"level":"info","time":"2024-06-10T17:45:13.388Z","caller":"controller/controller.go:234","msg":"Starting Controller","controller":"OperatingSystemProfileController"}
{"level":"info","time":"2024-06-10T17:45:13.490Z","caller":"controller/controller.go:234","msg":"Starting workers","controller":"operating-system-config-controller","worker count":10}
{"level":"info","time":"2024-06-10T17:45:13.491Z","caller":"osc/osc_controller.go:138","msg":"Reconciling OSC resource..","request":"kube-system/practical-blackwell"}
{"level":"info","time":"2024-06-10T17:45:13.593Z","caller":"controller/controller.go:234","msg":"Starting workers","controller":"OperatingSystemProfileController","worker count":10}
{"level":"info","time":"2024-06-10T17:45:13.594Z","caller":"osp/osp_controller.go:105","msg":"Reconciling default OSP resource.."}
{"level":"info","time":"2024-06-10T17:45:13.594Z","caller":"osp/osp_controller.go:105","msg":"Reconciling default OSP resource.."}
{"level":"info","time":"2024-06-10T17:45:13.594Z","caller":"osp/osp_controller.go:105","msg":"Reconciling default OSP resource.."}
{"level":"info","time":"2024-06-10T17:45:13.594Z","caller":"osp/osp_controller.go:105","msg":"Reconciling default OSP resource.."}
{"level":"info","time":"2024-06-10T17:45:13.690Z","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-flatcar"}
{"level":"info","time":"2024-06-10T17:45:13.707Z","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-amzn2"}
{"level":"info","time":"2024-06-10T17:45:13.797Z","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-rockylinux"}
{"level":"info","time":"2024-06-10T17:45:13.899Z","caller":"runtime/panic.go:770","msg":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","controller":"operating-system-config-controller","object":{"name":"practical-blackwell","namespace":"kube-system"},"namespace":"kube-system","name":"practical-blackwell","reconcileID":"2b09108a-c84b-4114-b88e-5aad7b00b559"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x130 pc=0x15f6c40]

goroutine 305 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:116 +0x1e5
panic({0x177e000?, 0x28da590?})
	runtime/panic.go:770 +0x132
k8c.io/operating-system-manager/pkg/controllers/osc.(*Reconciler).reconcileOperatingSystemConfigs(0xc00034e680, {0x1c665b8, 0xc000711bf0}, 0xc00035f408)
	k8c.io/operating-system-manager/pkg/controllers/osc/osc_controller.go:272 +0x8e0
k8c.io/operating-system-manager/pkg/controllers/osc.(*Reconciler).reconcile(0xc00034e680, {0x1c665b8, 0xc000711bf0}, 0xc00035f408)
	k8c.io/operating-system-manager/pkg/controllers/osc/osc_controller.go:184 +0xdd
k8c.io/operating-system-manager/pkg/controllers/osc.(*Reconciler).Reconcile(0xc00034e680, {0x1c665b8, 0xc000711bf0}, {{{0xc000a02180, 0xb}, {0xc00079e5a0, 0x13}}})
	k8c.io/operating-system-manager/pkg/controllers/osc/osc_controller.go:166 +0x405
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1c6a578?, {0x1c665b8?, 0xc000711bf0?}, {{{0xc000a02180?, 0xb?}, {0xc00079e5a0?, 0x0?}}})
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00072b680, {0x1c665f0, 0xc0005df630}, {0x17fe720, 0xc000020860})
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316 +0x3bc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00072b680, {0x1c665f0, 0xc0005df630})
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266 +0x1be
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 278
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:223 +0x50c

And

operating-system-manager-f849789b8-hgk7w             0/1     CrashLoopBackOff   233 (3m3s ago)   20h

Making it impossible to lifecycle other nodes in that cluster. This change catches the error and returns an error. If we implement that, it would also be quickly clear, that I am just too dumb to write profiles:

...
vel":"info","time":"2024-06-10T19:45:31.402+0200","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-rockylinux"}
{"level":"info","time":"2024-06-10T19:45:31.403+0200","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-rhel"}
{"level":"info","time":"2024-06-10T19:45:31.492+0200","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-ubuntu"}
{"level":"info","time":"2024-06-10T19:45:31.492+0200","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-amzn2"}
{"level":"info","time":"2024-06-10T19:45:31.593+0200","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-flatcar"}
{"level":"error","time":"2024-06-10T19:45:31.648+0200","caller":"osc/osc_controller.go:167","msg":"Reconciling failed","error":"failed to reconcile operating system config: failed to generate OSC: failed to render bootstrapping file templates: failed to populate OSP file template: failed to parse OSP file [/opt/bin/node-start.sh] template: template: /opt/bin/node-start.sh:3: unexpected \"\\\\\" in template clause"}
{"level":"error","time":"2024-06-10T19:45:31.648+0200","caller":"controller/controller.go:261","msg":"Reconciler error","controller":"operating-system-config-controller","controllerGroup":"cluster.k8s.io","controllerKind":"MachineDeployment","MachineDeployment":{"name":"practical-blackwell","namespace":"kube-system"},"namespace":"kube-system","name":"practical-blackwell","reconcileID":"b6c9d8c9-e19f-4949-b94b-a308ea27cbf8","error":"failed to reconcile operating system config: failed to generate OSC: failed to render bootstrapping file templates: failed to populate OSP file template: failed to parse OSP file [/opt/bin/node-start.sh] template: template: /opt/bin/node-start.sh:3: unexpected \"\\\\\" in template clause"}

However since there is no previous validation for the profile or anything like that and it's directly replicated to all seed clusters which may cause partial degradation of kubermatic components (OSM) we should probably just handle that error.

Which issue(s) this PR fixes:

Fixes #

What type of PR is this?

Special notes for your reviewer:

Does this PR introduce a user-facing change? Then add your Release Note here:

NONE

Documentation:

NONE

@kubermatic-bot kubermatic-bot added docs/none Denotes a PR that doesn't need documentation (changes). release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 10, 2024
@kubermatic-bot
Copy link
Contributor

Hi @oliverbaehler. Thanks for your PR.

I'm waiting for a kubermatic member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubermatic-bot kubermatic-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jun 10, 2024
@kubermatic-bot kubermatic-bot added dco-signoff: no Denotes that at least one commit in the pull request doesn't have a valid DCO signoff message. and removed dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. labels Jun 11, 2024
Signed-off-by: Oliver Bähler <oliverbaehler@hotmail.com>
@kubermatic-bot kubermatic-bot added dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. and removed dco-signoff: no Denotes that at least one commit in the pull request doesn't have a valid DCO signoff message. labels Jun 11, 2024
Copy link
Member

@ahmedwaleedmalik ahmedwaleedmalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@kubermatic-bot kubermatic-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 11, 2024
@kubermatic-bot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 07334420f5313de51ec7b2d78a195fb8a93d8442

@kubermatic-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahmedwaleedmalik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubermatic-bot kubermatic-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2024
@kubermatic-bot kubermatic-bot merged commit 57e5476 into kubermatic:main Jun 11, 2024
10 of 11 checks passed
@ahmedwaleedmalik
Copy link
Member

/cherry-pick release/v1.5

@kubermatic-bot
Copy link
Contributor

@ahmedwaleedmalik: new pull request created: #397

In response to this:

/cherry-pick release/v1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. docs/none Denotes a PR that doesn't need documentation (changes). lgtm Indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants