Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start smart clone controller from datavolume controller when needed #2265

Merged
merged 2 commits into from
May 12, 2022

Conversation

awels
Copy link
Member

@awels awels commented May 6, 2022

Signed-off-by: Alexander Wels awels@redhat.com

What this PR does / why we need it:
It was possible for the CSI snapshot CRD check to fail silently and prevent the smart clone controller from starting during the cdi deployment pod start up. This would prevent smart clone from working properly.

This moves the creation of the smart clone controller into the datavolume controller when someone attempts to do a smart clone. Until that happens the smart clone controller is not started at all. To ensure we don't have multiple smart clone controllers started, we use a channel to synchronize creation, and once created the function will be a no-op.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

BugFix: Fix smart clone controller not starting if an error occurred during startup

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels May 6, 2022
@@ -288,7 +289,7 @@ func deleteReadyFile() {
}

func addCrdInformerEventHandlers(crdInformer cache.SharedIndexInformer, extclient extclientset.Interface, mgr manager.Manager, log logr.Logger) {
crdInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
crdInformer.AddEventHandlerWithResyncPeriod(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should handle UpdateFunc as well as AddFunc as I believe UpdateFunc is called on resync

}

func startSmartController(extclient extclientset.Interface, mgr manager.Manager, log logr.Logger) {
if controller.IsCsiCrdsDeployed(extclient) {
if controller.IsCsiCrdsDeployed(extclient, log) {
log.Info("CSI CRDs detected, starting smart clone controller")
if _, err := controller.NewSmartCloneController(mgr, log, installerLabels); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to make sure to not do this multiple times I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this probably worked because we never tried to do it more than once.

@awels awels force-pushed the add_refresh_on_snapshot_crd_check branch from fdec6b2 to c751542 Compare May 6, 2022 21:58
@kubevirt-bot kubevirt-bot added size/M and removed size/S labels May 6, 2022
@akalenyu
Copy link
Collaborator

akalenyu commented May 9, 2022

What about panicking when this GET fails? just putting it out there, curious as to why you ruled it out?

This GET failing is an indication of pretty bad things going on right?

@awels
Copy link
Member Author

awels commented May 9, 2022

It could be a transient network issue for instance, no reason to panic. In essence a panic would cause the control plane of CDI to disappear for some period of time, which could add to whatever causing the transient failure. So why not handle it properly.

@awels awels force-pushed the add_refresh_on_snapshot_crd_check branch from c751542 to 4d9f1a3 Compare May 9, 2022 22:12
@kubevirt-bot kubevirt-bot added size/L and removed size/M labels May 9, 2022
@awels
Copy link
Member Author

awels commented May 10, 2022

/retest

@awels awels changed the title periodic sync CSI snapshot CRD check Start smart clone controller from datavolume controller when needed May 10, 2022
@@ -291,6 +294,7 @@ func NewDatavolumeController(
if err := addDatavolumeControllerWatches(mgr, datavolumeController); err != nil {
return nil, err
}
go reconciler.startSnapshotCsiCrdDetection()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nicer, rather than creating our own goroutine, let the controller runtime manager deal with it. See here: https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/manager.go#L61

Basically just pass a Runnable and when the manager is started, it will create/manage the goroutine.

I believe this will clean up all the funniness in the unit test with the dummy goroutine and passing the additional context

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was possible for the CSI snapshot CRD check to fail silently and
prevent the smart clone controller from starting during the cdi deployment
pod start up. This would prevent smart clone from working properly.

This adds a periodic sync of 1 minute for checking the CRDs. We also
log failures that are not is not found so we can more easily detect this
situation as humans.

Signed-off-by: Alexander Wels <awels@redhat.com>
@awels awels force-pushed the add_refresh_on_snapshot_crd_check branch from 4d9f1a3 to f68376f Compare May 11, 2022 15:40
log.Info("Missing CSI snapshotter CRDs, falling back to host assisted clone")
return "", nil
}
r.sccs.StartController()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little too hidden in this function, how about here:

Check if strategy is smart clone and start controller if so. In this location (main reconcile func) it is more explicit, no weird side effects

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we should only start the smart clone controller if the snapshot CRDs are there. I guess us creating the controller will fail because the watches won't work, and we can try again later.

Copy link
Member

@mhenriks mhenriks May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not choose smart clone strategy without checking the CRDs. Not sure what you mean about failing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if we do manage to attempt to start the smart clone controller, it will fail when we add the watches in the controller because the CRDs don't exist. but you make a good point that we won't pick that strategy without checking if the CRDs exist. I already fixed it btw.

Signed-off-by: Alexander Wels <awels@redhat.com>
@mhenriks
Copy link
Member

/retest

@mhenriks
Copy link
Member

/lgtm
/approve

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label May 11, 2022
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mhenriks

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2022
@kubevirt-bot kubevirt-bot merged commit b3d0826 into kubevirt:main May 12, 2022
@awels
Copy link
Member Author

awels commented May 13, 2022

/cherrypick release-v1.43

@kubevirt-bot
Copy link
Contributor

@awels: #2265 failed to apply on top of branch "release-v1.43":

Applying: periodic sync CSI snapshot CRD check
Using index info to reconstruct a base tree...
M	Makefile
M	cmd/cdi-controller/BUILD.bazel
M	cmd/cdi-controller/controller.go
M	pkg/controller/BUILD.bazel
M	pkg/controller/datavolume-controller.go
M	pkg/controller/datavolume-controller_test.go
M	pkg/controller/util.go
M	tests/datavolume_test.go
Falling back to patching base and 3-way merge...
Auto-merging tests/datavolume_test.go
Auto-merging pkg/controller/util.go
CONFLICT (content): Merge conflict in pkg/controller/util.go
Auto-merging pkg/controller/datavolume-controller_test.go
CONFLICT (content): Merge conflict in pkg/controller/datavolume-controller_test.go
Auto-merging pkg/controller/datavolume-controller.go
Auto-merging pkg/controller/BUILD.bazel
Auto-merging cmd/cdi-controller/controller.go
CONFLICT (content): Merge conflict in cmd/cdi-controller/controller.go
Auto-merging cmd/cdi-controller/BUILD.bazel
CONFLICT (content): Merge conflict in cmd/cdi-controller/BUILD.bazel
Auto-merging Makefile
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 periodic sync CSI snapshot CRD check
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-v1.43

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

awels added a commit to awels/containerized-data-importer that referenced this pull request May 13, 2022
…ubevirt#2265)

* periodic sync CSI snapshot CRD check

It was possible for the CSI snapshot CRD check to fail silently and
prevent the smart clone controller from starting during the cdi deployment
pod start up. This would prevent smart clone from working properly.

This adds a periodic sync of 1 minute for checking the CRDs. We also
log failures that are not is not found so we can more easily detect this
situation as humans.

Signed-off-by: Alexander Wels <awels@redhat.com>

* Change location of the start controller call.

Signed-off-by: Alexander Wels <awels@redhat.com>
awels added a commit to awels/containerized-data-importer that referenced this pull request May 13, 2022
…ubevirt#2265)

* periodic sync CSI snapshot CRD check

It was possible for the CSI snapshot CRD check to fail silently and
prevent the smart clone controller from starting during the cdi deployment
pod start up. This would prevent smart clone from working properly.

This adds a periodic sync of 1 minute for checking the CRDs. We also
log failures that are not is not found so we can more easily detect this
situation as humans.

Signed-off-by: Alexander Wels <awels@redhat.com>

* Change location of the start controller call.

Signed-off-by: Alexander Wels <awels@redhat.com>
kubevirt-bot pushed a commit that referenced this pull request May 17, 2022
…2265) (#2271)

* periodic sync CSI snapshot CRD check

It was possible for the CSI snapshot CRD check to fail silently and
prevent the smart clone controller from starting during the cdi deployment
pod start up. This would prevent smart clone from working properly.

This adds a periodic sync of 1 minute for checking the CRDs. We also
log failures that are not is not found so we can more easily detect this
situation as humans.

Signed-off-by: Alexander Wels <awels@redhat.com>

* Change location of the start controller call.

Signed-off-by: Alexander Wels <awels@redhat.com>
awels added a commit to awels/containerized-data-importer that referenced this pull request Jun 1, 2022
…ubevirt#2265)

* periodic sync CSI snapshot CRD check

It was possible for the CSI snapshot CRD check to fail silently and
prevent the smart clone controller from starting during the cdi deployment
pod start up. This would prevent smart clone from working properly.

This adds a periodic sync of 1 minute for checking the CRDs. We also
log failures that are not is not found so we can more easily detect this
situation as humans.

Signed-off-by: Alexander Wels <awels@redhat.com>

* Change location of the start controller call.

Signed-off-by: Alexander Wels <awels@redhat.com>
kubevirt-bot pushed a commit that referenced this pull request Jun 2, 2022
…er when needed #2265 (#2307)

* Start smart clone controller from datavolume controller when needed (#2265)

* periodic sync CSI snapshot CRD check

It was possible for the CSI snapshot CRD check to fail silently and
prevent the smart clone controller from starting during the cdi deployment
pod start up. This would prevent smart clone from working properly.

This adds a periodic sync of 1 minute for checking the CRDs. We also
log failures that are not is not found so we can more easily detect this
situation as humans.

Signed-off-by: Alexander Wels <awels@redhat.com>

* Change location of the start controller call.

Signed-off-by: Alexander Wels <awels@redhat.com>

* Modifications needed for backport. In particular added check for both betav1
and v1 crds.

Signed-off-by: Alexander Wels <awels@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants