Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix reconciliation blocked on improper permissions for establishing watches on managed content #1119

Conversation

everettraven
Copy link
Contributor

@everettraven everettraven commented Aug 13, 2024

Description

This PR completely refactors the content management logic to get lower-level and utilize client-go to create informers and give us more flexibility to:

  • Stop the informer and requeue the ClusterExtension if an informer gets an error during a watch after a previously successful sync
  • Stop the informer on a failed sync to prevent reconciliation from being blocked
  • Capture the sync error an informer encountered and put it on the ClusterExtension status instead of only logging it

Additionally, this PR:

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 13, 2024
Copy link

netlify bot commented Aug 13, 2024

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 71bccb9
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/66d9b9fbb284890008225d48
😎 Deploy Preview https://deploy-preview-1119--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -138,9 +147,8 @@ func (h *Helm) getReleaseState(cl helmclient.ActionInterface, ext *ocv1alpha1.Cl
desiredRelease, err := cl.Install(ext.GetName(), ext.Spec.InstallNamespace, chrt, values, func(i *action.Install) error {
i.DryRun = true
i.DryRunOption = "server"
i.Labels = labels
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for reviewers: Should the helm release secrets also have these labels? I assumed they weren't necessary with the release name being deterministic based on the ClusterExtension name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not necessary; would they otherwise be getting merged with labels by the postrenderer? If so I can see how they might get into a funny looking state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I think these labels are explicitly on the release secrets created by the Helm client.

My understanding is that the postrenderer would add the labels to the manifests only within the Helm Release

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would take ^ with a grain of salt though. I'm assuming the release secret is not included in the set of release manifests, but I don't actually know for sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @gallettilance is planning to add release secret labels to populate CSV annotations there.

Seems like we should be able to trace what labels we currently set on the release secret and make sure we understand their purpose to decide whether we can stop populating them . I think we put these labels there, and they are used in a variety of ways (setup performant informers, get the installed bundles)

@everettraven everettraven force-pushed the bugfix/contentmanager-permission-errors branch 2 times, most recently from d3c8159 to c20b41b Compare August 15, 2024 20:53
@everettraven everettraven changed the title [WIP] Bugfix/contentmanager permission errors 🐛 Fix reconciliation blocked on improper permissions for establishing watches on managed content Aug 15, 2024
@everettraven everettraven marked this pull request as ready for review August 15, 2024 21:07
@everettraven everettraven requested a review from a team as a code owner August 15, 2024 21:07
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 15, 2024
@everettraven everettraven force-pushed the bugfix/contentmanager-permission-errors branch from c20b41b to 5cf0503 Compare August 15, 2024 21:08
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 15, 2024
Copy link

codecov bot commented Aug 15, 2024

Codecov Report

Attention: Patch coverage is 72.04819% with 116 lines in your changes missing coverage. Please review.

Project coverage is 76.59%. Comparing base (c53453b) to head (71bccb9).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...nal/contentmanager/source/internal/eventhandler.go 53.62% 25 Missing and 7 partials ⚠️
internal/contentmanager/cache/cache.go 79.20% 19 Missing and 7 partials ⚠️
internal/contentmanager/source/dynamicsource.go 67.56% 21 Missing and 3 partials ⚠️
internal/contentmanager/contentmanager.go 66.66% 12 Missing and 3 partials ⚠️
internal/applier/helm.go 68.96% 6 Missing and 3 partials ⚠️
internal/contentmanager/sourcerer.go 80.00% 4 Missing and 3 partials ⚠️
cmd/manager/main.go 70.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1119      +/-   ##
==========================================
- Coverage   77.12%   76.59%   -0.53%     
==========================================
  Files          36       40       +4     
  Lines        2020     2329     +309     
==========================================
+ Hits         1558     1784     +226     
- Misses        327      389      +62     
- Partials      135      156      +21     
Flag Coverage Δ
e2e 57.70% <57.83%> (+0.08%) ⬆️
unit 52.42% <41.68%> (-2.98%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// managed content sources. It also exposes methods
// for retrieving the stored references of the managed
// content
type Cache interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing another custom cache implementation sort of sounded alarm-bells for me coming from v0's and all the cache issues there but this seems kind of unfair to compare to that as long as it stays as simple as it is written here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there is a negative history with needing a custom caching layer. I 1000% agree with you that we need to ensure this caching layer is as simple as possible.

Unfortunately I can't see any other way of achieving what we are looking for. If naming is an issue I'm open to suggestions for naming other than "Cache" :)

@@ -1,207 +0,0 @@
package contentmanager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we create an issue to make new tests for the new Manager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! I had left a comment on this file about tests but looks like it didn't stick after some git-fu: #1119 (comment)

If folks are comfortable with postponing addition of tests to a follow-up to this PR I'm happy to create a separate issue, but wanted to verify that preference wasn't for me to implement them as part of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are now unit tests for the underlying components, but not the Manager itself. i felt the logic was fairly straightforward, but if we still would like to have unit tests for it I can see if I can wiggle some in.

require.Equal(t, metav1.ConditionTrue, installedCond.Status)
require.Equal(t, ocv1alpha1.ReasonSuccess, installedCond.Reason)

t.Log("By checking the expected managed conditions")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to remediate this state? I assume it's not a critical state because the CE is still installed, but we just can't respond to any changes to the managed resources. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct.

I think the most common issue that will happen here will be permission related. With the new setup introduced by this PR, any informer errors result in a requeue of the ClusterExtension with exponential backoff.

Remediation for the permission error is manual user intervention to assign the missing permissions to the ServiceAccount referenced in ClusterExtension.Spec.ServiceAccount.Name

Copy link
Contributor

@dtfranz dtfranz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything here looks great to me! I am however pretty unfamiliar with the ins-and-outs of the low-level informer stuff, so I'd like to have somebody more familiar with that put in the final approval.

@everettraven everettraven force-pushed the bugfix/contentmanager-permission-errors branch from 8891a5e to b9a2637 Compare August 22, 2024 14:29
cmd/manager/main.go Outdated Show resolved Hide resolved
@openshift-merge-robot openshift-merge-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 31, 2024
Comment on lines +198 to +226
cacheGvks := sets.New[schema.GroupVersionKind]()
for gvk := range c.sources {
cacheGvks.Insert(gvk)
}
return cacheGvks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an FYI: 1.23 introduces maps.Values in the stdlib as an iterator, which I suspect would slightly simplify this function. Not sure where we are on bumping to 1.23, but thought I'd mention this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Comment on lines +179 to +195
func (dis *dynamicInformerSource) hasSynced() bool {
select {
case <-dis.syncedChan:
return true
default:
return false
}
}

func (dis *dynamicInformerSource) hasStarted() bool {
select {
case <-dis.startedChan:
return true
default:
return false
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: looks like an opportunity for a little reuse (and a slight improvement on checking if the channel is closed (vs having received a value)

Suggested change
func (dis *dynamicInformerSource) hasSynced() bool {
select {
case <-dis.syncedChan:
return true
default:
return false
}
}
func (dis *dynamicInformerSource) hasStarted() bool {
select {
case <-dis.startedChan:
return true
default:
return false
}
}
func isChannelClosed(ch <-chan struct{}) bool {
select {
case _, isOpen <-ch:
// we could maybe panic here if we get `isOpen=true`, because that would be an expected receive of a value
return !isOpen
default:
return false
}
}
func (dis *dynamicInformerSource) hasSynced() bool {
return isChannelClosed(dis.syncedChan)
}
func (dis *dynamicInformerSource) hasStarted() bool {
return isChannelClosed(dis.startedChan)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think having it as a single function could be nice I'm not sure it buys us all that much. Having distinct functions for hasSynced() and hasStarted() allows us to modify the behavior as we see fit and gives us flexibility to update the synced vs started logic as needed.

Going to leave as is for now.

@everettraven everettraven force-pushed the bugfix/contentmanager-permission-errors branch from f6f2770 to 14a9be3 Compare September 4, 2024 12:28
to fix bugs associated with insufficient permissions resulting
in halting reconciliation of all ClusterExtension and informer
sync errors not being reported via the ClusterExtension
status conditions.

Signed-off-by: everettraven <everettraven@gmail.com>
@everettraven everettraven force-pushed the bugfix/contentmanager-permission-errors branch from 14a9be3 to 71bccb9 Compare September 5, 2024 14:02
@joelanford joelanford added this pull request to the merge queue Sep 5, 2024
Merged via the queue into operator-framework:main with commit 75bb03d Sep 5, 2024
16 checks passed
@skattoju skattoju mentioned this pull request Sep 25, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants