-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix reconciliation blocked on improper permissions for establishing watches on managed content #1119
🐛 Fix reconciliation blocked on improper permissions for establishing watches on managed content #1119
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@@ -138,9 +147,8 @@ func (h *Helm) getReleaseState(cl helmclient.ActionInterface, ext *ocv1alpha1.Cl | |||
desiredRelease, err := cl.Install(ext.GetName(), ext.Spec.InstallNamespace, chrt, values, func(i *action.Install) error { | |||
i.DryRun = true | |||
i.DryRunOption = "server" | |||
i.Labels = labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for reviewers: Should the helm release secrets also have these labels? I assumed they weren't necessary with the release name being deterministic based on the ClusterExtension name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not necessary; would they otherwise be getting merged with labels by the postrenderer
? If so I can see how they might get into a funny looking state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I think these labels are explicitly on the release secrets created by the Helm client.
My understanding is that the postrenderer
would add the labels to the manifests only within the Helm Release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would take ^ with a grain of salt though. I'm assuming the release secret is not included in the set of release manifests, but I don't actually know for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @gallettilance is planning to add release secret labels to populate CSV annotations there.
Seems like we should be able to trace what labels we currently set on the release secret and make sure we understand their purpose to decide whether we can stop populating them . I think we put these labels there, and they are used in a variety of ways (setup performant informers, get the installed bundles)
d3c8159
to
c20b41b
Compare
c20b41b
to
5cf0503
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1119 +/- ##
==========================================
- Coverage 77.12% 76.59% -0.53%
==========================================
Files 36 40 +4
Lines 2020 2329 +309
==========================================
+ Hits 1558 1784 +226
- Misses 327 389 +62
- Partials 135 156 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
internal/contentmanager/cache.go
Outdated
// managed content sources. It also exposes methods | ||
// for retrieving the stored references of the managed | ||
// content | ||
type Cache interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing another custom cache implementation sort of sounded alarm-bells for me coming from v0's and all the cache issues there but this seems kind of unfair to compare to that as long as it stays as simple as it is written here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there is a negative history with needing a custom caching layer. I 1000% agree with you that we need to ensure this caching layer is as simple as possible.
Unfortunately I can't see any other way of achieving what we are looking for. If naming is an issue I'm open to suggestions for naming other than "Cache" :)
@@ -1,207 +0,0 @@ | |||
package contentmanager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we create an issue to make new tests for the new Manager
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! I had left a comment on this file about tests but looks like it didn't stick after some git-fu: #1119 (comment)
If folks are comfortable with postponing addition of tests to a follow-up to this PR I'm happy to create a separate issue, but wanted to verify that preference wasn't for me to implement them as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are now unit tests for the underlying components, but not the Manager
itself. i felt the logic was fairly straightforward, but if we still would like to have unit tests for it I can see if I can wiggle some in.
require.Equal(t, metav1.ConditionTrue, installedCond.Status) | ||
require.Equal(t, ocv1alpha1.ReasonSuccess, installedCond.Reason) | ||
|
||
t.Log("By checking the expected managed conditions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to remediate this state? I assume it's not a critical state because the CE is still installed, but we just can't respond to any changes to the managed resources. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct.
I think the most common issue that will happen here will be permission related. With the new setup introduced by this PR, any informer errors result in a requeue of the ClusterExtension
with exponential backoff.
Remediation for the permission error is manual user intervention to assign the missing permissions to the ServiceAccount
referenced in ClusterExtension.Spec.ServiceAccount.Name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything here looks great to me! I am however pretty unfamiliar with the ins-and-outs of the low-level informer stuff, so I'd like to have somebody more familiar with that put in the final approval.
8891a5e
to
b9a2637
Compare
cacheGvks := sets.New[schema.GroupVersionKind]() | ||
for gvk := range c.sources { | ||
cacheGvks.Insert(gvk) | ||
} | ||
return cacheGvks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an FYI: 1.23 introduces maps.Values
in the stdlib as an iterator, which I suspect would slightly simplify this function. Not sure where we are on bumping to 1.23, but thought I'd mention this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
func (dis *dynamicInformerSource) hasSynced() bool { | ||
select { | ||
case <-dis.syncedChan: | ||
return true | ||
default: | ||
return false | ||
} | ||
} | ||
|
||
func (dis *dynamicInformerSource) hasStarted() bool { | ||
select { | ||
case <-dis.startedChan: | ||
return true | ||
default: | ||
return false | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: looks like an opportunity for a little reuse (and a slight improvement on checking if the channel is closed (vs having received a value)
func (dis *dynamicInformerSource) hasSynced() bool { | |
select { | |
case <-dis.syncedChan: | |
return true | |
default: | |
return false | |
} | |
} | |
func (dis *dynamicInformerSource) hasStarted() bool { | |
select { | |
case <-dis.startedChan: | |
return true | |
default: | |
return false | |
} | |
} | |
func isChannelClosed(ch <-chan struct{}) bool { | |
select { | |
case _, isOpen <-ch: | |
// we could maybe panic here if we get `isOpen=true`, because that would be an expected receive of a value | |
return !isOpen | |
default: | |
return false | |
} | |
} | |
func (dis *dynamicInformerSource) hasSynced() bool { | |
return isChannelClosed(dis.syncedChan) | |
} | |
func (dis *dynamicInformerSource) hasStarted() bool { | |
return isChannelClosed(dis.startedChan) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I think having it as a single function could be nice I'm not sure it buys us all that much. Having distinct functions for hasSynced()
and hasStarted()
allows us to modify the behavior as we see fit and gives us flexibility to update the synced vs started logic as needed.
Going to leave as is for now.
f6f2770
to
14a9be3
Compare
to fix bugs associated with insufficient permissions resulting in halting reconciliation of all ClusterExtension and informer sync errors not being reported via the ClusterExtension status conditions. Signed-off-by: everettraven <everettraven@gmail.com>
14a9be3
to
71bccb9
Compare
Description
This PR completely refactors the content management logic to get lower-level and utilize client-go to create informers and give us more flexibility to:
Additionally, this PR:
Fixes a bug where the resources managed by the ClusterExtension were missing the expected labels
Optimizes the caching layer by:
Adds a finalizer so that when a ClusterExtension is deleted, the managed content cache associated with it is stopped and deleted
Adds a new
Managed
condition that is set toFalse
when there are issues setting up informers to react to changes to managed content andTrue
when all informers are successfully configured and reacting to events for all managed resourcesFixes (Provided ServiceAccount) Follow-up: update dynamic cache to re-use cache.Cache instead of creating a new one every time #1025
Fixes Bug: When insufficient permissions exist to watch managed resources, reconciliation halts #1109
Fixes Sample ArgoCD: List requests fail due resourceName in the RBAC rule #1195
Fixes ClusterExtension status/condition not getting updated #1192
Reviewer Checklist