Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add catalogsource entitysource #129

Merged
merged 17 commits into from
Mar 28, 2023

Conversation

ankitathomas
Copy link
Contributor

add an entitysource that can query v0 catalogsources backed by grpc

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 20, 2023
@ankitathomas ankitathomas changed the title add catalogsource entitysource [WIP] add catalogsource entitysource Feb 20, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2023
Comment on lines 17 to 27
type UpgradeEdge struct {
property.Channel
Replaces string `json:"replaces,omitempty"`
Skips []string `json:"skips,omitempty"`
SkipRange string `json:"skipRange,omitempty"`
Version string `json:"version,omitempty"`
}

type DefaultChannel struct {
DefaultChannel string `json:"defaultchannel"`
}
Copy link
Member

@joelanford joelanford Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought more about this, IMO, we should not generate new properties in the bundle entities for package and channel level info. I started a Slack thread on this topic here: https://kubernetes.slack.com/archives/C0181L6JYQ2/p1675882355517039

My thoughts are:

  1. a separate variable source for upgrade edge constraints
  2. do we need/want to keep the concept of a default channel in OLMv1? Maybe it isnt important to answer that right now, but by the same token, do we need to implement default channel logic right now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC in OLMv0, the purpose of the default channel is for dependency resolution. OLMv0 creates subscriptions for a dependency always using the dependency's default channel.

In OLMv1, the current plan (IIUC) is:

  • We're not going to have the operator controller create new Operator objects for dependencies
  • Even if we did, IMO, the channel should be an optional constraint on the Operator. In the case of dependencies, I assume that we'll likely tell users to avoid setting a channel for dependent operators to avoid issues when upgrades to parent Operators require dependent versions in other channels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Channels likely don't matter till we're dealing with upgrades/specific versions (resolver picks latest semver on an ordered list of channels to install). Since we install by packageName with the operator-controller, we'd need to rely on the catalogSource/FBC/user constraints to declare this order. FBC does support channel priorities for channel ordering, we don't necessarily need to use defaultChannel except for backward compatibility.

We can also follow a channel-free route if we use some other method for weighting versions/upgrade edges in the resolver, maybe using a different set of version constraints.

I can see the Upgrade Edge becoming its own variable rather than a property on the EntitySource. Whether it needs to be from a separate variableSource or should be part of the variables generated by the CatalogSource EntitySource depends on how we'd define the edge constraints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC in OLMv0, the purpose of the default channel is for dependency resolution

I'd argue that the default channel concept in v0 is used beyond just dependency resolution. Eg you can create a Subscription to a package without specifying a channel, and that installs the latest from the default channel. That to me is the concept being used for installation (and yes that itself is dependency resolution, but probably not in the sense you were talking about @joelanford, I.e resolves the dependencies of this operator I installed).

That to me is a natural concept that should stay regardless. It gives package authors a way of publishing different versions in different channels, while keeping a sane default in place. I personally am definitely used to this (am sure others are) when we're installing binaries with rpm/brew etc.

IMO, we should not generate new properties in the bundle entities for package and channel level info

Huge +1. It's probably better to have Bundle Entities only contain bundle level information, and Package Entities have package level information from the get go, so that clients we're building out now (deppy) are built with the knowledge of that from scratch, since that's the direction we want to go (I.e separate out bundle and package level metadata, and only include bundle level info inside bundles). I.e this means clients (like deppy) will have to think in terms of getting Bundle Entities and Package Entities to resolve dependencies, and not just Bundle Entities (we're still thinking this way coz that's the way information is laid out currently in v0 bundles, but I'm guessing @joelanford is also trying to say let's break out this information now so that our underlying components are expecting scoped out information, and then we can work our way up to teach package authors to rethink how they package their operators)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's break out this information now so that our underlying components are expecting scoped out information

@joelanford @anik120 @perdasilva
it makes sense to have the edge properties be contained in their own variable rather than be mixed with the bundle variable. Are you suggesting that these be omitted from the bundles for now, and added later when we extend the resolver to support upgrades?

Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty large PR that combines lots of feature implementations. WDYT about splitting this up such that we start out with an registry-based entity source that doesn't implement any caching at all? i.e. ListEntities would talk directly to the GRPC server every time.

Then we could follow-up with incremental, smaller, easier to review PR improvements.

Comment on lines 17 to 27
type UpgradeEdge struct {
property.Channel
Replaces string `json:"replaces,omitempty"`
Skips []string `json:"skips,omitempty"`
SkipRange string `json:"skipRange,omitempty"`
Version string `json:"version,omitempty"`
}

type DefaultChannel struct {
DefaultChannel string `json:"defaultchannel"`
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC in OLMv0, the purpose of the default channel is for dependency resolution. OLMv0 creates subscriptions for a dependency always using the dependency's default channel.

In OLMv1, the current plan (IIUC) is:

  • We're not going to have the operator controller create new Operator objects for dependencies
  • Even if we did, IMO, the channel should be an optional constraint on the Operator. In the case of dependencies, I assume that we'll likely tell users to avoid setting a channel for dependent operators to avoid issues when upgrades to parent Operators require dependent versions in other channels.

test/testdata/prometheus-index-v0.37.Dockerfile Outdated Show resolved Hide resolved
Spec: corev1.PodSpec{
Containers: []corev1.Container{{
Name: "registry",
Image: "quay.io/ankitathomas/index:prometheus-index-v0.37.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure the plan is to change this later, but we should do a similar thing that we do in rukpak here. That is, have these test images built from sources in this repo and loaded into the kind cluster during e2es. That way:

  1. No need to rely on a remote image registry (which can be a source of flakes)
  2. We get to really easily tweak the image as part of the normal dev/test process.

@@ -0,0 +1,323 @@
package e2e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An e2e without the real OLMv0 catalog operator might not be super useful. WDYT?

I wonder if we could simplify and still get most of the coverage by moving the tests to a set of catalogsource reconciler unit tests, where we could run a localhost grpc server directly in the unit test process (no container image, pod, or kubelet, or catalog operator necessary), create a catalog source pointing to our localhost grpc server, run Reconcile, and then assert on the expected ListEntities output. That would cover "this thing returns the entities we expect based on the state of a catalog source and grpc server.

var catalogSource = &v1alpha1.CatalogSource{}
if err := r.Client.Get(ctx, req.NamespacedName, catalogSource); err != nil {
if errors.IsNotFound(err) {
r.dropSource(req.String())
Copy link
Member

@varshaprasad96 varshaprasad96 Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are deleting the catsrc immediately if we are not able to fetch it from cluster. I'm wondering given this use case, would it make sense to use the uncached client for this purpose so that we don't run into additional reconciles for stale c-r caches.

@joelanford would that make more sense, instead of using a cached client and redoing the work of converting to entities if there is a stale cache. Would we hit any performance issues without caching for this specific case of watching catsrc (in the sense do catalog sources in a cluster get modified often enough that caching is important)?

Copy link
Member

@joelanford joelanford Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the controller is currently watching only catalog sources, I don't think the cache will ever be stale. I think the cached client is fine.

You bring up a good point about redoing work of listing entities though. It's a bummer that there's no GRPC endpoint that can tell us if we need to relist. This is something we should probably handle in the new OLMv1 catalog source spec (cc @anik120).

Copy link
Member

@varshaprasad96 varshaprasad96 Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another naive question: As an OLM user would I in future be allowed to specifically select the catsrc I want to restrict for resolution? Probably through annotations/labels?

For now we are watching all the catalog sources in the namespace. Is it necessary, if the user wants to use only a specific set of catsrc for resolution. And how do we provide the user that option. If its through Operator object, then we would need to dynamically change the watches after catsrc controller has started.

Or do we make this controller to watch all the catsrc objects, and do that filtering in the operator controller when it takes in the entities for resolution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varshaprasad96 the way we do this today is by allowing admins to set the priority of the CatalogSources. I don't think we'll ever go down the route of "when you as a tenant want to install an operator, you can specify which CatalogSources you want to install the dependencies from". Instead, what we have today is an admin can set the priorities for CatalogSources on cluster, and the resolver tires to find installable dependencies in the following oder:

Priority 0 -> CatalogSource that the top level installable is being installed from (I.e own catalogSource)
Priority 1 -> CatalogSource with priority set to 1
Priority 2 -> CatalogSource with priority set to 2
.
.
That being said, it does mean that the admin also has the privilege to change the priorities of the CatalogSource at any point in time too.

Copy link
Member

@varshaprasad96 varshaprasad96 Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anik120 Follow up question on that. If not the user, does the admin have the privilege of not selecting a particular catalogsrc to install dependencies from - for an operator?

The reason for asking that was from the perspective of whether we need to watch all catalogsources in the cluster necessarily.

But now my other question is - is there a particular design reason why we would not allow someone (whether its admin or any other persona) to not select the catsrc for resolution (Say blacklist a catsrc from being used for resolution)?

Also, since you mentioned that we would not want to go in that path (this wouldn't affect the watches part of it, but just asking to understand if there is a particular reason):

"when you as a tenant want to install an operator, you can specify which CatalogSources you want to install the dependencies from".

why not? Say I don't trust the publisher of this catsrc, I don't want to install the contents it provides.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varshaprasad96 @anik120 I'm not sure the admin can disable a catalogsource so it can be ignored when installing dependencies, apart from deleting the catalogsource. Since the admin is the one that creates and manages what catalogsources are available on cluster, it makes sense to me that they can also decide their priority when resolving something to be installed on cluster.

CatalogSource priorities are currently declared for each catalogSource, so this priority doesn't change each time we perform resolution. If we had tenants declaring their priorities, it could get potentially very messy, especially since two tenants could declare different priorities for operators being resolved at the same time.

controllers/catalogsource_controller.go Show resolved Hide resolved
@@ -63,7 +63,7 @@ type OperatorReconciler struct {
// For more details, check Reconcile and its Result here:
// - https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.13.1/pkg/reconcile
func (r *OperatorReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
l := log.FromContext(ctx).WithName("reconcile")
l := log.FromContext(ctx).WithName("operator-controller")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR (we can work on this in follow ups), but just confirming if I'm getting it right:

Since the operator controller hasn't been configured to receive events from catsrc controller, for now we have 2 isolated controllers, (ie):

  • operator reconciler triggering with change in Operator object or BD, fetching entities from entity cache.
  • Catsrc controller, working on populating the entity cache based on events it receives by watching catsrc on clusters.

So we have a possibility of operator controller reading a stale entity src cache, and giving us wrong results when a particular catsrc is modified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at the moment it is possible that resolution can happen off a stale cache state. Most often this should only affect upgrades, which we don't handle yet. It is possible that a resolution can fail because the cache has not been populated yet. However, this should result in an error and a re-enqueue, iirc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll definitely need to address this. But I think the resolver design will need to change if we want to do multi-stage resolution. It's likely that the resolver will need to be a separate process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most often this should only affect upgrades, which we don't handle yet. It is possible that a resolution can fail because the cache has not been populated yet. However, this should result in an error and a re-enqueue, iirc

Apart from upgrades, this scenario could also occur right?

  • Catsrc controller updates the entity cache with say the entities for "prometheus" package.
  • Operator Controller reconciles with the current state of entity cache successfully, installing the prometheus package on cluster.
  • Catsrc is removed, catsrc controller reconciles and updates the entity cache.
  • Nothing happens with operator controller since there is no re-trigger - unless either operator object is modified/ any events on BD trigger the reconciler/ the operator controller re-triggers after the default time interval.

The prometheus operator would still be available on cluster even after the respective package is removed from catsrc.

Which brings in my question on uninstall (sorry if it's a naive question, I'm not sure how OLM handles it currently - by subscriptions I guess?) - what should trigger uninstall - the operator object being deleted or the change in packages in catsrc? Should the prometheus operator persist on cluster even after it is removed from catsrc even though the user has not explicitly mentioned an uninstall by deleting Operator object?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uninstall has been an achilles heel (besides other heels obviously) of OLM v0. TLDR is that the uninstall process is cumbersome, and confusing. There have been many conversations that go
"I deleted the subscription, why did the operator not get uninstalled"
"Oh but the Subscription you created is a intent to subscribe to a particular channel for a package if you deleted the subscription we won't touch the workloads coz that's dangerous"
"Okay that's not what I expected, when I created the Subscription the operator was installed, so when I delete the Subscription I expect the operator to be uninstalled"

That said, you bring up a good point. The install/uninstall flow for v1 warrants a larger discussion.

At the face of it, it seems to me that we should only uninstall the operator if the Operator CR is deleted. If the CatalogSource is updated and the package is deleted from the CatalogSource, that should only trigger a notification in the Operator CR in the line of "The package that was installed from the CatalogSource foo was removed from foo, please get in touch with the CatalogSource owner to discuss the next steps if you want to receive updates for the operator you installed".

We've also never had to think about scenarios where the package was removed from a CatalogSource, cause so far that hasn't happened. Maybe that helps us in deciding "okay we don't have to over optimize for that scenario just yet and we can go with some sane default scenario for now, as long as we keep scope for future enhancement for that feature, should that scenario occur".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most often this should only affect upgrades, which we don't handle yet. It is possible that a resolution can fail because the cache has not been populated yet. However, this should result in an error and a re-enqueue, iirc

Apart from upgrades, this scenario could also occur right?

  • Catsrc controller updates the entity cache with say the entities for "prometheus" package.
  • Operator Controller reconciles with the current state of entity cache successfully, installing the prometheus package on cluster.
  • Catsrc is removed, catsrc controller reconciles and updates the entity cache.
  • Nothing happens with operator controller since there is no re-trigger - unless either operator object is modified/ any events on BD trigger the reconciler/ the operator controller re-triggers after the default time interval.

The prometheus operator would still be available on cluster even after the respective package is removed from catsrc.

Which brings in my question on uninstall (sorry if it's a naive question, I'm not sure how OLM handles it currently - by subscriptions I guess?) - what should trigger uninstall - the operator object being deleted or the change in packages in catsrc? Should the prometheus operator persist on cluster even after it is removed from catsrc even though the user has not explicitly mentioned an uninstall by deleting Operator object?

I would say that once a package is installed, even if it no longer exists anymore in a catsrc/cache, it should remain installed. IMHO the catalogsource is just a source of information. E.g. if we installed the prometheus operator directly from prometheus sources, and there's an outage, you wouldn't want prometheus to be uninstalled. Having said that, I think it would be good to include a condition on the operator that says whether the underlying source is still available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I think about uninstall, at least at the moment, is that if you delete the operator CR, everything goes! But it is a hairy landscape as Anik pointed out.

Comment on lines 92 to 94
r.recorder.Event(catalogSource, eventTypeWarning, eventReasonCacheUpdateFailed, fmt.Sprintf("Failed to update bundle cache from %s/%s: %v", catalogSource.GetNamespace(), catalogSource.GetName(), err))
return ctrl.Result{Requeue: true}, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later: I wonder if there's a way to propagate this to the Operator object.

Something along the lines of:

  1. mark this cache as stale in a way that the operator controller can see it.
  2. when reconciling an Operator, if any of the caches are stale, set a separate condition (e.g. UsingStaleCatalog or CatalogUnavailable or something)

We could even later build on that and have a knob in the Operator spec that gives the cluster admin some control about how the resolver should handle that situation. In OLMv0 (iirc), we fail resolution if any of the catalog sources are unavailable, and I'm not sure there's a way to say "ok yeah, just resolve without that catalog"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to fail outright for all resolution for a single stale/unreachable catalogSource. Currently the catalog reconciler should continue resolving with whatever the last healthy state of the failed catalog was despite emitting events about this failure.

It may be worth considering some way to mark different levels of staleness on the caches, like the number of updates that have failed due to an unreachable catalogSource and to have that as a guide to determine how to deal with the cached version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to fail outright for all resolution for a single stale/unreachable catalogSource. Currently the catalog reconciler should continue resolving with whatever the last healthy state of the failed catalog was despite emitting events about this failure.

👍 Yeah agreed. In my mind we can evolve this over time. Something along the lines of:

  • Iteration 1: we resolve with whatever we have, even if expected stuff is missing or stale
  • Iteration 2: we add detection for unreachable catsrcs and tell the user that resolution is using incomplete data. (My initial instinct would be to never use a stale cache. If a catsrc worked at one point, we populated the cache, and then it becomes unreachable, I'm thinking we just empty out the cache so that we don't use old data.)
  • Iteration 3: we give the user a knob that let's them decide whether resolution should fail on unreachable catsrcs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the statement "continue resolving with stale cache", what does "continue resolving" mean? Does it install dependencies, that admins will have to clean up if that wasn't the intended dependency resolution from the user's perspective? (think it installed etcd operator from the community-operator index but I absolutely cannot have any workload from the community-operator index running, so now I have to re-trigger the resolution once the certified-operators catalog is available again, and for that I need to uninstall a bunch of dependencies)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it installed etcd operator from the community-operator index but I absolutely cannot have any workload from the community-operator index running, so now I have to re-trigger the resolution once the certified-operators catalog is available again, and for that I need to uninstall a bunch of dependencies

I'm imagining you're talking about a scenario where both catalogSources are installed at once, but the certified-operator index takes some time to become available, during which one round of resolution completes with the community-operator index. In that case, we'd certainly run into the problem of having incorrect operators installed on cluster. We might be able to work around this by extending the operator object to have more constraints over what gets selected - like maybe a restriction on a custom property like the publisher or similar.

What I mean by "continue resolving with stale cache" is more like if we have a catalog v-1 on cluster which has its underlying source updated to serve v-2. If the caching fails because the catalog source is unreachable or similar, the resolver would continue using its stale v-1 cache. If this isn't noticed or fixed in time, we could have dependencies installed that can be difficult to clean up.

One solution is to void any cached catalogSource that isn't available for even one sync - so we'd need to rebuild the cache any time a catalogSource becomes temporarily unavailable. This might have some impact on the actual cache performance. Another is to have cache being marked stale, stopping any attempt at resolution that doesn't specify it's ok to resolve from a stale cache.

return ctrl.Result{}, client.IgnoreNotFound(err)
}

entities, err := r.registry.ListEntities(ctx, catalogSource)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we attempt to list entities, should we check if the catalog source says it's ready?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had this check at one point - it should reduce some of the noisiness of transient errors at startup. That was removed for supporting unmanaged catalogSources, where the only way to tell if the catalogSource is ready is to try connecting to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catalog operator doesn't maintain that readiness information in the catsrc status if the catsrc spec directly specifies an address? If that's true, that seems like it could be considered a bug.

My memory is fuzzy, but I thought there was something in the catsrc status about this regardless of managed vs. unmanaged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right about unmanaged catalogSources - the catsrc status for grpc backed catsrc does get populated by the OLMv0 catalog operator if it is present on the cluster regardless of whether it is managed/unmanaged. We're skipping the health check under the assumption that the catalog operator need not exist for the catalogSource to work. If we can have a guarantee that the OLMv0 catalog operator will be present on any cluster that has the OLMv0 catalogSources, then adding a health check here makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the catalog-operator ins't present, there's nothing reconciling v0 CatalogSources right? I.e the api server will just throw back a "kind CatalogSource is not recognized"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have the CatalogSource CRD present on cluster without having the catalog-operator running. Whether we want this to be the case is another question entirely.

entities, err := r.registry.ListEntities(ctx, catalogSource)
if err != nil {
r.recorder.Event(catalogSource, eventTypeWarning, eventReasonCacheUpdateFailed, fmt.Sprintf("Failed to update bundle cache from %s/%s: %v", catalogSource.GetNamespace(), catalogSource.GetName(), err))
return ctrl.Result{Requeue: true}, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of inline with my previous comment about checking certain aspects of the catsrc status before attempting to ListEntities... Do we need Requeue: true here? This will result in an exponential retry even when the catsrc object doesn't change.

I imagine that when a catsrc is coming up, it'll be unavailable until the pod spins up and the grpc server starts. In all that time, we'll be retrying over and over unnecessarily.

It seems like we could wait for cat src events to come into our watch, look at the status, and be a bit more diligent about waiting to attempt ListEntities.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failure should only happen if a catalogSource's underlying registry server isn't reachable. The requeue here is only necessary for unmanaged catalogSources (ones that declare a static address as their source, not updated by olmv0 catalog-operator).

The exponential backoff seemed appropriate for waiting out transient errors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bummer. It would be great if the catalog operator could be the thing probing the underlying GRPC server and telling us via catsrc status updates when it is ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this information (if the registry server is ready or not) is already exposed via the GrpcConnectionState object in the status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this information (if the registry server is ready or not) is already exposed via the GrpcConnectionState object in the status?

@anik120 connection state is exposed by GrpcConnectionState in catalogSources when the OLMv0 catalog-operator is running on cluster. On re-checking the code, it looks like it does this for static address catalogSources as well.

Having this as a strict requirement would mean we can't run operator-controller without running OLMv0 catalog-operator on cluster, though with the current reconciler, it'd still be relying on something external to ensure that the catalogSource was being served at the address it points to.

if isManagedCatalogSource(*catalogSource) {
return ctrl.Result{}, nil
}
return ctrl.Result{RequeueAfter: defaultCatalogSourceSyncInterval}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RequeueAfter necessary?

Does the OLMv0 catalog-operator update the catsrc status in some way when a catsrc is pivoted to a new image? If so, we can just rely on events coming in from our watch, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for images that the catalog-operator doesn't update, i.e. those serving from a static address. We can remove this if we're saying that the catalog reconciler will only work when the v0 catalog-operator is present on cluster.

Having a periodic requeue (though maybe not as frequent as every 5 minutes) means we don't strictly need any olm v0 components for the catalog reconciler to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. IMO I think we can assume that if the OLMv0 catalog source API is present, so is the OLMv0 catalog operator.

Copy link
Contributor Author

@ankitathomas ankitathomas Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is to allow an easier transition away from OLMv0 - to allow removing the v0 components before we've completed the v1 catalogs. It may he helpful when we're planning migration in the future.

Copy link
Member

@joelanford joelanford Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration is a good point. But I'm wondering if we'll still need at least the OLMv0 catalog operator (or equivalent)? If we don't have it that'll mean:

  1. Catalog sources that define images will never end up with a catalog pod or address to connect to (its the catalog operator that manages that).
  2. No catalog sources will have their status block updated.

It seems like we'll either need the OLMv0 catalog operator, or we'll need something that can migrate existing OLMv0 catalog sources to the new OLMv1 cluster catalog source.

Having said all that, my personal opinion is that worrying about migration right now is not in our interest. I feel like we're still in the experiment, learn, and iterate phase of OLMv1, and my thinking is that the migration concerns can wait until we've landed on something that is a bit closer to our OLM v1.0.0 release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also say the the underlying workloads of v0 CatalogSources (eg registry pods) are not(ideally, hopefully) the underlying workloads for v1 CatalogSources.

I'm also curious if we need to make this adapter aware of different types of CatalogSources. If all of this is already fully fleshed out then sure let's go for it, but I'm not sure if it's worth solving for every type of CatalogSource, if that'll make the implementation easy. Eg in this case, if we just solve for grpc CatalogSources, LatestImageRegistryPoll is the thing that lets you know if there's been a new polling update rollout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anik120 @perdasilva I think the idea is to have multiple sources for the resolver to pull from - v0 CatalogSources, v1 CatalogSources etc - each with their own controller/adapter.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2023
@ankitathomas ankitathomas force-pushed the entitysource branch 7 times, most recently from 3e8d754 to e0d7add Compare March 3, 2023 15:41
@ankitathomas ankitathomas force-pushed the entitysource branch 3 times, most recently from 81db438 to 7e9f8df Compare March 6, 2023 14:22
@ankitathomas ankitathomas changed the title [WIP] add catalogsource entitysource Add catalogsource entitysource Mar 6, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 6, 2023
@perdasilva
Copy link
Contributor

@joelanford @ankitathomas @awgreene @anik120 @varshaprasad96 Here are my 2c:

  • Adding v0 catalog support is meant as a value add so that (even if imperfect) we can have an on-cluster data source we can drive the operator-controller from. For now, I think it's ok that we need v0 to be there in order for things to work.
  • I think it's ok to leave the upgrade edge stuff in the entity for now. It's likely that we will abandon entities/entity sources anyway. In which case, we'd switch to the api.Bundle struct from operator-registry and just use that.
  • The self-managed catalog source is nice, but maybe not necessary at this stage. I would leave things as they are and document any limitations in the form of tickets. This should also help inform the next CatSrc API. I.e. just having the operatorhubio catalog source should be sufficient to install some of the operators therein. Self-managed is nice, because it allows people to create their own sources and play around with the operator-controller, test out their operators, etc.
  • Ultimately, the goal of this PR should be making on-cluster sources available for the operator-controller so that upstream (OLM) users can already start to play with the new APIs.
  • In hindsight, we should have done as Joe suggested and punted the cache system for now. Since, we would already be adding value even without it. Lesson for the next ticket.
  • I think we should just freeze scope, fix only what is necessary, and document what we think could be improved in tickets
  • We may need to inform users how to install olm, etc. as well as a quick follow-up.

@joelanford joelanford dismissed their stale review March 23, 2023 20:21

Agree with @perdasilva's two cents comment.

Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with everything @perdasilva said in his two cents comment. This gets us playing with some real-world operators and we can always make improvements or replace pieces of this as we keep moving. I'm +1 on merging once the conflicts are resolved and tests pass.

Freeze scope and capture any follow-ups for anything else we find.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2023
ankitathomas and others added 17 commits March 27, 2023 13:03
Signed-off-by: Ankita Thomas <ankithom@redhat.com>
Signed-off-by: Ankita Thomas <ankithom@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: Ankita Thomas <ankithom@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: Ankita Thomas <ankithom@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Signed-off-by: Ankita Thomas <ankithom@redhat.com>
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 27, 2023
@anik120
Copy link
Contributor

anik120 commented Mar 27, 2023

lgtm 👍🏽

@ankitathomas ankitathomas merged commit 2973a14 into operator-framework:main Mar 28, 2023
anik120 added a commit to anik120/operator-controller that referenced this pull request Apr 24, 2023
This reverts commit 2973a14.

Signed-off-by: Anik <anikbhattacharya93@gmail.com>
anik120 added a commit to anik120/operator-controller that referenced this pull request Apr 24, 2023
This reverts commit 2973a14.

Signed-off-by: Anik <anikbhattacharya93@gmail.com>
anik120 added a commit to anik120/operator-controller that referenced this pull request Apr 27, 2023
This reverts commit 2973a14.

Signed-off-by: Anik <anikbhattacharya93@gmail.com>
tmshort pushed a commit that referenced this pull request May 8, 2023
This reverts commit 2973a14.

Signed-off-by: Anik <anikbhattacharya93@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants