-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📖 F28 - Provided service versions (aka 'Operand versioning') #180
Conversation
docs/olmv1_roadmap.md
Outdated
@@ -77,6 +77,8 @@ _Priority Rating: 1 highest, 2 medium, 3 lower (e.g. P2 = Medium Priority)_ | |||
|
|||
**F27 - Pluggable certificate management (P2):** OLM should rely on other controllers to create and lifecycle TLS certificates required to drive the functionality of certain extensions, like webhooks that need to trust / need to be trusted by the cluster's API server. OLM should not implement any certificate handling itself. In a first implementation support should be established for cert-manager. | |||
|
|||
**F28 - Provided service versions (P2):** Extensions can offer multiple services with specific version ranges. To help cluster users make informed decisions about which extension version to install or update, OLM should provide a separate field for extension developers to specify the provided services and their respective versions. This will allow for declaring extension dependencies on specific service versions and enable flexibility in releasing extensions and their corresponding services at different cadences. In disconnected environments, cluster users should be able to deselect provided services and versions during image mirroring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will allow for declaring extension dependencies on specific service versions and enable flexibility in releasing extensions and their corresponding services at different cadences.
Can we walk through some examples of how this would work? If an operator author declared a dependency like "I need an operator that provides Kafka v3.4.0", and two different operators (call them A and B) with two different APIs say they manage Kafka v3.4.0, OLM might pick B, the "wrong" operator. The depender may only know about A's API. It seems like the service version dependency would need to be an addition to the required GVK or operator version constraints. So you would have to say "I depend on this GVK and service version range" OR "I depend on this package, package version range, and service version range". Is that what we have in mind here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't really thought about operand versioning and its use in operator-to-operator dependencies but it does make sense as a use case.
I feel that a service is represented by a GVK. If an operator provide v3.4.0 of Kafka then I would expect to make a Kafka
CR with a spec.version: 3.4.0
, or something along those lines. So the dependency would be:
- GVK: kafkas.apigroup.provider.com/v1beta1
- Operand version: 3.4.0
- Specific package range: (optional)
In this instance, we wouldn't even need OLM to understand that its the spec.version
field in the CR that identifies the version being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this instance, we wouldn't even need OLM to understand that its the spec.version field in the CR that identifies the version being used.
Correct.
My main point here is that this can't be a standalone constraint. It has to be tied to the operator name or the provided API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an operator author declared a dependency like "I need an operator that provides Kafka v3.4.0", and two different operators (call them A and B) with two different APIs say they manage Kafka v3.4.0, OLM might pick B, the "wrong" operator. The depender may only know about A's API. It seems like the service version dependency would need to be an addition to the required GVK or operator version constraints. So you would have to say "I depend on this GVK and service version range" OR "I depend on this package, package version range, and service version range". Is that what we have in mind here?
Correct, this "service version dependency" should be used as an addition to the "required GVK" or "Operator version constraint" to narrow down to:
- a specific range of Operator versions (if used with "GVK" which is shared across multiple operand versions)
- or "a specific image of Operand/provided service" (if used "Operator version constraint" and a single Operator version provides multiple versions of operand/provided service).
In this instance, we wouldn't even need OLM to understand that its the spec.version field in the CR that identifies the version being used.
Correct.
My main point here is that this can't be a standalone constraint. It has to be tied to the operator name or the provided API.
Correct, the main point is this "service version dependency" is not a standalone thing.
But in this Kafka example, Operand version: 3.4.0
is also useful because the GVK kafkas.apigroup.provider.com/v1beta1
is the same for multiple Kafka version. So Kafka/Operand version: 3.4.0
is useful to narrow down to a tighter Operator version range and pin point the desired Operand image for Kafka 3.4.0
within the target Operator version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will allow for declaring extension dependencies as an additional constraint on specific service versions and enable flexibility in releasing extensions and their corresponding services at different cadences.
Add "as an additional constraint" to clarify it is not intended as a standalone constraint
docs/olmv1_roadmap.md
Outdated
@@ -77,6 +77,8 @@ _Priority Rating: 1 highest, 2 medium, 3 lower (e.g. P2 = Medium Priority)_ | |||
|
|||
**F27 - Pluggable certificate management (P2):** OLM should rely on other controllers to create and lifecycle TLS certificates required to drive the functionality of certain extensions, like webhooks that need to trust / need to be trusted by the cluster's API server. OLM should not implement any certificate handling itself. In a first implementation support should be established for cert-manager. | |||
|
|||
**F28 - Provided service versions (P2):** Extensions can offer multiple services with specific version ranges. To help cluster users make informed decisions about which extension version to install or update, OLM should provide a separate field for extension developers to specify the provided services and their respective versions. This will allow for declaring extension dependencies on specific service versions and enable flexibility in releasing extensions and their corresponding services at different cadences. In disconnected environments, cluster users should be able to deselect provided services and versions during image mirroring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In disconnected environments, cluster users should be able to deselect provided services and versions during image mirroring.
This seems like it will lead to a poor UX for end users of the operator. There are several scenarios that could be problematic:
- An operator provides services A and B, but the mirroring user mirrors only service A. The operator API for service B is still present, so an end user tries to use that API. The operator doesn't know which services have been mirrored (i.e. which operand images have been mirrored), so it happily attempts to handle the request for service B. The result is "ImagePullBackoff" on all of the service B images with no indication to the user about why. To them, it looks like a bug in the operator.
- Operators that support multiple versions are very likely to have complex upgrade logic between those versions. If mirroring users are able to selectively remove arbitrary versions at mirror time, an operator may not be able to perform its essential function of performing operand migrations via a series of service version upgrades. Again, the operator and end user will see ImagePullBackoff if a mirror user has inadvertently opted not to mirror an intermediate service version required for a migration.
Are these acceptable outcomes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find myself thinking whether this type of configuration (reducing the the operators behavior surface) should be OLMs concern, or rather that of some operator side library/framework which makes this kind of configuration easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to reach some consensus on the boundary between the OLM's and the operator's responsibilities. IMHO OLM should only concern itself with installation/update/removal of operators. Configuration is something outside of that. Though, I'd be curious to see if we could have some kind of separate component within the OLM ecosystem that can handle this in a nice way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that OLM should only be concerned with installation/update/removal.
I can't see a good way for OLM (or the operator) to know if an image pull is going to succeed or fail before it is attempted. Although it would be great to be able to indicate to a user that the problem is that something wasn't mirrored vs a registry auth error, a missing image, or any other multitude of problems, I wouldn't consider that to be in scope either.
I do think though that we need the metadata available in the package to allows the user to select the service versions that they want to mirror. This does leave potential sharp edges at runtime, where if you attempt to use a version that wasn't mirrored you will get pull errors, but the alternative is that the user has to mirror everything. As we're explicitly saying that fan-out operators are going to be more common because all operators will be cluster singletons, the mirror size expands extremely fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The potential problem I'm trying to highlight is that if OLM has a first class API for declaring service versions AND a first class mirror API for selecting specific service versions, then OLM (or maybe the mirroring tool) will be pinned with the blame when an image pull inevitably fails because an intermediate version wasn't mirrored.
"I used the mirror API to select the service version images I cared about, and now I'm getting image pull errors. OLM and the mirror tool should mirror all necessary images based on my stated desires" is what the bug reports will say.
And that means we'll have to extend the service version API with an upgrade edge specification, we'll have to tell operator authors to keep that up-to-date with their operator code, and the mirror tool API will have to allow the mirror user to tell it which service versions have already been mirrored. With all of that, then and only then will the mirror tool be able to follow the graph to mirror all the intermediate versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, you're talking about the case where the operator needs to pass a workload through intermediate versions for upgrade, so:
- Operand version 1.0.0 exists
- The user mirrors 1.0.0 and starts using it
- Operand versions 1.1.0, 1.2.0 are released in the latest operator
- The user adds 1.2.0 to their mirror, because that's what they want to upgrade to
- The user changes the workload CR to 1.2.0
- The operator starts by attempting a 1.1.0 upgrade
- The pull fails!
Is that the scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In disconnected environments, cluster users should be able to deselect provided services and versions during image mirroring.
This is coming from a requirement of Operators that provide multiple Add-Ons in addition to the primary provided services. e.g., Upstream OCM (or RHACM) Operator that comes with VolSync
and Submariner
Add-ons.
Enable users to deselect mirroring images of a certain provided service that are not interested in can greatly improve the mirroring time, traffic, and storage size.
There are several scenarios that could be problematic:
- An operator provides services A and B, but the mirroring user mirrors only service A. The operator API for service B is still present, so an end user tries to use that API. The operator doesn't know which services have been mirrored (i.e. which operand images have been mirrored), so it happily attempts to handle the request for service B. The result is "ImagePullBackoff" on all of the service B images with no indication to the user about why. To them, it looks like a bug in the operator.
Right, so in that case, it's very clear this is the intended/expected behavior explicitly set up by the admins, so it should be on the admins to make this clear to their teams. And as you mentioned, end-users/developers should be able to see the ImagePullBackof
too (so they know the image is missing in their internal registry).
- Operators that support multiple versions are very likely to have complex upgrade logic between those versions. If mirroring users are able to selectively remove arbitrary versions at mirror time, an operator may not be able to perform its essential function of performing operand migrations via a series of service version upgrades. Again, the operator and end user will see ImagePullBackoff if a mirror user has inadvertently opted not to mirror an intermediate service version required for a migration.
The thinking here is to set up a minVersion
and maxVersion
as the Operand/provided service version range for mirroring. e.g.,
A user specifies the desired Kafka versions as:
minVersion
: 2.8.0maxVersion
: 3.2.0
, given Strimzi Operator declares:
- version: 0.27.1, provides Kafka: 2.8.0, 2.8.1, 3.0.0
- version: 0.28.0, provides Kafka: 3.0.0, 3.1.0
- version: 0.29.0, provides Kafka: 3.0.0, 3.0.1, 3.1.0, 3.1.1, 3.2.0
, so mirroring tool can derive the target versions for mirroring Strimzi Operator are:
- version: 0.27.1
- version: 0.28.0
- version: 0.29.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enable users to deselect mirroring images of a certain provided service that are not interested in can greatly improve the mirroring time, traffic, and storage size.
This thread makes me think we need to provide a more robust "relatedImages" API. Basically by definition, an operator is complex code that orchestrates $something in place of an expert human. In that context, the set of related images needed for that orchestration is also likely complex. Mirroring by service version is somewhat simplistic and may not align with the ways that operators want to expose what to mirror when.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a separate thread @Jamstah said:
I don't think we can rely on operand versions being semver
I agree. Which means using minVersion
and maxVersion
for operand versions is likely not feasible.
Will review early next week, thanks for sharing! |
This PR focuses mainly on operator-to-operator dependencies, which is a great use case for being able to determine the service versions and keep the operator dependency graph complete. The other high level use case for this is user selection of service versions, which I feel we should include more strongly. For example, if the user has a Potentially, even if the user doesn't have a CR using that operand version right now, if that's what they asked for at install time then they should be warned if that service version is going away, although I'm not sure how/where that dependency would be recorded, or if we need it. Perhaps in the subscription? I guess this would be an issue for job style CRs that are created temporarily. Certainly a less urgent use case. |
That's a breaking change and that would warrant a major version bump of the operator. And I would expect:
Beyond that, the implications here seems to be that:
IMO, that is too opinionated. What if an API wants to expose the desired version as a range or as a channel containing multiple versions? IMO, going this far is beyond the scope of what I think OLM should have awareness of. |
I would have said that it would be a status field not a spec field, where the operator has resolved the constraints and reconciled a specific version, that then needs to be supported through any operator change.
Yes, OLM would need to know which field to look for, and that would need to be specific in some way (in the CSV?).
I'm in two minds here. Part of me thinks that its a good way to stop users shooting themselves in the foot, but I do agree that its a lot of additional complexity just for that use case. Removal of supported operand versions being a breaking change might be the right call, which would mean the user can rely on being in the right operator channel instead. I was kind of equating it to the function in OpenShift that checks for deprecated API usage in the cluster and says "are you sure you want to upgrade? You're still using these API versions which won't be in the next OpenShift". I guess that's easier to track. |
That's the case because the thing that removes the deprecated API is the same thing that is doing this tracking, right? The analogy holds if the operator (which owns this API) is the thing doing this checking. We're planning to have an OperatorHealth sort of API where the operator can raise these sorts of things and potentially prevent upgrades. |
Right, but the currently installed operator doesn't know the capabilities of the next one, so it doesn't know to prevent the upgrade. Maybe that's why we need a pre-install hook of some kind. I'm definitely all for the operator making the call, because I don't think we can rely on operand versions being semver, which means only the operator can safely parse them. |
Signed-off-by: Tony Wu <tlwu2013@gmail.comexport>
eea388b
to
0778bfd
Compare
My overall perspective I think remains the same.
TL;DR: I think we're trying to make OLM "all things for all people", and in doing so, I think we're wrapping ourselves into knots. I'm worried this is another knot. It solves one set of use cases, but adds friction to other -- equally valid -- use cases. |
This API being optional is just so that the API does not need to support all kinds of approaches Operators take to orchestrate their operands, but the majority of the use cases (e.g., operand versioning following semver, etc).
This API does incentivize Operator authors to denote provided service versions clearly to their users following the guidance if their Operators do provide managed service(s) to showcase their maturity and I’d argue it’s in a good way. There are Operators out there that don’t ship any CRDs at all, but they are not viewed as inferior simply because of that. But if they do provide managed service(s), and they don’t clearly denote the supported version, then that’s a valid maturity issue.
I don’t see how adding this will exacerbate “operator authors to remove support for operand versions in a non-major-version-bump of an operator” but rather with this Operand versioning thing, it helps greatly to expose this to the end-users, so users can either better prepared for that, or simply walked away toward another Operator does better in honoring semver. Just as you mentioned, without this Operand versioning, even OLM is having a hard time telling “if existing operands are at versions unsupported by the upgrade”. This is exactly the reason this API is critical to improving the visibility of this to both OLM and the end-users. |
Late to the party, it seems like we are discussing a couple of things in parallel so here is my attempt to bring into one larger picture: A) users, relying on workload-managing operators, generally have a great desire to know upfront (before install or upgrade) what versions of the workload the desired operator version can manage To your point @joelanford - not all operators are workload-managing operators. This kind of data will not make sense for something like Tekton or Isitio, which generally represent control planes and event-based automation facilities instead of managing a workload that is the actual thing that is consumed. Also automation-focussed operators like a namespace-configuration operator or cert-manager will not take value from this. That said, I don't think it incentives breaking changes in minor release versions anymore than OLM currently incentivises breaking changes in the operator APIs due to OLMs enforcement of semantic version strings for bundles. OLM doesn't bark if you remove an entire CRD in a minor version. Still, we see value in enshrining the semver concept in operator versioning to convey the recommendations associated with that. That value isn't lowered just because we aren't always able to enforce it at runtime in all cases. Rather the presence of this data allows test systems to check for this behavior. Equally, with operand information we would allow test systems to call our operators that are removing operand versions within a minor or patch release. On the other hand, I would be curious to learn more about the example:
Do we know of an example? I almost seems like a missed opportunity by the operator to provide more controls and nuance. And even when an operator does that at runtime, isn't the logic around that caused by variable input from the user? I.e. a version of another component. Regardless, if an operator has a good reason not to expose a choice her, they can opt out of providing the operand version data as well. B) OLM resolving based on operand information I think it's important to remind ourselves that OLM is not a workload manager. It is generally not aware of what the operators are exactly doing, nor should it be. What is in OLMs realm is using this information as additional constraints when resolving dependencies. If operator A says it needs operator B and operand B.v1.1, OLM shouldn't update B to a version that only provides operand B.v2.0 and not v1.1 anymore (of course an admin would always be able to force it through). That's where this feature provides additional value in the form of safety. The scenario I see is: popular operators used by multiple products and open source projects as a shared dependency but serving different requirements in regards to the version of the managed workload provided. I would go as far saying that without operand versioning information here, the concept of shared dependencies of operator will not be something the community can adopt, ultimately hampering reuseability and growth of open-source projects. C) Mirroring based on operand information There seem to be two issues here:
We could tackle this in a couple of ways: we could either say that like operators, also operands should follow semver. And by looking at the ecosystem, I think we stand a good chance to capture the majority of users / projects / software here.
IMHO, with our current mirroring concept, would likely end up doing the equivalent for operand versioning of what we are already doing for channels and packages in catalogs today: we mingle with the list of stuff in the mirrored catalog. That is, if a users filters out an operand version during mirroring, the resulting catalog would also not contain that operand version reference. This way, we convey to the user correctly what's available in the mirrored environment. |
@Jamstah @perdasilva @joelanford @tlwu2013 What do we think, how do we move forward here? |
I'm seeing four requirements:
Maybe I'm getting hung up on the prescriptiveness of the implementation in the PR (i.e. there shall be a new field for a list of service versions, operand version shall be constrainable). Perhaps we could reword it to look more like the above? |
I think that's a good list, I might add:
The easy bit here is an operator knowing itself what it supports. The hard bit is knowing what something that isn't yet installed supports to know its going to be suitable for the job. To write that down in a way that is flexible enough may be impossible, we're expecting to be able to communicate knowledge to OLM that it really shouldn't need to have, like:
If we do build that model, I'm 90% sure that there will immediately be requests to extend it with additional checks for more specific cases. I feel like we may need a way for an uninstalled operator to look at a cluster and say if it can or cannot reconcile what is already existing on it. Perhaps a CSV could include a pre-install/pre-update Job of some kind for that purpose.... it wouldn't solve all the use cases, but I feel many of them could be solved with documentation if we take the sharp edges out of install. |
Done with the editing. @Jamstah @perdasilva @joelanford @dmesser |
Codecov Report
@@ Coverage Diff @@
## main #180 +/- ##
=======================================
Coverage 80.69% 80.69%
=======================================
Files 18 18
Lines 803 803
=======================================
Hits 648 648
Misses 112 112
Partials 43 43
Flags with carried forward coverage won't be shown. Click here to find out more. |
Signed-off-by: Tony Wu <tlwu2013@gmail.comexport>
@Jamstah, I've also added the "upgrade" piece to the latest change:
|
This looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ^^ thank you!!!
High-level, I'm comfortable with some sort of optional API operator authors can use to declare which versions of an operand are supported by a particular operator version. This seems to dovetail with some recent conversations we've had about making operators more multi-tenant capable. However, I think there is still MUCH to discuss about the mechanics of this and how it would work. E.g. what if an operator supports upgrading an existing workload from a particular operand version, but not installing or upgrading to that version? I still think this feature description seems too presumptive about how operators would (or wouldn't) expose operand version selection to users. So my stance is that I'm okay to merge if the goal here is to capture a high-level idea of supporting some sort of feature about operand version support awareness for admins. But I am not ready to say that what is described can or should be implemented without more design discussion. |
This is the feature request received in the past for OLM to expose the "provided service versions" (aka "Operand versioning") offered by the Operator/extension.
As extensions can only be installed once in a cluster as cluster-wide singletons with OLM 1.0, it is likely for an extension to oversee multiple versions of service and utilize a distinct versioning scheme. This should be a part of the OLM v1 requirement.
Related work in the past (Feb, 2021): https://hackmd.io/IEYEaOYkR-eBkMQkbVrC1A?view