Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling updates nullify use of readiness probes #1028

Closed
djzager opened this issue Sep 5, 2019 · 8 comments
Closed

Rolling updates nullify use of readiness probes #1028

djzager opened this issue Sep 5, 2019 · 8 comments

Comments

@djzager
Copy link

djzager commented Sep 5, 2019

Problem

Using rolling updates for operator upgrades, it is not possible for an operator upgrade to perform any initialization/transition tasks inside the scope of an OLM update.

Background

Let us assume that the operator is configured to delay readiness to perform some set of initialization tasks during an upgrade (upgrading operands, transitioning resource versions, etc) and that these operators use leader election. When the rollout begins there are two operator instances that want to own the same set of resources, the old operator instance has no work to do and stays "ready", the new operator is not able to do any work because it isn't in control. At this point, whether the use of readiness is not effective.

  • new operator refuses to go "ready" until it has completed initialization/transition, then the update will fail because it will not win the leader election until the old operator is taken down
  • new operator goes "ready" before completing initialization/transition, then the update will complete and all of the upgrade related tasks will be done after OLM has marked the upgrade as complete

Impact

Only operators that are using a rolling update strategy (either by changing the names of deployments in the CSV or the "RollingUpdate" k8s deployment strategy). This is regardless if they are using the "Many Version" or "Tightly Coupled" approach.

Solution

It seems that, in a majority of scenarios, that operator developers should be encouraged to use update/patch (keep deployment names the same) + recreate strategy type to grant the updated operator control over the resources; especially if. This makes #952 more of an issue for operator developers.

At the same time this appears, to me at least, to shine on the discussion of #922 where a more robust mechanism for communicating operator status back to OLM, whether that be install/continuous operation/upgrades, would be beneficial to operators being managed by OLM.

@rthallisey
Copy link

cc @davidvossel

@rthallisey
Copy link

cc @tchughesiv

@tchughesiv
Copy link
Contributor

We're using rolling updates and upgrades are working as expected.

@djzager
Copy link
Author

djzager commented Sep 5, 2019

We're using rolling updates and upgrades are working as expected.

I think this would only be impact you, maybe I need to update the impact 😎, if you need your new operator to do work during an upgrade. Even then, this issue would not really prevent that work from being done it would just be done after OLM had said the operator update was complete.

MarSik pushed a commit to kubevirt/kubevirt-ssp-operator that referenced this issue Sep 9, 2019
To let the upgrades go nicely and make good use of conditions reporting,
we need to make sure that k8s don't change things behind our back.
To do that, we force the update strategy to "Recreate".

Relevant OLM issues:
operator-framework/operator-lifecycle-manager#1028
operator-framework/operator-lifecycle-manager#952

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ecordell
Copy link
Member

If rolling updates or other deployment-level options don't work for you case, you can change the name of the deployment in your install strategy. This will spin up the new deployment and remove the old deployment without any rolling.

@mhrivnak
Copy link
Member

@ecordell I thought that if you changed the name of the Deployment, OLM will still wait for the new Pod to become ready before removing the old Deployment. Is that not the case?

@djzager
Copy link
Author

djzager commented Sep 17, 2019

@ecordell Changing the name of the deployment in the install strategy is the most problematic of all the options available to an operator developer. In that scenario, OLM will bring up all of the new deployments, wait for them to be ready, and then take down the old deployments.

This will spin up the new deployment and remove the old deployment without any rolling.

You may not call this a rolling update but for all practical purposes this is a rolling update. There is no correlation, from an OLM perspective, between example-operator-1 in example-operator.v1.0.0 and example-operator-2 in example-operator.v2.0.0.

I don't understand why this issue was closed without any further discussion. At a minimum this is a doc issue where we should be warning operator developers of potential pitfalls related to upgrades.

@rmohr
Copy link

rmohr commented Oct 11, 2019

@ecordell shouldn't this be reopened? Recreate should not be the first option if you want your operators to be available. Just consider the case where the new pods are not schedulable for a longer time, you operator will be down then. This can of course also happen otherwise (e.g. cluster is out of capacity and the operator is evicted or crashes), but it is an unnecessary additional risk during a normal cluster operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants