Rolling updates nullify use of readiness probes #1028

djzager · 2019-09-05T13:33:20Z

Problem

Using rolling updates for operator upgrades, it is not possible for an operator upgrade to perform any initialization/transition tasks inside the scope of an OLM update.

Background

Let us assume that the operator is configured to delay readiness to perform some set of initialization tasks during an upgrade (upgrading operands, transitioning resource versions, etc) and that these operators use leader election. When the rollout begins there are two operator instances that want to own the same set of resources, the old operator instance has no work to do and stays "ready", the new operator is not able to do any work because it isn't in control. At this point, whether the use of readiness is not effective.

new operator refuses to go "ready" until it has completed initialization/transition, then the update will fail because it will not win the leader election until the old operator is taken down
new operator goes "ready" before completing initialization/transition, then the update will complete and all of the upgrade related tasks will be done after OLM has marked the upgrade as complete

Impact

Only operators that are using a rolling update strategy (either by changing the names of deployments in the CSV or the "RollingUpdate" k8s deployment strategy). This is regardless if they are using the "Many Version" or "Tightly Coupled" approach.

Solution

It seems that, in a majority of scenarios, that operator developers should be encouraged to use update/patch (keep deployment names the same) + recreate strategy type to grant the updated operator control over the resources; especially if. This makes #952 more of an issue for operator developers.

At the same time this appears, to me at least, to shine on the discussion of #922 where a more robust mechanism for communicating operator status back to OLM, whether that be install/continuous operation/upgrades, would be beneficial to operators being managed by OLM.

rthallisey · 2019-09-05T13:35:59Z

cc @davidvossel

rthallisey · 2019-09-05T13:44:59Z

cc @tchughesiv

tchughesiv · 2019-09-05T13:47:05Z

We're using rolling updates and upgrades are working as expected.

djzager · 2019-09-05T13:51:06Z

We're using rolling updates and upgrades are working as expected.

I think this would only be impact you, maybe I need to update the impact 😎, if you need your new operator to do work during an upgrade. Even then, this issue would not really prevent that work from being done it would just be done after OLM had said the operator update was complete.

To let the upgrades go nicely and make good use of conditions reporting, we need to make sure that k8s don't change things behind our back. To do that, we force the update strategy to "Recreate". Relevant OLM issues: operator-framework/operator-lifecycle-manager#1028 operator-framework/operator-lifecycle-manager#952 Signed-off-by: Francesco Romani <fromani@redhat.com>

ecordell · 2019-09-16T14:36:18Z

If rolling updates or other deployment-level options don't work for you case, you can change the name of the deployment in your install strategy. This will spin up the new deployment and remove the old deployment without any rolling.

mhrivnak · 2019-09-17T02:27:02Z

@ecordell I thought that if you changed the name of the Deployment, OLM will still wait for the new Pod to become ready before removing the old Deployment. Is that not the case?

djzager · 2019-09-17T13:31:13Z

@ecordell Changing the name of the deployment in the install strategy is the most problematic of all the options available to an operator developer. In that scenario, OLM will bring up all of the new deployments, wait for them to be ready, and then take down the old deployments.

This will spin up the new deployment and remove the old deployment without any rolling.

You may not call this a rolling update but for all practical purposes this is a rolling update. There is no correlation, from an OLM perspective, between example-operator-1 in example-operator.v1.0.0 and example-operator-2 in example-operator.v2.0.0.

I don't understand why this issue was closed without any further discussion. At a minimum this is a doc issue where we should be warning operator developers of potential pitfalls related to upgrades.

rmohr · 2019-10-11T11:04:56Z

@ecordell shouldn't this be reopened? Recreate should not be the first option if you want your operators to be available. Just consider the case where the new pods are not schedulable for a longer time, you operator will be down then. This can of course also happen otherwise (e.g. cluster is out of capacity and the operator is evicted or crashes), but it is an unnecessary additional risk during a normal cluster operation.

ffromani mentioned this issue Sep 5, 2019

manifests: force "Recreate" update strategy kubevirt/kubevirt-ssp-operator#90

Merged

ecordell closed this as completed Sep 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling updates nullify use of readiness probes #1028

Rolling updates nullify use of readiness probes #1028

djzager commented Sep 5, 2019

rthallisey commented Sep 5, 2019

rthallisey commented Sep 5, 2019

tchughesiv commented Sep 5, 2019

djzager commented Sep 5, 2019 •

edited

Loading

ecordell commented Sep 16, 2019

mhrivnak commented Sep 17, 2019

djzager commented Sep 17, 2019

rmohr commented Oct 11, 2019

Rolling updates nullify use of readiness probes #1028

Rolling updates nullify use of readiness probes #1028

Comments

djzager commented Sep 5, 2019

Problem

Background

Impact

Solution

rthallisey commented Sep 5, 2019

rthallisey commented Sep 5, 2019

tchughesiv commented Sep 5, 2019

djzager commented Sep 5, 2019 • edited Loading

ecordell commented Sep 16, 2019

mhrivnak commented Sep 17, 2019

djzager commented Sep 17, 2019

rmohr commented Oct 11, 2019

djzager commented Sep 5, 2019 •

edited

Loading