Retry initialization error conditions #2979

tmshort · 2023-06-14T14:59:50Z

Description of the change:

When the api server is flakey (e.g. during a cluster install), it is possible for some of the OLM initialization to fail. When this happens, OLM gets into a bad state (e.g. a monitoring go routine terminates) and can't recover without a restart.

There were at least two places I found where a retry mechanism is needed to handle intialization errors. This was as far as I peeled the onion. It's not an exponential backoff retry, but a 1 minute retry interval should be sufficient (no other backoffs are exponential).

Motivation for the change:

Downstream bug report.

Architectural changes:

None.

Testing remarks:

Reviewer Checklist

ankitathomas · 2023-06-23T16:05:58Z

pkg/lib/queueinformer/queueinformer_operator_test.go

@@ -78,7 +78,7 @@ func TestOperatorRunChannelClosure(t *testing.T) {

 			o.Run(ctx)

-			timeout := time.After(time.Second)
+			timeout := time.After(2 * time.Minute)


is this meant to represent 2*defaultServerVersionInterval ?

Yeah, it probably should be.

ankitathomas · 2023-06-23T16:12:12Z

pkg/lib/operatorstatus/monitor.go

+			select {
+			case <-time.After(defaultProbeInterval):
+			case <-stopCh:
+				return
+			}


Do we want to try initialization indefinitely?

If this fails, then OLM is unable to monitor and update the packageserver cluster operator. After chatting with some people, it was decided to keep on retrying, rather than terminate the OLM process (which would restart the pod), which could look worse in metrics.

The other retry was added, but is restricted because it impacts unit tests.

awgreene · 2023-06-26T20:29:35Z

pkg/lib/queueinformer/queueinformer_operator.go

 		v, err := o.serverVersion.ServerVersion()
+		if err == nil {
+			o.logger.Infof("connection established. cluster-version: %v", v)
+			return
+		}
+		select {
+		case <-time.After(defaultServerVersionInterval):
+		case <-ctx.Done():
+			return
+		}
+		v, err = o.serverVersion.ServerVersion()


So if I'm reading this correctly, we attempt to get the server version twice. If the first attempt fails, we retry after the defaultServerVersionInterval. If we ever succeed we return the serverVersion. Could you share your reasoning why we make two attempts here but infinite attempts above?

Unit tests... the unit tests actually wait for this initialization to complete (success or failure), which is guaranteed to fail.

(I would've liked to have had an infinite wait, but then the unit-tests timeout)

pkg/lib/operatorstatus/monitor.go

ankitathomas

/lgtm

When the api server is flakey (e.g. during a cluster install), it is possible for some of the OLM initialization to fail. When this happens, OLM gets into a bad state (e.g. a monitoring go routine terminates) and can't recover without a restart. There were at least two places I found where a retry mechanism is needed to handle intialization errors. This was as far as I peeled the onion. It's not an exponential backoff retry, but a 1 minute retry interval should be sufficient (no other backoffs are exponential). The ServerVersion only retries once with a minute in between. This required fixing a unit-test to take the retry into account. Signed-off-by: Todd Short <todd.short@me.com>

ankitathomas

/lgtm

openshift-ci · 2023-07-05T14:37:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ankitathomas, tmshort
Once this PR has been reviewed and has the lgtm label, please assign kevinrizza for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from dtfranz and ecordell June 14, 2023 15:00

tmshort force-pushed the OCPBUGS-13128 branch 4 times, most recently from 48f8cdc to 3de9dc2 Compare June 14, 2023 19:58

ankitathomas reviewed Jun 23, 2023

View reviewed changes

tmshort force-pushed the OCPBUGS-13128 branch from 3de9dc2 to 2515158 Compare June 23, 2023 18:31

awgreene reviewed Jun 26, 2023

View reviewed changes

pkg/lib/operatorstatus/monitor.go Show resolved Hide resolved

ankitathomas approved these changes Jun 29, 2023

View reviewed changes

openshift-ci bot assigned ankitathomas Jun 29, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2023

tmshort force-pushed the OCPBUGS-13128 branch from 2515158 to b080d21 Compare July 5, 2023 14:06

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023

ankitathomas approved these changes Jul 5, 2023

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023

tmshort merged commit e908cfc into operator-framework:master Jul 5, 2023

tmshort deleted the OCPBUGS-13128 branch July 5, 2023 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry initialization error conditions #2979

Retry initialization error conditions #2979

tmshort commented Jun 14, 2023

ankitathomas Jun 23, 2023

tmshort Jun 23, 2023

tmshort Jun 23, 2023

ankitathomas Jun 23, 2023

tmshort Jun 23, 2023

awgreene Jun 26, 2023

tmshort Jun 26, 2023

tmshort Jun 26, 2023

ankitathomas left a comment

ankitathomas left a comment

openshift-ci bot commented Jul 5, 2023

Retry initialization error conditions #2979

Retry initialization error conditions #2979

Conversation

tmshort commented Jun 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankitathomas left a comment

Choose a reason for hiding this comment

ankitathomas left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jul 5, 2023