Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry initialization error conditions #2979

Merged
merged 1 commit into from
Jul 5, 2023

Conversation

tmshort
Copy link
Contributor

@tmshort tmshort commented Jun 14, 2023

Description of the change:

When the api server is flakey (e.g. during a cluster install), it is possible for some of the OLM initialization to fail. When this happens, OLM gets into a bad state (e.g. a monitoring go routine terminates) and can't recover without a restart.

There were at least two places I found where a retry mechanism is needed to handle intialization errors. This was as far as I peeled the onion. It's not an exponential backoff retry, but a 1 minute retry interval should be sufficient (no other backoffs are exponential).

Motivation for the change:

Downstream bug report.

Architectural changes:

None.

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

@openshift-ci openshift-ci bot requested review from dtfranz and ecordell June 14, 2023 15:00
@tmshort tmshort force-pushed the OCPBUGS-13128 branch 4 times, most recently from 48f8cdc to 3de9dc2 Compare June 14, 2023 19:58
@@ -78,7 +78,7 @@ func TestOperatorRunChannelClosure(t *testing.T) {

o.Run(ctx)

timeout := time.After(time.Second)
timeout := time.After(2 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this meant to represent 2*defaultServerVersionInterval ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it probably should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +139 to +143
select {
case <-time.After(defaultProbeInterval):
case <-stopCh:
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to try initialization indefinitely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fails, then OLM is unable to monitor and update the packageserver cluster operator. After chatting with some people, it was decided to keep on retrying, rather than terminate the OLM process (which would restart the pod), which could look worse in metrics.

The other retry was added, but is restricted because it impacts unit tests.

Comment on lines 201 to +211
v, err := o.serverVersion.ServerVersion()
if err == nil {
o.logger.Infof("connection established. cluster-version: %v", v)
return
}
select {
case <-time.After(defaultServerVersionInterval):
case <-ctx.Done():
return
}
v, err = o.serverVersion.ServerVersion()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm reading this correctly, we attempt to get the server version twice. If the first attempt fails, we retry after the defaultServerVersionInterval. If we ever succeed we return the serverVersion. Could you share your reasoning why we make two attempts here but infinite attempts above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests... the unit tests actually wait for this initialization to complete (success or failure), which is guaranteed to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I would've liked to have had an infinite wait, but then the unit-tests timeout)

Copy link
Contributor

@ankitathomas ankitathomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2023
When the api server is flakey (e.g. during a cluster install), it is
possible for some of the OLM initialization to fail. When this happens,
OLM gets into a bad state (e.g. a monitoring go routine terminates)
and can't recover without a restart.

There were at least two places I found where a retry mechanism is
needed to handle intialization errors. This was as far as I peeled the
onion. It's not an exponential backoff retry, but a 1 minute retry
interval should be sufficient (no other backoffs are exponential).

The ServerVersion only retries once with a minute in between. This
required fixing a unit-test to take the retry into account.

Signed-off-by: Todd Short <todd.short@me.com>
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023
Copy link
Contributor

@ankitathomas ankitathomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023
@openshift-ci
Copy link

openshift-ci bot commented Jul 5, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ankitathomas, tmshort
Once this PR has been reviewed and has the lgtm label, please assign kevinrizza for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tmshort tmshort merged commit e908cfc into operator-framework:master Jul 5, 2023
@tmshort tmshort deleted the OCPBUGS-13128 branch July 5, 2023 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants