-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry initialization error conditions #2979
Conversation
48f8cdc
to
3de9dc2
Compare
@@ -78,7 +78,7 @@ func TestOperatorRunChannelClosure(t *testing.T) { | |||
|
|||
o.Run(ctx) | |||
|
|||
timeout := time.After(time.Second) | |||
timeout := time.After(2 * time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this meant to represent 2*defaultServerVersionInterval
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it probably should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
select { | ||
case <-time.After(defaultProbeInterval): | ||
case <-stopCh: | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to try initialization indefinitely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this fails, then OLM is unable to monitor and update the packageserver
cluster operator. After chatting with some people, it was decided to keep on retrying, rather than terminate the OLM process (which would restart the pod), which could look worse in metrics.
The other retry was added, but is restricted because it impacts unit tests.
v, err := o.serverVersion.ServerVersion() | ||
if err == nil { | ||
o.logger.Infof("connection established. cluster-version: %v", v) | ||
return | ||
} | ||
select { | ||
case <-time.After(defaultServerVersionInterval): | ||
case <-ctx.Done(): | ||
return | ||
} | ||
v, err = o.serverVersion.ServerVersion() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I'm reading this correctly, we attempt to get the server version twice. If the first attempt fails, we retry after the defaultServerVersionInterval
. If we ever succeed we return the serverVersion. Could you share your reasoning why we make two attempts here but infinite attempts above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests... the unit tests actually wait for this initialization to complete (success or failure), which is guaranteed to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I would've liked to have had an infinite wait, but then the unit-tests timeout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
When the api server is flakey (e.g. during a cluster install), it is possible for some of the OLM initialization to fail. When this happens, OLM gets into a bad state (e.g. a monitoring go routine terminates) and can't recover without a restart. There were at least two places I found where a retry mechanism is needed to handle intialization errors. This was as far as I peeled the onion. It's not an exponential backoff retry, but a 1 minute retry interval should be sufficient (no other backoffs are exponential). The ServerVersion only retries once with a minute in between. This required fixing a unit-test to take the retry into account. Signed-off-by: Todd Short <todd.short@me.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ankitathomas, tmshort The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description of the change:
When the api server is flakey (e.g. during a cluster install), it is possible for some of the OLM initialization to fail. When this happens, OLM gets into a bad state (e.g. a monitoring go routine terminates) and can't recover without a restart.
There were at least two places I found where a retry mechanism is needed to handle intialization errors. This was as far as I peeled the onion. It's not an exponential backoff retry, but a 1 minute retry interval should be sufficient (no other backoffs are exponential).
Motivation for the change:
Downstream bug report.
Architectural changes:
None.
Testing remarks:
Reviewer Checklist
/doc
[FLAKE]
are truly flaky and have an issue