-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-36301: parallelize member health checks #1286
OCPBUGS-36301: parallelize member health checks #1286
Conversation
https://issues.redhat.com/browse/OCPBUGS-36301 Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine. With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members. I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other callsites.
@AlexVulaj: This pull request references Jira Issue OCPBUGS-36301, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@AlexVulaj: This pull request references Jira Issue OCPBUGS-36301, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/label cherry-pick-approved
resChan <- checkSingleMemberHealth(ctxTimeout, member) | ||
// closing here to avoid late replies to panic on resChan, | ||
// the result will be considered a timeout anyway | ||
close(resChan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we kinda still have to close this channel, don't we?
fyi, that's a panic I've fixed recently:
https://issues.redhat.com//browse/OCPBUGS-27959
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the pre-#1190 code, the panics were from timelines like:
checkSingleMemberHealth
goroutine launched with its own 30sContext
duration.select
waited on a result fromresChan
or a new 30stime.After
.- When the
select
time.After
won, it appended a30s timeout waiting for member...
tomemberHealth
, andclose
dresChan
. - A millisecond or two later,
checkSingleMemberHealth
would hit its 30sContext
timeout in theGet
call, create its ownhealth check failed: ...
result, and push it intoresChan
. - But
resChan
was closed in step 3! Panic!
With #1190, you dropped the close
from step 3, and moved it to step 4, so no more panic.
But from Go's Range and Close tour:
Channels aren't like files; you don't usually need to close them. Closing is only necessary when the receiver must be told there are no more values coming, such as to terminate a range loop.
And with this pull, we no longer have the receiver-side select
or timeout. With this pull, the receiver will block until it has a result back from each launched checkSingleMemberHealth
goroutine, and it's up to those goroutines to respect the Context
timeout. So there is no chance of the GetMemberHealth
close
-ing the channel before a checkSingleMemberHealth
goroutine writes, because we no longer have an explicit close
at all, and the channel is just garbage-collected as it goes out of scope, like all local Go variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, indeed kicking the select out and just looping all the values is enough :)
/lgtm |
/cherry-pick release-4.16 release-4.15 release-4.14 release-4.13 release-4.12 |
@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AlexVulaj, geliu2016, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required |
/retest-required |
I think the |
This is now tracked in ETCD-637. In the meantime, possibly worth an |
/override ci/prow/e2e-aws-ovn-etcd-scaling no doubt :) |
@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-etcd-scaling In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@AlexVulaj: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
unrelated failure /override ci/prow/e2e-aws-ovn-serial |
@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-serial In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
aabb6d6
into
openshift:master
@AlexVulaj: Jira Issue OCPBUGS-36301: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-36301 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@tjungblu: new pull request created: #1290 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-etcd-operator-container-v4.17.0-202407031527.p0.gaabb6d6.assembly.stream.el9 for distgit cluster-etcd-operator. |
Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine.
With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members.
I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other call sites.