-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding retries for gRPC calls #1248
Adding retries for gRPC calls #1248
Conversation
Hi @akirillov. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks for your contribution! 🎉 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akirillov Thank you for doing this!
Maybe we could make this change directly in v1beta1, since v1alpha3 release was already cut?
What do you think @johnugeorge @gaocegege ?
@@ -35,6 +38,11 @@ const ( | |||
// which is used to run healthz check using grpc probe. | |||
DefaultGRPCService = "manager.v1alpha3.Suggestion" | |||
|
|||
// DefaultGRPCRetryCount is the the maximum number of retries for gRPC calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have DefaultGRPCRetryAttempts in comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks. Fixed that.
bd117e3
to
00a834c
Compare
thanks for the review, @gaocegege, @andreyvelich! Speaking of adding the changes to both alpha and beta, I've noticed that even though the image tags in this PR kubeflow/manifests#1239 were updated to 0.9.0 release, the images are still using
Is it intentional? |
Yes, the current Katib release for Kubeflow 1.1 is v1alpha3 with |
Thanks for the clarification, @andreyvelich. So what's the decision, shall I revert the changes for |
@akirillov What do you think about adding this change to v1alpha3 and v1beta1 version. But if users come with this problem, they just need to update Katib controller image to the latest v1alpha3 version and should be able to use Istio 1.5 +. They don't need to deploy v1beta1 version. Let's see what others say. |
thanks, @andreyvelich, this PR adds it to both versions and I think it is beneficial to have it that way to avoid waiting for |
@akirillov Got it. What do you think @gaocegege @johnugeorge ? |
I have no objection to it. If users (akirillov 's team) need it we can do it. /cc @johnugeorge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will update master manifests when we merge this PR.
/lgtm
/assign @johnugeorge
SGTM |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnugeorge The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This PR adds retries for gRPC calls in
suggestionclient.go
. One of the specific problems resolved by retries is the timing issue in Katib controller pod when running on Istio 1.5+: when Experiment deployment is ready, Envoy sidecar health check doesn't treat it as a healthy upstream yet andValidateAlgorithmSettings
call from the controller fails withno healthy upstream
.In general, retries can improve resilience and help to avoid transient failures instead of marking experiment as failed after the first and the only attempt.