Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding retries for gRPC calls #1248

Conversation

akirillov
Copy link
Contributor

@akirillov akirillov commented Jun 29, 2020

What this PR does / why we need it:
This PR adds retries for gRPC calls in suggestionclient.go. One of the specific problems resolved by retries is the timing issue in Katib controller pod when running on Istio 1.5+: when Experiment deployment is ready, Envoy sidecar health check doesn't treat it as a healthy upstream yet and ValidateAlgorithmSettings call from the controller fails with no healthy upstream.

In general, retries can improve resilience and help to avoid transient failures instead of marking experiment as failed after the first and the only attempt.

@kubeflow-bot
Copy link

This change is Reviewable

@k8s-ci-robot
Copy link

Hi @akirillov. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gaocegege
Copy link
Member

/ok-to-test

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for your contribution! 🎉 👍

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akirillov Thank you for doing this!
Maybe we could make this change directly in v1beta1, since v1alpha3 release was already cut?
What do you think @johnugeorge @gaocegege ?

@@ -35,6 +38,11 @@ const (
// which is used to run healthz check using grpc probe.
DefaultGRPCService = "manager.v1alpha3.Suggestion"

// DefaultGRPCRetryCount is the the maximum number of retries for gRPC calls
Copy link
Member

@andreyvelich andreyvelich Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have DefaultGRPCRetryAttempts in comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. Fixed that.

@akirillov akirillov force-pushed the add-grrpc-retry-to-suggestion-controller branch from bd117e3 to 00a834c Compare June 30, 2020 18:35
@k8s-ci-robot k8s-ci-robot removed the lgtm label Jun 30, 2020
@akirillov
Copy link
Contributor Author

thanks for the review, @gaocegege, @andreyvelich! Speaking of adding the changes to both alpha and beta, I've noticed that even though the image tags in this PR kubeflow/manifests#1239 were updated to 0.9.0 release, the images are still using v1alpha3 version e.g.

images:
  - name: gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller
    newTag: 917164a

Is it intentional?

@andreyvelich
Copy link
Member

thanks for the review, @gaocegege, @andreyvelich! Speaking of adding the changes to both alpha and beta, I've noticed that even though the image tags in this PR kubeflow/manifests#1239 were updated to 0.9.0 release, the images are still using v1alpha3 version e.g.

images:
  - name: gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller
    newTag: 917164a

Is it intentional?

Yes, the current Katib release for Kubeflow 1.1 is v1alpha3 with 917164a commit. This is the cut release: https://github.com/kubeflow/katib/releases/tag/v0.9.0.
Currently, we make all new changes only for v1beta1 version and we will try to release v1beta1 in the next Kubeflow release.
We update files for v1alpha3 version only with the urgent bug fixes.

@akirillov
Copy link
Contributor Author

Thanks for the clarification, @andreyvelich. So what's the decision, shall I revert the changes for v1alpha3 and keep them only in v1beta1? Until v1beta1 is released, the fix can be helpful for the users running Katib v1alpha3 on Istio 1.5+ because, without it, all created Experiments are marked as failed after ValidateAlgorithmSetting gRPC call fails with 503 and no retry.

@andreyvelich
Copy link
Member

andreyvelich commented Jul 1, 2020

@akirillov What do you think about adding this change to v1alpha3 and v1beta1 version.
I am not sure if we can include this change to the Katib v0.9.0, which deployed with Kubeflow 1.1, because it is already cut and manifest repository is updated.

But if users come with this problem, they just need to update Katib controller image to the latest v1alpha3 version and should be able to use Istio 1.5 +. They don't need to deploy v1beta1 version.

Let's see what others say.
What do you think @gaocegege @johnugeorge ?

@akirillov
Copy link
Contributor Author

thanks, @andreyvelich, this PR adds it to both versions and I think it is beneficial to have it that way to avoid waiting for v1beta1 release. This is the use case my team is dealing with and it would be great to have v1alpha3 image with the fix published to avoid living on the custom build from a fork until v1beta1 is out.

@andreyvelich
Copy link
Member

@akirillov Got it.

What do you think @gaocegege @johnugeorge ?
Can we just update image tags for the master branch in manifest repository?
I believe, we have kfdef config, which is related on master branch.

@gaocegege
Copy link
Member

I have no objection to it. If users (akirillov 's team) need it we can do it.

/cc @johnugeorge

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will update master manifests when we merge this PR.
/lgtm
/assign @johnugeorge

@johnugeorge
Copy link
Member

SGTM
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit c3b38d8 into kubeflow:master Jul 6, 2020
sperlingxx pushed a commit to sperlingxx/katib that referenced this pull request Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants