Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum number of retries aws sdk attempts for recoverable exceptions. #3965

Merged
merged 11 commits into from
Feb 16, 2018

Conversation

mohsen0
Copy link
Contributor

@mohsen0 mohsen0 commented Feb 13, 2018

Potentially this could be a configuration too, AWS defaultRetryer has a backoff logic for a set of recoverable exceptions.

Related issue: #3942

https://github.com/hashicorp/vault/blob/master/vendor/github.com/aws/aws-sdk-go/aws/client/client.go#L59

@mohsen0 mohsen0 changed the title #3942 Maximum number of retries aws sdk attempts for recoverable exceptions. Maximum number of retries aws sdk attempts for recoverable exceptions. Feb 13, 2018
@jefferai
Copy link
Member

The SDK default is 3. 30 seems like a lot...

@jefferai
Copy link
Member

I'm actually now a bit confused by the purpose of this PR. The original issue was for Vault to attempt to retry failed requests. But the code you linked to for the client suggests that it's already using a client default of 3 retries. The fact that it's still failing probably means it is an error that actually should be propagated back.

@mohsen0
Copy link
Contributor Author

mohsen0 commented Feb 13, 2018

True, But there is an option in the SDK to increase the number. yeah, 30 is a lot, maybe 10 try?

@mohsen0
Copy link
Contributor Author

mohsen0 commented Feb 13, 2018

Changed it to 10 try

@jefferai
Copy link
Member

If you turn this into something that can be configured in both the credential and secrets backend I will consider it, but I'm not going to merge something that sets the value to an arbitrary non-default static number, especially given that this can already be done via env vars.

@mohsen0
Copy link
Contributor Author

mohsen0 commented Feb 13, 2018

Thanks, I will look into that,
But before that what variables you are referring too?

@jefferai
Copy link
Member

I didn't refer to variables -- can you point out what you're asking about?

@mohsen0
Copy link
Contributor Author

mohsen0 commented Feb 13, 2018

" especially given that this can already be done via env vars."

@jefferai
Copy link
Member

Sorry, brain fart. I think I mixed up the aws.UseServiceDefaultRetries with aws.UseEnvDefaultRetries.

@mohsen0
Copy link
Contributor Author

mohsen0 commented Feb 14, 2018

Hi, 👋 Can you have a look?

@jefferai
Copy link
Member

The default for the parameter if not set by the user will be zero but this will actually disable retries. I think you want to use aws.UseServiceDefaultRetries as the default value for the field.

@mohsen0
Copy link
Contributor Author

mohsen0 commented Feb 14, 2018

On other note, Noticed kops (the kubernete tool for aws), Hardcoded the value for this to 13. Using the same aws sdk.
But now, I understand this way solution is more elegant.
https://github.com/kubernetes/kops/blob/88984d4b477dbb59e124f0626e1cc23d9b5b0aef/upup/pkg/fi/cloudup/awsup/aws_cloud.go#L55

@mohsen0
Copy link
Contributor Author

mohsen0 commented Feb 14, 2018

Updated the PR with the default value.

@joelthompson
Copy link
Contributor

Adding my own $0.02 here, I'd like to echo what @jefferai said about 30 being excessive -- you'll really do yourself a favor if you track down what's causing all the excessive API traffic. Many things will be having these issues, not just Vault (and I speak from personal experience).

Also, having a retry count of 30 is probably counter-productive -- many people will just complain that Vault is "slow" under certain circumstances and either file impossible-to-track-down bugs or just stop using it. So, in most cases, it's probably better to actually fail out to let people know what's going on so they can correct it, rather than just constantly increase the retry count. 3 might be a bit low, but 30 is probably too high.

@mohsen0
Copy link
Contributor Author

mohsen0 commented Feb 14, 2018

I agree I was just testing the waters with 30 :) , 30 is a lot, I have already looked into other possible causes too, but practically speaking with 3 retries backtracking functionality in the SDK cannot do much.

Regarding the sleep, AWS backend and AUTH are involved with AWS API, I think a reasonable person will agree that that matters in terms of responsiveness.

Other than that, This is the first time I head about vault being slow.

@mohsen0
Copy link
Contributor Author

mohsen0 commented Feb 15, 2018

Please let me know if anything else is required to be done by me,
Regarding the build failure, I started my branch from v0.9.3 tag, and I run the tests for each package but I can not figure out why the build is failing, it seems to be unrelated to this change,

$: make test TEST=./builtin/logical/aws
==> Checking that code complies with gofmt requirements...
==> Checking that build is using go version >= 1.9.1...
go generate
ok  	github.com/hashicorp/vault/builtin/logical/aws	0.022s
$ make test TEST=./builtin/credential/aws
==> Checking that code complies with gofmt requirements...
==> Checking that build is using go version >= 1.9.1...
go generate
ok  	github.com/hashicorp/vault/builtin/credential/aws	0.531s

@jefferai jefferai added this to the 0.9.4 milestone Feb 15, 2018
@mohsen0
Copy link
Contributor Author

mohsen0 commented Feb 16, 2018

Thanks for the help, all the checks have passed.

@jefferai
Copy link
Member

Thanks!

@jefferai jefferai merged commit 9db39c8 into hashicorp:master Feb 16, 2018
@mohsen0 mohsen0 deleted the feature/#3942-retry-logic branch February 16, 2018 16:41
@mohsen0
Copy link
Contributor Author

mohsen0 commented Feb 16, 2018

When do you think we will have 0.9.4?

@jefferai
Copy link
Member

Soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants