Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry mechanism for put call #18137

Merged
merged 6 commits into from
Aug 17, 2023
Merged

Add retry mechanism for put call #18137

merged 6 commits into from
Aug 17, 2023

Conversation

Juanadelacuesta
Copy link
Member

The SDK for the locking could benefit from a retry mechanism for calls made to the nomad server, this PR introduces this mechanism on top of the put call already present on the api.

Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Added some questions and think it would be good to have some more tests to cover functions such as the calculateDelay function.

api/retry.go Outdated
for attempt := uint64(0); attempt < rc.MaxRetries; attempt++ {
iDelay = rc.calculateDelay(attempt)

t := time.NewTimer(iDelay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also will need a defer t.Stop() here, otherwise we could leak timers if the context is closed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that might make sense is to instantiate the timer outside the loop and then call Reset on it when needed. Our helper.SafeTimer would be nice here but we don't want to pull that into this package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, reuse the time, will do!

api/retry.go Outdated
return 0
}

new := rc.DelayBase << (attempt - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this variable as it collides with the builtin function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

api/retry.go Outdated
Comment on lines 23 to 29
type retryOptions struct {
MaxRetries uint64 // Optional, defaults to 3
MaxBetweenCalls time.Duration // Optional, defaults to 0, meaning no time cap
MaxToLastCall time.Duration // Optional, defaults to 0, meaning no time cap
FixedDelay time.Duration // Optional, defaults to 0, meaning Delay is exponential, starting at 1sec
DelayBase time.Duration // Optional, defaults to 1sec
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct isn't exported but the parameters are, do we want to alter the parameters? This also makes me wonder if this is something we would want to make available to external users of the API package?

Could we also add some added descriptions on the parameters? I had to read the code to understand MaxBetweenCalls the first time.

return wm, err
}

func isCallRetriable(statusCode int) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as this function is available to the whole API package, I wonder if it needs a more descriptive name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the best I could come up with jejej any ideas?

api/retry.go Outdated
var err error
var wm *WriteMeta

iDelay := time.Duration(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iDelay for initial delay? If so, we use it for all delays, so maybe delayDuration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was for iteration delay, but lets change it for attempt delay :)

api/retry.go Outdated
Comment on lines 31 to 61
func newRetryClient(c client, opts retryOptions) *retryClient {
rc := &retryClient{
c: c,
retryOptions: retryOptions{
MaxRetries: 3,
DelayBase: time.Second,
},
}

if opts.DelayBase != 0 {
rc.DelayBase = opts.DelayBase
}

if opts.MaxRetries != 0 {
rc.MaxRetries = opts.MaxRetries
}

if opts.MaxBetweenCalls != 0 {
rc.MaxBetweenCalls = opts.MaxBetweenCalls
}

if opts.MaxToLastCall != 0 {
rc.MaxToLastCall = opts.MaxToLastCall
}

if opts.FixedDelay != 0 {
rc.FixedDelay = opts.FixedDelay
}

return rc
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryClient isn't exported, so it's only available for Nomad-internal uses. I think that's probably fine for now (until we're happy with our own uses of it). But from the standpoint of API design, maybe it would make sense to have all this functionality in the standard Client by adding it to the Config struct as an unexported field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure I know what you mean by adding as a Config

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we added the retryOptions to api.Config then we wouldn't need to have a special retryClient struct, and the standard api.Client could use the retry options whenever it's been configured that way.

type Config struct {
  // ... there's a bunch of existing field here
  retry retryOptions
}

And then later if we want to, we could make the retry options public for all api package consumers (ex. could the Terraform provider use this?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this last night and there's no real reason to hold this PR up for this though. The new code you've added isn't exported, so it's not part of the public API and will be safe to update later on without breaking backwards compatibility. Let's ship and do any changes like this later if we want.

api/retry.go Outdated
for attempt := uint64(0); attempt < rc.MaxRetries; attempt++ {
iDelay = rc.calculateDelay(attempt)

t := time.NewTimer(iDelay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that might make sense is to instantiate the timer outside the loop and then call Reset on it when needed. Our helper.SafeTimer would be nice here but we don't want to pull that into this package.

Comment on lines +115 to +117
return statusCode > http.StatusInternalServerError &&
statusCode < http.StatusNetworkAuthenticationRequired ||
statusCode == http.StatusTooManyRequests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there are still a bunch of places in the HTTP server where we return 500 when we should be returning 400. 😊 But I don't think we can fix all of those in this PR. So long as we're not exporting this for now I think that's fine.

api/retry.go Outdated
type retryOptions struct {
MaxRetries uint64 // Optional, defaults to 3
MaxBetweenCalls time.Duration // Optional, defaults to 0, meaning no time cap
MaxToLastCall time.Duration // Optional, defaults to 0, meaning no time cap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we really need this if we're passing context.Context to retryPut. The caller can cancel the work however they see fit there, and it would be easy to misconfigure this to be shorter than any context timeout.

Copy link
Member Author

@Juanadelacuesta Juanadelacuesta Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of the lock lease on this, how it doesn't make sense to retry a call after the TTL passed, the servir will expire the lock by then anyways.

api/retry.go Outdated
return 0
}

newDelay := c.config.retryOptions.delayBase << (attempt - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will overflow quite quickly similarly to what is shown here: #18200

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. @Juanadelacuesta note that we don't want to pull in the helper into this api package, so we'll need to duplicate the logic we added in #18200 here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set a limit to the max number of attempts?

api/retry.go Outdated
Comment on lines 124 to 128
// Ensure that a big attempt number or a big delayBase number will not cause
// a negative delay by overflowing the delay increase.
if math.MaxInt64/c.config.retryOptions.delayBase.Nanoseconds() < (1 << attempt) {
return defaultMaxBackoffDelay
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work to protect against overflow in the bitshift, because it does the bitshift in order to do the check. Without a low cap on the number of attempts it'd be easy to cause this to overflow. See https://go.dev/play/p/DztXnyHL9hk for the results.

Instead, we should be tracking the current delay in the caller and returning early if that delay exceeds the max backoff (or default max backoff, if no max backoff is set).

@Juanadelacuesta Juanadelacuesta merged commit adfcb1b into f-gh-17449 Aug 17, 2023
23 checks passed
@Juanadelacuesta Juanadelacuesta deleted the f-gh-17449-sdk branch August 17, 2023 11:37
Juanadelacuesta added a commit that referenced this pull request Sep 12, 2023
* func: add retry mechanism for put call

* style: remove debg print

* fix: move timer outside of main loop, add descriptions

* func: make the retryPut a method of the API client directly

* func: add verification to avoid negative delays

* func: change the delay capping mechanism to checking for an attempt that wont cause the delay to overflow
Juanadelacuesta added a commit that referenced this pull request Sep 13, 2023
* func: add retry mechanism for put call

* style: remove debg print

* fix: move timer outside of main loop, add descriptions

* func: make the retryPut a method of the API client directly

* func: add verification to avoid negative delays

* func: change the delay capping mechanism to checking for an attempt that wont cause the delay to overflow
Juanadelacuesta added a commit that referenced this pull request Sep 15, 2023
* func: add retry mechanism for put call

* style: remove debg print

* fix: move timer outside of main loop, add descriptions

* func: make the retryPut a method of the API client directly

* func: add verification to avoid negative delays

* func: change the delay capping mechanism to checking for an attempt that wont cause the delay to overflow
Juanadelacuesta added a commit that referenced this pull request Sep 21, 2023
* func: add retry mechanism for put call

* style: remove debg print

* fix: move timer outside of main loop, add descriptions

* func: make the retryPut a method of the API client directly

* func: add verification to avoid negative delays

* func: change the delay capping mechanism to checking for an attempt that wont cause the delay to overflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants