-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation for retrying requests #57
Conversation
manicminer
commented
Jun 2, 2021
•
edited
Loading
edited
- Attempt to work around eventual consistency issues at the cost of some performance and potentially unnecessary requests.
- Request input structs can now contain a ConsistencyFailureFunc which determines whether a request should be retried on failure.
- ConsistencyFailureFuncs receive a copy of the response and the parsed OData with which to make such decisions.
- A canned RetryOn404ConsistencyFailureFunc can simply retry requests that receive a 404 response.
- This mitigates some eventual consistency issues, such as manipulating or referencing a newly created object, but doesn't work when listing objects.
- There's an example of a custom func in the ServicePrincipalsClient{}.Create() method which indicates retries when the referenced application isn't found
8dc9455
to
c59f24d
Compare
7351c3b
to
db2dd24
Compare
1ae0753
to
cebaaca
Compare
- Attempt to work around eventual consistency issues at the cost of some performance and potentially unnecessary requests. - Request input structs can now contain a ConsistencyConditionFunc which determines whether a request should be retried on failure. - ConsistencyConditionFuncs receive a copy of the response and the parsed OData with which to make such decisions. - A canned RetryOn404ConsistencyConditionFunc can simply retry requests that receive a 404 response. - This mitigates some eventual consistency issues, such as manipulating or referencing a newly created object, but doesn't work when listing objects.
cebaaca
to
5c69989
Compare
@romainDavaze @tsologub Do you have any thoughts on this approach? Is there a risk to replaying requests that I haven't realized? In the end I figured it was probably better to enable this retry mechanism by default, but it can be disabled by setting It's unlikely the SDK can shield all consistency issues from apps (eg. the List methods can often be missing newly created objects) but hopefully most of the common delay-related errors can be handled. |
This seems to be a good way to implement the retry mechanism. To me the only risk is that the end-user might be a bit thrown off by the delay of function calls in some cases. I think it might be wise to add a note somewhere in the documentation explaining this behavior, why we do this and how it can be turned off. This will introduce more delay in SDK calls which is not ideal, but I'd rather have this than something quicker that has a chance to fail. |
I will gladly take a look on Monday (June 7). Currently, on a short vacation. |
Great job! I like it, especially turning this feature on by default. 👍 Yeah, it can be slow, but a user can turn it off explicitly. Btw. I found other places where the retries can be useful as well. e.g. |
msgraph/client.go
Outdated
return nil, status, nil, fmt.Errorf("reading request body: %v", err) | ||
} | ||
} | ||
|
||
var attempts, backoff, multiplier int64 | ||
for attempts = 0; attempts < requestAttempts; attempts++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is slightly hard to understand for how long the retries will be performed.
Correct me if I am wrong:
- we have
10 attempts
. - every time we have a consistency error, the
backoff
is set to2
. - in between attempts, we sleep for
2
seconds.
i.e., at most, we are retrying for ~20 seconds, right ?
Some time ago, I stumbled upon this video, where the author talks about 1-2 minute
for replication to propagate fully.
We are provisioning applications and service principals daily, and I can say statistically it is about 2 minutes. i.e., we stopped receiving errors when we increased retries to 2 minutes mark.
I am afraid that ~20 seconds won't be enough. Or, we can start with something (20 seconds
) and see how it goes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to strike a balance between performance and consistency - ultimately the API doesn't guarantee consistency so you should always defensively check - and 20 seconds seems to smooth over replication most of the time, i.e. in the best conditions. But if you think the retries should be a bit more persistent that's totally fine, we can extend it. I agree around 2 mins is a reasonable milestone, anecdotally it's fairly rare for it to be longer than that IME. We should probably add some backoff in that case (but still be fairly aggressive).
Thanks! I noticed a few methods I missed, I'll add those 👍 |
- Retry in more methods - Use exponential backoff for retrying due to consistency failure - Capped at just over 2 minutes
Thanks all for the time to review 🙌 |