-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Honouring retries from 429 response #247
Comments
@mcquiggd. Yes, that was pseudo-code-ish (have re-labelled). Thanks for identifying the issues! Let's see if we can sort it out. I'd also been meaning to ask @jmansar (who originally requested the feature, #177) or @tcsatheesh (who was also interested in using it with DocumentDb) if they wanted to expand on my blogged (pseudo-) code: both, feel free to chip in! More coming shortly. |
@mcquiggd Having spent a couple of hours on this, I am not convinced we have adequate support within Polly yet for the DocumentDB RetryAfter Use Case. I can see various ways forward to support this better in Polly, but need more time to consider. A possible workround for now with Polly might be (using microsoft.azure.documentdb.core) as below:
Clearly this is sub-optimal: we are bypassing Polly's in-built sleep, and sleeping in the I haven't had time to test the above tonight and would be interested in feedback on whether it works/gets you closer - that would help guide how we wrap this up into a more off-the-shelf/ready-to-go solution! Thanks! |
Note: the Microsoft Patterns/Practices team recommend the in-built retry for DocumentDb: Is this relevant? . Also more in this documentation Would be great to hear if this covers it, or if there's a reason it's still worth us pursuing this use case in Polly - will help guide where to place effort. Many thanks! EDIT: I guess this paragraph:
in the linked documentation suggests a reason why still handling this in Polly could be useful .... |
Many thanks for taking the time to look into this - it's appreciated. We had sort of come to the same point. My thoughts so far:
Still thinking about all this... as you mention in your edit, I feel there is a role for Polly to play in this. I need to ponder it for a while and try some ideas over a cup of tea ... ;-) |
24 hours is a long time in technology - now we have Azure Cosmos DB... 🥇 From the documentation:
Having looked at the DocumentClient source code, I am currently testing an approach that disables the built-in retries, and uses Polly instead. Will report back if I make progress. |
Thanks for the thoughtful commentary @mcquiggd ^^ (two up). Agree on all counts! Re your point 3 - wrapping a bunch of policies around a call as a small abstraction - check out Polly's +1 to your idea of:
We would love to hear more about this later, if grows and you would like to share. We would love to eg blog/guest-blog about it, or share samples perhaps (duly credited) via a Polly.Contrib. (Contact me on Polly slack later if u want to discuss.) One issue with the in-built retries for the various Azure services is that they are all different APIs - good in the sense each is probably a good fit/strategy for the particular service - but a slightly fragmented API experience. Another issue with combining Polly policies with Azure's in-built retries, is that with Polly-wraps-Azure-retry you can effectively only have the retry innermost (out of the set of policies). However, you might want to have certain policy types inside a retry, in some Re Topaz and the Enterprise Application Blocks, yes, feels like it is well on the way out. The Microsoft patterns and practices team are now moving away from recommending Topaz, in favour of Polly (I've provided input on the Polly samples). |
Well, I have made a little progress. Firstly, I have taken the DocumentDB Benchmark project (available from the DocumentDB GitHub repo, under samples), copied it and converted it to .Net Core, so we have both .Net 4.x and .Net Core 1.x versions in the same solution. I am using this as a testbed for my 'experiments', with the latest DocumentDB emulator set to use a single, non-partitioned collection of 400 Request Unit capacity, the 'slowest'. I deliberately set the values for threads, and amount of documents to insert, to cause throttling. With the DocumentDB Client retry, throttling was handled automatically. Next, I disabled the DocumentDB Client retries; of course exceptions occurred. Bear in mind that these settings are 'global' if you follow the sound advice of using a single Client instance, which does a lot of performance enhancement, but is also a little inflexible...
Then, I created a Polly Retry Policy, as follows:
So, basically using Context to pass the retry value from the onRetryAsync to the sleepDurationProvider, and using purely Polly to handle waits etc. And now we can create different policies / settings for different 'operations', using the same Client instance, potentially passing the 'operation name' in Context from the call to Execute the Policy... if my understanding of the Retry life-cycle is correct. I have noticed that retry attempt 1 does not seem to include a server retry-after value, but had limited time today to verify. This Policy is then called in the multi-threaded benchmarking app as below:
The result - Polly handles all the retries, observing the returned retry interval from the server: I'll continue to work on this (it can definitely be tidied up, I have a lot of distractions around at the moment), and upload the Solution to Github in case others wish to contribute. Perhaps you can spot any obvious mistakes - e.g. is it valid to create the Policy as static readonly in a multithreaded environment? I want to create some more advanced examples for the different scenarios available with Polly, along the lines of your last post... will pop onto Slack at some point... |
Made some changes this morning, including storing the id of the Document to be inserted via Context when ExecuteAsync is called, and that functions as expected, with data subsequently being available within the Policy. A1. I changed the Policy declaration to be simply private - i.e. not static, not read only - I observed no change in behaviour. A2. I have also set the parameter continueOnCapturedContext on ExecuteAsync to true - this does not appear to have any impact on results. A3. I have verified that on retry attempt 1, there is no RetryAfter being returned in the Context, although it is returned from the server. A4. On investigation, this is due to the sleepDurationProvider being called before the onRetryAsync delegate, which is not what I expected from the Retry lifecycle :
A5. I don't see an override of the sleepDurationProvider that provides access to the Exception / handled object - which would allow simplified logic in this instance. Any thoughts...? |
@mcquiggd Great to see this all coming together! 👍 using the A1. All Polly policies are fully thread-safe, so you're absolutely fine re-using an instance across multiple calls (if relevant to any decisions around this). A2. Whether you want to A3/A4. Your analysis is exactly correct. Sleep duration is calculated before Foresaw your A3 problem as soon as you posted this, btw, but ... busy day this end ... you beat me to the follow-up 😀 and figured it out, so great 👍 . This exact issue is why my example the day before was more circuitous! ( A5. comments in mo ) |
A5. So, we could do this (add So, you'd have access to eg, ( Great to have this conversation and see the DocumentDb use case in detail: thanks for sharing! ) |
Very nicely, and I was going to suggest that - the existing number of overloads was a little intimidating at first, and of course changing the execution behaviour and breaking backwards compatibility is a non-starter. |
Added some random thoughts: B1. A Policy can B2. As we are using an initial generic B3. If we are able to pass the |
B1/B2 Handled results: When a policy is a strongly-typed generic It would mean creating a new generic-typed (thoughts on other categories coming separately) |
B1/B2 Handling multiple exception types: Yes, a single policy can handle multiple unrelated exception types (say, That power does make providing a strongly-typed Thinking around ... The only option (which would be a major API change) would be moving the At this stage, I'm left thinking that the need (sometimes) to disambiguate which exception you have received is a necessary corollary of the power to handle multiple exception types, encapsulated nicely in a single Policy. Open to thoughts, though, from anyone, if we're missing some clever alternative? (This discussion excludes EDIT: Another option that does exist, with existing Polly, is to specify separate policies for each exception type. You can use the same policy type more than once in a wrap. Where the |
Re the case It would probably be possible to introduce handle clauses like:
and/or
These could extract a matching (single; or any) exception from within Do Polly users think this would be useful? The alternative - changing existing [*] Long, intention-revealing names intentionally used - other suggestions welcome! |
As an aside @mcquiggd , do we know that I saw this dual approach in a code example when researching, but it could have started a spurious hare ... |
Indeed, Stephen Cleary states here:
So, there is a need to handle both possibilities. This is, I believe, how the Transient Fault Handler works. I will look at the code for it tomorrow, and will drop a message to a person I know on the DocumentDB team to see if they have any insight... I have some time this week to devote to this - I will try to create an example of what I envisaged for my own application; basically the same sort of approach as the Transient Fault Handler that decorates the Document DB client, but with Polly policies. These would be specified at client instantiation as '"global" Policies, and optionally overridden by operation specific Policies passed when calling a method. So basically there would be a library of extension methods, which wrap each type of client, and return a 'Resilient' instance; such as If we cannot find an elegant way to enhance the existing Polly approach to AggregateExceptions without breaking backwards compatibility, I would probably look at adding a try catch block around the actual DocumentDB Client call, extracting the DocumentClientException and rethrowing that to the ExecuteAsync method of the Policy, within the Resilient decorator. For my project I also intend to add such features as concurrency handling (e.g. handing etag mismatch in DocumentDB) where appropriate. That's just a summary of my initial thoughts... at this point I prefer to try to write some code to see how it 'feels', and adjust accordingly... Ill share it so we can discuss it... David |
Hi @mcquiggd . Long familiar with async, TPL and their exception behaviour - I should have been (apologies) more precise in my previous question. Q in my mind was whether any one execution through DocumentDb might fail with either a Agree tho somewhat theoretical until see how your code patterns pan out - cranking out some real code 👍 And independently: the Re:
|
Yep, got it - thats why I wanted to look at the Transient Fault Handler code and Document Client code to see under what circumstances (and why) there are Aggregate exceptions. I put the background for others reading the thread :) It would be a perfectly reasonable theory that the Async methods on DocumentClient throw AggregateExceptions, and the non-async methods throw DocumentClientExceptions. But as I say, now its time to code / examine code and see how things work ;) |
@mcquiggd Polly v5.6.0 (uploading to nuget in next day or so) will include the ability for This should make it even easier to use |
Hi @reisenberger , Does this code is valid? If not, do you have a valid solution available somewhere? Thanks for your help
|
@ranouf We added overloads where the |
Hi, Thanks @reisenberger. For next one who are looking for a sample code:
Let me know if you see something to improve :) |
Hi @ranouf, I'm new to Polly and am struggling with the sync version of your above code. Here's my first attempt, but it's not clear to me what I should use for ???:
It seems to be asking for that delegate being retried. How can this be accessed there? Also, on a side note, do you (or anyone watching this thread) know if 449's (Transient Errors) include a RetryAfter value? Many thanks, |
@briansboyd In the original:
just denotes an async function which does nothing. To create an equivalent do-nothing for a sync function, you might use;
More realistically, you might use the Out of interest: Is there a sync API on the Azure CosmosDB SDK which is throwing |
Edit: heh, started replying hours ago & got really distracted.. glad i got the same answer.. @briansboyd i guess you don't actually have to do anything in the onRetry if you don't want as for the syncronous version it's an action, and not a func so expects no return.. i think? it's a shame that the only synchronous signature that lets you access the exception in the sleepDurationProvider also requires an onRetry, but there is already an absolute slew of overloads it's really quite confusing.. so you could just do:
i tend to log what i'm doing though 🙂
|
Thanks @reisenberger and @m1nkeh for your responses. I guessed
Although I haven't explicitly seen this error being thrown (and RetryAfter property set), I'm asking for the Create*Query(...) set of methods: e.g., The documentation here doesn't distinguish between sync and async, so I'm hopeful this property will be populated. Of course, would be great to hear if otherwise. I figure that 429's and 449's are the exceptions to retry for Cosmos DB, I'm surprised that 449 wasn't included in this discussion. Maybe Microsoft handled this and wraps 449's into DocumentClientException's and includes a RetryAfter? I'll ask our Microsoft support contact this question and post back the answer. Cheers, |
Here's the response from Microsoft's Cosmos DB Product Group: "The short answer is yes and it is wrapped. The 449 errors will be wrapped in a document client exception with the RetryAfter property set. All exceptions from Cosmos DB are wrapped in a DocumentClientException. The 449 errors are internally retried by the SDK: If the users would like to perform more retries on their side, similar to how they handle 429, there is a RetryAfter property on the DocumentClientException that can be used to implement that backoff. Looks to be good news all around. Cheers, |
Thanks @briansboyd . Re the sync/async question, it may be worth noting that the
(Apologies if this is stating something known already - just for clarity.) |
Thanks @reisenberger for clarity! I had missed that point. Just wanna also say thanks for the quality of your responses throughout this thread, helped me learn better how Polly works. What I learned from studying this thread, and the call out to Microsoft Support, was that 30 secs of retry is fine for our use case. Any more than that and we need to just kick down and pay for more RU's :) |
I am attempting to capture a 429 response, and the advised retry interval from the HTTPResponse, as outlined in the blog post here
I am using .Net Core, with version 5.1 (current latest) of Polly.
I have started trying to get a simple HTTPResponse captured, but ultimately I want to use Polly to handle retries for the .Net Core DocumentDB client.
The example code in the blog post is as follows:
Unfortunately, this particular item seems to be more pseudocode than a usable example - there is no such named parameter as
sleepDurationProvider
for RetryAsync. So, I have tried WaitAndRetry as follows:However, it is becoming confusing, as there are so many overrides with different signatures, and their availability depends on the complete signature of the complete Policy, and some of the examples in the documentation do not compile. e.g:
So, I am feeling as If I am just hacking things together in an inappropriate way ;)
I would really appreciate it if someone can point me in the right direction to get this feature working, as intended.
The text was updated successfully, but these errors were encountered: