-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClientRetryPolicy
: Adds Cross Regional Retry Logic on 429/3092
and 410/1022
#4677
ClientRetryPolicy
: Adds Cross Regional Retry Logic on 429/3092
and 410/1022
#4677
Conversation
ClientRetryPolicy
: Adds Cross Regional Retry Logic on 429
/ 3092
and 410
/ 1022
ClientRetryPolicy
: Adds Cross Regional Retry Logic on 429/3092
and 410/1022
Since you are updating the direct package in this PR it might be good to include a summary of the changes in the PR description as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks!
@@ -3,7 +3,7 @@ | |||
<ClientOfficialVersion>3.43.0</ClientOfficialVersion> | |||
<ClientPreviewVersion>3.44.0</ClientPreviewVersion> | |||
<ClientPreviewSuffixVersion>preview.0</ClientPreviewSuffixVersion> | |||
<DirectVersion>3.35.0</DirectVersion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
410/1022 is directly coming from direct package right? No changes are needed.
How about a separate PR just with that scope?
In-general that's the pattern we are following where direct revision with features it brings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. However, if we are splitting this PR into two (410/1022 and 429/3092), then the Direct PR needed to go first since this work (329/3092) has dependency on the sub status codes, which is originated from the Direct package.
// Today, the only scenario where we would receive a ServiceUnavailableException from the Throttling Retry Policy | ||
// is when we get 410 (Gone) with sub status code 3092 (System Resource Not Available). Note that this is applicable | ||
// for write requests targeted to a multiple master account. In such case, the 410/3092 will get converted into 503. | ||
if (throttleRetryResult.ExceptionToThrow is ServiceUnavailableException) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: Alternative is to do 429/3092 check before throttlingPolicy here explicitly, that seems direct and simpler right?
Also exception creation is expensive and throttle do last longer and overall impact will be higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error combination we are trying to handle here is 429/3092
which itself is a scenario of a throttled response. Therefore, IMO ideal place to handle such cases would be the ResourceThrottleRetryPolicy
. This will avoid any unnecessary status combination check that should not belong to the ClientRetryPolocy
.
On creating and throwing the exception part, I will double check if this can be optimized, but from the scenario handling perspective, I feel ResourceThrottleRetryPolicy
is the ideal place.
@FabianMeiswinkel : Any thoughts on this ?
@@ -529,11 +529,12 @@ public bool ShouldRefreshEndpoints(out bool canRefreshInBackground) | |||
|
|||
public bool CanUseMultipleWriteLocations(DocumentServiceRequest request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its used in other scenario like evaluating for sessionToken sending or not.
SDK region count might be stale and it should not be used to make decision of not passing sesisonToken.
These are two different usecases and should be distinguished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the background account refresh (InitializeAccountPropertiesAndStartBackgroundRefresh()
) should take care of the staleness isn't it ? Any way, will refactor the code to create another method just to safeguard the existing code path.
Pull request was closed
Pull Request Template
Description
This PR is covering two CRI scenarios in general:
The behavior changes are described below in detail:
Background: In general
410/1022
is used as a signal to indicate that a region is not part of the dynamic quorum - meaning in global strong with dynamic quorum all reads from the secondary region kicked out of the quorum will return410/1022
.Today when a read region loses the read lease it returns
410/LeaseNotFound
- the client will then retry locally for up-to60
seconds (for global strong) and up-to30
seconds (for others) (catch-all 410 retry like for 410/0 or connectivity related 410) for any.Instead the
410/LeaseNotFound(1022)
should be mapped immediately to a503
to allow the cross-regional retry to happen as quickly as possible. Retrying in the next preferred region like usual for reads is acceptable. This has been taken care in theCosmos.Direct
release3.36.0
and this PR is upgrading theCosmos.Direct
version to3.36.0
to reflect the change.For
429/3092 (Request Throttled/ SystemResourceUnavailable)
, we would only want to short-circuit this to retry cross-region quickly when we can actually retry cross region (not a write in single master) - and for reads only when it is returned from all replica in a region. The latter is extremely unlikely - so, I would probably compromise on a pragmatic approach and simply short-circuit by mapping429/3092
to a 503 for write operations when multi-master is enabled (and there is >1 write regions).That way we get the cross-region retry in the scenario where this is most critical.
Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #4656, #4390