-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gRFC A6: Retries #12
gRFC A6: Retries #12
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few cosmetic issues to deal with here.
A6.md
Outdated
* Author(s): [Noah Eisen](https://github.com/ncteisen) and [Eric Gribkoff](https://github.com/ericgribkoff) | ||
* Approver: a11r | ||
* Status: Draft | ||
* Implemented in: All Languages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line is intended to indicate where this proposal is already implemented. Given that we haven't yet implemented this in any language, I think this field should be blank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
A6.md
Outdated
The maximum number of retry attempts per original RPC. This can be specified in the retry policy as follows: | ||
|
||
``` | ||
'max_retry_attempts': 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example for the hedging policy below shows the parameters wrapped in a 'hedging_policy' block. For consistency, the retry examples should be wrapped in a 'retry_policy' block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
A6.md
Outdated
1. Retry policy: Retry in *n* ms. If this attempt also fails, retry delay will reset to initial backoff for the following retry (if applicable) | ||
2. Hedging policy: Send next hedged request in *n* ms. Subsequent hedged requests will resume at `n + hedging_delay` | ||
|
||
![State Diagram](A6_graphics/StateDiagram.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be a good idea to include the SVG version of these diagrams as well as the PNG, so that we can edit them in the future if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ncteisen - this request is buried on github after I pushed the latest commit. Do you have SVG versions of the diagrams we can add?
A6.md
Outdated
#### Exposed Retry Metadata | ||
Both client and server application logic will have access to data about retries via gRPC metadata. Upon seeing an RPC from the client, the server will know if it was a retry, and moreover, it will know the number of previously made attempts. Likewise, the client will receive the number of retry attempts made when receiving the results of an RPC. | ||
|
||
The new header names for exposing the metadata will be `"x-grpc-retry-attempts"` to give clients and servers access to the attempt count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the value of this field will be a human-readable integer? If so, please explicitly state that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A6.md
Outdated
When gRPC receives a non-OK response status from a server, this status is checked against the set of retryable status codes to determine if a retry attempt should be made. | ||
|
||
``` | ||
'retryable_status_codes': {UNAVAILABLE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, let's make all examples in this doc be the exact JSON format that will be used in the service config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
A6.md
Outdated
This is an example of a retry policy and its associated configuration. It implements exponential backoff with a maximum of three retry attempts, only retrying RPCs when an `UNAVAILABLE` status code is received. | ||
|
||
``` | ||
'retry_policy': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default conversion from protobuf will change a protobuf field of the form "foo_bar" to a JSON string of the form "fooBar". So let's use that form for all fields in this doc (e.g., "retry_policy" should be "retryPolicy").
In fact, before finalizing this, I would encourage you to write the protobuf definition that will be used internally and then create a proto and see how it converts to JSON, to make sure it's compatible with what we're specifying here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but I think it'd be great to have a link to a reference protobuf definition which reflects the JSON in the main service config doc in a form of Appendix at the bottom. I think a lot of people due to DNS limitation will use it, myself included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The (mostly finalized) JSON format of the service config with retry, hedging, and throttling policies is now included. I think it's a good idea to include the protobuf definition as well once its finalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukaszx0, can you please chime in on the comment thread for #5 (https://groups.google.com/d/topic/grpc-io/DkweyrWEXxU/discussion) and describe the limitations that would prevent you from using DNS for service configs? I'd like more information on that, but I'd like to keep this discussion in that other thread, since it's not really directly related to the retry design. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markdroth sure, I had it opened in tab but didn't get to it.
A6.md
Outdated
{ | ||
'loadBalancingPolicy': string, | ||
|
||
# Throttling parameters for retry attempts and hedged RPCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use //
instead of #
for comments in JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (aside: https://github.com/grpc/grpc/blob/master/doc/service_config.md uses #
for the comments)
A6.md
Outdated
# Integer | ||
'multiplier': number | ||
}, | ||
# Array of strings. Supported values are the gRPC status codes, e.g., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the status codes should be encoded as JSON strings? If so, let's state that explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A6.md
Outdated
# Integer | ||
'hedging_delay_ms' : number, | ||
# Array of strings. | ||
# Supported values are the gRPC status codes, e.g., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
## Proposal | ||
|
||
### Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, one more thing: This section should probably include a link to #5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a first pass and studied it a bit. In general looks great, very well written! Kudos!
A6.md
Outdated
'retryable_status_codes': {UNAVAILABLE} | ||
``` | ||
|
||
In general, only status codes that indicate the service did not process the request should be retried. However, in the case of idempotent methods, a more aggressive set of parameters can be specified, as there is no harm in the server processing the request more than once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idempotence is mentioned here, but it's not clear where does it come from. I assume it is expected to be set in method descriptor - I know that there is API for that in Java, not sure about other languages - if so, maybe clarify it a bit? How about "... in the case of idempotent methods (which can be set in method descriptor)..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the doc to clarify. We are actually intentionally not defining a mechanism for service owners to specify that a method is idempotent. An idempotent flag would really just mean "retry on these additional status codes", and it's cleaner to just let the service owner explicitly set the status codes rather than have an idempotent flag.
The use of idempotent here in the doc was just to give an example of when its safe to expand the set of retryable status codes, and the example has been expanded to make that clearer. The idempotent bit in the Java API is not being used by this retry policy design.
A6.md
Outdated
} | ||
``` | ||
|
||
When a method has `hedging_params` set, the first RPC is sent immediately, as with a standard non-hedged call. After `hedging_delay` has elapsed without a successful response, the second RPC will be issued. If neither RPC has received a response after `hedging_delay` has elapsed again, a third RPC is sent, and so on, up to `max_requests`. In the above configuration, after 1ms there would be one outstanding hedged RPC, after 501ms there would be two outstanding hedged RPCs, and after 1001ms there would be three outstanding hedged RPCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quoted key names are inconsistent with the example above the paragraph (hedging_params
=> hedging_policy
, hedging_delay
=> hedging_delay_ms
)
Besides that, I wonder if this paragraph should mention presence of deadline
which can already be set for a call
? It is mentioned above in "Maximum Number of Retries" and below in "Configuration Language" examples for hedging, but not here and since we're within "Detailed Design" section, I think the semantics of hedging in presence of call deadline should be also described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and done, thanks for pointing that out.
A6.md
Outdated
|
||
Throttling may only be specified per server name, rather than per method or per service. | ||
|
||
For each server name, gRPC maintains a `token_count` which is initially set to `max_tokens`. Every outgoing RPC (regardless of service or method invoked) will effect `token_count` as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: to avoid confusion for future readers, maybe worth being more verbose by saying "... gRPC client maintains..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
A6.md
Outdated
|
||
Neither retry attempts or hedged RPCs block when `token_count` is less than or equal to the threshold. Retry attempts are canceled and the failure returned to the client application. The hedged request is cancelled, and if there are no other already-sent hedged RPCs the failure is returned to the client application. | ||
|
||
The only RPCs that are counted as failures for the throttling policy are RPCs that fail with a status code that qualifies as a retryable or non-fatal status code (see [here](#retryable-status-codes) and [here](#hedging-policy)), or that receive a pushback response indicating not to retry. This avoids conflating server failure with responses to malformed requests (such as the `INVALID_ARGUMENT` status code). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we could improve UX for future readers by removing here
links. How about:
... are RPCs that fail with a status code that qualifies as a retryable ([see retryable status codes](#retryable-status-codes)) or non-fatal status code (see [hedging policy](#hedging-policy)) ...
or maybe even just:
... are RPCs that fail with a status code that qualifies as a [retryable](#retryable-status-codes)) or [non-fatal status code](#hedging-policy) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
#### Where Retries Occur | ||
|
||
The retry policy will be implemented in-between the channel and the load balancing policy. That way every retry gets a chance to be sent out on a different subchannel than it originally failed on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Immediately wanted to ask about metadata which LB could us, but found answer later. Maybe worth linking to it? Someting like "... every retry gets a chance to be sent out on a different subchannel based on the retry metadata which can by used by load balancer ... "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, but currently the metadata only says if the RPC is a retry. We are considering adding information about the previously used subchannel(s), based on comments from the gRFC email thread. If/once that is added, then adding the link you suggest (based on the retry metadata...) makes sense.
A6.md
Outdated
|
||
#### Buffered RPCs | ||
|
||
The gRPC library will have a configurable amount of available memory, `retry_buffer_size`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit`. These limits are configured by the client, rather than coming from the service config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume both key's values are in bytes. Probably should be explicitly stated at least it docs, or maybe in key name? Or maybe it's number of messages to buffer? (if regardless of size, would mean that is effectively unbounded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in_bytes
to the name.
|
||
Clients cannot override retry policy set by the service config. However, retry support can be disabled entirely within the gRPC client library. This is designed to enable existing libraries that wrap gRPC with their own retry implementation (such as Veneer Toolkit) to avoid having retries taking place at the gRPC layer and within their own libraries. | ||
|
||
Eventually, retry logic should be taken out of the wrapping libraries, and only exist in gRPC. But allowing the retries to be disabled entirely will make that rollout process easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean that support for disabling retries in the implementations will be temporary or will it be always supporter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hesitate to say always...but it will always be supported, at least for now :-) We want clients in resource-constrained environments to also be able to disable retries, so this is intended as a feature of the API, not a temporary stopgap.
A6.md
Outdated
#### Pushback | ||
Servers may explicitly pushback by setting metadata in their response to the client. The pushback can either tell the client to retry after a given delay or to not retry at all. If the client has already exhausted its `max_retry_attempts`, the call will not be retried even if the server says to retry after a given delay. | ||
|
||
A new metadata key, `"x-grpc-retry-pushback-ms"`, will be added to support server pushback. If the value for pushback is set to -1, then it will be seen as the server asking the client not to retry at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I got to "Throttling Configuration" below, it wasn't really clear to me what's the scope of pushback on the server side. I think it's worth mentioning here since this section is supposed to describe it in details, not the examples one below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by scope on the server side. Pushback is intended to give service owners a means to programmatically influence retry attempts, whereas everything else (including throttling) is implemented automatically on the client side. Pushback is only related to throttling configuration in that pushback saying not to retry will be counted as a failure.
A6.md
Outdated
The only RPCs that are counted as failures for the throttling policy are RPCs that fail with a status code that qualifies as a retryable or non-fatal status code (see [here](#retryable-status-codes) and [here](#hedging-policy)), or that receive a pushback response indicating not to retry. This avoids conflating server failure with responses to malformed requests (such as the `INVALID_ARGUMENT` status code). | ||
|
||
#### Pushback | ||
Servers may explicitly pushback by setting metadata in their response to the client. The pushback can either tell the client to retry after a given delay or to not retry at all. If the client has already exhausted its `max_retry_attempts`, the call will not be retried even if the server says to retry after a given delay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume by the mention of max_rety_attempts
that Pushback
applies only to backoff policy. Probably should be stated explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushback actually applies to hedging too. This was hidden in the Summary of Retry and Hedging Logic section, now it's explicitly describe here as well.
Addressing lukaszx0's comments
Addressing markdroth's comments
A6.md
Outdated
|
||
After the RPC response has been returned to the client application layer, the RPC is removed from the buffer. | ||
|
||
New RPCs which do not fit in the available buffer space (either due to the total available buffer space, or due to the per-RPC limit) will not be retryable. For client-side streaming RPCs, if additional outgoing messages would cause the buffered size to exceed the per-RPC limit, the entire message will be removed from the buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New RPCs which do not fit in the available buffer space (either due to the total available buffer space, or due to the per-RPC limit) will not be retryable.
Does that mean that they'll fail immediately? Does that mean they will make original request and in case of failure won't qualify for retry and fail after first initial request? I assume it's the former, maybe worth clarifying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it is the latter. We don't intend the buffer to limit the overall number or size of outgoing RPCs, so we still send the original request and essentially behave exactly as if there were no retry policy specified for RPCs sent when the buffer is full.
Updated the doc to state this explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Makes sense, thanks for the clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more minor comments. On the whole, this looks great!
A6.md
Outdated
@@ -52,7 +52,7 @@ Many teams have implemented their own retry logic wrapped around gRPC like [Vene | |||
|
|||
### Overview | |||
|
|||
gRPC will support two configurable retry policies. The service configuration may choose from a retry policy (retry failed RPCs) or a hedging policy (aggressively send the same RPC multiple times in parallel). An individual RPC may be governed by a retry policy or a hedge policy, but not both. | |||
gRPC will support two configurable retry policies. The [service configuration](https://github.com/grpc/proposal/pull/5) may choose from a retry policy (retry failed RPCs) or a hedging policy (aggressively send the same RPC multiple times in parallel). An individual RPC may be governed by a retry policy or a hedge policy, but not both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest changing the link for the text "service configuration" to point to https://github.com/grpc/grpc/blob/master/doc/service_config.md, and adding the following after it:
(which may soon be published via DNS)
A6.md
Outdated
@@ -214,15 +217,15 @@ To clarify the second scenario, we define an *outgoing message* as everything th | |||
|
|||
#### Buffered RPCs | |||
|
|||
The gRPC library will have a configurable amount of available memory, `retry_buffer_size`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit`. These limits are configured by the client, rather than coming from the service config. | |||
The gRPC library will have a configurable amount of available memory, `retry_buffer_size_in_bytes`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit_in_bytes`. These limits are configured by the client, rather than coming from the service config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest changing the wording to make it clear that this part of the spec is optional and may differ based on how the client is implemented. Suggested text (feel free to wordsmith):
The gRPC client library may support application-configured limits for the amount of memory used for retries. It is suggested that the client limit both the total amount of memory used to buffer retryable or hedged RPCs as well as the amount of memory used by any one RPC (to prevent a single large RPC from using the whole buffer and preventing retries of subsequent smaller RPCs).
A6.md
Outdated
@@ -214,15 +217,15 @@ To clarify the second scenario, we define an *outgoing message* as everything th | |||
|
|||
#### Buffered RPCs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest changing this heading to something like "Memory Management".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Memory Management (Buffering)" maybe?
A6.md
Outdated
|
||
![State Diagram](A6_graphics/StateDiagram.png) | ||
|
||
[Link to diagram on draw.io](https://drive.google.com/a/google.com/file/d/0Bwcp6fL6ikUvOGJWcExrNFhMZ0k/view?usp=sharing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't fly. I'm guessing you linked to your work drive at google. I think it would be nice if you could export SVGs and add them to repo alongside pngs (looks like draw.io let's you do that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops!! Thanks for the catch. I switched all of the diagrams to SVG format :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, didn't know GH will render it. Nice, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um, apparently it won't render them inline - https://github.com/ncteisen/proposal/blob/3b24372020f0acc7017bbf2e1e559dcdc9279561/A6.md 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it would be better to include the SVG files in this PR instead of linking to an external source. Otherwise, there's no guarantee that someone won't delete the external source without realizing that it's referenced from this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha oops again, reverted back to PNGs, and now there are links to the SVG files too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you :)
3dfb2dc
to
3b24372
Compare
3b24372
to
75e08fa
Compare
19487f9
to
9136afb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, having implemented most of what is proposed here as generic gRPC-Go Interceptors (https://github.com/mwitkow/go-grpc-middleware/tree/master/retry) that we use in production, a couple of things:
a) all of this logic can be implemented as interceptor APIs in Go, Java and C++.
b) retry logic would benefit a lot from knowing whether the method is idempotent or not. II understand that this is supposed to be handled by "service configs", but realistically they're really hard to use. Few people would put their retry logic in DNS TXT entries, and even fewer people operate the gRPC LB protocol. Can we consider adding a .proto
option (annotation) to Method definitions?
c) One thing I found out useful "in the wild" is the ability to limit the Deadline of the retriable call. For example, the "parent" RPC call (user invoked) has a deadline of 5s, but each retriable call only 1s. This allows you to skip a "deadlining" server and retry against one that works.
|
||
In general, only status codes that indicate the service did not process the request should be retried. However, a more aggressive set of parameters can be specified when the service owner knows that a method is idempotent, or safe to be processed more than once by the server. For example, if an RPC to a delete user method fails with an `INTERNAL` error code, it's possible that the user was already deleted before the failure occurred. If the method is idempotent then retrying the call should have no adverse effects. However, it is entirely possible to implement a non-idempotent delete method that has adverse side effects if called multiple times, and such a method should not be retried unless the error codes guarantee the original RPC was not processed. It is up to the service owner to pick the correct set of retryable status codes given the semantics of their service. gRPC retries do not provide a mechanism to specifically mark a method as idempotent. | ||
|
||
#### Hedging Policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename hedging to something else? When I was first reading this I thought it meant one:
send X RPCs immediately and return the call once the first one completes, cancelling the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hedging seems like the appropriate name. It can mean what you say (with a hedging delay of 0ms) but also supports sending the RPCs on a fixed delay.
A6.md
Outdated
|
||
#### Memory Management (Buffering) | ||
|
||
The gRPC client library will support application-configured limits for the amount of memory used for retries. It is suggested that the client sets `retry_buffer_size_in_bytes` to limit the total amount of memory used to buffer retryable or hedged RPCs. The client should also set `per_rpc_buffer_limit_in_bytes` to limit the amount of memory used by any one RPC (to prevent a single large RPC from using the whole buffer and preventing retries of subsequent smaller RPCs). These limits are configured by the client, rather than coming from the service config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this could be made significantly simpler by saying: number of buffered per RPC.
This makes it easier for users of the library to reason about it: they know that 5 unary RPCs will make becasue retries buffer 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work if we assume a fixed size for each RPCs. Since this also applies to streaming RPCs that are effectively unbounded, we need to specify the buffer size in memory rather than number of RPCs.
|
||
#### Disabling Retries | ||
|
||
Clients cannot override retry policy set by the service config. However, retry support can be disabled entirely within the gRPC client library. This is designed to enable existing libraries that wrap gRPC with their own retry implementation (such as Veneer Toolkit) to avoid having retries taking place at the gRPC layer and within their own libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be tunable per-call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our intent is to limit the client-side API surface as much as possible. We have to allow disabling retries altogether, due to existing wrapper libraries and clients in resource-constrained environments, but otherwise we want to keep retry policy the domain of the service owner rather than individual clients.
Hi Michal, Thank you for the comments. We are trying to keep high-level discussion on the email thread (see here) but my responses to your points (b) and (c) are below.
There are two concerns here. One is that saying a method is idempotent really just means "retry on status codes x y and z". If we pick a canonical set of idempotent status codes, we are forcing every service owner to obey these semantics if they want to use retries. However, the gRPC status code mechanism is very flexible (even allowing services to return UNAVAILABLE arbitrarily) so we'd prefer to force service owners to consider the semantics of their application and pick a concrete set of status codes rather than just flipping an "idempotent" switch. The second concern is around the ease of use of the service config. The intent is for the service config to be a universally useful mechanism, and we want to avoid just baking everything into the proto. Concerns about the delivery mechanism for service config shouldn't invalidate its use for encoding retry policy, and may be something we have to tackle separately.
This is covered by our hedging policy. There doesn't seem to be any reason to cancel the first RPC in your scenario, as it may be just about to complete on the server and cancellation implies the work done so far is wasted. Instead, hedging allows you to send the first request, wait one second, send a second request, and accept the results of whichever completes first. |
A6.md
Outdated
|
||
The gRPC library will have a configurable amount of available memory, `retry_buffer_size_in_bytes`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit_in_bytes`. These limits are configured by the client, rather than coming from the service config. | ||
The gRPC client library will support application-configured limits for the amount of memory used for retries. It is suggested that the client sets `retry_buffer_size_in_bytes` to limit the total amount of memory used to buffer retryable or hedged RPCs. The client should also set `per_rpc_buffer_limit_in_bytes` to limit the amount of memory used by any one RPC (to prevent a single large RPC from using the whole buffer and preventing retries of subsequent smaller RPCs). These limits are configured by the client, rather than coming from the service config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section is still a bit too specific (i.e., "will" instead of "may", and specifying specific names for the two pools). Having discussed the C-core implementation with @ctiller, I think that we will not actually wind up supporting a separate total pool for retries; instead, that will come out of the existing pool used for all of gRPC. However, we will still provide a per-RPC limit, since we want to avoid one large RPC from starving all other RPCs of space for retries.
I think this section should be re-worded to describe the types of constraints that an implementation may want to consider, but it should not mandate any particular mechanism or configuration parameters.
* Clarify that initial metadata commits an RPC * Remove gRPC wire format change section * gRPC servers should delay sending headers until first message * Hedged RPCs will pass 'do not call' list to local LB
A6.md
Outdated
@@ -140,7 +140,7 @@ If all instances of a hedged RPC fail, there are no additional retry attempts. E | |||
|
|||
If server pushback that specifies not to retry is received in response to a hedged request, no further hedged requests should be issued for the call. | |||
|
|||
Hedged requests should be sent to distinct backends, if possible. | |||
Hedged requests should be sent to distinct backends, if possible. To facilitate this, the gRPC client will maintain a list of previously used backend addresses for each hedged RPC. This list will be passed to the gRPC client's local load-balancing policy. Previously used addresses should be skipped when the load-balancing policy picks a server for the hedged RPC. If all available backend addresses have already been used, the load-balancing policy may pick any backend. This means that service owners who specify more hedged requests than they have available backends may still see hedged requests sent to the same backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the language here should be less prescriptive and more about what functionality can be obtained if desired. For example, instead of saying "Previously used addresses should be skipped", we can say something like "The load balancing policy may use this information to send retries to an address that was not previously used if possible". In addition, the last couple of sentences can be changed to indicate that the load-balancing policy will need to decide what to do in the case where there are no addresses that have not already been tried.
Thanks for the detailed design and everyone for the thoughtful feedback. I am approving and merging this PR now. Any follow up discussion can continue on the thread. |
gRFC for retries
Authors: ncteisen and ericgribkoff
grpc.io discussion