Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement client interceptors for grpc-native-core #59

Merged
merged 4 commits into from
Feb 28, 2018

Conversation

drobertduke
Copy link
Contributor

A NodeJS implementation of client-side interceptors, as described in
the proposal: grpc/proposal#14

@murgatroid99
Copy link
Member

Can you give a high-level explanation of this implementation? How do the different classes interact? What is the common code path for a call using one or more interceptors?

@drobertduke
Copy link
Contributor Author

At a high level, this PR inserts an InterceptingCall between the JS client and the C-core.

Before:
client.js => call.cc startBatch

After:
client.js => client_interceptors.js => call.cc startBatch

client_interceptors.js presents the API interceptors must be written to, and the mechanism for executing a chain of interceptors. The interceptors are chained together dynamically, allowing them to be provided either during client construction or during any RPC invocation.

All interception work is completed before submitting a batch to startBatch. A unary call with a single FooInterceptor would be broken out like this like this:
Outbound

client.js => FooInterceptor.start
          => FooInterceptor.sendMessage
          => FooInterceptor.halfClose    => call.cc startBatch

Inbound

call.cc startBatch => FooInterceptor.onReceiveMetadata
                   => FooInterceptor.onReceiveMessage
                   => FooInterceptor.onReceiveStatus   => consumer's callback

Logic previously in client.js for creating the client_batch and the callback provided to startBatch had to be moved to execute after all interceptor method chains have fired. This logic (and the Client*Stream logic) has been consolidated/de-duplicated and moved to client_batches.js.

A typical code path with two interceptors (InterceptorFoo and InterceptorBar) would look like this. This is the outbound path for a unary call performing the start operation:

Client.makeUnaryRequest client.js
InterceptingCall.start client_interceptors.js
InterceptingCall._callNext client_interceptors.js
InterceptorFoo.start interceptor_foo.js
next (a closure in InterceptingCall.start) client_interceptors.js
InterceptingCall.start client_interceptors.js
InterceptingCall._callNext client_interceptors.js
InterceptorBar.start interceptor_bar.js
next (a closure in InterceptingCall.start) client_interceptors.js
InterceptingCall.start client_interceptors.js
InterceptingCall._callNext client_interceptors.js
InterceptingCall.start (from _getOutboundBatchingInterceptor) client_interceptors.js
_joinOutboundOperations client_interceptors.js
_joinOpsGroup client_interceptors.js
_joinOpsCustom client_interceptors.js

At this point the start path has been fully completed on the JS side and the result recorded in the BatchRegistry. We unwind the stack back to Client.makeUnaryRequest and repeat the process for sendMessage, and a third time for halfClose. Once all required outbound operations have been recorded in the BatchRegistry, _joinOpsCustom will proceed to deliver the full outbound batch to call.cc's startBatch via this path:

_joinOpsCustom client_interceptors.js
_joinOperations client_interceptors.js
joiner (a closure in _getOutboundBatchingInterceptor) client_interceptors.js
join_outbound (from _registerUnaryBatch) client_batches.js
startBatch call.cc

Inbound operations take a similar path but the interceptor order is reversed, we call the inbound join functions, and it ends with delivering RPC errors/results to the client's consumer.

Copy link
Member

@murgatroid99 murgatroid99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of the logic in client_batch.js may be more trouble than it's worth. The primary purpose of using a batch instead of sending each op separately is as an optimization to coalesce events internally. But now that interceptors can introduce asynchronous operations that delay ops individually, coalescing ops into a batch that is started when the last op is ready may actually negate that advantage. So, I think it would be better to simply always call startBatch with each individual op when the interceptor makes it ready. If that turns out that that is a significant performance hit, I think we can make up for it by optimizing it differently, such as by using a UV prepare handle to coalece all ops that are started within a single event loop cycle.

Client.prototype.makeUnaryRequest = function(method, serialize, deserialize,
argument, metadata, options,
callback) {
Client.prototype.makeUnaryRequest = function(method_descriptor, argument,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is part of the public API; we can't change the method signature. The same goes for the other methods on the Client prototype. However, I don't think that's a major problem; you can pass that method descriptor information in the options argument.

* @constructor
*/
function BatchDefinition(batch_ops, trigger_ops, outbound_handler,
inbound_handler) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the method signatures of outbound_handler and inbound_handler.

@@ -0,0 +1,572 @@
/**
* @license
* Copyright 2015 gRPC authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright notices on new files should be 2018 (or 2017, since the file was actually written last year).

* interceptor (if it applies to the method)
* @constructor
*/
var InterceptorProvider = function(get_interceptor) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an object type here just to wrap a single function? Why not just define the InterceptorProvider type as a function with a signature that matches get_interceptor's signature?

var batch_registry = new BatchRegistry();
var intercepting_call = client_interceptors.getInterceptingCall(
call_constructor, interceptors, batch_registry, args.options);
var emitter = new ClientUnaryCall(intercepting_call);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an InterceptingCall is being passed to the ClientUnaryCall here, then the ClientUnaryCall documentation should be change to reflect that. The same goes for the other surface API call types.

@Crevil
Copy link
Contributor

Crevil commented Jan 17, 2018

How is progress going on this? Anything I can do to help get this merged?

@thelinuxfoundation
Copy link

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

Regards,
CLA GitHub bot

@drobertduke
Copy link
Contributor Author

@Crevil Re-writing this to eliminate the batching of operations is going to take some time, and I'll also need to assess the performance impact for our use cases. I hope to have the requested changes addressed in the next two weeks.

@Crevil
Copy link
Contributor

Crevil commented Jan 19, 2018

Sound great @drobertduke . We have been looking forward to this change and I would just check up if we could assist in getting it forward.

@drobertduke
Copy link
Contributor Author

Update: I hope to have a new version of this PR posted next week

@murgatroid99
Copy link
Member

In the meantime, it looks like you actually created this PR before we switched to the CNCF CLA, so you'll have to sign that before this can be merged.

@thelinuxfoundation
Copy link

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

Regards,
CLA GitHub bot

@drobertduke
Copy link
Contributor Author

I've re-written the PR to send operations as they become available instead of trying to match the same batches the client was making originally. We've deployed this new code to production in a limited capacity and do not see a performance penalty from this change vs earlier versions of this PR.

The new interaction pattern with call.startBatch leaves some questions about how a singular callback should be used when multiple batches are fired. I'll comment on this in the code.

I've signed the CLA now, thanks for the reminder.

var batch = {
[grpc.opType.SEND_MESSAGE]: serialize(message),
};
call.startBatch(batch, function () {});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should I reconcile multiple batches (each with their own errors potentially) vs a single callback provided by the consumer? I've skipped the callback in early batches in favor of using it when receiving a status as seen in the client streaming RPC but will this obscure useful errors from the consumer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that what you are doing is correct. I think that every instance of multiple operations in a batch where the single callback actually matters includes a status operation. And I'm pretty sure that the status operation is always guaranteed to complete eventually. And I'm also pretty sure that if there is an error with another operation, it is always because of some circumstance that would cause a non-OK status. So in the end, you will want to surface the status.

One thing you need to be careful of here to avoid spurious errors is that you have to start the SEND_MESSAGE batch before you start the SEND_CLOSE_FROM_CLIENT batch, because the latter indicates that no more messages will be sent and the write will fail if you do them in the wrong order. I haven't seen yet if you are ensuring that ordering elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-streaming operations are ordered in client.js, here's the unary case: https://github.com/grpc/grpc-node/pull/59/files#diff-86af358fb1304514580c24212de4f182R486

This doesn't guarantee against a badly-writtten interceptor forcing an incorrect order but I assume we don't need to guard against that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you consider an interceptor that does any asynchronous operations to be badly written? Because I'm pretty sure that an asychronous write interceptor is sufficient to cause those operations to be performed out of order if there is no guard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will admit, perhaps there was more value to the batch processing code you had than I previously realized. But I think there are other good solutions to this problem. For example, if the individual InterceptingCall methods have an additional callback parameter or return a promise, you can propagate "the corresponding batch has been started" back up the call stack to ensure that the batches are started in order. Another option could be to share more information across the interceptor stack. If there is a pending start call that hasn't gone through all of the interceptors, don't send any messages, and if there are any pending sendMessage operations in the interceptor stack, don't start the halfClose batch.

In general, the guarantees we need here are weaker than what you get from batches. The start operations have to happen at the same time as or before the sendMessage operations, which have to happen at the same time as or before the halfClose operations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually brings up another question for me: if a sendMessage interceptor on a streaming request takes an asnychronous action before passing the message to the next layer, is there any protection against messages getting passed down out of order?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the sendMessageWithContext actually answers that question, and I think it also provides a solution to the other problem. I am still worried about start, and not just because of unary calls. For any call, sending a message before sending the metadata will not do the right thing, and may crash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's desirable to prevent an interceptor from changing the order of streaming messages. Interceptors can modify messages, swap a message out for a new message, or short circuit the operation entirely to skip a message. Building an ordered list of messages as they are passed in by the client's consumer and then enforcing that order when starting batches would make certain use cases impossible, like circuit breaking on streaming messages.

re: enforcing the order of operations in non-streaming requests, I was expecting interceptor authors to do that work if they want to make a continuation asynchronous. Adding a mechanism to the framework to delay an operation until a required operation is ready could reduce the burden on interceptor authors with async continuations. But it could also lead to difficult-to-debug scenarios with silent failures:

  1. An interceptor author writes a bad interceptor which unexpectedly fails to call the SEND_MESSAGE continuation n% of the time.
  2. An operation-order-enforcement mechanism in the framework allows the SEND_METADATA operation to be sent but doesn't send CLOSE n% of the time because it is waiting on a SEND_MESSAGE that never comes.
  3. n% of all calls pile up without error until they time out or memory runs out.

The same bad interceptor would cause immediate failures with useful errors if the framework is not waiting on an order-enforcement mechanism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of sendMessageWithContext already enforces the order of messages by preventing the top-level stream from sending down each message until the previous message has been fully processed.

And from my point of view, it feels very surprising to users that this interceptor API is clearly written to support asynchronous operations, but if they try to actually use an asynchronous operation, it will screw up some requests unless they do significant extra work to explicitly track their own pending operations. On top of that, in the current library it is not possible for the user to call the API in a way that results in out-of-order calls to the C core. I think that is a valuable guarantee and I would be very hesitant to change it.

@drobertduke
Copy link
Contributor Author

I've added a mechanism to define 'dependencies' for individual operations, check whether those dependencies are met and defer the batch if needed. The use case only applies to RPCs with synchronous requests (unary and serverStreaming).

I'm guessing this implementation is fairly expensive. Some of the lodash calls can be optimized away but I'm leaving them in for now for clarity. I'll work on producing some benchmarks.

@drobertduke
Copy link
Contributor Author

I ran benchmarks on an m4 EC2 instance for three versions of grpc-native:

  1. The "base" version (the 044102d commit on master this PR is currently based on)
  2. This PR without protections against out-of-order operations (f644c06)
  3. This PR with those protections (d83d354)

Each benchmark is the time it takes to issue and get responses to 50k simple unary RPC requests (with no interceptors) to a local server running in another process. I ran the benchmark three times for each version.

BASE - 044102d

real	0m58.352s
user	0m13.152s
sys	0m0.916s

real	0m58.275s
user	0m13.300s
sys	0m0.784s

real	0m57.998s
user	0m13.360s
sys	0m0.820s

PR without order protection - f644c06

real	0m52.190s
user	0m17.204s
sys	0m1.480s

real	0m51.821s
user	0m16.808s
sys	0m1.596s

real	0m51.974s
user	0m17.164s
sys	0m1.528s

PR with order protection - d83d354

real	1m6.632s
user	0m19.756s
sys	0m3.140s

real	1m6.389s
user	0m20.468s
sys	0m2.768s

real	1m6.010s
user	0m19.656s
sys	0m3.180s

Looking at the median timing for each version we get these results (lower is better):

real median benchmarks
PR without order protection: 89.2% of base
PR with order protection: 113.9% of base

I was surprised by the performance improvement of the PR without order protection over the base. Looking at the user timings we can see that more work is being done by the PR-versions of the client as I would expect. It looks like breaking up the request operations is significantly increasing efficiency, at least in this benchmark scenario.

user median benchmarks
PR without order protection: 132.6% of base
PR with order protection: 148.5% of base

I won't have time to work on performance improvements (reducing sort() and lodash calls) for about a week. If you want to merge this one as-is I can post a follow up PR later with those improvements.

* @param {string} path
* @param {grpc.Client~CallOptions=} options Options object.
*/
exports.getCall = function(channel, path, options) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function is only used in client_interceptors.js, so it should be defined there.

@murgatroid99
Copy link
Member

OK, I can't say I've totally wrapped my head around all of the code here, but at this point I can't find any other problems, so I'm going to merge it. Thank you for your contribution, and your patience in working through the contribution process. Assuming nothing goes wrong, this will be published in the 1.11 release.

@ebenoist
Copy link

Is there a timeline for when the 1.11 release is expected? Happy to test some of this out in our organization as well if that helps.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants