-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagate networking connection error handling into NodeConnection error handling #224
Comments
We must add tests that break the |
Issue template is a bug here, but should this be a development spec? |
As I've started to spec this issue out a bit more, I remembered that I'd actually already made a comment about error handling at the |
@CMCDragonkai I initially created this issue in August with respect to this comment from you: https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/198#note_644652543 Is propagating the errors to a generic |
You should be able to catch all networking exceptions in the nodes domain. |
Regarding Whereas this issue is about losing connection to the PK agent between node to node. And this requires integrating with the networking domain. Losing connection to client service and client server does not use the networking domain. It's just straightforward from a GRPC error that we need catch and convert into a domain exception in the
The PK CLI commands are going to be using the There should be a good human readable message for this kind of exception when New issue in #276. |
So the problem here is that the The At this current point in time, in the source code for The GRPC connection can break when the networking domain has an error, or this may also happen in #276. But there would no way of handling this error event automatically. Instead I believe an exception would only occur at the point of calling another client method. This is obviously not suitable for our We need to investigate the source code, issues and docs for If we cannot do this, the alternative is to do interval pings, which are unary calls and this should trigger an exception if the connection has been broken. We already do this inside the Also in reference to the #225, we have to make a choice of whether Check this https://github.com/grpc/grpc-node/search?q=connection+drop+handle&type=issues |
Recall our TCP connection between the We'd discussed earlier that the networking components of Polykey can be treated as a completely separate component of the system. They have no knowledge of GRPC, etc. However, the This also means that the
|
Leaving some notes here regarding management of channel or subchannel state in GRPC client. Client Channel OptionsIt is possible to override the The export declare type ClientOptions = Partial<ChannelOptions> & {
channelOverride?: Channel;
channelFactoryOverride?: (address: string, credentials: ChannelCredentials, options: ClientOptions) => Channel;
interceptors?: Interceptor[];
interceptor_providers?: InterceptorProvider[];
callInvocationTransformer?: CallInvocationTransformer;
}; See the See: if (options.channelOverride) {
this[CHANNEL_SYMBOL] = options.channelOverride;
} else if (options.channelFactoryOverride) {
const channelFactoryOverride = options.channelFactoryOverride;
delete options.channelFactoryOverride;
this[CHANNEL_SYMBOL] = channelFactoryOverride(
address,
credentials,
options
);
} else {
this[CHANNEL_SYMBOL] = new ChannelImplementation(
address,
credentials,
options
);
} Because this allows us to change the entire channel implementation, we could potentially change how we do the network domain entirely. That is we could fix #234 by making our own Channel Methods for Acquiring Connectivity StateInternally in GRPC, there's a The interface contains 2 interesting methods: /**
* Get the channel's current connectivity state. This method is here mainly
* because it is in the existing internal Channel class, and there isn't
* another good place to put it.
* @param tryToConnect If true, the channel will start connecting if it is
* idle. Otherwise, idle channels will only start connecting when a
* call starts.
*/
getConnectivityState(tryToConnect: boolean): ConnectivityState;
/**
* Watch for connectivity state changes. This is also here mainly because
* it is in the existing external Channel class.
* @param currentState The state to watch for transitions from. This should
* always be populated by calling getConnectivityState immediately
* before.
* @param deadline A deadline for waiting for a state change
* @param callback Called with no error when a state change, or with an
* error if the deadline passes without a state change.
*/
watchConnectivityState(
currentState: ConnectivityState,
deadline: Date | number,
callback: (error?: Error) => void
): void; The connectivity state is an enum:
We could potentially hook into these connectivity state changes and use this to understand what state of the connection is, and therefore react to it in In fact the Subchannel PoolThen each channel has a subchannel pool. And each subchannel has their own connectivity state. It is still unclear to me how subchannel connectivity state relates to connectivity state of the channel. But it appears the state of the subchannels does affect the state of the channel. SubchannelEach subchannel has their own connectivity state. And its own Now I noticed that when the channel state is export class Subchannel {
/**
* The subchannel's current connectivity state. Invariant: `session` === `null`
* if and only if `connectivityState` is IDLE or TRANSIENT_FAILURE.
*/
private connectivityState: ConnectivityState = ConnectivityState.IDLE;
/**
* The underlying http2 session used to make requests.
*/
private session: http2.ClientHttp2Session | null = null; |
For now we won't replace the network domain just yet. We'll leave that to #234. Even without implementing our own channel implementation, one can interact with the channel by exposing But I'd prefer to abstract that and have our own callback mechanisms built into Then Recommend cloning the https://github.com/grpc/grpc-node/tree/master/packages/grpc-js repo and take a look when tackling this @joshuakarp. |
Further info: https://stackoverflow.com/questions/63749113/grpc-call-channel-connection-and-http-2-lifecycle. Note that GRPC can support hostname resolution too, and this impacts how the channels/subchannels/load balancers work. However I'm not sure if we would want to allow GRPC to do this work since we need to ensure that our proxies are working properly. If there are multiple IPs to a given hostname, we would want separate node connections, not a single node connection load balanced. Each IP there is considered not a point of presence for the same node, but completely separate nodes. |
Based on the work here https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213#note_731431339 I realised that channel state readiness should be tied with the first subchannel state readiness. That means if channel is ready, then the first subchannel should also be ready. I've added extra commentary to |
Writing up a quick summary from memory from what I've learned. In the case of a failed connection. An 'ErrorGRPCClientTimeout' error will arise when failing to connect. this is caused by the timeout at if A GRPC call is made after the connection drops we end up with a If the connection fails during a call then we receive an Currently all these errors will be thrown when using From the channel perspective. there are 5 states.
We start in |
So the 3 errors listed above, they originate from...
With recent changes in #310 there are two ways connection errors are handled. The first is where we handle the errors thrown by the client when starting a connection or a doing a GRPC call. When starting a connection if the connection fails to establish the The connections channel is also watched asynchronously for the connection dying and then destroys and removes the |
I think with the combination of |
I've made a primitive start to mapping the various transitions for The more I create this, the more I'm inclined to giving the states a proper title too (as opposed to numbering them). Obviously though, this is still a WIP.
|
Will be able to use the completed diagram to also check that we have adequate testing for each path of a |
In the process of writing up a state diagram in #224, there's an interesting difference in behaviour with connection shutdown and corresponding client status that I'm trying to make sense of and elucidate through the diagram. This follows on from the following comments:
and #310 (comment)
The above 2 comments refer to the transitions when we move to
Log output: Important part:
Full log:
Log output: Important part:
Full log:
So as a result of this:
|
From discussions this afternoon, there's a few extra states and transitions that need to be verified. Some first remarks:
The states and transitions to explore:
|
@tegefaulkes mentioned that he previously experimented with the TTL at the GRPC level, but it seemed to take more than 6 minutes for the client's state to transition to |
Got a final first draft of the connection state diagram:
Some notes:
|
@tegefaulkes made the comment that for the bottom left Another comment was whether the |
So in actual fact, if we assume that we have already seen the At the moment anyway, our internal TTL in |
Based on feedback, a somewhat simpler version:
|
So |
@tegefaulkes please try splitting the test modules atm, especially the nested describe. Do some basic renaming, and further renaming can be done later after the merge. |
#332 has been created as a continuation of verifying the |
Specification
When errors are encountered in the
network
domain for a particular connection, we expect these errors to propagate up to the higher-levelNodeConnection
. In this process, an activeNodeConnection
object that experiences a lower-level failure from thenetwork
domain must also be destroyed and garbage collected, as the underlying connection is no longer valid.As per Roger's comment here #224 (comment), the
network
domain can be seen to be operating in a separate space to the more procedural elements of the source code. The network components are singleton instances that operate and handle errors in isolation. So, how do we propagate these errors to the higher levelNodeConnection
?Additional context
NodeConnection
levelTasks
network
domain: how do these arise? Where do they currently bubble up to?grpc-js
for how to handle connection errors. Are there existing functions that could provide error handlers?network
domain). We expect that these exceptions may only arise after a call is made, so these internal pings will be a means of assessing the connection status (i.e. no GRPC call error, connection is still active; GRPC call error, connection is no longer active, so destroyNodeConnection
object)The text was updated successfully, but these errors were encountered: