Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate networking connection error handling into NodeConnection error handling #224

Closed
joshuakarp opened this issue Aug 9, 2021 · 27 comments · Fixed by #310
Closed
Assignees
Labels
bug Something isn't working development Standard development epic Big issue with multiple subissues r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@joshuakarp
Copy link
Contributor

joshuakarp commented Aug 9, 2021

Specification

When errors are encountered in the network domain for a particular connection, we expect these errors to propagate up to the higher-level NodeConnection. In this process, an active NodeConnection object that experiences a lower-level failure from the network domain must also be destroyed and garbage collected, as the underlying connection is no longer valid.

As per Roger's comment here #224 (comment), the network domain can be seen to be operating in a separate space to the more procedural elements of the source code. The network components are singleton instances that operate and handle errors in isolation. So, how do we propagate these errors to the higher level NodeConnection?

NodeConnection error propagation (Cropped)

So the problem here is that the NodeManager or NodeConnection uses GRPCClientAgent which extends from GRPCClient.

The GRPCClient is what encapsulates the underlying Client object provided by grpc-js.

At this current point in time, in the source code for GRPCClient, there's no provision for asynchronous handlers in the case where the GRPC connection breaks.

The GRPC connection can break when the networking domain has an error, or this may also happen in #276.

But there would no way of handling this error event automatically. Instead I believe an exception would only occur at the point of calling another client method. This is obviously not suitable for our NodeManager as we need to handle when the connections are dead and garbage collect them.

We need to investigate the source code, issues and docs for grpc-js https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/src/client.ts to see where we can add a handler for these GRPC client problems.

If we cannot do this, the alternative is to do interval pings, which are unary calls and this should trigger an exception if the connection has been broken. We already do this inside the network domain at a low level. But this would a GRPC-level ping, which uses a full GRPC message. This would trigger exceptions to occur as well.

Also in reference to the #225, we have to make a choice of whether NodeManager is the primary way to access, or NodeConnection. If users are using NodeConnection, then it would be expected that they call NodeManager to reconnect if there are problems. It seems NodeManager is the better choice, but this makes it a different architecture from how we are doing it in VaultManager. Which is fine, because vaults do not represent objects over a network. Networks are always more flakier. So we don't expect one to lose a connection to a vault. But we do expect losing connections to nodes. Which is why it may be better to have NodeManager handle all the complexity.

Check this https://github.com/grpc/grpc-node/search?q=connection+drop+handle&type=issues

Additional context

Tasks

  1. Investigate the current network connection errors that arise in the lower-level network domain: how do these arise? Where do they currently bubble up to?
  2. Investigate existing code, issues, and docs for grpc-js for how to handle connection errors. Are there existing functions that could provide error handlers?
  3. If no handlers exist at the GRPC level, look into how to perform internal pings (very similar to the keep-alive packets in the lower level network domain). We expect that these exceptions may only arise after a call is made, so these internal pings will be a means of assessing the connection status (i.e. no GRPC call error, connection is still active; GRPC call error, connection is no longer active, so destroy NodeConnection object)
@joshuakarp joshuakarp added the bug Something isn't working label Aug 9, 2021
@CMCDragonkai CMCDragonkai added development Standard development and removed bug Something isn't working labels Aug 24, 2021
@CMCDragonkai CMCDragonkai changed the title Handle errors for agent connection (losing connection, failing to connect) Test error handling at the GRPC client/server sites for connection problems in the networking domain (losing connection, failing to connect) Aug 24, 2021
@CMCDragonkai
Copy link
Member

We must add tests that break the network domain connections while a unary calls and streaming calls are taking place and observe and apply tests on that behaviour. I'm expecting some sort of exception to take place on the next await/promise at the GRPC client stub and service handler simultaneously. But this should be tested and properly speced out.

@CMCDragonkai
Copy link
Member

Issue template is a bug here, but should this be a development spec?

@CMCDragonkai CMCDragonkai added the epic Big issue with multiple subissues label Oct 26, 2021
@CMCDragonkai CMCDragonkai changed the title Test error handling at the GRPC client/server sites for connection problems in the networking domain (losing connection, failing to connect) Propagate networking connection error handling into NodeConnection error handling Oct 26, 2021
@joshuakarp
Copy link
Contributor Author

joshuakarp commented Oct 27, 2021

As I've started to spec this issue out a bit more, I remembered that I'd actually already made a comment about error handling at the NodeConnection level on our issue on restructuring the nodes domain, where I talk about incorporating handlers at the nodes level: #225 (comment)

@joshuakarp
Copy link
Contributor Author

@CMCDragonkai I initially created this issue in August with respect to this comment from you: https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/198#note_644652543

Is propagating the errors to a generic AgentConnectionError actually what this issue should be doing?

@CMCDragonkai
Copy link
Member

You should be able to catch all networking exceptions in the nodes domain.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 27, 2021

Regarding AgentConnectionError, in that comment, I meant that there should be a general error for losing connection when calling from the PK Client CLI or GUI to the PK Agent. So it's about losing connection to the client service.

Whereas this issue is about losing connection to the PK agent between node to node. And this requires integrating with the networking domain.

Losing connection to client service and client server does not use the networking domain. It's just straightforward from a GRPC error that we need catch and convert into a domain exception in the src/client/errors.ts. That's a separate problem. This should be a separate issue. Note that this may already be covered by src/grpc/errors.ts:

class ErrorGRPCClientTimeout extends ErrorGRPC {}

class ErrorGRPCConnection extends ErrorGRPC {}

The PK CLI commands are going to be using the PolykeyClient which uses src/client. Losing connection to the agent should have a domain exception in src/client/errors. So the above GRPC exceptions should be converted.

There should be a good human readable message for this kind of exception when pk commands are executed on the terminal.

New issue in #276.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 27, 2021

So the problem here is that the NodeManager or NodeConnection uses GRPCClientAgent which extends from GRPCClient.

The GRPCClient is what encapsulates the underlying Client object provided by grpc-js.

At this current point in time, in the source code for GRPCClient, there's no provision for asynchronous handlers in the case where the GRPC connection breaks.

The GRPC connection can break when the networking domain has an error, or this may also happen in #276.

But there would no way of handling this error event automatically. Instead I believe an exception would only occur at the point of calling another client method. This is obviously not suitable for our NodeManager as we need to handle when the connections are dead and garbage collect them.

We need to investigate the source code, issues and docs for grpc-js https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/src/client.ts to see where we can add a handler for these GRPC client problems.

If we cannot do this, the alternative is to do interval pings, which are unary calls and this should trigger an exception if the connection has been broken. We already do this inside the network domain at a low level. But this would a GRPC-level ping, which uses a full GRPC message. This would trigger exceptions to occur as well.

Also in reference to the #225, we have to make a choice of whether NodeManager is the primary way to access, or NodeConnection. If users are using NodeConnection, then it would be expected that they call NodeManager to reconnect if there are problems. It seems NodeManager is the better choice, but this makes it a different architecture from how we are doing it in VaultManager. Which is fine, because vaults do not represent objects over a network. Networks are always more flakier. So we don't expect one to lose a connection to a vault. But we do expect losing connections to nodes. Which is why it may be better to have NodeManager handle all the complexity.

Check this https://github.com/grpc/grpc-node/search?q=connection+drop+handle&type=issues

@joshuakarp
Copy link
Contributor Author

Recall our TCP connection between the ForwardProxy and GRPCClient, as per:

Network diagram

image

We'd discussed earlier that the networking components of Polykey can be treated as a completely separate component of the system. They have no knowledge of GRPC, etc.

However, the GRPCClient and ForwardProxy are connected through a "a legitimate fat free 100% natural TCP connection" (Roger's words). This means that an explicit close through FIN/ACK is propagated from the ForwardProxy to the GRPCClient, and the GRPCClient can close the connection. This is different from an exception arising (e.g. a dropped connection), which cannot be propagated.

This also means that the NodeConnection needs to be garbage collected when this explicit close occurs. Roger mentioned that there should be a callback that grpc-js exposes in order to handle closing.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Nov 12, 2021

Leaving some notes here regarding management of channel or subchannel state in GRPC client.

Client Channel Options

It is possible to override the Channel used by GRPC. Remember a 1 Client = 1 Channel. 1 Channel can be a subchannel pool, and each subchannel is 1 TCP connection using HTTP2 frames.

The options parameter of Client is:

export declare type ClientOptions = Partial<ChannelOptions> & {
    channelOverride?: Channel;
    channelFactoryOverride?: (address: string, credentials: ChannelCredentials, options: ClientOptions) => Channel;
    interceptors?: Interceptor[];
    interceptor_providers?: InterceptorProvider[];
    callInvocationTransformer?: CallInvocationTransformer;
};

See the channelOverride and channelFactoryOverride. I believe one of these options can be set to override the channel object used by the client. Normally the client creates the channel object internally but this allows us to customise the channel object prior.

See:

    if (options.channelOverride) {
      this[CHANNEL_SYMBOL] = options.channelOverride;
    } else if (options.channelFactoryOverride) {
      const channelFactoryOverride = options.channelFactoryOverride;
      delete options.channelFactoryOverride;
      this[CHANNEL_SYMBOL] = channelFactoryOverride(
        address,
        credentials,
        options
      );
    } else {
      this[CHANNEL_SYMBOL] = new ChannelImplementation(
        address,
        credentials,
        options
      );
    }

Because this allows us to change the entire channel implementation, we could potentially change how we do the network domain entirely. That is we could fix #234 by making our own Channel implementation use a reliable UDP instead of proxying over the forward and reverse proxies. This would likely be alot more efficient, and we could concentrate our TLS logic in one place too.

Channel Methods for Acquiring Connectivity State

Internally in GRPC, there's a ChannelImplementation class that implements the Channel interface.

The interface contains 2 interesting methods:

  /**
   * Get the channel's current connectivity state. This method is here mainly
   * because it is in the existing internal Channel class, and there isn't
   * another good place to put it.
   * @param tryToConnect If true, the channel will start connecting if it is
   *     idle. Otherwise, idle channels will only start connecting when a
   *     call starts.
   */
  getConnectivityState(tryToConnect: boolean): ConnectivityState;
  /**
   * Watch for connectivity state changes. This is also here mainly because
   * it is in the existing external Channel class.
   * @param currentState The state to watch for transitions from. This should
   *     always be populated by calling getConnectivityState immediately
   *     before.
   * @param deadline A deadline for waiting for a state change
   * @param callback Called with no error when a state change, or with an
   *     error if the deadline passes without a state change.
   */
  watchConnectivityState(
    currentState: ConnectivityState,
    deadline: Date | number,
    callback: (error?: Error) => void
  ): void;

The connectivity state is an enum:

export declare enum ConnectivityState {
    IDLE = 0,
    CONNECTING = 1,
    READY = 2,
    TRANSIENT_FAILURE = 3,
    SHUTDOWN = 4
}

We could potentially hook into these connectivity state changes and use this to understand what state of the connection is, and therefore react to it in NodeConnection and perform garbage collection.

In fact the waitForReady method of the client makes use of these 2 functions in order to know whether the client is ready or not. See https://github.com/grpc/grpc-node/blob/ff387c9cc733abb2941c722ca86b454b82a4906d/packages/grpc-js/src/client.ts#L169

Subchannel Pool

Then each channel has a subchannel pool. And each subchannel has their own connectivity state. It is still unclear to me how subchannel connectivity state relates to connectivity state of the channel. But it appears the state of the subchannels does affect the state of the channel.

Subchannel

Each subchannel has their own connectivity state. And its own Http2Session. This is interesting because the session is only null when the state is IDLE or TransientFailure.

Now I noticed that when the channel state is READY, but the subchannel state may actually be IDLE. Furthermore the only place where the channel state is updated is from updateState private method. And this is used by the load balancer. Which might meant that the load balancer uses the subchannel pool state to drive the channel state.

export class Subchannel {
  /**
   * The subchannel's current connectivity state. Invariant: `session` === `null`
   * if and only if `connectivityState` is IDLE or TRANSIENT_FAILURE.
   */
  private connectivityState: ConnectivityState = ConnectivityState.IDLE;
  /**
   * The underlying http2 session used to make requests.
   */
  private session: http2.ClientHttp2Session | null = null;

@CMCDragonkai
Copy link
Member

For now we won't replace the network domain just yet. We'll leave that to #234.

Even without implementing our own channel implementation, one can interact with the channel by exposing client.getChannel() on our GRPCClient.

But I'd prefer to abstract that and have our own callback mechanisms built into GRPCClient and internally it just calls client.getChannel() and works with the channel directly. This is because the Channel is quite complex and external users shouldn't really be using it. That way GRPC internals is encapsulated in grpc domain.

Then nodes domain just has to make use of new methods on the GRPCClient.

Recommend cloning the https://github.com/grpc/grpc-node/tree/master/packages/grpc-js repo and take a look when tackling this @joshuakarp.

@CMCDragonkai CMCDragonkai added the bug Something isn't working label Nov 12, 2021
@CMCDragonkai
Copy link
Member

Further info: https://stackoverflow.com/questions/63749113/grpc-call-channel-connection-and-http-2-lifecycle.

Note that GRPC can support hostname resolution too, and this impacts how the channels/subchannels/load balancers work. However I'm not sure if we would want to allow GRPC to do this work since we need to ensure that our proxies are working properly.

If there are multiple IPs to a given hostname, we would want separate node connections, not a single node connection load balanced. Each IP there is considered not a point of presence for the same node, but completely separate nodes.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Nov 12, 2021

Based on the work here https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213#note_731431339 I realised that channel state readiness should be tied with the first subchannel state readiness.

That means if channel is ready, then the first subchannel should also be ready. I've added extra commentary to grpc/utils.ts getClientSession utility function there. Will be available when merged.

@tegefaulkes
Copy link
Contributor

Writing up a quick summary from memory from what I've learned.

In the case of a failed connection. An 'ErrorGRPCClientTimeout' error will arise when failing to connect. this is caused by the timeout at await waitForReady(timeout); src/grpc/GRPCClient.ts:130.

if A GRPC call is made after the connection drops we end up with a ErrorGRPCClientCall with message 14 UNAVAILABLE: No connection established. however NodeConnection is watching the channel so it will very likely be destroyed before the next GRPC call is made resulting in an ErrorAgentClientDestroyed error instead.

If the connection fails during a call then we receive an ErrorGRPCClientCall with message 14 UNAVAILABLE: Connection dropped.

Currently all these errors will be thrown when using NodeConnectionManager.withConnection(). otherwise connection problems are handled asynchronously and silently. or an ErrorGRPCClientTimeout may happen if you use NodeConnectionManager.createConnection(). TheErrorGRPCClientCall 14 UNAVAILABLE errors are too generic and may need more specific errors created for them.

From the channel perspective. there are 5 states.

  • IDLE - No active connection, starts here.
  • CONNECTING - in the process of connecting
  • READY - current active connection
  • TRANSIENT_FAILURE - temp failure, We don't care about this since it will transition away from it automatically.
  • SHUTDOWN - GRPC client has been destroyed, only initiated by client when connection negotiation fails or client.close() is called.

We start in IDLE state, when connecting we transition to CONNECTING and then READY if connection succeeds. If we transition to IDLE we can assume the connection is inactive. Currently if we transition to IDLE after seeing a READY state once we assume the connection is dead and proceed to destroy the connection. However we could add an extra step where we try to reconnect when we see IDLE and if we fail to connect then we can assume the connection is dead and proceed to destroy it. If we see a Shutdown state then the connection was closed by the client and we can proceed to destroy.

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Jan 31, 2022

So the 3 errors listed above, they originate from...

  • ErrorGRPCClientTimeout is from GRPCClient when attempting to connect and it times out at waitForReady.
  • ErrorGRPCClientCall originates from src/grpc/utils/utils.ts:166 toError when the errors comes through the GRPC call.

With recent changes in #310 there are two ways connection errors are handled. The first is where we handle the errors thrown by the client when starting a connection or a doing a GRPC call. When starting a connection if the connection fails to establish the NodeConnection throws and is never added to the connection map. Likewise when a network error happens during a withConnection the NodeConnection destroyed and removed from the connection map.

The connections channel is also watched asynchronously for the connection dying and then destroys and removes the NodeConnection as well. this happens passively.

@tegefaulkes
Copy link
Contributor

I think with the combination of withConnection and the channel watcher that this is addressed in #310 .

@joshuakarp
Copy link
Contributor Author

joshuakarp commented Feb 1, 2022

I've made a primitive start to mapping the various transitions for NodeConnection state. Still unfinished, but gives an idea of the structure of the diagram, as per discussion from #310 (comment). I'll keep comments concerning its progress in this issue, to keep #310 as clear as possible.

image

The more I create this, the more I'm inclined to giving the states a proper title too (as opposed to numbering them). Obviously though, this is still a WIP.

@startuml
' Inactive connection
state "NodeConnection 1" as NC1 {
  NC1 : running: false
  NC1 : destroyed: false
  NC1 : status: starting
  state "Client" as Client1 {
    Client1 : IDLE
  }
}

NC1 --> NC2 : conn.start()

' Connecting to target
state "NodeConnection 2" as NC2 {
  NC2 : running: false
  NC2 : destroyed: false
  NC2 : status: starting
  state "Client" as Client2 {
    Client2 : CONNECTING
  }
}
 
NC2 --> NC3 : successfully connected to target

' Active connection
state "NodeConnection 3" as NC3 {
  NC3 : running: true
  NC3 : destroyed: false
  NC3 : status: starting
  state "Client" as Client3 {
    Client3 : READY
  }
}

NC2 --> NC4 : failed to connect to target
NC2 --> NC5 : timed out | failed to validate certificate

' Inactive connection
state "NodeConnection 4" as NC4 {
  NC4 : running: true
  NC4 : destroyed: false
  NC4 : status: starting
  state "Client" as Client4 {
    Client4 : READY
  }
}

' Failed certificate validation
state "NodeConnection 5" as NC5 {
  NC5 : running: true
  NC5 : destroyed: false
  NC5 : status: starting
  state "Client" as Client5 {
    Client5 : TRANSIENT_FAILURE
  }
}

@enduml

@joshuakarp
Copy link
Contributor Author

Will be able to use the completed diagram to also check that we have adequate testing for each path of a NodeConnection lifecycle. The logging outputs in NodeConnection.test.ts are especially useful to check the transitions between client state here,

@joshuakarp
Copy link
Contributor Author

In the process of writing up a state diagram in #224, there's an interesting difference in behaviour with connection shutdown and corresponding client status that I'm trying to make sense of and elucidate through the diagram. This follows on from the following comments:

#310 (comment)

Found another bug. after connecting to a PolykeyAgent and then the PolykeyAgent stops the NodeConenction is not closing properly.

INFO:nodeConnection:Creating NodeConnection
INFO:GRPCClientAgent:Creating GRPCClientAgent connecting to 0.0.0.0:45568
INFO:connectionState:State is now IDLE
INFO:connectionState:State is now CONNECTING
INFO:connectionState:State is now READY
INFO:GRPCClientAgent:Created GRPCClientAgent connecting to 0.0.0.0:45568
INFO:nodeConnection:Created NodeConnection
// Stopping PolykeyAgent here
INFO:connectionState:State is now IDLE
WARN:ConnectionReverse 127.0.0.1:11110:Server Error: ErrorConnectionEndTimeout
WARN:ConnectionReverse 127.0.0.1:11110:Reverse Error: ErrorConnectionEndTimeout

Not sure what's happening here. connection is trying to close gracefully but the client never sees the connection ending.

and #310 (comment)

When the server is stopped we do see the FIN ACK packets closing the connection. this corresponds with the channel transitioning to IDLE.

IDLE
CONNECTING
READY
// server shutdown, 
// server initiates end.
IDLE

The above 2 comments refer to the transitions when we move to IDLE after a connection shutdown. However, there's also a transition where we move to SHUTDOWN. I'm wanting to make this process clear. We have 2 different tests that demonstrate these 2 scenarios:

  1. connects to its target (via direct connection)
// create NodeConnection (creates client)
Client: IDLE
Client: CONNECTING
// connection created
Client: READY
// explicit call to conn.destroy() (destroys client)
Client: SHUTDOWN
// connection and client destroyed, channel closed

Log output:

Important part:

 INFO:NodeConnection Test:Creating NodeConnection
 INFO:GRPCClientAgent:Creating GRPCClientAgent connecting to 127.0.0.2:11111
 INFO:connectionState:State is now IDLE
 INFO:connectionState:State is now CONNECTING
 INFO:NodeConnection Test:Handling connection from 127.0.0.1:11110
 INFO:ConnectionReverse 127.0.0.1:11110:Starting Connection Reverse
 INFO:ConnectionReverse 127.0.0.1:11110:Started Connection Reverse
 INFO:ConnectionReverse 127.0.0.1:11110:Composing Connection Reverse
 INFO:connectionState:State is now READY
 INFO:GRPCClientAgent:Created GRPCClientAgent connecting to 127.0.0.2:11111
 INFO:NodeConnection Test:Created NodeConnection
 INFO:ConnectionReverse 127.0.0.1:11110:Composed Connection Reverse
 INFO:NodeConnection Test:Handled connection from 127.0.0.1:11110
 INFO:NodeConnection Test:Destroying NodeConnection
 INFO:GRPCClientAgent:Destroying GRPCClientAgent connected to 127.0.0.2:11111
 INFO:connectionState:State is now SHUTDOWN
 INFO:GRPCClientAgent:Destroyed GRPCClientAgent connected to 127.0.0.2:11111
 INFO:NodeConnection Test:Destroyed NodeConnection
 INFO:connectionState:Finished watching, Channel has been shut down
 DEBUG:ConnectionReverse 127.0.0.1:11110:Receives tlsSocket ending
 DEBUG:ConnectionReverse 127.0.0.1:11110:Responds tlsSocket ending
 DEBUG:ConnectionReverse 127.0.0.1:11110:Responded tlsSocket ending
 INFO:ConnectionReverse 127.0.0.1:11110:Stopping Connection Reverse
 DEBUG:ConnectionReverse 127.0.0.1:11110:Sends serverSocket ending
 INFO:ConnectionReverse 127.0.0.1:11110:Stopped Connection Reverse
 WARN:ConnectionForward 127.0.0.2:11111:Forward Error: ErrorConnectionEndTimeout

Full log:

Determining test suites to run...
GLOBAL SETUP
Global Data Dir: /run/user/1000/polykey-test-global-UonOea
INFO:NodeConnection Test:Creating KeyManager
INFO:NodeConnection Test:Setting keys path to /run/user/1000/polykey-test-serverjxH6ga/serverKeys
INFO:NodeConnection Test:Starting KeyManager
INFO:NodeConnection Test:Checking /run/user/1000/polykey-test-serverjxH6ga/serverKeys/root.pub and /run/user/1000/polykey-test-serverjxH6ga/serverKeys/root.key
INFO:NodeConnection Test:Generating root key pair
INFO:NodeConnection Test:Writing /run/user/1000/polykey-test-serverjxH6ga/serverKeys/root.pub and /run/user/1000/polykey-test-serverjxH6ga/serverKeys/root.key
INFO:NodeConnection Test:Checking /run/user/1000/polykey-test-serverjxH6ga/serverKeys/root.crt
INFO:NodeConnection Test:Generating root certificate
INFO:NodeConnection Test:Writing /run/user/1000/polykey-test-serverjxH6ga/serverKeys/root.crt
INFO:NodeConnection Test:Checking /run/user/1000/polykey-test-serverjxH6ga/serverKeys/db.key
INFO:NodeConnection Test:Generating keys db key
INFO:NodeConnection Test:Writing /run/user/1000/polykey-test-serverjxH6ga/serverKeys/db.key
INFO:NodeConnection Test:Checking /run/user/1000/polykey-test-serverjxH6ga/serverKeys/vault.key
INFO:NodeConnection Test:Generating keys db key
INFO:NodeConnection Test:Writing /run/user/1000/polykey-test-serverjxH6ga/serverKeys/vault.key
INFO:NodeConnection Test:Started KeyManager
INFO:NodeConnection Test:Created KeyManager
INFO:NodeConnection Test:Creating DB
INFO:NodeConnection Test:Starting DB
INFO:NodeConnection Test:Setting DB path to /run/user/1000/polykey-test-serverjxH6ga/serverDb
INFO:NodeConnection Test:Started DB
INFO:NodeConnection Test:Created DB
INFO:NodeConnection Test:Creating ACL
INFO:NodeConnection Test:Starting ACL
INFO:NodeConnection Test:Started ACL
INFO:NodeConnection Test:Created ACL
INFO:NodeConnection Test:Creating Sigchain
INFO:NodeConnection Test:Starting Sigchain
INFO:NodeConnection Test:Started Sigchain
INFO:NodeConnection Test:Created Sigchain
INFO:NodeConnection Test:Creating GestaltGraph
INFO:NodeConnection Test:Starting GestaltGraph
INFO:NodeConnection Test:Started GestaltGraph
INFO:NodeConnection Test:Created GestaltGraph
INFO:NodeConnection Test:Creating Forward Proxy
INFO:NodeConnection Test:Created Forward Proxy
INFO:NodeConnection Test:Creating Reverse Proxy
INFO:NodeConnection Test:Created Reverse Proxy
INFO:NodeConnection Test:Creating NodeGraph
INFO:NodeConnection Test:Starting NodeGraph
INFO:NodeConnection Test:Started NodeGraph
INFO:NodeConnection Test:Created NodeGraph
INFO:NodeConnection Test:Creating NodeConnectionManager
INFO:NodeConnection Test:Created NodeConnectionManager
INFO:NodeConnection Test:Starting NodeConnectionManager
INFO:NodeConnection Test:Started NodeConnectionManager
INFO:NodeConnection Test:Creating NodeManager
INFO:NodeConnection Test:Created NodeManager
INFO:NodeConnection Test:Creating VaultManager
INFO:NodeConnection Test:Created VaultManager
INFO:NodeConnection Test:Starting VaultManager
INFO:DB:Creating DB
INFO:DB:Starting DB
INFO:DB:Setting DB path to /run/user/1000/polykey-test-serverjxH6ga/serverVaults
INFO:DB:Started DB
INFO:DB:Created DB
INFO:NodeConnection Test:Starting EncryptedFS
INFO:NodeConnection Test:Started EncryptedFS
INFO:NodeConnection Test:Started VaultManager
INFO:NodeConnection Test:Creating NotificationsManager
INFO:NodeConnection Test:Starting NotificationsManager
INFO:NodeConnection Test:Started NotificationsManager
INFO:NodeConnection Test:Created NotificationsManager
INFO:NodeConnection Test:Starting GRPCServer on 127.0.0.2:0
INFO:NodeConnection Test:Started GRPCServer on 127.0.0.2:35853
INFO:NodeConnection Test:Starting Reverse Proxy from 127.0.0.2:11111 to 127.0.0.2:35853
INFO:NodeConnection Test:Started Reverse Proxy from 127.0.0.2:11111 to 127.0.0.2:35853
INFO:NodeConnection Test:Creating KeyManager
INFO:NodeConnection Test:Setting keys path to /run/user/1000/polykey-test-clientWAbVX6/clientKeys
INFO:NodeConnection Test:Starting KeyManager
INFO:NodeConnection Test:Checking /run/user/1000/polykey-test-clientWAbVX6/clientKeys/root.pub and /run/user/1000/polykey-test-clientWAbVX6/clientKeys/root.key
INFO:NodeConnection Test:Generating root key pair
INFO:NodeConnection Test:Writing /run/user/1000/polykey-test-clientWAbVX6/clientKeys/root.pub and /run/user/1000/polykey-test-clientWAbVX6/clientKeys/root.key
INFO:NodeConnection Test:Checking /run/user/1000/polykey-test-clientWAbVX6/clientKeys/root.crt
INFO:NodeConnection Test:Generating root certificate
INFO:NodeConnection Test:Writing /run/user/1000/polykey-test-clientWAbVX6/clientKeys/root.crt
INFO:NodeConnection Test:Checking /run/user/1000/polykey-test-clientWAbVX6/clientKeys/db.key
INFO:NodeConnection Test:Generating keys db key
INFO:NodeConnection Test:Writing /run/user/1000/polykey-test-clientWAbVX6/clientKeys/db.key
INFO:NodeConnection Test:Checking /run/user/1000/polykey-test-clientWAbVX6/clientKeys/vault.key
INFO:NodeConnection Test:Generating keys db key
INFO:NodeConnection Test:Writing /run/user/1000/polykey-test-clientWAbVX6/clientKeys/vault.key
INFO:NodeConnection Test:Started KeyManager
INFO:NodeConnection Test:Created KeyManager
INFO:NodeConnection Test:Creating NodeConnection
INFO:GRPCClientAgent:Creating GRPCClientAgent connecting to 127.0.0.2:11111
INFO:connectionState:State is now IDLE
INFO:connectionState:State is now CONNECTING
INFO:NodeConnection Test:Handling connection from 127.0.0.1:11110
INFO:ConnectionReverse 127.0.0.1:11110:Starting Connection Reverse
INFO:ConnectionReverse 127.0.0.1:11110:Started Connection Reverse
INFO:ConnectionReverse 127.0.0.1:11110:Composing Connection Reverse
INFO:connectionState:State is now READY
INFO:GRPCClientAgent:Created GRPCClientAgent connecting to 127.0.0.2:11111
INFO:NodeConnection Test:Created NodeConnection
INFO:ConnectionReverse 127.0.0.1:11110:Composed Connection Reverse
INFO:NodeConnection Test:Handled connection from 127.0.0.1:11110
INFO:NodeConnection Test:Destroying NodeConnection
INFO:GRPCClientAgent:Destroying GRPCClientAgent connected to 127.0.0.2:11111
INFO:connectionState:State is now SHUTDOWN
INFO:GRPCClientAgent:Destroyed GRPCClientAgent connected to 127.0.0.2:11111
INFO:NodeConnection Test:Destroyed NodeConnection
INFO:connectionState:Finished watching, Channel has been shut down
DEBUG:ConnectionReverse 127.0.0.1:11110:Receives tlsSocket ending
DEBUG:ConnectionReverse 127.0.0.1:11110:Responds tlsSocket ending
DEBUG:ConnectionReverse 127.0.0.1:11110:Responded tlsSocket ending
INFO:ConnectionReverse 127.0.0.1:11110:Stopping Connection Reverse
DEBUG:ConnectionReverse 127.0.0.1:11110:Sends serverSocket ending
INFO:ConnectionReverse 127.0.0.1:11110:Stopped Connection Reverse
WARN:ConnectionForward 127.0.0.2:11111:Forward Error: ErrorConnectionEndTimeout
INFO:NodeConnection Test:Stopping KeyManager
INFO:NodeConnection Test:Stopped KeyManager
INFO:NodeConnection Test:Destroying KeyManager
INFO:NodeConnection Test:Destroyed KeyManager
INFO:NodeConnection Test:Stopping ACL
INFO:NodeConnection Test:Stopped ACL
INFO:NodeConnection Test:Destroying ACL
INFO:NodeConnection Test:Destroyed ACL
INFO:NodeConnection Test:Stopping Sigchain
INFO:NodeConnection Test:Stopped Sigchain
INFO:NodeConnection Test:Destroying Sigchain
INFO:NodeConnection Test:Destroyed Sigchain
INFO:NodeConnection Test:Stopping GestaltGraph
INFO:NodeConnection Test:Stopped GestaltGraph
INFO:NodeConnection Test:Destroying GestaltGraph
INFO:NodeConnection Test:Destroyed GestaltGraph
INFO:NodeConnection Test:Stopping VaultManager
INFO:NodeConnection Test:Stopping EncryptedFS
INFO:DB:Stopping DB
INFO:DB:Stopped DB
INFO:NodeConnection Test:Stopped EncryptedFS
INFO:NodeConnection Test:Stopped VaultManager
INFO:NodeConnection Test:Destroying VaultManager
INFO:NodeConnection Test:Destroying EncryptedFS
INFO:DB:Destroying DB
INFO:DB:Destroyed DB
INFO:NodeConnection Test:Destroyed EncryptedFS
INFO:NodeConnection Test:Removing vaults directory at '/run/user/1000/polykey-test-serverjxH6ga/serverVaults'
INFO:NodeConnection Test:Destroyed VaultManager
INFO:NodeConnection Test:Stopping NodeGraph
INFO:NodeConnection Test:Stopped NodeGraph
INFO:NodeConnection Test:Destroying NodeGraph
INFO:NodeConnection Test:Destroyed NodeGraph
INFO:NodeConnection Test:Stopping NodeConnectionManager
INFO:NodeConnection Test:Stopped NodeConnectionManager
INFO:NodeConnection Test:Stopping NotificationsManager
INFO:NodeConnection Test:Stopped NotificationsManager
INFO:NodeConnection Test:Destroying NotificationsManager
INFO:NodeConnection Test:Destroyed NotificationsManager
INFO:NodeConnection Test:Stopping GRPCServer
INFO:NodeConnection Test:Stopped GRPCServer
INFO:NodeConnection Test:Stopping Reverse Proxy
INFO:NodeConnection Test:Stopped Reverse Proxy
INFO:NodeConnection Test:Stopping KeyManager
INFO:NodeConnection Test:Stopped KeyManager
INFO:NodeConnection Test:Destroying KeyManager
INFO:NodeConnection Test:Destroyed KeyManager
INFO:NodeConnection Test:Stopping DB
INFO:NodeConnection Test:Stopped DB
INFO:NodeConnection Test:Destroying DB
INFO:NodeConnection Test:Destroyed DB
 PASS  tests/nodes/NodeConnection.test.ts (13.818 s)
  NodeConnection
    ✓ connects to its target (via direct connection) (1297 ms)
    ○ skipped session readiness
    ○ skipped grpcCall after connection drops
    ○ skipped fails to connect to target (times out)
    ○ skipped getRootCertChain
    ○ skipped getExpectedPublicKey
    should call `killSelf`
      ○ skipped if connection is closed based on bad certificate
      ○ skipped if connection is closed before TLS is established
      ○ skipped if the Agent is stopped.
      ○ skipped and throw if the server exit's during testUnaryFail
      ○ skipped and throw if the server kill's during testUnaryFail
      ○ skipped and throw if the server sigkill's during testUnaryFail
      ○ skipped and throw if the server exit's during testStreamFail
      ○ skipped and throw if the server kill's during testStreamFail
      ○ skipped and throw if the server sigkill's during testStreamFail

Test Suites: 1 passed, 1 total
Tests:       14 skipped, 1 passed, 15 total
Snapshots:   0 total
Time:        13.853 s
Ran all test suites matching /tests\/nodes\/NodeConnection.test.ts/i.
GLOBAL TEARDOWN
Destroying Global Data Dir: /run/user/1000/polykey-test-global-UonOea
  1. should call 'killSelf' - if the Agent is stopped
// create NodeConnection (creates client)
Client: IDLE
Client: CONNECTING
// connection created
Client: READY
// stop the remote keynode (server)
// server sends end through serverSocket + tlsSocket
Client: IDLE
// connection and client destroyed, channel closed

Log output:

Important part:

INFO:nodeConnection:Creating NodeConnection
INFO:GRPCClientAgent:Creating GRPCClientAgent connecting to 0.0.0.0:36569
INFO:connectionState:State is now IDLE
INFO:connectionState:State is now CONNECTING
INFO:ReverseProxy:Handling connection from 127.0.0.1:11110
INFO:ConnectionReverse 127.0.0.1:11110:Starting Connection Reverse
INFO:ConnectionReverse 127.0.0.1:11110:Started Connection Reverse
INFO:ConnectionReverse 127.0.0.1:11110:Composing Connection Reverse
INFO:connectionState:State is now READY
INFO:GRPCClientAgent:Created GRPCClientAgent connecting to 0.0.0.0:36569
INFO:nodeConnection:Created NodeConnection
INFO:PolykeyAgent:Stopping PolykeyAgent
INFO:Status:Begin Status STOPPING
INFO:Status:Writing Status to /run/user/1000/polykey-test-global-H63u2Q/PolykeyAgent3/status.json
INFO:ConnectionReverse 127.0.0.1:11110:Composed Connection Reverse
INFO:ReverseProxy:Handled connection from 127.0.0.1:11110
INFO:Status:Status is STOPPING
INFO:SessionManager:Stopping SessionManager
INFO:SessionManager:Stopped SessionManager
INFO:NotificationsManager:Stopping NotificationsManager
INFO:NotificationsManager:Stopped NotificationsManager
INFO:VaultManager:Stopping VaultManager
INFO:VaultManager:Stopping EncryptedFS
INFO:DB:Stopping DB
INFO:DB:Stopped DB
INFO:VaultManager:Stopped EncryptedFS
INFO:VaultManager:Stopped VaultManager
INFO:Discovery:Destroying Discovery
INFO:Discovery:Destroyed Discovery
INFO:NodeConnectionManager:Stopping NodeConnectionManager
INFO:NodeConnectionManager:Stopped NodeConnectionManager
INFO:NodeGraph:Stopping NodeGraph
INFO:NodeGraph:Stopped NodeGraph
INFO:ReverseProxy:Stopping Reverse Proxy
INFO:ConnectionReverse 127.0.0.1:11110:Stopping Connection Reverse
DEBUG:ConnectionReverse 127.0.0.1:11110:Sends serverSocket ending
DEBUG:ConnectionReverse 127.0.0.1:11110:Sends tlsSocket ending
INFO:connectionState:State is now IDLE
INFO:nodeConnection:Destroying NodeConnection
INFO:GRPCClientAgent:Destroying GRPCClientAgent connected to 0.0.0.0:36569
INFO:GRPCClientAgent:Destroyed GRPCClientAgent connected to 0.0.0.0:36569
INFO:nodeConnection:Destroyed NodeConnection
INFO:connectionState:Finished watching, Channel has been shut down
WARN:ConnectionReverse 127.0.0.1:11110:Server Error: ErrorConnectionEndTimeout
WARN:ConnectionReverse 127.0.0.1:11110:Reverse Error: ErrorConnectionEndTimeout
INFO:ConnectionReverse 127.0.0.1:11110:Stopped Connection Reverse

Full log:

TBC

So as a result of this:

  • if the server initiates a connection closure, then our connection transitions to IDLE
  • if the client specifically closes the connection, then we transition to SHUTDOWN

@joshuakarp
Copy link
Contributor Author

joshuakarp commented Feb 8, 2022

From discussions this afternoon, there's a few extra states and transitions that need to be verified.

Some first remarks:

  • we have two "TTLs": the TTL at the GRPC level, and also the TTL at the NodeConnectionManager level (specified by connTimeoutTime)
  • a connection between client and server can be seen as a combination of two layers: the GRPC layer, and the TCP layer that GRPC sits on top of. e.g.:
    ┌──────┐    gRPC   ┌──────┐
    │      │◄─────────►│      │
    │Client│           │Server│
    │      │◄─────────►│      │
    └──────┘    TCP    └──────┘
    

The states and transitions to explore:

  1. With an active connection, wait until the TTL expires at the GRPC level. We expect the client to go IDLE, but nothing else should happen.
  2. With an active connection, and an IDLE client (i.e. an expired TTL), we expect the client to transition to READY after a GRPC call is made
    • note that this has already been verified from a READY client (see NodeConnectionManager.test.ts and should time out a connection)
  3. With an active connection, close both the GRPC and TCP layers (i.e. connection shutdown). We expect the client to go IDLE. The following:
     ┌──────┐    gRPC   ┌──────┐
     │      │◄─────────x│      │
     │Client│           │Server│
     │      │◄─────────x│      │
     └──────┘    TCP    └──────┘
    
    • I believe this is simply the case where the ConnectionReverse is closed (i.e. called when the server agent shuts down), and therefore is verified by the should call 'killSelf' - if the Agent is stopped test (see above comment)
  4. With an active connection, close only the TCP layer. We can invoke this by grabbing the socket itself (may require a monkeypatch), call end() and destroy() on the socket, give a timeout for 1 second. We expect the client to go IDLE as well. The following:
     ┌──────┐    gRPC   ┌──────┐
     │      │◄─────────►│      │
     │Client│           │Server│
     │      │◄─────────x│      │
     └──────┘    TCP    └──────┘
    

@joshuakarp
Copy link
Contributor Author

@tegefaulkes mentioned that he previously experimented with the TTL at the GRPC level, but it seemed to take more than 6 minutes for the client's state to transition to IDLE on its own. He believes there is a way to set the timeout manually, but would need some further investigation.

@joshuakarp
Copy link
Contributor Author

joshuakarp commented Feb 8, 2022

Got a final first draft of the connection state diagram:

image

@startuml
' Inactive connection
state "Inactive connection" as NC1 {
  NC1 : running: false
  NC1 : destroyed: false
  NC1 : status: starting
  state "Client" as Client1 {
    Client1 : IDLE
  }
}

NC1 --> NC2 : attempt connection

' Connecting to target
state "Connecting to target" as NC2 {
  NC2 : running: false
  NC2 : destroyed: false
  NC2 : status: starting
  state "Client" as Client2 {
    Client2 : CONNECTING
  }
}
 
NC2 --> NC3 : successfully connected to target

' Active connection
state "Active connection" as NC3 {
  NC3 : running: true
  NC3 : destroyed: false
  NC3 : status: starting
  state "Client" as Client3 {
    Client3 : READY
  }
}

NC3 --> NC8 : gRPC TTL expires
NC3 --> NC9 : server closes connection
NC3 --> NC5 : error during gRPC call

' Idle connection
state "Idle connection" as NC8 {
  NC8 : running: true
  NC8 : destroyed: false
  NC8 : status: starting
  state "Client" as Client8 {
    Client8 : IDLE
  }
}

NC8 --> NC3 : gRPC call invoked

' Dropped connection
state "Dropped connection" as NC9 {
  NC9 : running: true
  NC9 : destroyed: false
  NC9 : status: destroying
  state "Client" as Client9 {
    Client9 : IDLE
  }
}

NC2 --> NC5 : timed out | failed to validate certificate
NC9 --> NC10 : destroy client

' Failed certificate validation
state "Connection error" as NC5 {
  NC5 : running: true
  NC5 : destroyed: false
  NC5 : status: starting
  state "Client" as Client5 {
    Client5 : TRANSIENT_FAILURE
  }
}

NC3 --> NC6 : destroy connection | connection TTL expired
NC5 --> NC7 : destroy client

' Destroy NodeConnection
state "Destroying connection" as NC6 {
  NC6 : running: true
  NC6 : destroyed: false
  NC6 : status: destroying
  state "Client" as Client6 {
    Client6 : READY
  }
}

NC6 --> NC7 : destroy client

' Destroy client
state "Destroyed connection" as NC7 {
  NC7 : running: false
  NC7 : destroyed: true
  NC7 : status: destroying
  state "Client" as Client7 {
    Client7 : SHUTDOWN
  }
}

' Destroy client
state "Destroyed connection" as NC10 {
  NC10 : running: false
  NC10 : destroyed: true
  NC10 : status: destroying
  state "Client" as Client10 {
    Client10 : IDLE
  }
}
@enduml

Some notes:

  • we have 2 "destroyed connection" states, because we have 2 different states the client can be in when a connection is deemed to be destroyed (either IDLE or SHUTDOWN - see Propagate networking connection error handling into NodeConnection error handling #224 (comment)). Can potentially figure out a neater way of representing this in the diagram, but we can consider this as the first version of the connection state diagram
  • the transitions in the diagram are predominantly validated, but there are still some to validate:
    • Active connection --> Connection error (the documentation suggests that this transition should exist, but haven't seen anywhere in our code where it's reproduced)
    • the cycle between Active connection and Idle connection

@joshuakarp
Copy link
Contributor Author

@tegefaulkes made the comment that for the bottom left Destroyed connection, the channel watcher shuts down before we show the actual SHUTDOWN of the client. So in actual fact, it's likely that we do go from IDLE to SHUTDOWN after the connection is closed from the server-side. As a result, we can possibly synthesis the two Destroyed connection states in this way.

Another comment was whether the status field in each of the states was valid at all. NodeConnection is just a container class for the underlying client, so it doesn't make a huge amount of sense to consider whether the NodeConnection is "starting" or "destroying".

@joshuakarp
Copy link
Contributor Author

So in actual fact, if we assume that we have already seen the CONNECTING and READY client states, then at the moment, any later transition to IDLE will kill off the connection. This is how it's currently implemented with the channel watcher.

At the moment anyway, our internal TTL in NodeConnectionManager is longer than the GRPC TTL (@tegefaulkes played with this and said it's something like 6 minutes), so it's not possible to reach the GRPC TTL and move from READY to IDLE. But, if we were to see that transition from READY to IDLE, it would kill the connection.

@joshuakarp
Copy link
Contributor Author

Based on feedback, a somewhat simpler version:

image

@startuml
' Inactive connection
state "Inactive connection" as NC1 {
  NC1 : running: false
  NC1 : destroyed: false
  NC1 : status: starting
  state "Client" as Client1 {
    Client1 : IDLE
  }
}

NC1 --> NC2 : attempt connection

' Connecting to target
state "Connecting to target" as NC2 {
  NC2 : running: false
  NC2 : destroyed: false
  NC2 : status: starting
  state "Client" as Client2 {
    Client2 : CONNECTING
  }
}
 
NC2 --> NC3 : successfully connected to target

' Active connection
state "Active connection" as NC3 {
  NC3 : running: true
  NC3 : destroyed: false
  NC3 : status: starting
  state "Client" as Client3 {
    Client3 : READY
  }
}

NC3 --> NC3 : gRPC call performed
'NC3 --> NC8 : gRPC TTL expires
NC3 --> NC9 : server closes connection
NC3 --> NC5 : error during gRPC call

' Idle connection
'state "Idle connection" as NC8 {
'  NC8 : running: true
'  NC8 : destroyed: false
'  NC8 : status: starting
'  state "Client" as Client8 {
'    Client8 : IDLE
'  }
'}

'NC8 --> NC3 : gRPC call invoked

' Dropped connection
state "Dropped connection" as NC9 {
  NC9 : running: true
  NC9 : destroyed: false
  NC9 : status: destroying
  state "Client" as Client9 {
    Client9 : IDLE
  }
}

NC2 --> NC5 : timed out | failed to validate certificate
NC9 --> NC7 : channel watcher destroys\n client and shuts down

' Failed certificate validation
state "Connection error" as NC5 {
  NC5 : running: true
  NC5 : destroyed: false
  NC5 : status: starting
  state "Client" as Client5 {
    Client5 : TRANSIENT_FAILURE
  }
}

NC3 --> NC6 : destroy connection | connection TTL expired
NC5 --> NC7 : destroy client

' Destroy NodeConnection
state "Destroying connection" as NC6 {
  NC6 : running: true
  NC6 : destroyed: false
  NC6 : status: destroying
  state "Client" as Client6 {
    Client6 : READY
  }
}

NC6 --> NC7 : destroy client

' Destroy client
state "Destroyed connection" as NC7 {
  NC7 : running: false
  NC7 : destroyed: true
  NC7 : status: destroying
  state "Client" as Client7 {
    Client7 : SHUTDOWN
  }
}

' Destroy client
'state "Destroyed connection" as NC10 {
'  NC10 : running: false
'  NC10 : destroyed: true
'  NC10 : status: destroying
'  state "Client" as Client10 {
'    Client10 : IDLE
'  }
'}
@enduml

@CMCDragonkai
Copy link
Member

So NodeConnectionManager.test.st can be separated into smaller test modules. But for NodeConnection.test.ts, the naming of the test functions should be related to the above transition and state names, so it's easy for one to follow from the model to the actual tests to know if a transition is being tested.

@CMCDragonkai
Copy link
Member

@tegefaulkes please try splitting the test modules atm, especially the nested describe. Do some basic renaming, and further renaming can be done later after the merge.

@CMCDragonkai
Copy link
Member

#332 has been created as a continuation of verifying the READY to IDLE transition.

@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working development Standard development epic Big issue with multiple subissues r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

Successfully merging a pull request may close this issue.

3 participants