Apply `timedCancellable` across the board to control how long side-effects are allowed to complete #450

tegefaulkes · 2022-09-21T05:08:07Z

Specification

With the application of timedCancellable to functions within the nodes domain. We have encountered new connection errors randomly during tests. These error started after applying a deadline of 1500 ms to pingNode usage. We suspect the problem is coming from openConnectionForward after setting such a short deadline. These may be bugs that existed previously that was never revealed by the current testing. Currently there is 1 NodeConnectionManager test where we override the deadline for openConnectionForward.

The occur randomly and are as follows. the can be re-created by running all of the nodes domain tests concurrently.

Error: Socket is closed
    at UTP.Object.<anonymous>.UTP.send (/home/faulkes/matrixCode/polykey/js-polykey/node_modules/utp-native/index.js:121:50)
    at /home/faulkes/matrixCode/polykey/js-polykey/src/utils/utils.ts:178:9
    at new Promise (<anonymous>)
    at UTP.<anonymous> (/home/faulkes/matrixCode/polykey/js-polykey/src/utils/utils.ts:163:12)
    at constructor_.send (/home/faulkes/matrixCode/polykey/js-polykey/src/network/Connection.ts:79:11)
    at Timeout.<anonymous> (/home/faulkes/matrixCode/polykey/js-polykey/src/network/ConnectionReverse.ts:147:20)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)

Error: Unhandled error. (Error: UTP_ECONNRESET
    at createUTPError (/home/faulkes/matrixCode/polykey/js-polykey/node_modules/utp-native/lib/connection.js:238:15)
    at Connection.Object.<anonymous>.Connection._onerror (/home/faulkes/matrixCode/polykey/js-polykey/node_modules/utp-native/lib/connection.js:175:16) {
  code: 'UTP_ECONNRESET',
  errno: 1
})
    at new NodeError (node:internal/errors:372:5)
    at Connection.emit (node:events:516:17)
    at Connection.Object.<anonymous>.Connection._onclose (/home/faulkes/matrixCode/polykey/js-polykey/node_modules/utp-native/lib/connection.js:169:25)

Error: address already in use
    at UTP.Object.<anonymous>.UTP.bind (/home/faulkes/matrixCode/polykey/js-polykey/node_modules/utp-native/index.js:198:13)
    at /home/faulkes/matrixCode/polykey/js-polykey/src/utils/utils.ts:178:9
    at new Promise (<anonymous>)
    at UTP.<anonymous> (/home/faulkes/matrixCode/polykey/js-polykey/src/utils/utils.ts:163:12)
    at Object.<anonymous> (/home/faulkes/matrixCode/polykey/js-polykey/tests/nodes/NodeManager.test.ts:83:11)

Additional context

Related discussion starting at Integrate TaskManager into NodeGraph and Discovery #445 (comment)
Related Integrate TaskManager into NodeGraph and Discovery #445
Related Apply timedCancellable to Sigchain such as the deadline in nodes claim process #243
Related Add extra testing for nodes claims process #242

Tasks

Explore and find the root cause of the errors
Apply a fix for the errors

The text was updated successfully, but these errors were encountered:

tegefaulkes · 2022-09-28T07:41:04Z

This is going to be a pretty large change. Converting everything to TimedCancellable is pretty simple. there is just a lot of things to change and cover.

Nodes domain handles a bunch of connections functionality for agent-agent communication. These have been converted to TimedCancellable already but needs to be reviewed again.
Cancellablility needs to be integrated with the GRPC calls. I need to research how to do this, it should be possible to abort a connection. Generalizing this for all GRPC calls and error handling needs to be prototyped.
Server side deadlines and abortion needs to be implemented as a complement to the client side aborting and deadlines.
Proxy connections need to be updated to use the new style of cancellability.

tegefaulkes · 2022-09-28T07:47:18Z

I don't think the default timeout for the timedCancellable decorator can be set at any time. It has to be defined at import time. This means we can't configure a default time when starting the NodeConnectionManager. We also don't want to have hard coded timeouts everywhere so we may have to define the defaults in the config file.

CMCDragonkai · 2022-09-29T01:09:49Z

There are several places where timedCancellable and variants of this should be used. Basically anywhere where there is non-deterministic side-effects.

Pretty much all of them concern around network usage.

Deadlines applied to GRPC client
Deadlines applied to GRPC service handlers
Propagating deadline cancellation to long-running computations and network calls
Locking - this currently requires the Promise.race work arounds until Integrate js-contexts and js-async-cancellable js-async-locks#21

For long running computations, we're talking about generators and related iterators. In these cases, the cancellation of generators hasn't been entirely worked out yet, and instead it is expected that you would stop consuming the generator (if you don't ask for the next item, it would continue computing). The only case where cancellation applies is when you do ask for the next item, but you must await for it, in which case you'd like to have the ability to do p.cancel(). However atm, we don't have this ability yet, so instead, you may need to pass an overall signal to the generator function and for it to signal cancellation the traditional way.

Remember similar to async f functions, I believe async f* forces them to be regular promises and not our extended promises. But I don't remember if I confirmed this fact.

As for the network, pretty much all network operations should be cancellable or timed cancellable if there's a timeout. This requires a big major refactor of the Proxy code to support this. There are multiple "default" timeouts currently configured during class construction. If we use the decorators, the defaults has to be set statically and cannot take in instance parameters.

Alternatively you can also not use the decorators, and just directly use PromiseCancellable while taking in the signal. That would allow you to take the instance/constructor parameters as the defaults for the Proxy.

So we have to decide whether to use static parameters and the decorators, or whether to swap to just using the PromiseCancellable and taking in the signal directly.

It's important to understand that the decorators are for convenience, but they are not infinitely flexible, it trades off some flexibility for some better convenience.

CMCDragonkai · 2022-09-29T01:11:31Z

I don't think the default timeout for the timedCancellable decorator can be set at any time. It has to be defined at import time. This means we can't configure a default time when starting the NodeConnectionManager. We also don't want to have hard coded timeouts everywhere so we may have to define the defaults in the config file.

Yes, because if decorators are static, then the parameters they take are static too. This is of course only necessary for "default" deadlines. Whatever is passed into the method/function will override it. If we do use static deadlines, then it does make sense to propagate these static parameters directly from the config object.

I discussed that you can also use signals and promise cancellable directly in the method to avoid static parameterisation...

CMCDragonkai · 2022-09-29T01:13:21Z

I suggest we do a priority here.

Start with the network domain, we should refactor this to support timers and cancellability (this may mean we can remove the existing timer utility)
Refactor the tests to use fast check as well, and to change the timeouts and see how it behaves.
Then work onto the GRPC client and server side deadlines

Also additionally there's the identities domain, which also has network side effects from contacting identity providers, these should also start using timed cancellable too.

CMCDragonkai · 2022-09-29T01:15:17Z

I'm thinking there's sort of like 3 levels of defaults here:

Static defaults on the decorators themselves inherited from src/config.ts
Instance defaults coming from the construction of the class
Method parameters that are passed from the caller

I'm not sure if 2 is possible if we use the decorators, unless we change them up to look up parameters on the instance if the method parameter isn't passed in (that is is set to undefined). But we would need to pass in a string to indicate the name of the parameter to look up.

CMCDragonkai · 2022-09-29T01:31:39Z

I think it is possible to do all 3.

This is the function head:

function timedCancellable(
  lazy: boolean = false,
  delay: number = Infinity,
  errorTimeoutConstructor: new () => Error = contextsErrors.ErrorContextsTimedTimeOut,
)

It's possible to do something like:

function timedCancellable(
  lazy: boolean | string | symbol = false,
  delay: number | string | symbol = Infinity,
  errorTimeoutConstructor: new () => Error = contextsErrors.ErrorContextsTimedTimeOut,
)

And then do timedCancellable('lazyProp', delaySym). And later use the this[lazy] and this[delay] to acquire the parameters.

// If it was `undefined`, it would be set to `false`
if (typeof lazy !== 'boolean') {
  lazy = this[lazy];
}

// If it was `undefined`, it would be set to `Infinity`
if (typeof delay !== 'number') {
  delay = this[delay];
}

However this pattern may not be scalable to other decorators which have actual string/symbol parameters...

So it may be better to actually have a separate set of parameters to refer to the instance.

function timedCancellable(
  lazy?: boolean,
  delay?: number,
  errorTimeoutConstructor: new () => Error = contextsErrors.ErrorContextsTimedTimeOut,
  defaults?: {
    lazy: 'lazyProp';
    delay: 'delayProp';
  }
)

But I find this syntax way too verbose...

We could also use a more different type like arrow functions.

function timedCancellable(
  lazy: boolean | (this: any) => boolean = false,
  delay: number | (this: any) => number = Infinity,
  errorTimeoutConstructor: new () => Error = contextsErrors.ErrorContextsTimedTimeOut,
)

It'd be unlikely for a decorator parameter to be an arrow function, but if they needed to differentiate, they could also check the number of parameters... etc.

Notice the this is literally any here.

Using it would be like:

class X {
  lazyProp = true;

  @timedCancellable((this) => this.lazyProp)
  method() {

  }
}

CMCDragonkai · 2022-09-29T01:33:33Z

The last style is likely the more flexible.

CMCDragonkai · 2022-09-29T07:23:40Z

I've added related issues #243 and #242. They should be addressed together here.

If there are going to be multiple PRs here, this should be made an epic and with subissues and subPRs.

CMCDragonkai · 2022-09-29T07:48:06Z

Full work on this might take 1 sprint and a bit.

CMCDragonkai · 2022-09-29T07:50:47Z

The todos relating to cancellability should also be done in this issue. This issue is an epic so there are multiple subissues. @tegefaulkes you can add more issues into this epic as you see fit.

tegefaulkes · 2022-09-29T08:45:43Z

I think we can split up the work in order of:

Updating decorators to support dynamic timeouts
Updating Proxy
nodes domain connection handling
GRPC client calls timeouts and service handler timeouts. This part will be wide reaching due to the number of handlers.
Locking

tegefaulkes · 2022-09-29T09:36:08Z

It might be better to address the locking along side the other issues when I encounter any locks.

tegefaulkes · 2022-09-30T02:43:28Z

@CMCDragonkai pointed out that we want to reduce the punch packet interval. We can set it to something like 100 or 50 ms.

We can also send a whole bunch of packets at once when trying to connect. In my experience however if you commonucate too much with a closed port some firewalls will start to ignore you. With a good connection you only need to send 2 packets each way within a second to establish a connection through a NAT. For connections with packet drop you;ll need to send more to make up for the drop but you don't need a crazy amount.

tegefaulkes · 2022-10-05T01:20:21Z

#382 overlaps this epic but doesn't fully fit within the scope. I suggested that #382 be an epic with sub tasks, both epics can share the tasks for network and nodes domain.

tegefaulkes · 2022-10-05T01:24:16Z

I thought I attached more issues to this epic earlier, I need to fix that.

tegefaulkes · 2022-10-05T02:58:58Z

Need a new sub issue for Identity manager connections. This is also needed for discovery.

tegefaulkes · 2022-10-07T07:00:44Z

As discussed before, the creation of a NodeConnection with createNodeConnection doesn't convey the timeout to the proxy connection. The proxy connection is started automatically when it receives a connection from the NodeConnection. This means we can end up with an un-composed proxy connection if the NodeConnection creation times out.

We can fix this by having the NodeConnection start the ForwardConnection directly with the timeout. Then it can connect to the proxy and compose like normal.

This is not a critical change, currently if a uncomposed forward connection exists it will be cleaned up after a TTL period of no activity.

CMCDragonkai · 2022-10-07T07:05:38Z

Let's keep it the way it is... the proxy's internal connection logic is somewhat separated by the Node Connection and GRPC connection. This is due to how we are using HTTP connect protocol to connect these 2 disparate parts.

When we get around to working on QUIC or WebRTC #234 and replacing GRPC with jsonrpc in #249, then we won't need any workarounds, everything should just be tightly integrated.

So if it can still work with the TTLs, let's just continue doing this.

It does make sense that it's possible to create proxy connections without there being a corresponding node connection. That's the main issue here. And I think we can continue with this design.

tegefaulkes · 2022-10-07T08:35:16Z

Created issues #477 and #476 for discovery and identities.

CMCDragonkai · 2022-10-31T12:55:19Z

I'd want to introduce the ability to set { timer: 1000 } or some number parameter for convenience purposes.

CMCDragonkai · 2023-02-17T07:19:14Z

So this means when we move to JSON RPC, we can add another parameter to our params object, which is the timeout.

If you don't set timeout, then the handler may default to something else, or my default to infinite time.

We can have these 3 meanings:

timeout: undefined; // default
timeout: null;  // infinite
timeout: 1000; // 1000ms

It might be good idea to have timeout: null; available. Maybe only a select few calls can accept this. Because it is ripe for abuse.

tegefaulkes · 2023-02-20T02:12:47Z

I think given that we should only allow for the default timeout or a specified timeout that is less than the default. Any values larger than the default can be clipped to the default value. Or if we want to avoid the implicit behaviour we can throw an error for a bad value.

A slight problem with this is that we currently don't enforce any timeouts in the RPC system. And the only way to convey timeouts is through the metadata which the RPC system doesn't care about or even perceive as it's part of the middleware or user implementation.

unless we add timeout parameter to the message structure then it's up to the middleware to handle the timeout logic. But currently the middle ware can't affect any timeouts or handler cancellation. It's meant to short circuit and reject the current stream immediately.

If we want to allow the middle ware to affect cancellation then we need to pass it an abort controller and have it handle it's own abortion logic. That way we can create timeout or cancellation logic inside of the middleware. while we're at it we may as well provide the middleware with other contextual information such as the ConnectionInfo.

CMCDragonkai · 2023-02-20T04:00:11Z

I think it depends on a few things.

We don't want callers to DOS the agent.
Callers also should be setting a timeout because they don't want non-responsive agents to hold themselves up.
Therefore timeouts make sense for both sides.

The timeout is then a proxy for the amount of work that should be accepted by the server, and the expected deadline by the agent to complete.

The timeout should be negotiated but then clients can ask for a shorter timeout or for a longer timeout but only up the maximum timeout.

How do timeouts work for streams? Is it the timeout for the entire stream or only for a single message of the stream? How does grpc implement deadlines for their streams? I would prefer timeouts per message because it measures liveness and can be used for infinite streaming if that's what we are trying to implement.

Therefore each rpc method should have a max timeout on the server side.

The caller if not specifying a timeout can default to max. But can specify a lower timeout.

CMCDragonkai · 2023-02-20T04:03:00Z

I don't think the middleware should be dealing with timeouts.

The timeouts should be part of the message params just like other metadata.

The timeout info must be passed into the subsequent methods. There are some methods that will cost time but won't take a timeout. These are synchronous methods...

For all other async methods a timeout can be used to cancel any operation.

So technically this means the timeout isn't 100 percent accurate since it won't cancel sync operations on the server side.

On the client side though, the timeout would just cancel the call and ignore any data. The only way to deal with this is an exception on the server side which could cancel sync operations.

CMCDragonkai · 2023-04-24T06:43:02Z

The deadline mechanism works in the JSON RPC system now. It's just a matter of doing agent migration. But also ensuring that the subissues here are completed.

CMCDragonkai · 2023-07-10T04:22:40Z

Applying it to sigchain depends on agent migration.

Applying it to identity management only requires bridging it to the underlying IO which can be just the fetch function provided by cross-fetch.

tegefaulkes · 2024-10-18T01:47:47Z

This issue is very old and predates the network stack overhaul. I'm going to close it as not planned.

CMCDragonkai · 2024-10-18T01:49:41Z

This is still required. It's independent of networking. It's basically any kind of operation that is meant to be timed and cancellable. It's a general pattern. Of course new issues should be more specific.

tegefaulkes added the development Standard development label Sep 21, 2022

tegefaulkes self-assigned this Sep 21, 2022

tegefaulkes mentioned this issue Sep 21, 2022

Integrate TaskManager into NodeGraph and Discovery #445

Merged

15 tasks

CMCDragonkai changed the title ~~Fix errors due to short pingNode timeout~~ Apply timedCancellable across the board to control how long side-effects are allowed to complete Sep 26, 2022

This was referenced Sep 26, 2022

ci: merge staging to master #411

Merged

Integrating TaskManager into Discovery #451

Merged

CMCDragonkai added the epic Big issue with multiple subissues label Sep 29, 2022

CMCDragonkai mentioned this issue Sep 29, 2022

Use fast-check to test concurrency consistency invariants #382

Closed

tegefaulkes mentioned this issue Sep 30, 2022

Feature: network cancellability and deadlines #468

Merged

8 tasks

This was referenced Oct 7, 2022

Update IdentityManager to used timedCancellable #476

Open

Update Discovery domain to use timedCancellable #477

Closed

CMCDragonkai mentioned this issue Feb 17, 2023

Migrate from GRPC to JSON RPC (with binary variant for higher performance) #495

Closed

tegefaulkes mentioned this issue Mar 3, 2023

Add timeout support to the Agnostic RPC system #510

Closed

CMCDragonkai added the r&d:polykey:supporting activity Supporting core activity label Jul 10, 2023

tegefaulkes mentioned this issue Dec 13, 2023

CLI Beta Launch MatrixAI/Polykey-CLI#40

Closed

tegefaulkes removed their assignment Dec 13, 2023

CMCDragonkai removed the epic Big issue with multiple subissues label Aug 12, 2024

tegefaulkes closed this as not planned Won't fix, can't repro, duplicate, stale Oct 18, 2024

tegefaulkes mentioned this issue Oct 18, 2024

Review and convert long running methods to TimedCancellable #833

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply `timedCancellable` across the board to control how long side-effects are allowed to complete #450

Apply `timedCancellable` across the board to control how long side-effects are allowed to complete #450

tegefaulkes commented Sep 21, 2022 •

edited by CMCDragonkai

Loading

tegefaulkes commented Sep 28, 2022

tegefaulkes commented Sep 28, 2022

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022 •

edited

Loading

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022 •

edited

Loading

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022

tegefaulkes commented Sep 29, 2022

tegefaulkes commented Sep 29, 2022

tegefaulkes commented Sep 30, 2022

tegefaulkes commented Oct 5, 2022

tegefaulkes commented Oct 5, 2022

tegefaulkes commented Oct 5, 2022 •

edited

Loading

tegefaulkes commented Oct 7, 2022

CMCDragonkai commented Oct 7, 2022

tegefaulkes commented Oct 7, 2022

CMCDragonkai commented Oct 31, 2022

CMCDragonkai commented Feb 17, 2023 •

edited

Loading

tegefaulkes commented Feb 20, 2023

CMCDragonkai commented Feb 20, 2023

CMCDragonkai commented Feb 20, 2023

CMCDragonkai commented Apr 24, 2023

CMCDragonkai commented Jul 10, 2023

tegefaulkes commented Oct 18, 2024

CMCDragonkai commented Oct 18, 2024

Apply timedCancellable across the board to control how long side-effects are allowed to complete #450

Apply timedCancellable across the board to control how long side-effects are allowed to complete #450

Comments

tegefaulkes commented Sep 21, 2022 • edited by CMCDragonkai Loading

Specification

Additional context

Tasks

tegefaulkes commented Sep 28, 2022

tegefaulkes commented Sep 28, 2022

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022 • edited Loading

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022 • edited Loading

CMCDragonkai commented Sep 29, 2022

CMCDragonkai commented Sep 29, 2022

tegefaulkes commented Sep 29, 2022

tegefaulkes commented Sep 29, 2022

tegefaulkes commented Sep 30, 2022

tegefaulkes commented Oct 5, 2022

tegefaulkes commented Oct 5, 2022

tegefaulkes commented Oct 5, 2022 • edited Loading

tegefaulkes commented Oct 7, 2022

CMCDragonkai commented Oct 7, 2022

tegefaulkes commented Oct 7, 2022

CMCDragonkai commented Oct 31, 2022

CMCDragonkai commented Feb 17, 2023 • edited Loading

tegefaulkes commented Feb 20, 2023

CMCDragonkai commented Feb 20, 2023

CMCDragonkai commented Feb 20, 2023

CMCDragonkai commented Apr 24, 2023

CMCDragonkai commented Jul 10, 2023

tegefaulkes commented Oct 18, 2024

CMCDragonkai commented Oct 18, 2024

Apply `timedCancellable` across the board to control how long side-effects are allowed to complete #450

Apply `timedCancellable` across the board to control how long side-effects are allowed to complete #450

tegefaulkes commented Sep 21, 2022 •

edited by CMCDragonkai

Loading

CMCDragonkai commented Sep 29, 2022 •

edited

Loading

CMCDragonkai commented Sep 29, 2022 •

edited

Loading

tegefaulkes commented Oct 5, 2022 •

edited

Loading

CMCDragonkai commented Feb 17, 2023 •

edited

Loading