Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Callbacks preventing garbage collection of objects #223

Closed
achingbrain opened this issue Mar 13, 2024 · 13 comments
Closed

Callbacks preventing garbage collection of objects #223

achingbrain opened this issue Mar 13, 2024 · 13 comments

Comments

@achingbrain
Copy link
Contributor

This is a follow-on to #215, I think the problem still exists, even after node-datachannel@0.5.4.

These references can keep threads alive:

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:63
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:67
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:71
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:75
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:79
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:85
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:95
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62   

The referenced lines in RTCPeerConnection are all callbacks registered with the C++ object, for example line 63:

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/.../node_modules/node-datachannel/polyfill/RTCPeerConnection.js:63

this.#peerConnection.onStateChange(() => {

If I change the .close method of RTCPeerConnection the problem goes away:

close() {
  // close all channels before shutting down
  this.#dataChannels.forEach((channel) => {
    channel.close();
    });

    this.#peerConnection.close();
    this.#peerConnection.destroy();  // <-- add this line
}

I think this is because doDestroy calls doResetCallbacks which releases references to the JS callbacks.

To replicate, clone js-libp2p, apply this diff:

diff --git a/packages/transport-webrtc/package.json b/packages/transport-webrtc/package.json
index b38cbc5aa..01f03204e 100644
--- a/packages/transport-webrtc/package.json
+++ b/packages/transport-webrtc/package.json
@@ -39,8 +39,8 @@
   "scripts": {
     "generate": "protons src/private-to-private/pb/message.proto src/pb/message.proto",
     "build": "aegir build",
-    "test": "aegir test -t node -t browser -- --exit",
-    "test:node": "aegir test -t node --cov -- --exit",
+    "test": "aegir test -t node -t browser",
+    "test:node": "aegir test -t node --cov",
     "test:chrome": "aegir test -t browser --cov",
     "test:firefox": "aegir test -t browser -- --browser firefox",
     "lint": "aegir lint",
@@ -65,7 +65,7 @@
     "it-stream-types": "^2.0.1",
     "multiformats": "^13.1.0",
     "multihashes": "^4.0.3",
-    "node-datachannel": "^0.5.3",
+    "node-datachannel": "^0.5.4",
     "p-defer": "^4.0.0",
     "p-event": "^6.0.0",
     "p-timeout": "^6.1.2",
@@ -73,7 +73,8 @@
     "race-signal": "^1.0.2",
     "react-native-webrtc": "^118.0.1",
     "uint8arraylist": "^2.4.8",
-    "uint8arrays": "^5.0.2"
+    "uint8arrays": "^5.0.2",
+    "why-is-node-running": "^2.2.2"
   },
   "devDependencies": {
     "@chainsafe/libp2p-yamux": "^6.0.2",
diff --git a/packages/transport-webrtc/test/basics.spec.ts b/packages/transport-webrtc/test/basics.spec.ts
index c54488a7a..c649a9312 100644
--- a/packages/transport-webrtc/test/basics.spec.ts
+++ b/packages/transport-webrtc/test/basics.spec.ts
@@ -1,5 +1,8 @@
 /* eslint-disable @typescript-eslint/no-unused-expressions */
 
+// @ts-expect-error no types
+import log from 'why-is-node-running'
+
 import { noise } from '@chainsafe/libp2p-noise'
 import { yamux } from '@chainsafe/libp2p-yamux'
 import { circuitRelayTransport } from '@libp2p/circuit-relay-v2'
@@ -19,6 +22,10 @@ import pRetry from 'p-retry'
 import { webRTC } from '../src/index.js'
 import type { Libp2p, Connection, Stream, StreamHandler } from '@libp2p/interface'
 
+setTimeout(() => {
+  log()
+}, 5000).unref()
+
 async function createNode (): Promise<Libp2p> {
   return createLibp2p({
     addresses: {

Then install & build and run the node tests:

% cd js-libp2p
% npm i && npm run build
% cd packages/transport-webrtc
% npm run test:node -- --grep 'should connect'

You should see one test run, a brief pause then a list of all the handles keeping the process running:

% npm run test:node -- --grep 'should connect'

> @libp2p/webrtc@4.0.21 test:node
> aegir test -t node --cov --grep should connect

build

> @libp2p/webrtc@4.0.21 build
> aegir build

[10:37:34] tsc [started]
[10:37:35] tsc [completed]
[10:37:35] esbuild [started]
[10:37:35] esbuild [completed]
test node.js
Warning: Cannot find any files matching pattern "test/node.{js,cjs,mjs}"
Warning: Cannot find any files matching pattern "test/**/*.spec.{js,cjs,mjs}"
Warning: Cannot find any files matching pattern "dist/test/node.{js,cjs,mjs}"


  webrtc basic
    ✔ should connect (84ms)


  1 passing (92ms)

There are 43 handle(s) keeping the process running

# FILEHANDLE
node:internal/async_hooks:202

... lots of output here

Ignore the FILEHANDLE and KEYPAIRGENREQUEST entries - it's the ThreadSafeCallback callback entries causing the problem.

@achingbrain
Copy link
Contributor Author

achingbrain commented Mar 13, 2024

It's worth noting that calling the exported cleanup function after the test run makes the process exit successfully.

From what I can see, cleanup comes from RTCWrapper which closes all currently open RTCPeerConnections, then resets all of their callbacks which lets the process exit.

Give that you can't re-open a RTCPeerConnection that's been closed, it seems reasonable to not expect any events to fire after it's been closed, so resetting all the callbacks on close (instead of just destroy) would be ok?

@murat-dogan
Copy link
Owner

Destroying or resetting callbacks will do the same.

But the point is that; if you call destroy or reset callbacks then you will not receive closed state change, which is not good.
That is why I suggested calling cleanup when you don't need the lib.

@achingbrain
Copy link
Contributor Author

Calling cleanup isn't a solution for a server-style application that may service hundreds of thousands or millions of connections during it's lifetime.

If I run this script with the --trace-gc node flag I can see that garbage collection occurs, but on my laptop it runs out of memory after a minute or so:

import { RTCPeerConnection } from 'node-datachannel/polyfill'

while (true) {
  let conn = new RTCPeerConnection()
  conn.close()
  conn = null
}

The WebRTC spec says that calling close on a RTCPeerConnection tears down any associated resources, so it should become garbage collectable after that.

If it doesn't then this is a memory leak.

@murat-dogan
Copy link
Owner

Hello,

If GC is running then it should call the destructor and reset all callbacks etc.
But the absent point is we are not assigning the underlying object to null.

this.#peerConnection = null;

Could you please check and try this?
#224

@achingbrain
Copy link
Contributor Author

achingbrain commented Mar 15, 2024

With #224 applied the above script still runs out of memory after a minute or so.

Interestingly I was trying to make it be kinder to the garbage collector by waiting for the connection to close before creating a new one with:

while (true) {
  const conn = new RTCPeerConnection()
  conn.close()

  await new Promise(resolve => {
    conn.addEventListener('connectionstatechange', () => {
      if (conn.connectionState === 'closed') {
        resolve()
      }
    })
  })
}

with this it runs for a few iterations then the process exits with code 13 and no output:

% node index.js
% echo $?
13

If I only create a single RTCPeerConnection #224 does let the process exit, though it takes a long time:

import { RTCPeerConnection } from 'node-datachannel/polyfill'

const conn = new RTCPeerConnection()
conn.addEventListener('connectionstatechange', () => {
  if (conn.connectionState === 'closed') {
    console.timeEnd('close')
  }
})

console.time('exit')
console.time('close')
conn.close()

process.addListener('exit', () => {
  console.timeEnd('exit')
  process.exit(0)
})

8 seconds to close!

% node index.js
close: 0.449ms
exit: 8.100s

Applying this diff to #224 makes the closing almost instant:

diff --git a/polyfill/RTCPeerConnection.js b/polyfill/RTCPeerConnection.js
index db3ae16..171c10e 100644
--- a/polyfill/RTCPeerConnection.js
+++ b/polyfill/RTCPeerConnection.js
@@ -61,6 +61,11 @@ export default class _RTCPeerConnection extends EventTarget {
 
         // forward peerConnection events
         this.#peerConnection.onStateChange(() => {
+            if (this.connectionState === 'closed') {
+                this.#peerConnection.destroy();
+                this.#peerConnection = null;
+            }
+
             this.dispatchEvent(new Event('connectionstatechange'));
         });
 
@@ -227,7 +232,6 @@ export default class _RTCPeerConnection extends EventTarget {
         });
 
         this.#peerConnection?.close();
-        this.#peerConnection = null;
     }
 
     createAnswer() {

E.g. instead of nulling this.#peerConnection in the .close() method, we wait for the connection state to change to 'closed', then reset the callbacks and remove the reference to the C++ object.

% node index.js
close: 0.436ms
exit: 5.802ms

5ms to close, much faster.


As an alternative to #224, this diff applied against master also solves the problem:

diff --git a/polyfill/RTCPeerConnection.js b/polyfill/RTCPeerConnection.js
index be9913b..08accf1 100644
--- a/polyfill/RTCPeerConnection.js
+++ b/polyfill/RTCPeerConnection.js
@@ -61,6 +61,10 @@ export default class _RTCPeerConnection extends EventTarget {
 
         // forward peerConnection events
         this.#peerConnection.onStateChange(() => {
+            if (this.connectionState === 'closed') {
+                this.#peerConnection.destroy();
+            }
+
             this.dispatchEvent(new Event('connectionstatechange'));
         });

Setting this.#peerConnection to null doesn't appear to be necessary, since the only reference to the C++ object is the internal field of the RTCPeerConnection instance, once that instance has no references it can be collected along with the underlying C++ object.

Really I think the root cause is the use of Napi::Persistent in ThreadSafeCallback - it's purpose is to prevent the wrapped object from being garbage collected so we have to tell it when we are done.

@murat-dogan
Copy link
Owner

@achingbrain Thanks for the detailed investigation.

To summarize things;

  • In all cases we need to find a way to call the underlying peer-connection object's destroy mechanism (destructor or destroy function)
  • If we set the underlying peer-connection object to null, GC runs the destructor ~8s later. If you don't create a connection, like in the example, in a loop this is a solution.
  • If we call the underlying peer-connection object destroy function directly on the close function, then we will not receive the closed event.
  • We can call the underlying peer-connection object's destroy function when we receive the closed event in the state-change callback. If we don't receive a closed event, then this will be a problem.

According to this info, we need to select an option.

@murat-dogan
Copy link
Owner

I tried with Chrome. If you call the close function of a peer-connection, the peer does not receive any closed event. The other peer receives a disconnected event.

So it seems we can call the destroy function directly within the close function.
What do you say?

<!DOCTYPE html>
<html lang="en-US">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>RTCPeerConnection Example</title>
    <script>
        let peer1 = new RTCPeerConnection();
        let peer2 = new RTCPeerConnection();

        peer1.onicecandidate = ({candidate}) => candidate && peer2.addIceCandidate(candidate);
        peer2.onicecandidate = ({candidate}) => candidate && peer1.addIceCandidate(candidate);

        peer1.onconnectionstatechange = () => console.log('Peer1 State:', peer1.connectionState);
        peer2.onconnectionstatechange = () => console.log('Peer2 State:', peer2.connectionState);

        let channel = peer1.createDataChannel("mychannel");
        channel.onopen = () => console.log('Data channel is open');
        channel.onmessage = ({data}) => console.log('Received Message:', data);

        peer2.ondatachannel = ({channel}) => {
            channel.onmessage = ({data}) => console.log('Received Message:', data);
            channel.onopen = () => {
                console.log('Data channel is open');
                channel.send('Hello from Peer2!');
            };
        };

        peer1.createOffer().then(offer => {
            return peer1.setLocalDescription(offer);
        }).then(() => {
            return peer2.setRemoteDescription(peer1.localDescription);
        }).then(() => {
            return peer2.createAnswer();
        }).then(answer => {
            return peer2.setLocalDescription(answer);
        }).then(() => {
            return peer1.setRemoteDescription(peer2.localDescription);
        }).catch(console.error);

        setTimeout(() => {
            peer1.close();
        }, 2000);
    </script>
</head>

<body>
</body>

</html>

@paullouisageneau
Copy link
Contributor

Indeed, with the Web API, RTCPeerConnection.close() is synchronous and does not fire any event (see specification). I think it makes sense to align the behavior and call the destroy function directly in the close function.

@murat-dogan
Copy link
Owner

I have just published v0.5.5
Could you please try it?

@achingbrain
Copy link
Contributor Author

Sorry for the delay in replying, I've been AFK for a little while.

If I run:

import { RTCPeerConnection } from 'node-datachannel/polyfill'

const conn = new RTCPeerConnection()
conn.addEventListener('connectionstatechange', () => {
  if (conn.connectionState === 'closed') {
    console.timeEnd('close')
  }
})

console.time('exit')
console.time('close')
conn.close()

process.addListener('exit', () => {
  console.timeEnd('exit')
  process.exit(0)
})

The process now exits almost immediately, so the initially reported issue is fixed, thanks for pushing that release. 🎉 🚀

However this script still causes the process to run out of memory after a minute or so:

import { RTCPeerConnection } from 'node-datachannel/polyfill'

while (true) {
  const conn = new RTCPeerConnection()
  conn.close()
}

...and this script causes the process to terminate unexpectedly with a non-zero error code (it should run forever due to no connectionstatechange event being emitted):

import { RTCPeerConnection } from 'node-datachannel/polyfill'

while (true) {
  const conn = new RTCPeerConnection()
  conn.close()

  await new Promise(resolve => {
    conn.addEventListener('connectionstatechange', () => {
      if (conn.connectionState === 'closed') {
        resolve()
      }
    })
  })
}

@murat-dogan
Copy link
Owner

// test.js
import { RTCPeerConnection } from 'node-datachannel/polyfill'

let i=0;
while (i++<100*1000) {
  const conn = new RTCPeerConnection()
  conn.close()
}

Run Result

murat@murat-ThinkBook:~/js/node-datachannel$ free -h
               total        used        free      shared  buff/cache   available
Mem:            13Gi       3.7Gi       5.4Gi        90Mi       4.5Gi       9.5Gi
Swap:           14Gi       322Mi        14Gi
murat@murat-ThinkBook:~/js/node-datachannel$ node test.js 
murat@murat-ThinkBook:~/js/node-datachannel$ free -h
               total        used        free      shared  buff/cache   available
Mem:            13Gi       3.6Gi       5.4Gi        92Mi       4.5Gi       9.5Gi
Swap:           14Gi       322Mi        14Gi

Same result as @achingbrain for while(true). But not sure if this is really a problem.
For me, it is related to GC

@murat-dogan
Copy link
Owner

I have just released v0.9.0
https://github.com/murat-dogan/node-datachannel/releases/tag/v0.9.0

Should we close also this issue?

@murat-dogan
Copy link
Owner

I am closing the issue. Please feel free to open if you need to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants