-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpectedly high memory usage seemingly due to a "stuck" client #2870
Comments
Just committing again here to keep our production stake pool relay which still demonstrates the problem running as-is for another 12 hours in case any information needs to be dug from it. I'll be watching for email on this issue and @nfrisby you & other devs can also contact me on Telegram as @karknu did. Here is the software revision on that node:
|
#2880 is an example of how to fix the problem. Care should be taken to go through all other miniprotocol and make sure that if they allocate any state it should be cleaned up even in case of an exception. |
Three days after blocking incoming TCP connections from any nodes generating the |
Fixed in #2880. |
We intend for this change to make it more difficult to repeat the mistake underlying Issue #2870.
We intend for this change to make it more difficult to repeat the mistake underlying Issue #2870.
We intend for this change to make it more difficult to repeat the mistake underlying Issue #2870.
We intend for this change to make it more difficult to repeat the mistake underlying Issue #2870.
This Issue arose from the debugging/triage efforts of Issue IntersectMBO/cardano-node#2235. A server has a space-leak that seems to be related to problematic clients. Both clients have followed the chain up to just before the first Allegra block. And both repeatedly disconnect and reconnect, without having made any progress. We have theories about why the clients are doing that (one is too old to understand Allegra; maybe the other is running OOM in the epoch boundary computation or similar), but it appears that they are inducing unacceptable resource usage in the server.
This Issue is to attempt to reproduce that interaction in a minimal controlled setup, and debug it.
From this perspective, the relevant facts are as follows.
We don't yet know what release the server we have the most anecdata from is running.
The V_2 client is sending the same
FindIntersect
once per second. The V_5 client is sending it about once per 100/3 seconds and also fetching some blocks for 10 seconds each time before it disconnects.(It may be that only one of them is causing the memory leak.)
Both peer-to-peer connections are killed once per
FindIntersect
(V_2 by the server, V_5 by the client). And theFindIntersect
is always the same: the expected points for a client whose current chain ends at the block before Allegra started.There is also a space leak of “PINNED” memory, which is not related to the number of clients the server has. It is possible that that is the only space leak in play.
And the GitHub Issue has logs of network traffic that show bursts of high outbound (ie the repeated bulk sync?).
The text was updated successfully, but these errors were encountered: