-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: p2p stale connections #4189
Conversation
PRO-862 P2P: detect stale sockets
Currently we create a ZMQ connection for every registered node on the SC (see PRO-133 also). On perseverance at least, it seems that it is not uncommon that registered nodes either unavailable to begin with or become unavailable later, which seemed to have caused PRO-858 . We somewhat mitigated the problem by limiting message buffers, but it is still currently possible to "leak" memory if these buffers are never consumed by offline peers. It is probably also wasteful to have a ZMQ socket at all for these stale peers, since they must be constantly trying to reconnect internally (although they do use an exponential backoff strategy). I think it is worthwhile at least detecting if we have such stale "connections" and reporting them through prometheus, for example. Shouldn't be too difficult to do this using the current mechanism for monitoring ZMQ sockets. Once we have detection, it wouldn't be difficult to reset the sockets or even put then in "Inactive" state so they don't consume any system resources (Of course PRO-133 would help with stale connections too.) PRO-133 CFE should peer only to those it needs to.
You should only peer with other nodes that are active/outgoing in the epochs you are also active/outgoing. We also want to peer with anyone with a top bid in an ongoing auction. So we are ready to do the keygen. TODO: Plan |
Codecov Report
@@ Coverage Diff @@
## main #4189 +/- ##
======================================
Coverage 72% 72%
======================================
Files 379 380 +1
Lines 61667 61860 +193
Branches 61667 61860 +193
======================================
+ Hits 44242 44389 +147
- Misses 15134 15161 +27
- Partials 2291 2310 +19
... and 19 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Pull Request
Closes: PRO-862, PRO-133
Checklist
Please conduct a thorough self-review before opening the PR.
Summary
As discussed in PRO-133, we decided that rather than maintaining a separate list of nodes we need to connect to (a subset of all registered nodes), it is simpler and more robust to simply detect which connections are "stale" (the nodes that we didn't have a need to comunicate for some time, 1h in this case) and drop the socket for them until the connection is actually needed again (if ever). This means we are not wasting resources (file descriptors, send buffers and whatever work is required to maintain a zmq connection).
When we first establish a connection, we set
last_activity
to "now". Every time we need to send a message to the node we, updatelast_activity
again. Every minute we check if for any non-stale nodelast_activity
is 1h in the past, and if so, we drop the socket and mark it as "stale". If the connection is ever needed again (i.e. we need to send a message), the stale connection is upgraded to active and the message is sent afterwards. Added a test to check that sending to a stale socket works.Note that I removed
set_immediate(true)
option since it would prevent the first few messages to a stale socket from reaching destination. The option is not needed after this PR anyway since it is no longer a concern that we create a send buffer now that we have a way to deal with stale connections (and their buffers).