feat: p2p stale connections #4189

msgmaxim · 2023-11-01T04:58:48Z

Pull Request

Checklist

Please conduct a thorough self-review before opening the PR.

I am confident that the code works.
I have updated documentation where appropriate.

Summary

As discussed in PRO-133, we decided that rather than maintaining a separate list of nodes we need to connect to (a subset of all registered nodes), it is simpler and more robust to simply detect which connections are "stale" (the nodes that we didn't have a need to comunicate for some time, 1h in this case) and drop the socket for them until the connection is actually needed again (if ever). This means we are not wasting resources (file descriptors, send buffers and whatever work is required to maintain a zmq connection).

When we first establish a connection, we set last_activity to "now". Every time we need to send a message to the node we, update last_activity again. Every minute we check if for any non-stale node last_activity is 1h in the past, and if so, we drop the socket and mark it as "stale". If the connection is ever needed again (i.e. we need to send a message), the stale connection is upgraded to active and the message is sent afterwards. Added a test to check that sending to a stale socket works.

Note that I removed set_immediate(true) option since it would prevent the first few messages to a stale socket from reaching destination. The option is not needed after this PR anyway since it is no longer a concern that we create a send buffer now that we have a way to deal with stale connections (and their buffers).

linear · 2023-11-01T04:58:50Z

PRO-862 P2P: detect stale sockets

Currently we create a ZMQ connection for every registered node on the SC (see PRO-133 also). On perseverance at least, it seems that it is not uncommon that registered nodes either unavailable to begin with or become unavailable later, which seemed to have caused PRO-858 . We somewhat mitigated the problem by limiting message buffers, but it is still currently possible to "leak" memory if these buffers are never consumed by offline peers. It is probably also wasteful to have a ZMQ socket at all for these stale peers, since they must be constantly trying to reconnect internally (although they do use an exponential backoff strategy).

I think it is worthwhile at least detecting if we have such stale "connections" and reporting them through prometheus, for example. Shouldn't be too difficult to do this using the current mechanism for monitoring ZMQ sockets. Once we have detection, it wouldn't be difficult to reset the sockets or even put then in "Inactive" state so they don't consume any system resources (Of course PRO-133 would help with stale connections too.)

PRO-133 CFE should peer only to those it needs to.

You should only peer with other nodes that are active/outgoing in the epochs you are also active/outgoing.

We also want to peer with anyone with a top bid in an ongoing auction. So we are ready to do the keygen.

TODO: Plan

codecov · 2023-11-01T05:08:25Z

Codecov Report

Merging #4189 (427b5db) into main (2bfaea3) will increase coverage by 0%.
Report is 3 commits behind head on main.
The diff coverage is 74%.

@@          Coverage Diff           @@
##            main   #4189    +/-   ##
======================================
  Coverage     72%     72%            
======================================
  Files        379     380     +1     
  Lines      61667   61860   +193     
  Branches   61667   61860   +193     
======================================
+ Hits       44242   44389   +147     
- Misses     15134   15161    +27     
- Partials    2291    2310    +19

Files	Coverage Δ
engine/src/p2p/core/socket.rs	`93% <ø> (-3%)`	⬇️
engine/src/p2p/core/tests.rs	`100% <100%> (ø)`
engine/src/p2p/core.rs	`76% <65%> (-2%)`	⬇️

... and 19 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

engine/src/p2p/core.rs

feat: p2p stale connections

9eb0e64

msgmaxim requested review from kylezs and j4m1ef0rd November 1, 2023 04:58

chore: fix clippy

335eb13

kylezs reviewed Nov 1, 2023

View reviewed changes

engine/src/p2p/core.rs Show resolved Hide resolved

engine/src/p2p/core.rs Show resolved Hide resolved

j4m1ef0rd approved these changes Nov 2, 2023

View reviewed changes

engine/src/p2p/core.rs Show resolved Hide resolved

engine/src/p2p/core.rs Outdated Show resolved Hide resolved

engine/src/p2p/core.rs Show resolved Hide resolved

msgmaxim added 3 commits November 2, 2023 12:17

chore: address review

0fde7e6

feat: increase send buffer to 100

5b97dcf

chore: remove stale (and broken) test

427b5db

kylezs approved these changes Nov 2, 2023

View reviewed changes

msgmaxim merged commit e10d14a into main Nov 3, 2023

msgmaxim deleted the feat/p2p-track-activity branch November 3, 2023 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: p2p stale connections #4189

feat: p2p stale connections #4189

msgmaxim commented Nov 1, 2023

linear bot commented Nov 1, 2023

codecov bot commented Nov 1, 2023 •

edited

Loading

feat: p2p stale connections #4189

feat: p2p stale connections #4189

Conversation

msgmaxim commented Nov 1, 2023

Pull Request

Checklist

Summary

linear bot commented Nov 1, 2023

codecov bot commented Nov 1, 2023 • edited Loading

Codecov Report

codecov bot commented Nov 1, 2023 •

edited

Loading