Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: p2p stale connections #4189

Merged
merged 5 commits into from
Nov 3, 2023
Merged

feat: p2p stale connections #4189

merged 5 commits into from
Nov 3, 2023

Conversation

msgmaxim
Copy link
Contributor

@msgmaxim msgmaxim commented Nov 1, 2023

Pull Request

Closes: PRO-862, PRO-133

Checklist

Please conduct a thorough self-review before opening the PR.

  • I am confident that the code works.
  • I have updated documentation where appropriate.

Summary

As discussed in PRO-133, we decided that rather than maintaining a separate list of nodes we need to connect to (a subset of all registered nodes), it is simpler and more robust to simply detect which connections are "stale" (the nodes that we didn't have a need to comunicate for some time, 1h in this case) and drop the socket for them until the connection is actually needed again (if ever). This means we are not wasting resources (file descriptors, send buffers and whatever work is required to maintain a zmq connection).

When we first establish a connection, we set last_activity to "now". Every time we need to send a message to the node we, update last_activity again. Every minute we check if for any non-stale node last_activity is 1h in the past, and if so, we drop the socket and mark it as "stale". If the connection is ever needed again (i.e. we need to send a message), the stale connection is upgraded to active and the message is sent afterwards. Added a test to check that sending to a stale socket works.

Note that I removed set_immediate(true) option since it would prevent the first few messages to a stale socket from reaching destination. The option is not needed after this PR anyway since it is no longer a concern that we create a send buffer now that we have a way to deal with stale connections (and their buffers).

@msgmaxim msgmaxim requested review from kylezs and j4m1ef0rd November 1, 2023 04:58
Copy link

linear bot commented Nov 1, 2023

PRO-862 P2P: detect stale sockets

Currently we create a ZMQ connection for every registered node on the SC (see PRO-133 also). On perseverance at least, it seems that it is not uncommon that registered nodes either unavailable to begin with or become unavailable later, which seemed to have caused PRO-858 . We somewhat mitigated the problem by limiting message buffers, but it is still currently possible to "leak" memory if these buffers are never consumed by offline peers. It is probably also wasteful to have a ZMQ socket at all for these stale peers, since they must be constantly trying to reconnect internally (although they do use an exponential backoff strategy).

I think it is worthwhile at least detecting if we have such stale "connections" and reporting them through prometheus, for example. Shouldn't be too difficult to do this using the current mechanism for monitoring ZMQ sockets. Once we have detection, it wouldn't be difficult to reset the sockets or even put then in "Inactive" state so they don't consume any system resources (Of course PRO-133 would help with stale connections too.)

PRO-133 CFE should peer only to those it needs to.

You should only peer with other nodes that are active/outgoing in the epochs you are also active/outgoing.

We also want to peer with anyone with a top bid in an ongoing auction. So we are ready to do the keygen.

TODO: Plan

Copy link

codecov bot commented Nov 1, 2023

Codecov Report

Merging #4189 (427b5db) into main (2bfaea3) will increase coverage by 0%.
Report is 3 commits behind head on main.
The diff coverage is 74%.

@@          Coverage Diff           @@
##            main   #4189    +/-   ##
======================================
  Coverage     72%     72%            
======================================
  Files        379     380     +1     
  Lines      61667   61860   +193     
  Branches   61667   61860   +193     
======================================
+ Hits       44242   44389   +147     
- Misses     15134   15161    +27     
- Partials    2291    2310    +19     
Files Coverage Δ
engine/src/p2p/core/socket.rs 93% <ø> (-3%) ⬇️
engine/src/p2p/core/tests.rs 100% <100%> (ø)
engine/src/p2p/core.rs 76% <65%> (-2%) ⬇️

... and 19 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

engine/src/p2p/core.rs Show resolved Hide resolved
engine/src/p2p/core.rs Show resolved Hide resolved
engine/src/p2p/core.rs Show resolved Hide resolved
engine/src/p2p/core.rs Outdated Show resolved Hide resolved
engine/src/p2p/core.rs Show resolved Hide resolved
@msgmaxim msgmaxim merged commit e10d14a into main Nov 3, 2023
@msgmaxim msgmaxim deleted the feat/p2p-track-activity branch November 3, 2023 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants