Consider improving network survey schema #4169

marta-lokhova · 2024-01-29T20:59:27Z

Currently, peers respond to network survey with the following information:

    * `bytesRead`: The total number of bytes read from this peer.
    * `bytesWritten`: The total number of bytes written to this peer.
    * `duplicateFetchBytesRecv`: The number of bytes received that were duplicate transaction sets and quorum sets.
    * `duplicateFetchMessageRecv`: The count of duplicate transaction sets and quorum sets received from this peer.
    * `duplicateFloodBytesRecv`: The number of bytes received that were transactions and SCP votes duplicates.
    * `duplicateFloodMessageRecv`: The count of duplicate transactions and SCP votes received from this peer.
    * `messagesRead`: The total number of messages read from this peer.
    * `messagesWritten`: The total number of messages written to this peer.
    * `nodeId`: Node's public key.
    * `secondsConnected`: The total number of seconds this peer has been connected to the surveyed node.
    * `uniqueFetchBytesRecv`: The number of bytes received that were unique transaction sets and quorum sets.
    * `uniqueFetchMessageRecv`: The count of unique transaction sets and quorum sets received from this peer.
    * `uniqueFloodBytesRecv`: The number of bytes received that were unique transactions and SCP votes.
    * `uniqueFloodMessageRecv`: The count of unique transactions and SCP votes received from this peer.
    * `version`: stellar-core version.

In practice, some of these metrics aren't particularly useful (for example, it's a bit hard to reason about the absolute number of messages/bytes given that the rate fluctuates significantly overtime). In addition, we might be missing some key information about nodes on the network:

connection latency (to help assess quality of the connection)
whether node is a validator or a watcher
node's quorum set
basic high-level metrics that can give hints about overall node health: scheduler queue average waiting time, average block application latency, ledger age. Note that these should be aggregated over a reasonably long time period to hash out noise (e.g. over 10 minutes).
maybe node's upgrade status? (to understand what the network intends to vote on)

We might want to introduce different request types (node health metrics are quite a bit different from connectivity stats, for example, plus we should keep response sizes sane)

Tagging as discussion to get the conversation started.

The text was updated successfully, but these errors were encountered:

ire-and-curses · 2024-02-15T00:53:27Z

Taking a step back, what are the goals of the survey? I can think of several possible goals:

measuring performance: network latency / throughput / resource starvation / redundant traffic
measuring graph properties (e.g. topology structure, quorum memberships)
measuring node health: responsiveness, uptime, connection stability, churn rate
summing gross network stats: number of validators, number of watchers, number of older clients, number of alternative clients
using visible nodes and their network contribution as the tip of the iceberg to estimate the size of the hidden node membership

I'm personally most interested in network stats and graph properties. From a decentralisation perspective, it would be extremely valuable to understand the extent of the network, the number of validators and watchers, their versions, and the impact of those groups on the communications and stability of the overlay.

For example, I would like to be able to answer questions such as

how many potential voting entities exist on stellar?
how big is the stellar network compared to last year?
how many validators could the overlay reasonably support?
how close are we to that number?
what is the impact of watcher nodes on the network load?

Perhaps this might not be a goal best executed by the survey mechanism. Alternatives could include recursive crawlers or IP scanners. Would love to hear thoughts on this.

bboston7 · 2024-02-15T22:41:47Z

Perhaps this might not be a goal best executed by the survey mechanism. Alternatives could include recursive crawlers or IP scanners. Would love to hear thoughts on this.

One of the major benefits of using a survey mechanism built into stellar-core is the ability to reach nodes behind NATs. Recursive crawlers / IP scanners will miss NATed nodes that do not accept inbound connections. A quick look at prior survey results shows many nodes on the network have no inbound peers. Of course we don't know exactly why that is, but if it's largely due to NATs then recursive crawlers / IP scanners may paint a misleading picture of the network. Moreover, if individuals are running nodes on residential ISPs then it's very likely they're behind some kind of NAT, especially with ISP level CGNAT becoming more common.

Whether or not missing these nodes is important gets more at your question of what goal the survey is trying to achieve. I think measuring network health and decentralization is an important goal, which would require reaching as many nodes as possible.

bboston7 · 2024-02-24T00:48:38Z

In practice, some of these metrics aren't particularly useful (for example, it's a bit hard to reason about the absolute number of messages/bytes given that the rate fluctuates significantly overtime)

I think these metrics become useful when they're defined over time slices. If we can see how much data every node ingested for the same window of time we can start reasoning about the differences between them better.

We might want to introduce different request types (node health metrics are quite a bit different from connectivity stats, for example, plus we should keep response sizes sane)

Another idea is to break request types down by the underlying object the metric is measured over. So far, all of the metrics we're talking about either concern a node, or a connection. When we perform a survey request, the node responds with information about itself, as well as information about a subset of its connections. If that subset of connections isn't the full set of connections, we query the node again to (hopefully) get the remaining peers. This causes the node to send the information about itself again!

If we always want all of the survey information but we also want to minimize the data sent, we could send one request for node data, and separate request(s) for connection data. To clear up the difference, here's what I'm thinking of adding for each type of data based on this thread and other conversations. Note that most of the existing metrics in the ticket description are per-connection.

Additional per-connection data:

Average latency

Additional per-node data:

Number of connections added or dropped (per time slice). This will help to understand network churn.
Latency from surveyor. Measure time taken to receive a survey response after sending a request. This will help to analyze survey timeout parameters.
Node type (validator or watcher)
Node’s quorum set
Scheduler queue average waiting time
Average block application latency
Upgrade status

MonsieurNicolas · 2024-02-28T01:20:00Z

We should probably split out the work on that front:

data necessary to improve overlay itself --> great candidate for this
- a very small subset of high level metrics that are a good proxy for node health can be useful here as this impacts overlay performance. Things like: SCP latency (first to self), if the node is in sync (or which ledger it's on at a specific timestamp).
  - scheduler queue wait time is not a high level metric as having good numbers there does not imply much for the node.
more detailed node health
- I don't know if survey is the right way to move forward as we really need a lot of metrics to understand what is going on. We may want to consider something that can be applied to other systems SDF maintains (core, Horizon, Soroban-RPC, etc): something like an optional telemetry stream that scrapes the metrics endpoint and uploads data to a server somewhere (we can run that server). We can coordinate to get this done cross team.
other information
- quorum set information -- this can get pretty large and without a clear goal on what we'd like to do with it, I am not sure it's worth doing at this time (that being said -- it could be used to understand the resilience of the network of watcher nodes from a consensus point of view)
- upgrade information -- same thing, wrt goals. I am actually not sure that we'd want intent to be visible as it changes voting dynamics. On the voting front, I would like to see actual votes after a protocol upgrade though (something that should end up in archives, not survey) -- I am not sure the current SCP messages uploaded allow to infer votes.

note that anything that depends on clocks being synchronized will require estimating the clock skew somehow or the data will be noisy,

bboston7 · 2024-02-29T19:34:12Z

note that anything that depends on clocks being synchronized will require estimating the clock skew somehow or the data will be noisy

Does stellar-core do any clock synchronization? It looks like stellar-core used to synchronize with NTP, but that functionality was removed. In poking around I didn't see whether we later added a different method for synchronization.

If there is no synchronization whatsoever then clock skews could be quite large. Another idea is to support surveys over time slices by broadcasting a start-survey-recording <nonce> message, then x minutes later broadcasting a stop-survey-recording <nonce> message and querying nodes about data during the <nonce> survey. This still isn't perfect as some nodes will receive the {start, stop}-survey-recording message before others, but if that difference is measured in miliseconds/seconds and the survey duration is measured many minutes, then the data acquired during the relatively small discrepancy in start/stop time may average out to have a negligible impact on the results.

MonsieurNicolas · 2024-03-02T02:23:26Z

The NTP code was only there to warn the operator (and was buggy/not secure) so we removed it; also it didn't do anything about the system clock that we use basically everywhere else (like: we use the local clock to quantize metrics) because we actually need a steady clock there.

The new message "start-survey-recording " could be an interesting idea:

are you thinking that it would basically create some sort of local accumulator identified by "nonce", which then allows to reason based on a specific accumulator.
the accumulator would be deleted after some TTL in the order of something "long enough" (like 15 minutes).
it can also be deleted by "stop-survey-recording "
its TTL would be reset when responding to survey requests that corresponds to that nonce
- this could allow to accumulate data for window sizes decided by the surveyor, potentially running for much longer than the TTL.

What would nodes that don't have the nonce do? (new nodes for example)?

bboston7 · 2024-03-04T18:25:31Z

The NTP code was only there to warn the operator (and was buggy/not secure) so we removed it; also it didn't do anything about the system clock that we use basically everywhere else (like: we use the local clock to quantize metrics) because we actually need a steady clock there.

Got it, thanks for the clarification!

are you thinking that it would basically create some sort of local accumulator identified by "nonce", which then allows to reason based on a specific accumulator.

Yep!

the accumulator would be deleted after some TTL in the order of something "long enough" (like 15 minutes).

Yes, we'd need some TTL to prevent survey requests/data from potentially growing unbounded.

it can also be deleted by "stop-survey-recording "

Actually, I was thinking stop-survey-recording would define the end of a time slice and nodes would delete the accumulator some time after that, giving the surveying node ample time to collect the data from the time slice. From a surveyed node's perspective, the algorithm looks like this:

The node receives start-survey-recording n where n is the nonce for this survey
The node creates an accumulator a for nonce n and begins recording data. The accumulator has some TTL t.
If t passes before receiving a stop-survey-recording n, the node deletes a and the algorithm terminates here.
Upon receipt of stop-survey-recording n, the node freezes the data in a and assigns it a new TTL u.
Upon receipt of survey-request-data n, the node responds with the data in a.
After u passes, the node deletes a.

its TTL would be reset when responding to survey requests that corresponds to that nonce

this could allow to accumulate data for window sizes decided by the surveyor, potentially running for much longer than the TTL.

Good point. There should be some way to extend the TTL.

What would nodes that don't have the nonce do? (new nodes for example)?

The main point of time slicing is that it makes the data easier to compare between nodes. Given that, I think nodes without the nonce (either because they're new or they missed the start-survey-recording message) should either not respond, or should respond with a different message type indicating the node exists but does not have full survey data. I'm thinking this second message type would only include non-time sliced data (such as version number, or if the node is currently in sync).

I like the partial response solution better than the no-response solution because it helps differentiate between unresponsive nodes and (likely) new nodes.

MonsieurNicolas · 2024-03-04T19:05:09Z

yeah makes sense... actually nodes can just respond with the existing survey response if they don't have the accumulator, that way we share the logic with the old clients that don't understand accumulators/timeslicing

marta-lokhova added enhancement discussion labels Jan 29, 2024

bboston7 self-assigned this Feb 21, 2024

bboston7 mentioned this issue Apr 5, 2024

Add new time sliced overlay survey #4275

Merged

13 tasks

latobarita closed this as completed in #4275 May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider improving network survey schema #4169

Consider improving network survey schema #4169

marta-lokhova commented Jan 29, 2024

ire-and-curses commented Feb 15, 2024

bboston7 commented Feb 15, 2024 •

edited

Loading

bboston7 commented Feb 24, 2024

MonsieurNicolas commented Feb 28, 2024

bboston7 commented Feb 29, 2024

MonsieurNicolas commented Mar 2, 2024

bboston7 commented Mar 4, 2024

MonsieurNicolas commented Mar 4, 2024

Consider improving network survey schema #4169

Consider improving network survey schema #4169

Comments

marta-lokhova commented Jan 29, 2024

ire-and-curses commented Feb 15, 2024

bboston7 commented Feb 15, 2024 • edited Loading

bboston7 commented Feb 24, 2024

MonsieurNicolas commented Feb 28, 2024

bboston7 commented Feb 29, 2024

MonsieurNicolas commented Mar 2, 2024

bboston7 commented Mar 4, 2024

MonsieurNicolas commented Mar 4, 2024

bboston7 commented Feb 15, 2024 •

edited

Loading