-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider improving network survey schema #4169
Comments
Taking a step back, what are the goals of the survey? I can think of several possible goals:
I'm personally most interested in network stats and graph properties. From a decentralisation perspective, it would be extremely valuable to understand the extent of the network, the number of validators and watchers, their versions, and the impact of those groups on the communications and stability of the overlay. For example, I would like to be able to answer questions such as
Perhaps this might not be a goal best executed by the survey mechanism. Alternatives could include recursive crawlers or IP scanners. Would love to hear thoughts on this. |
One of the major benefits of using a survey mechanism built into stellar-core is the ability to reach nodes behind NATs. Recursive crawlers / IP scanners will miss NATed nodes that do not accept inbound connections. A quick look at prior survey results shows many nodes on the network have no inbound peers. Of course we don't know exactly why that is, but if it's largely due to NATs then recursive crawlers / IP scanners may paint a misleading picture of the network. Moreover, if individuals are running nodes on residential ISPs then it's very likely they're behind some kind of NAT, especially with ISP level CGNAT becoming more common. Whether or not missing these nodes is important gets more at your question of what goal the survey is trying to achieve. I think measuring network health and decentralization is an important goal, which would require reaching as many nodes as possible. |
I think these metrics become useful when they're defined over time slices. If we can see how much data every node ingested for the same window of time we can start reasoning about the differences between them better.
Another idea is to break request types down by the underlying object the metric is measured over. So far, all of the metrics we're talking about either concern a node, or a connection. When we perform a survey request, the node responds with information about itself, as well as information about a subset of its connections. If that subset of connections isn't the full set of connections, we query the node again to (hopefully) get the remaining peers. This causes the node to send the information about itself again! If we always want all of the survey information but we also want to minimize the data sent, we could send one request for node data, and separate request(s) for connection data. To clear up the difference, here's what I'm thinking of adding for each type of data based on this thread and other conversations. Note that most of the existing metrics in the ticket description are per-connection. Additional per-connection data:
Additional per-node data:
|
We should probably split out the work on that front:
note that anything that depends on clocks being synchronized will require estimating the clock skew somehow or the data will be noisy, |
Does stellar-core do any clock synchronization? It looks like stellar-core used to synchronize with NTP, but that functionality was removed. In poking around I didn't see whether we later added a different method for synchronization. If there is no synchronization whatsoever then clock skews could be quite large. Another idea is to support surveys over time slices by broadcasting a |
The NTP code was only there to warn the operator (and was buggy/not secure) so we removed it; also it didn't do anything about the system clock that we use basically everywhere else (like: we use the local clock to quantize metrics) because we actually need a steady clock there. The new message "start-survey-recording " could be an interesting idea:
What would nodes that don't have the nonce do? (new nodes for example)? |
Got it, thanks for the clarification!
Yep!
Yes, we'd need some TTL to prevent survey requests/data from potentially growing unbounded.
Actually, I was thinking
Good point. There should be some way to extend the TTL.
The main point of time slicing is that it makes the data easier to compare between nodes. Given that, I think nodes without the nonce (either because they're new or they missed the I like the partial response solution better than the no-response solution because it helps differentiate between unresponsive nodes and (likely) new nodes. |
yeah makes sense... actually nodes can just respond with the existing survey response if they don't have the accumulator, that way we share the logic with the old clients that don't understand accumulators/timeslicing |
Currently, peers respond to network survey with the following information:
In practice, some of these metrics aren't particularly useful (for example, it's a bit hard to reason about the absolute number of messages/bytes given that the rate fluctuates significantly overtime). In addition, we might be missing some key information about nodes on the network:
We might want to introduce different request types (node health metrics are quite a bit different from connectivity stats, for example, plus we should keep response sizes sane)
Tagging as
discussion
to get the conversation started.The text was updated successfully, but these errors were encountered: