Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per topic stats of Whisper nodes #1642

Closed
oskarth opened this issue Oct 14, 2019 · 23 comments
Closed

Per topic stats of Whisper nodes #1642

oskarth opened this issue Oct 14, 2019 · 23 comments
Labels

Comments

@oskarth
Copy link
Contributor

oskarth commented Oct 14, 2019

Problem

I want to see envelope statistics for Whisper topics so I can reason about Whisper and scalability and address issues such as status-im/status-mobile#9081

Implementation

(Run as a full bloom filter node to pick up all topics).

Desired statistics. For a node I want to see aggregate stats, i.e. over time, for:

  1. How many envelopes are there in each topic?
  2. For ingress traffic, what does the breakdown look like in terms of:
  • hash duplicate from a single peer
  • hash duplicate from multiple peers (i.e. likely function of number of peers connected)
  • invalid envelopes, valid, expired/futured, pow too low (both old and new calc, ideally), no bloom filter match
  • how big are the envelopes
  1. For egress traffic, how many envelopes are we sending to each peer:
  • With what reason (e.g. bloom filter X => send envelopes matching these topics)

Acceptance Criteria

A text dump or some URL where I can go to a node and see numbers above over some time period, e.g. minute/hour/day.

Notes

Here's example of what some of this might look like, just from my own hacky testing:

Hash duplicates (with 24 whisper nodes connected, ideally this would have topic in it too):

oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | awk '{print $8}' | sort | uniq -c 
      1 
      8 hash=0002B6BA794DD457D55D67AD68BB8C49C98791AFEF466534FC97E28062F763FB
      8 hash=00637F3882A2EFEC89CF40249DC59FDC9A049B78D42EDB6E116B0D76BE0AA523
      4 hash=0102C0C42044FDCAC2D64CAF7EF35AA759BEA21703EE8BA7AEFFD176E2280089
     24 hash=2D567B7E97FA2510A1299EA84F140B19DA0B2012BE431845B433BA238A22282C
     22 hash=40D7D3BCC784CC9D663446A9DFB06D55533F80799261FDD30E30CC70853572CE
     21 hash=707DA8C56605C57C930CE910F2700E379C350C845B5DAE20A9F8A6DBA4F59B2B
     24 hash=AC8C3ABE198ABE3BF286E80E25B0CFF08B3573F42F61499FB258F519C1CF9F18
     24 hash=C4C3D64886ED31A387B7AE57C904D702AEE78036E9446B30E964A149134B0D56
     24 hash=D4A1D17641BD08E58589B120E7F8F399D23DA1AF1BA5BD3FED295CD852BC17DA

Envelopes ingress stats:

`WRN 2019-10-14 13:05:55+09:00 Message Ingress Stats                      tid=9527 added=6 allowed=58 disallowed=281 disallowedBloom=0 disallowedPow=281 disallowedSize=0 duplicate=27 invalid=23 noHandshake=0`

Topic stats:

oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat bar.log | awk '{print $8}' | sort | uniq -c
      1 
      8 topicStr=12000000
     24 topicStr=5C6C9B56
     20 topicStr=9C22FF5F
    852 topicStr=F8946AAC

^ @jakubgs @adambabik

@cammellos
Copy link
Contributor

Some of these should be readily available from mailservers, for example:

How many envelopes are there in each topic?

@oskarth
Copy link
Contributor Author

oskarth commented Oct 14, 2019

@cammellos what's the best way of accessing those? Has anything materially changed since https://discuss.status.im/t/performance-of-mailservers/1215?

 1043120 | \x5c6c9b56 (What is this?, there's roughly a envelope published each second)
  351359 | \xf8946aac (Discovery topic)
    7892 | \x98ac139b
    4441 | \x13f4f374

@cammellos
Copy link
Contributor

I haven't checked recently , but it should be the same as before, @jakubgs should be able to point in the right direction (you need to ssh into the machine and connect to the db running in docker)

@jakubgs
Copy link
Member

jakubgs commented Oct 14, 2019

If you want SSH access to the cluster hosts you'll need to make a PR for infra-role-bootstrap with your SSH key, like this one: https://github.com/status-im/infra-role-bootstrap/pull/12

If you just want temporary access to one of the hosts to fetch some data just send me your public key on Discord.

@oskarth
Copy link
Contributor Author

oskarth commented Oct 15, 2019

How difficult would it be to get these up in a dashboard?

Ideally we'd not just see a unique count of envelopes, but have it reflect actual envelopes sent (then split by duplicate and not duplicate)

@jakubgs
Copy link
Member

jakubgs commented Nov 4, 2019

Note to self from Oskar:

ideally whatever is needed to falsify/increase likelihood of assumptions here https://htmlpreview.github.io/?https://github.com/vacp2p/research/blob/master/whisper_scalability/report.html

@jakubgs
Copy link
Member

jakubgs commented Nov 6, 2019

Some metrics already exist:
https://github.com/status-im/whisper/blob/master/whisperv6/metrics.go
But they need some work:

  • Port them to the new Prometheus library we are using
  • Vend which metrics we need, rename some so they make sense
  • Add additional metrics to match requirements here

One thing that probably can't be done via Prometheus are the per-topic metrics themselves.
It would be too high cardinality to put them in metric labels. So a better solution would be some kind of separate HTTP endpoint with those metrics per topic hash, but I don't want to mess up whisper master branch, so it would probably require making a separate branch with such an endpoint exposed.

@jakubgs
Copy link
Member

jakubgs commented Nov 6, 2019

@jakubgs
Copy link
Member

jakubgs commented Nov 6, 2019

It might be possible to get this data without modifying whisper repo because there's an event feed:
https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/whisper.go#L110
I can subscribe to the feed with a channel:
https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/whisper.go#L178-L182
But the EvenlopeEvent lacks the topic, so that would have to be added:
https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/events.go#L38-L45
And the even I'm interested in is EventEnvelopeReceived:
https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/events.go#L19

@jakubgs
Copy link
Member

jakubgs commented Nov 7, 2019

There's already a bot that could be modified to do this but it doesn't use status-protocol-go:
https://github.com/status-im/statusd-bots/blob/master/cmd/pubchats
So I might as well write it from scratch.

@jakubgs
Copy link
Member

jakubgs commented Nov 7, 2019

Regarding the metrics themselves. I discussed them with Adam and came up with these comments:

  • hash duplicate from a single peer - Unlikely edge case, but maybe worth checking for.
  • hash duplicate from multiple peers - Possible and likely, worth monitoring.
  • invalid envelopes - Not really possible, unless a bad actor exists that sends them.
  • valid - Sure
  • expired/futured - Possible
  • pow too low - possible
  • no bloom filter match - Makes no sense with a full bloom filter node

@oskarth
Copy link
Contributor Author

oskarth commented Nov 8, 2019

hash duplicate from a single peer - Unlikely edge case, but maybe worth checking for.
invalid envelopes - Not really possible, unless a bad actor exists that sends them.
no bloom filter match - Makes no sense with a full bloom filter node

Which is why we want to check them. We want them to be zero, and if they aren't that indicates an issue.

The bloom filter one is also not unlikely to change, plus instrumentation can be used for light nodes as well as testing Waku mode. It's likely the biggest blocker bandwidth wise in terms of false positives, so we want stats there.

@jakubgs
Copy link
Member

jakubgs commented Nov 8, 2019

The bloom filter one is also not unlikely to change, plus instrumentation can be used for light nodes as well as testing Waku mode. It's likely the biggest blocker bandwidth wise in terms of false positives, so we want stats there.

But how can an envelope not match a full bloom filter? That makes no sense.

@jakubgs
Copy link
Member

jakubgs commented Nov 17, 2019

Here is my understanding of the metrics:

I'm currently in the process of rewriting existing metrics and adding new ones using the new Prometheus library we used in #1648. I should have a PR tomorrow.

@jakubgs
Copy link
Member

jakubgs commented Nov 19, 2019

First version of PR for new Whisper metrics: status-im/whisper#38

@adambabik
Copy link
Contributor

hash duplicate from a single peer - Unlikely edge case, but maybe worth checking for.

This should be allowed. Consider a scenario with two nodes A and B:

  1. Node A broadcasts a message M to peer B,
  2. Node A restarts itself due to some reason,
  3. Node A connects again to some old peers and a new peer C which sends it the message M,
  4. Node A receives the message M,
  5. Node A broadcasts the message M to peer B again because it was restarted and does not remember doing that already.

So, it's a completely valid case. We can't consider it immediately an error and disconnect. Alternatively, there should be some threshold which depending on a TTL of the message and number of duplicates could finally disconnect such a peer.

@oskarth
Copy link
Contributor Author

oskarth commented Nov 20, 2019

I wouldn't go so far as to call it a completely valid case. We might choose to introduce some slack to make it easier for implementations if it is 100% necessary, but in principle it is bad behaviour. Node A should keep track of the messages it has sent to B already, and I suppose this should/could be persisted across restarts? Probably a spec issue. cc @kdeme

EDIT: In any case, for this specific issue, we should still track it!

@cammellos
Copy link
Contributor

cammellos commented Nov 20, 2019

hash duplicate from a single peer - Unlikely edge case, but maybe worth checking for.

This is just at-least-once vs at-most-once , which semantic do we want to go for?

at-least-once seems the best suited, if there's some sort of acknowledgement between nodes, i.e I have received this rlp packet.

If it's at-least-once then the occasional re-sending should not be considered bad behavior and it's indeed a valid case.
If instead we want to go for at most once, it should be considered bad behavior, while packet loss might occur.

@adambabik
Copy link
Contributor

I think it's a huge requirement from a Whisper node. Currently, it can be totally stateless. If we require it to track what was send to who between restarts, it can't be stateless anymore.

Also, this requirement only makes sense with high PoW. High PoW + at-most-once gives you a protection against flooding. Low PoW + at-most-once gives nothing (or very little) because creating messages is cheap and is almost the same as at-least-once.

@oskarth
Copy link
Contributor Author

oskarth commented Nov 20, 2019

While a useful discussion I think this is off-topic for OP. We still want to track it in terms of metrics.

For specs, I suggest we open an issue and continue discussion there. Right now I believe it is currently not specified, and no implementation is disconnecting peers due to it (though it is suggested in the code https://github.com/status-im/nim-eth/blob/master/eth/p2p/rlpx_protocols/whisper_protocol.nim#L179-L191).

@kdeme
Copy link

kdeme commented Nov 20, 2019

Same point made by @adambabik counts for the nim-eth. After restart or disconnect/reconnect the list of already send/received message for that peer is gone.

In nim-eth the nodes don't disconnect peers on this "bad" behaviour, or any other type in fact (for now). I'm not even sure if this will practically create huge issues if we would disconnect (from the first duplicate, not a threshold of duplicates, which would be better I guess). One peer less to be connected with, in this possibly(?) rare case? The network should be resilient to this I'd hope. But yes, best way to find out is measure it.

@status-github-bot
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@oskarth
Copy link
Contributor Author

oskarth commented Aug 6, 2021

This has largely been addressed, and more work on this is happening with nim-waku benchmarking

cc @jm-clius fyi

@oskarth oskarth closed this as completed Aug 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants