Per topic stats of Whisper nodes #1642

oskarth · 2019-10-14T05:21:36Z

Problem

I want to see envelope statistics for Whisper topics so I can reason about Whisper and scalability and address issues such as status-im/status-mobile#9081

Implementation

(Run as a full bloom filter node to pick up all topics).

Desired statistics. For a node I want to see aggregate stats, i.e. over time, for:

How many envelopes are there in each topic?
For ingress traffic, what does the breakdown look like in terms of:

hash duplicate from a single peer
hash duplicate from multiple peers (i.e. likely function of number of peers connected)
invalid envelopes, valid, expired/futured, pow too low (both old and new calc, ideally), no bloom filter match
how big are the envelopes

For egress traffic, how many envelopes are we sending to each peer:

With what reason (e.g. bloom filter X => send envelopes matching these topics)

Acceptance Criteria

A text dump or some URL where I can go to a node and see numbers above over some time period, e.g. minute/hour/day.

Notes

Here's example of what some of this might look like, just from my own hacky testing:

Hash duplicates (with 24 whisper nodes connected, ideally this would have topic in it too):

oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | awk '{print $8}' | sort | uniq -c 
      1 
      8 hash=0002B6BA794DD457D55D67AD68BB8C49C98791AFEF466534FC97E28062F763FB
      8 hash=00637F3882A2EFEC89CF40249DC59FDC9A049B78D42EDB6E116B0D76BE0AA523
      4 hash=0102C0C42044FDCAC2D64CAF7EF35AA759BEA21703EE8BA7AEFFD176E2280089
     24 hash=2D567B7E97FA2510A1299EA84F140B19DA0B2012BE431845B433BA238A22282C
     22 hash=40D7D3BCC784CC9D663446A9DFB06D55533F80799261FDD30E30CC70853572CE
     21 hash=707DA8C56605C57C930CE910F2700E379C350C845B5DAE20A9F8A6DBA4F59B2B
     24 hash=AC8C3ABE198ABE3BF286E80E25B0CFF08B3573F42F61499FB258F519C1CF9F18
     24 hash=C4C3D64886ED31A387B7AE57C904D702AEE78036E9446B30E964A149134B0D56
     24 hash=D4A1D17641BD08E58589B120E7F8F399D23DA1AF1BA5BD3FED295CD852BC17DA

Envelopes ingress stats:

`WRN 2019-10-14 13:05:55+09:00 Message Ingress Stats                      tid=9527 added=6 allowed=58 disallowed=281 disallowedBloom=0 disallowedPow=281 disallowedSize=0 duplicate=27 invalid=23 noHandshake=0`

Topic stats:

oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat bar.log | awk '{print $8}' | sort | uniq -c
      1 
      8 topicStr=12000000
     24 topicStr=5C6C9B56
     20 topicStr=9C22FF5F
    852 topicStr=F8946AAC

^ @jakubgs @adambabik

The text was updated successfully, but these errors were encountered:

cammellos · 2019-10-14T05:27:33Z

Some of these should be readily available from mailservers, for example:

How many envelopes are there in each topic?

oskarth · 2019-10-14T11:51:43Z

@cammellos what's the best way of accessing those? Has anything materially changed since https://discuss.status.im/t/performance-of-mailservers/1215?

 1043120 | \x5c6c9b56 (What is this?, there's roughly a envelope published each second)
  351359 | \xf8946aac (Discovery topic)
    7892 | \x98ac139b
    4441 | \x13f4f374

cammellos · 2019-10-14T12:04:27Z

I haven't checked recently , but it should be the same as before, @jakubgs should be able to point in the right direction (you need to ssh into the machine and connect to the db running in docker)

jakubgs · 2019-10-14T12:58:02Z

If you want SSH access to the cluster hosts you'll need to make a PR for infra-role-bootstrap with your SSH key, like this one: https://github.com/status-im/infra-role-bootstrap/pull/12

If you just want temporary access to one of the hosts to fetch some data just send me your public key on Discord.

oskarth · 2019-10-15T04:28:11Z

How difficult would it be to get these up in a dashboard?

Ideally we'd not just see a unique count of envelopes, but have it reflect actual envelopes sent (then split by duplicate and not duplicate)

jakubgs · 2019-11-04T10:35:25Z

Note to self from Oskar:

ideally whatever is needed to falsify/increase likelihood of assumptions here https://htmlpreview.github.io/?https://github.com/vacp2p/research/blob/master/whisper_scalability/report.html

jakubgs · 2019-11-06T15:30:49Z

Some metrics already exist:
https://github.com/status-im/whisper/blob/master/whisperv6/metrics.go
But they need some work:

Port them to the new Prometheus library we are using
Vend which metrics we need, rename some so they make sense
Add additional metrics to match requirements here

One thing that probably can't be done via Prometheus are the per-topic metrics themselves.
It would be too high cardinality to put them in metric labels. So a better solution would be some kind of separate HTTP endpoint with those metrics per topic hash, but I don't want to mess up whisper master branch, so it would probably require making a separate branch with such an endpoint exposed.

jakubgs · 2019-11-06T15:33:20Z

The topic hash is available in the Envelope structure itself:
https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/envelope.go#L37-L40
And it's a 4 byte array:
https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/topic.go#L26-L29

jakubgs · 2019-11-06T15:36:56Z

It might be possible to get this data without modifying whisper repo because there's an event feed:
https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/whisper.go#L110
I can subscribe to the feed with a channel:
https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/whisper.go#L178-L182
But the EvenlopeEvent lacks the topic, so that would have to be added:
https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/events.go#L38-L45
And the even I'm interested in is EventEnvelopeReceived:
https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/events.go#L19

jakubgs · 2019-11-07T23:08:24Z

There's already a bot that could be modified to do this but it doesn't use status-protocol-go:
https://github.com/status-im/statusd-bots/blob/master/cmd/pubchats
So I might as well write it from scratch.

jakubgs · 2019-11-07T23:10:38Z

Regarding the metrics themselves. I discussed them with Adam and came up with these comments:

hash duplicate from a single peer - Unlikely edge case, but maybe worth checking for.
hash duplicate from multiple peers - Possible and likely, worth monitoring.
invalid envelopes - Not really possible, unless a bad actor exists that sends them.
valid - Sure
expired/futured - Possible
pow too low - possible
no bloom filter match - Makes no sense with a full bloom filter node

oskarth · 2019-11-08T04:47:32Z

hash duplicate from a single peer - Unlikely edge case, but maybe worth checking for.
invalid envelopes - Not really possible, unless a bad actor exists that sends them.
no bloom filter match - Makes no sense with a full bloom filter node

Which is why we want to check them. We want them to be zero, and if they aren't that indicates an issue.

The bloom filter one is also not unlikely to change, plus instrumentation can be used for light nodes as well as testing Waku mode. It's likely the biggest blocker bandwidth wise in terms of false positives, so we want stats there.

jakubgs · 2019-11-08T08:37:12Z

The bloom filter one is also not unlikely to change, plus instrumentation can be used for light nodes as well as testing Waku mode. It's likely the biggest blocker bandwidth wise in terms of false positives, so we want stats there.

But how can an envelope not match a full bloom filter? That makes no sense.

jakubgs · 2019-11-17T10:20:20Z

Here is my understanding of the metrics:

How many envelopes are there in each topic?
- Simple, it's already being counted:
- https://github.com/status-im/whisper/blob/a527df15/whisperv6/whisper.go#L1199
Hash duplicate from a single/multiple peer(s) for ingress traffic
- Simple for just finding duplicates, harder for distinguishing between one and multiple peers:
- https://github.com/status-im/whisper/blob/a527df15/whisperv6/whisper.go#L1262-L1263
Invalid envelopes
- Trivial, just add more metrics for locations like this:
- https://github.com/status-im/whisper/blob/a527df15/whisperv6/whisper.go#L916-L917
- https://github.com/status-im/whisper/blob/a527df15/whisperv6/whisper.go#L928-L929
- https://github.com/status-im/whisper/blob/a527df15/whisperv6/whisper.go#L928-L929
- And already existing metrics here:
- https://github.com/status-im/whisper/blob/a527df15/whisperv6/whisper.go#L1202-L1217
Valid
- https://github.com/status-im/whisper/blob/a527df15/whisperv6/whisper.go#L950-L954
Expired/futured
- https://github.com/status-im/whisper/blob/a527df15/whisperv6/whisper.go#L1211-L1215
PoW too low
- https://github.com/status-im/whisper/blob/a527df15/whisperv6/whisper.go#L1231
No bloom filter match
- https://github.com/status-im/whisper/blob/a527df15/whisperv6/whisper.go#L1241
How big are the envelopes
- https://github.com/status-im/whisper/blob/a527df15/whisperv6/whisper.go#L1267
How many envelopes are we sending to each peer (egress)
- Not sure about this one, will talk to Adam about it.

I'm currently in the process of rewriting existing metrics and adding new ones using the new Prometheus library we used in #1648. I should have a PR tomorrow.

jakubgs · 2019-11-19T13:29:57Z

First version of PR for new Whisper metrics: status-im/whisper#38

adambabik · 2019-11-19T14:04:20Z

hash duplicate from a single peer - Unlikely edge case, but maybe worth checking for.

This should be allowed. Consider a scenario with two nodes A and B:

Node A broadcasts a message M to peer B,
Node A restarts itself due to some reason,
Node A connects again to some old peers and a new peer C which sends it the message M,
Node A receives the message M,
Node A broadcasts the message M to peer B again because it was restarted and does not remember doing that already.

So, it's a completely valid case. We can't consider it immediately an error and disconnect. Alternatively, there should be some threshold which depending on a TTL of the message and number of duplicates could finally disconnect such a peer.

oskarth · 2019-11-20T03:11:42Z

I wouldn't go so far as to call it a completely valid case. We might choose to introduce some slack to make it easier for implementations if it is 100% necessary, but in principle it is bad behaviour. Node A should keep track of the messages it has sent to B already, and I suppose this should/could be persisted across restarts? Probably a spec issue. cc @kdeme

EDIT: In any case, for this specific issue, we should still track it!

cammellos · 2019-11-20T08:45:40Z

hash duplicate from a single peer - Unlikely edge case, but maybe worth checking for.

This is just at-least-once vs at-most-once , which semantic do we want to go for?

at-least-once seems the best suited, if there's some sort of acknowledgement between nodes, i.e I have received this rlp packet.

If it's at-least-once then the occasional re-sending should not be considered bad behavior and it's indeed a valid case.
If instead we want to go for at most once, it should be considered bad behavior, while packet loss might occur.

adambabik · 2019-11-20T10:58:54Z

I think it's a huge requirement from a Whisper node. Currently, it can be totally stateless. If we require it to track what was send to who between restarts, it can't be stateless anymore.

Also, this requirement only makes sense with high PoW. High PoW + at-most-once gives you a protection against flooding. Low PoW + at-most-once gives nothing (or very little) because creating messages is cheap and is almost the same as at-least-once.

oskarth · 2019-11-20T11:07:46Z

While a useful discussion I think this is off-topic for OP. We still want to track it in terms of metrics.

For specs, I suggest we open an issue and continue discussion there. Right now I believe it is currently not specified, and no implementation is disconnecting peers due to it (though it is suggested in the code https://github.com/status-im/nim-eth/blob/master/eth/p2p/rlpx_protocols/whisper_protocol.nim#L179-L191).

kdeme · 2019-11-20T11:14:17Z

Same point made by @adambabik counts for the nim-eth. After restart or disconnect/reconnect the list of already send/received message for that peer is gone.

In nim-eth the nodes don't disconnect peers on this "bad" behaviour, or any other type in fact (for now). I'm not even sure if this will practically create huge issues if we would disconnect (from the first duplicate, not a threshold of duplicates, which would be better I guess). One peer less to be connected with, in this possibly(?) rare case? The network should be resilient to this I'd hope. But yes, best way to find out is measure it.

status-github-bot · 2021-08-05T16:00:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

oskarth · 2021-08-06T02:43:00Z

This has largely been addressed, and more work on this is happening with nim-waku benchmarking

cc @jm-clius fyi

weekly-digest bot mentioned this issue Oct 20, 2019

Weekly Digest (13 October, 2019 - 20 October, 2019) #1647

Closed

weekly-digest bot mentioned this issue Nov 17, 2019

Weekly Digest (10 November, 2019 - 17 November, 2019) #1680

Closed

jakubgs mentioned this issue Nov 19, 2019

Use new Prometheus metrics status-im/whisper#38

Merged

status-github-bot bot added the stale label Aug 5, 2021

oskarth closed this as completed Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per topic stats of Whisper nodes #1642

Per topic stats of Whisper nodes #1642

oskarth commented Oct 14, 2019

cammellos commented Oct 14, 2019

oskarth commented Oct 14, 2019

cammellos commented Oct 14, 2019

jakubgs commented Oct 14, 2019

oskarth commented Oct 15, 2019 •

edited

Loading

jakubgs commented Nov 4, 2019

jakubgs commented Nov 6, 2019

jakubgs commented Nov 6, 2019 •

edited

Loading

jakubgs commented Nov 6, 2019 •

edited

Loading

jakubgs commented Nov 7, 2019

jakubgs commented Nov 7, 2019

oskarth commented Nov 8, 2019

jakubgs commented Nov 8, 2019

jakubgs commented Nov 17, 2019

jakubgs commented Nov 19, 2019

adambabik commented Nov 19, 2019

oskarth commented Nov 20, 2019 •

edited

Loading

cammellos commented Nov 20, 2019 •

edited

Loading

adambabik commented Nov 20, 2019

oskarth commented Nov 20, 2019

kdeme commented Nov 20, 2019

status-github-bot bot commented Aug 5, 2021

oskarth commented Aug 6, 2021

Per topic stats of Whisper nodes #1642

Per topic stats of Whisper nodes #1642

Comments

oskarth commented Oct 14, 2019

Problem

Implementation

Acceptance Criteria

Notes

cammellos commented Oct 14, 2019

oskarth commented Oct 14, 2019

cammellos commented Oct 14, 2019

jakubgs commented Oct 14, 2019

oskarth commented Oct 15, 2019 • edited Loading

jakubgs commented Nov 4, 2019

jakubgs commented Nov 6, 2019

jakubgs commented Nov 6, 2019 • edited Loading

jakubgs commented Nov 6, 2019 • edited Loading

jakubgs commented Nov 7, 2019

jakubgs commented Nov 7, 2019

oskarth commented Nov 8, 2019

jakubgs commented Nov 8, 2019

jakubgs commented Nov 17, 2019

jakubgs commented Nov 19, 2019

adambabik commented Nov 19, 2019

oskarth commented Nov 20, 2019 • edited Loading

cammellos commented Nov 20, 2019 • edited Loading

adambabik commented Nov 20, 2019

oskarth commented Nov 20, 2019

kdeme commented Nov 20, 2019

status-github-bot bot commented Aug 5, 2021

oskarth commented Aug 6, 2021

oskarth commented Oct 15, 2019 •

edited

Loading

jakubgs commented Nov 6, 2019 •

edited

Loading

jakubgs commented Nov 6, 2019 •

edited

Loading

oskarth commented Nov 20, 2019 •

edited

Loading

cammellos commented Nov 20, 2019 •

edited

Loading