feat(lightpush): introduce ReliabilityMonitor and allow `send` retries #2130

danisharora099 · 2024-09-12T08:19:07Z

Problem

Based on #2075 and #2069,

there is tight coupling between the reliability logic and the LightPush SDK class
if a lightpush.send() request fails, we renew the peer directly

Solution

Based on the above problem statement, we want to do two things:

extend and separate reliability related logic into a separate class apart from LightPush (SenderMessageMonitor)
attempt retries on the failed peers for lightpush.send() before we actually renew the peer

Notes

Resolves feat(lightpush): add retries to failing peers #2069
Resolves feat: message verification and retry #2075 partially
To be merged after feat(filter): reliability monitor as a separate class to handle reliability logic #2117 (with rebase)

Contribution checklist:

covered by unit tests;
covered by e2e test;
add ! in title if breaks public API;

github-actions · 2024-09-12T09:03:10Z

size-limit report 📦

Path	Size	Loading time (3g)	Running time (snapdragon)	Total time
Waku node	83.56 KB (+0.44% 🔺)	1.7 s (+0.44% 🔺)	1.8 s (-42.57% 🔽)	3.5 s
Waku Simple Light Node	135.17 KB (+0.2% 🔺)	2.8 s (+0.2% 🔺)	1.8 s (-51.36% 🔽)	4.5 s
ECIES encryption	22.94 KB (0%)	459 ms (0%)	1.4 s (+136.49% 🔺)	1.9 s
Symmetric encryption	22.39 KB (0%)	448 ms (0%)	1.6 s (+96.63% 🔺)	2.1 s
DNS discovery	72.28 KB (0%)	1.5 s (0%)	1.8 s (-2.96% 🔽)	3.3 s
Peer Exchange discovery	73.79 KB (0%)	1.5 s (0%)	3.4 s (+117.46% 🔺)	4.9 s
Local Peer Cache Discovery	67.63 KB (0%)	1.4 s (0%)	2.2 s (+37.41% 🔺)	3.6 s
Privacy preserving protocols	74.79 KB (-0.05% 🔽)	1.5 s (-0.05% 🔽)	3.9 s (+63.96% 🔺)	5.4 s
Waku Filter	78.52 KB (+0.25% 🔺)	1.6 s (+0.25% 🔺)	2.8 s (+56.2% 🔺)	4.4 s
Waku LightPush	76.79 KB (+1.68% 🔺)	1.6 s (+1.68% 🔺)	1.8 s (-26.8% 🔽)	3.3 s
History retrieval protocols	76.02 KB (+0.1% 🔺)	1.6 s (+0.1% 🔺)	1.2 s (-42.03% 🔽)	2.7 s
Deterministic Message Hashing	7.38 KB (0%)	148 ms (0%)	480 ms (+48.41% 🔺)	628 ms

packages/sdk/src/index.ts

packages/sdk/src/protocols/reliability_monitor_manager.ts

packages/tests/tests/light-push/peer_management.spec.ts

danisharora099 · 2024-09-13T08:42:28Z

thanks for the great reviews @weboko ! helpful!

weboko · 2024-09-13T11:18:59Z

packages/sdk/src/protocols/lightpush/index.ts

@@ -89,16 +97,23 @@ class LightPushSDK extends BaseProtocolSDK implements ILightPushSDK {
          successes.push(success);
        }
        if (failure) {
+          failures.push(failure);


leaving for some context, will make an issue for it later so that we can discuss

in status-go where retry is already implemented they do things differently and we will probably do something similar too

when message is sent it is queued, then it will be retried and once it is definitely fails - it will be communicated to consumer through something similar to event API

cc @waku-org/js-waku

weboko · 2024-09-13T11:20:07Z

packages/sdk/src/protocols/lightpush/index.ts

+            const peer = this.connectedPeers.find((connectedPeer) =>
+              connectedPeer.id.equals(failure.peerId)
+            );
+            if (peer) {


I think here we should retry to any peer available - otherwise we loose message if peer got dropped (renewed, went offline or just networking issue)

well, this block is executed WHEN a particular peer fails to send a lightpush request.

example: we have 3 connected peers, and one fails to send it, this is the block that will reattempt delivery for that peer and renewing (instead of just renewiing)

we aren't losing the message in any case

reliability monitor later (even after if reattempts fail), resends the lightpush request after renewal

hm, I think we should align here on what is desired behavior.

considering what we care is successfully pushing a message at this stage (i.e no errors while sending)
then it is enough for us to have at least 1 successful push - in that case no retries needed
if all failures - then just retry but not necessarily to the same peer, just any peer
and if during all of this any peer is failing 3 times - we renew

I probably summarized it for myself only, but just clarifying

considering what we care is successfully pushing a message at this stage (i.e no errors while sending)
then it is enough for us to have at least 1 successful push - in that case no retries needed

Well, technically, yes. I agree. I can't really think of a case where we would indeed need redundancy if one of the peers can assure us that they indeed relayed the message further into GossipSub. However, for now, without LightPush V2, it's not trivial to get that. Thus, having redundancy + retries is good for now and we can revisit later if our apps perform well. (ref: https://discord.com/channels/1110799176264056863/1284121724433993758/1285014955023798342 as well)

decoupled into follow up #2140

weboko · 2024-09-13T11:21:58Z

packages/sdk/src/protocols/lightpush/index.ts

+                `);
+              void this.reliabilityMonitor.attemptRetriesOrRenew(
+                failure.peerId,
+                () => this.protocol.send(encoder, message, peer)


aren't we risking getting into recursion here? apologies if I missed that code part

lightPush fails -> retry initiated -> lightPush fails -> ...

hm that's a good point.
technically, recursion would've been a neater solution here but this is not it

here: we detect (one peer) to have failed to send the lightpush request, and reattempt a few times. if it keeps failing, we renew the peer, and attempt (only once) to send the lightpush request through that peer.
here it would be neater to introduce recursion maybe, but seems like overkill for now TBH.

we can find peace in the fact that:

we already use multiple peers to send it first time

even if one of the peers fails, we will re attempt

if even that fails, we will do renewal and use the new peer to send it

for it to fail, ALL peers would have to literally just not entertain our requests

in a case where we disregard that, introducing recursion here would be a neat solution.

continue https://github.com/waku-org/js-waku/pull/2130/files#r1759073607

packages/tests/tests/light-push/peer_management.spec.ts

weboko · 2024-09-13T15:29:01Z

packages/sdk/src/reliability_monitor/sender.ts

+
+        this.attempts.delete(peerIdStr);
+        this.attempts.set(newPeer.id.toString(), 0);
+        await protocolSend();


prev - https://github.com/waku-org/js-waku/pull/2130/files#r1758702483

here we call .send in both branches making no exit of the recursion (i.e it will be called over and over it seems to me)

so I think here we should do .send to newPeer instead because next time from here attemptRetriesOrRenew will be called for peerIdStr and at that point this.attempts. will not have it so it will just continue

I believe this is the reason why you needed to implement .stop operation on the manager entity

this.protocol.send() is different from SDK's lightpush.send()

SDK.send() calls this.protocol.send() internally

o I think here we should do .send to newPeer instead because next time from here attemptRetriesOrRenew

We are indeed doing this.protocol.send(peer) by binding the protocolSend() function call:

void this.reliabilityMonitor.attemptRetriesOrRenew( failure.peerId, () => this.protocol.send(encoder, message, peer)

oh right, thanks for clarifying
then it is not an issue - we can have a recursion here

weboko · 2024-09-16T09:55:50Z

This is the last discussion I want to align on - #2130 (comment)
Other than that looks good!

danisharora099 changed the base branch from master to feat/filter-reliability-split September 12, 2024 08:19

danisharora099 force-pushed the feat/lightpush-reliability-monitor branch from 9bde3f7 to 73e350a Compare September 12, 2024 08:51

danisharora099 marked this pull request as ready for review September 12, 2024 11:15

danisharora099 force-pushed the feat/filter-reliability-split branch from 5ae69f0 to 2eddaf5 Compare September 12, 2024 11:41

danisharora099 force-pushed the feat/lightpush-reliability-monitor branch from 3c82393 to e51e369 Compare September 12, 2024 11:42

weboko requested a review from a team September 12, 2024 14:52

weboko reviewed Sep 12, 2024

View reviewed changes

packages/sdk/src/index.ts Outdated Show resolved Hide resolved

weboko reviewed Sep 12, 2024

View reviewed changes

packages/sdk/src/protocols/reliability_monitor_manager.ts Outdated Show resolved Hide resolved

weboko reviewed Sep 12, 2024

View reviewed changes

packages/tests/tests/light-push/peer_management.spec.ts Show resolved Hide resolved

danisharora099 requested a review from a team September 13, 2024 08:38

danisharora099 requested a review from weboko September 13, 2024 08:42

danisharora099 force-pushed the feat/lightpush-reliability-monitor branch from 0bd895b to 58b2c13 Compare September 13, 2024 08:50

Base automatically changed from feat/filter-reliability-split to master September 13, 2024 09:27

danisharora099 added 4 commits September 13, 2024 15:20

chore: restructure reliabiltiy monitors

0fe4869

feat: setup sender monitor

c0e6e05

chore: update tests

6b58f08

chore: minor fixes

89c7bae

danisharora099 force-pushed the feat/lightpush-reliability-monitor branch 2 times, most recently from 381a84e to 89c7bae Compare September 13, 2024 10:56

weboko reviewed Sep 13, 2024

View reviewed changes

packages/tests/tests/light-push/peer_management.spec.ts Show resolved Hide resolved

chore: comment for doc

9b56e13

danisharora099 requested a review from weboko September 13, 2024 11:41

weboko reviewed Sep 13, 2024

View reviewed changes

weboko approved these changes Sep 16, 2024

View reviewed changes

danisharora099 merged commit 7a6247c into master Sep 17, 2024
10 of 11 checks passed

danisharora099 deleted the feat/lightpush-reliability-monitor branch September 17, 2024 06:05

weboko mentioned this pull request Sep 16, 2024

chore: release master #2135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lightpush): introduce ReliabilityMonitor and allow `send` retries #2130

feat(lightpush): introduce ReliabilityMonitor and allow `send` retries #2130

danisharora099 commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 12, 2024 •

edited

Loading

danisharora099 commented Sep 13, 2024

weboko Sep 13, 2024

weboko Sep 13, 2024

danisharora099 Sep 13, 2024

weboko Sep 13, 2024

danisharora099 Sep 17, 2024

weboko Sep 17, 2024

weboko Sep 13, 2024

danisharora099 Sep 13, 2024 •

edited

Loading

weboko Sep 13, 2024

weboko Sep 13, 2024

danisharora099 Sep 16, 2024 •

edited

Loading

weboko Sep 16, 2024

weboko commented Sep 16, 2024

feat(lightpush): introduce ReliabilityMonitor and allow send retries #2130

feat(lightpush): introduce ReliabilityMonitor and allow send retries #2130

Conversation

danisharora099 commented Sep 12, 2024 • edited Loading

Problem

Solution

Notes

github-actions bot commented Sep 12, 2024 • edited Loading

size-limit report 📦

danisharora099 commented Sep 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danisharora099 Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danisharora099 Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weboko commented Sep 16, 2024

feat(lightpush): introduce ReliabilityMonitor and allow `send` retries #2130

feat(lightpush): introduce ReliabilityMonitor and allow `send` retries #2130

danisharora099 commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 12, 2024 •

edited

Loading

danisharora099 Sep 13, 2024 •

edited

Loading

danisharora099 Sep 16, 2024 •

edited

Loading