Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lightpush): introduce ReliabilityMonitor and allow send retries #2130

Merged
merged 5 commits into from
Sep 17, 2024

Conversation

danisharora099
Copy link
Collaborator

@danisharora099 danisharora099 commented Sep 12, 2024

Problem

Based on #2075 and #2069,

  • there is tight coupling between the reliability logic and the LightPush SDK class
  • if a lightpush.send() request fails, we renew the peer directly

Solution

Based on the above problem statement, we want to do two things:

  • extend and separate reliability related logic into a separate class apart from LightPush (SenderMessageMonitor)
  • attempt retries on the failed peers for lightpush.send() before we actually renew the peer

Notes

Contribution checklist:

  • covered by unit tests;
  • covered by e2e test;
  • add ! in title if breaks public API;

@danisharora099 danisharora099 changed the base branch from master to feat/filter-reliability-split September 12, 2024 08:19
@danisharora099 danisharora099 force-pushed the feat/lightpush-reliability-monitor branch from 9bde3f7 to 73e350a Compare September 12, 2024 08:51
Copy link

github-actions bot commented Sep 12, 2024

size-limit report 📦

Path Size Loading time (3g) Running time (snapdragon) Total time
Waku node 83.56 KB (+0.44% 🔺) 1.7 s (+0.44% 🔺) 1.8 s (-42.57% 🔽) 3.5 s
Waku Simple Light Node 135.17 KB (+0.2% 🔺) 2.8 s (+0.2% 🔺) 1.8 s (-51.36% 🔽) 4.5 s
ECIES encryption 22.94 KB (0%) 459 ms (0%) 1.4 s (+136.49% 🔺) 1.9 s
Symmetric encryption 22.39 KB (0%) 448 ms (0%) 1.6 s (+96.63% 🔺) 2.1 s
DNS discovery 72.28 KB (0%) 1.5 s (0%) 1.8 s (-2.96% 🔽) 3.3 s
Peer Exchange discovery 73.79 KB (0%) 1.5 s (0%) 3.4 s (+117.46% 🔺) 4.9 s
Local Peer Cache Discovery 67.63 KB (0%) 1.4 s (0%) 2.2 s (+37.41% 🔺) 3.6 s
Privacy preserving protocols 74.79 KB (-0.05% 🔽) 1.5 s (-0.05% 🔽) 3.9 s (+63.96% 🔺) 5.4 s
Waku Filter 78.52 KB (+0.25% 🔺) 1.6 s (+0.25% 🔺) 2.8 s (+56.2% 🔺) 4.4 s
Waku LightPush 76.79 KB (+1.68% 🔺) 1.6 s (+1.68% 🔺) 1.8 s (-26.8% 🔽) 3.3 s
History retrieval protocols 76.02 KB (+0.1% 🔺) 1.6 s (+0.1% 🔺) 1.2 s (-42.03% 🔽) 2.7 s
Deterministic Message Hashing 7.38 KB (0%) 148 ms (0%) 480 ms (+48.41% 🔺) 628 ms

@danisharora099 danisharora099 marked this pull request as ready for review September 12, 2024 11:15
@danisharora099 danisharora099 force-pushed the feat/filter-reliability-split branch from 5ae69f0 to 2eddaf5 Compare September 12, 2024 11:41
@danisharora099 danisharora099 force-pushed the feat/lightpush-reliability-monitor branch from 3c82393 to e51e369 Compare September 12, 2024 11:42
@weboko weboko requested a review from a team September 12, 2024 14:52
packages/sdk/src/index.ts Outdated Show resolved Hide resolved
@danisharora099 danisharora099 requested a review from a team September 13, 2024 08:38
@danisharora099
Copy link
Collaborator Author

thanks for the great reviews @weboko ! helpful!

@danisharora099 danisharora099 force-pushed the feat/lightpush-reliability-monitor branch from 0bd895b to 58b2c13 Compare September 13, 2024 08:50
Base automatically changed from feat/filter-reliability-split to master September 13, 2024 09:27
@danisharora099 danisharora099 force-pushed the feat/lightpush-reliability-monitor branch 2 times, most recently from 381a84e to 89c7bae Compare September 13, 2024 10:56
@@ -89,16 +97,23 @@ class LightPushSDK extends BaseProtocolSDK implements ILightPushSDK {
successes.push(success);
}
if (failure) {
failures.push(failure);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving for some context, will make an issue for it later so that we can discuss

in status-go where retry is already implemented they do things differently and we will probably do something similar too

when message is sent it is queued, then it will be retried and once it is definitely fails - it will be communicated to consumer through something similar to event API

cc @waku-org/js-waku

const peer = this.connectedPeers.find((connectedPeer) =>
connectedPeer.id.equals(failure.peerId)
);
if (peer) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we should retry to any peer available - otherwise we loose message if peer got dropped (renewed, went offline or just networking issue)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, this block is executed WHEN a particular peer fails to send a lightpush request.

example: we have 3 connected peers, and one fails to send it, this is the block that will reattempt delivery for that peer and renewing (instead of just renewiing)

we aren't losing the message in any case

reliability monitor later (even after if reattempts fail), resends the lightpush request after renewal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I think we should align here on what is desired behavior.

considering what we care is successfully pushing a message at this stage (i.e no errors while sending)
then it is enough for us to have at least 1 successful push - in that case no retries needed
if all failures - then just retry but not necessarily to the same peer, just any peer
and if during all of this any peer is failing 3 times - we renew

I probably summarized it for myself only, but just clarifying

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering what we care is successfully pushing a message at this stage (i.e no errors while sending)
then it is enough for us to have at least 1 successful push - in that case no retries needed

Well, technically, yes. I agree. I can't really think of a case where we would indeed need redundancy if one of the peers can assure us that they indeed relayed the message further into GossipSub. However, for now, without LightPush V2, it's not trivial to get that. Thus, having redundancy + retries is good for now and we can revisit later if our apps perform well. (ref: https://discord.com/channels/1110799176264056863/1284121724433993758/1285014955023798342 as well)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decoupled into follow up #2140

`);
void this.reliabilityMonitor.attemptRetriesOrRenew(
failure.peerId,
() => this.protocol.send(encoder, message, peer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we risking getting into recursion here? apologies if I missed that code part

lightPush fails -> retry initiated -> lightPush fails -> ...

Copy link
Collaborator Author

@danisharora099 danisharora099 Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm that's a good point.
technically, recursion would've been a neater solution here but this is not it

here: we detect (one peer) to have failed to send the lightpush request, and reattempt a few times. if it keeps failing, we renew the peer, and attempt (only once) to send the lightpush request through that peer.
here it would be neater to introduce recursion maybe, but seems like overkill for now TBH.

we can find peace in the fact that:

  • we already use multiple peers to send it first time
  • even if one of the peers fails, we will re attempt
    • if even that fails, we will do renewal and use the new peer to send it

for it to fail, ALL peers would have to literally just not entertain our requests

in a case where we disregard that, introducing recursion here would be a neat solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


this.attempts.delete(peerIdStr);
this.attempts.set(newPeer.id.toString(), 0);
await protocolSend();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prev - https://github.com/waku-org/js-waku/pull/2130/files#r1758702483

here we call .send in both branches making no exit of the recursion (i.e it will be called over and over it seems to me)

so I think here we should do .send to newPeer instead because next time from here attemptRetriesOrRenew will be called for peerIdStr and at that point this.attempts. will not have it so it will just continue

I believe this is the reason why you needed to implement .stop operation on the manager entity

Copy link
Collaborator Author

@danisharora099 danisharora099 Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.protocol.send() is different from SDK's lightpush.send()

SDK.send() calls this.protocol.send() internally

o I think here we should do .send to newPeer instead because next time from here attemptRetriesOrRenew

We are indeed doing this.protocol.send(peer) by binding the protocolSend() function call:

void this.reliabilityMonitor.attemptRetriesOrRenew(
                failure.peerId,
                () => this.protocol.send(encoder, message, peer)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, thanks for clarifying
then it is not an issue - we can have a recursion here

@weboko
Copy link
Collaborator

weboko commented Sep 16, 2024

This is the last discussion I want to align on - #2130 (comment)
Other than that looks good!

@danisharora099 danisharora099 merged commit 7a6247c into master Sep 17, 2024
10 of 11 checks passed
@danisharora099 danisharora099 deleted the feat/lightpush-reliability-monitor branch September 17, 2024 06:05
@weboko weboko mentioned this pull request Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: message verification and retry feat(lightpush): add retries to failing peers
2 participants