-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(lightpush): introduce ReliabilityMonitor and allow send
retries
#2130
Changes from 4 commits
0fe4869
c0e6e05
6b58f08
89c7bae
9b56e13
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,13 +13,17 @@ import { | |
} from "@waku/interfaces"; | ||
import { ensurePubsubTopicIsConfigured, Logger } from "@waku/utils"; | ||
|
||
import { BaseProtocolSDK } from "./base_protocol.js"; | ||
import { ReliabilityMonitorManager } from "../../reliability_monitor/index.js"; | ||
import { SenderReliabilityMonitor } from "../../reliability_monitor/sender.js"; | ||
import { BaseProtocolSDK } from "../base_protocol.js"; | ||
|
||
const log = new Logger("sdk:light-push"); | ||
|
||
class LightPushSDK extends BaseProtocolSDK implements ILightPushSDK { | ||
public readonly protocol: LightPushCore; | ||
|
||
private readonly reliabilityMonitor: SenderReliabilityMonitor; | ||
|
||
public constructor( | ||
connectionManager: ConnectionManager, | ||
libp2p: Libp2p, | ||
|
@@ -33,6 +37,10 @@ class LightPushSDK extends BaseProtocolSDK implements ILightPushSDK { | |
} | ||
); | ||
|
||
this.reliabilityMonitor = ReliabilityMonitorManager.createSenderMonitor( | ||
this.renewPeer.bind(this) | ||
); | ||
|
||
this.protocol = this.core as LightPushCore; | ||
} | ||
|
||
|
@@ -89,16 +97,23 @@ class LightPushSDK extends BaseProtocolSDK implements ILightPushSDK { | |
successes.push(success); | ||
} | ||
if (failure) { | ||
failures.push(failure); | ||
if (failure.peerId) { | ||
try { | ||
await this.renewPeer(failure.peerId); | ||
log.info("Renewed peer", failure.peerId.toString()); | ||
} catch (error) { | ||
log.error("Failed to renew peer", error); | ||
const peer = this.connectedPeers.find((connectedPeer) => | ||
connectedPeer.id.equals(failure.peerId) | ||
); | ||
if (peer) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think here we should retry to any peer available - otherwise we loose message if peer got dropped (renewed, went offline or just networking issue) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well, this block is executed WHEN a particular peer fails to send a lightpush request. example: we have 3 connected peers, and one fails to send it, this is the block that will reattempt delivery for that peer and renewing (instead of just renewiing) we aren't losing the message in any case reliability monitor later (even after if reattempts fail), resends the lightpush request after renewal There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hm, I think we should align here on what is desired behavior. considering what we care is successfully pushing a message at this stage (i.e no errors while sending) I probably summarized it for myself only, but just clarifying There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Well, technically, yes. I agree. I can't really think of a case where we would indeed need redundancy if one of the peers can assure us that they indeed relayed the message further into GossipSub. However, for now, without LightPush V2, it's not trivial to get that. Thus, having redundancy + retries is good for now and we can revisit later if our apps perform well. (ref: https://discord.com/channels/1110799176264056863/1284121724433993758/1285014955023798342 as well) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. decoupled into follow up #2140 |
||
log.info(` | ||
Failed to send message to peer ${failure.peerId}. | ||
Retrying the message with the same peer in the background. | ||
If this fails, the peer will be renewed. | ||
`); | ||
void this.reliabilityMonitor.attemptRetriesOrRenew( | ||
failure.peerId, | ||
() => this.protocol.send(encoder, message, peer) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. aren't we risking getting into recursion here? apologies if I missed that code part lightPush fails -> retry initiated -> lightPush fails -> ... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hm that's a good point. here: we detect (one peer) to have failed to send the lightpush request, and reattempt a few times. if it keeps failing, we renew the peer, and attempt (only once) to send the lightpush request through that peer. we can find peace in the fact that:
for it to fail, ALL peers would have to literally just not entertain our requests in a case where we disregard that, introducing recursion here would be a neat solution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
); | ||
} | ||
} | ||
|
||
failures.push(failure); | ||
} | ||
} else { | ||
log.error("Failed unexpectedly while sending:", result.reason); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
import type { Peer, PeerId } from "@libp2p/interface"; | ||
import { | ||
ContentTopic, | ||
CoreProtocolResult, | ||
PubsubTopic | ||
} from "@waku/interfaces"; | ||
|
||
import { ReceiverReliabilityMonitor } from "./receiver.js"; | ||
import { SenderReliabilityMonitor } from "./sender.js"; | ||
|
||
export class ReliabilityMonitorManager { | ||
private static receiverMonitors: Map< | ||
PubsubTopic, | ||
ReceiverReliabilityMonitor | ||
> = new Map(); | ||
private static senderMonitor: SenderReliabilityMonitor | undefined; | ||
|
||
public static createReceiverMonitor( | ||
pubsubTopic: PubsubTopic, | ||
getPeers: () => Peer[], | ||
renewPeer: (peerId: PeerId) => Promise<Peer>, | ||
getContentTopics: () => ContentTopic[], | ||
protocolSubscribe: ( | ||
pubsubTopic: PubsubTopic, | ||
peer: Peer, | ||
contentTopics: ContentTopic[] | ||
) => Promise<CoreProtocolResult> | ||
): ReceiverReliabilityMonitor { | ||
if (ReliabilityMonitorManager.receiverMonitors.has(pubsubTopic)) { | ||
return ReliabilityMonitorManager.receiverMonitors.get(pubsubTopic)!; | ||
} | ||
|
||
const monitor = new ReceiverReliabilityMonitor( | ||
pubsubTopic, | ||
getPeers, | ||
renewPeer, | ||
getContentTopics, | ||
protocolSubscribe | ||
); | ||
ReliabilityMonitorManager.receiverMonitors.set(pubsubTopic, monitor); | ||
return monitor; | ||
} | ||
|
||
public static createSenderMonitor( | ||
renewPeer: (peerId: PeerId) => Promise<Peer> | ||
): SenderReliabilityMonitor { | ||
if (!ReliabilityMonitorManager.senderMonitor) { | ||
ReliabilityMonitorManager.senderMonitor = new SenderReliabilityMonitor( | ||
renewPeer | ||
); | ||
} | ||
return ReliabilityMonitorManager.senderMonitor; | ||
} | ||
|
||
private constructor() {} | ||
|
||
public static stop(pubsubTopic: PubsubTopic): void { | ||
this.receiverMonitors.delete(pubsubTopic); | ||
this.senderMonitor = undefined; | ||
} | ||
|
||
public static stopAll(): void { | ||
for (const [pubsubTopic, monitor] of this.receiverMonitors) { | ||
monitor.setMaxMissedMessagesThreshold(undefined); | ||
monitor.setMaxPingFailures(undefined); | ||
this.receiverMonitors.delete(pubsubTopic); | ||
this.senderMonitor = undefined; | ||
} | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
import type { Peer, PeerId } from "@libp2p/interface"; | ||
import { CoreProtocolResult, PeerIdStr } from "@waku/interfaces"; | ||
import { Logger } from "@waku/utils"; | ||
|
||
const log = new Logger("sdk:sender:reliability_monitor"); | ||
|
||
const DEFAULT_MAX_ATTEMPTS_BEFORE_RENEWAL = 3; | ||
|
||
export class SenderReliabilityMonitor { | ||
private attempts: Map<PeerIdStr, number> = new Map(); | ||
private readonly maxAttemptsBeforeRenewal = | ||
DEFAULT_MAX_ATTEMPTS_BEFORE_RENEWAL; | ||
|
||
public constructor(private renewPeer: (peerId: PeerId) => Promise<Peer>) {} | ||
|
||
public async attemptRetriesOrRenew( | ||
peerId: PeerId, | ||
protocolSend: () => Promise<CoreProtocolResult> | ||
): Promise<void> { | ||
const peerIdStr = peerId.toString(); | ||
const currentAttempts = this.attempts.get(peerIdStr) || 0; | ||
this.attempts.set(peerIdStr, currentAttempts + 1); | ||
|
||
if (currentAttempts + 1 < this.maxAttemptsBeforeRenewal) { | ||
try { | ||
const result = await protocolSend(); | ||
if (result.success) { | ||
log.info(`Successfully sent message after retry to ${peerIdStr}`); | ||
this.attempts.delete(peerIdStr); | ||
} else { | ||
log.error( | ||
`Failed to send message after retry to ${peerIdStr}: ${result.failure}` | ||
); | ||
await this.attemptRetriesOrRenew(peerId, protocolSend); | ||
} | ||
} catch (error) { | ||
log.error( | ||
`Failed to send message after retry to ${peerIdStr}: ${error}` | ||
); | ||
await this.attemptRetriesOrRenew(peerId, protocolSend); | ||
} | ||
} else { | ||
try { | ||
const newPeer = await this.renewPeer(peerId); | ||
log.info( | ||
`Renewed peer ${peerId.toString()} to ${newPeer.id.toString()}` | ||
); | ||
|
||
this.attempts.delete(peerIdStr); | ||
this.attempts.set(newPeer.id.toString(), 0); | ||
await protocolSend(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. prev - https://github.com/waku-org/js-waku/pull/2130/files#r1758702483 here we call so I think here we should do I believe this is the reason why you needed to implement There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We are indeed doing void this.reliabilityMonitor.attemptRetriesOrRenew(
failure.peerId,
() => this.protocol.send(encoder, message, peer) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh right, thanks for clarifying |
||
} catch (error) { | ||
log.error(`Failed to renew peer ${peerId.toString()}: ${error}`); | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving for some context, will make an issue for it later so that we can discuss
in status-go where retry is already implemented they do things differently and we will probably do something similar too
when message is sent it is queued, then it will be retried and once it is definitely fails - it will be communicated to consumer through something similar to event API
cc @waku-org/js-waku