Adds exponential backoff to re-spawing new streams for supposedly dead peers #483

yhassanzadeh13 · 2022-04-06T18:28:43Z

Addresses #482 by implementing an exponential backoff mechanism for re-spawning new streams to peers that are originally supposedly dead.

closes #482

synzhu

We need some mechanisms to prune this new datastructure once peers get disconnected, otherwise it could be a memory leak.

We also should reset the backoff timer after a while. We can track the time when the last request happened, and if we are more than some threshold past that time, then we can reset the timer and just return the base timeout.

synzhu

Also, I'm not sure if we need to handle canceling a pending reconnect if the peer becomes disconnected?

Maybe before calling handleNewPeer we need to first check if the peer is still connected? But not sure if this is needed.

synzhu · 2022-04-06T22:34:14Z

backoff.go

+	if len(b.info) > b.ct {
+		b.cleanup()
+	}


nit: if we have a lot of peers (more than b.ct), then we will end up running b.cleanup() on every call to this function (which requires looping though the entire map), even when there is nothing to cleanup. Perhaps we need a better way to handle this?

For example, if we maintain an additional heap datastructure which is sorted by the expiration time, then at every call to the function we can just pop all the expired entries from the heap one by one and remove them from the map. This would require us to track the explicit expiration time in each backoffHistory, rather than lastTried.

Will leave to @vyzo to decide whether this is necessary.

Lets see if it is actually a problem in practice

Actually it might be problematic, lets run a background goroutine that does it periodically (say once a minute).

Added in a9f4edf

vyzo

looks mostly good, just a couple of things:

lets add some jitter in the backoff
lets do the cleanup in a bacground goroutine.

vyzo · 2022-04-21T07:55:07Z

backoff.go

+	} else if h.duration < MinBackoffDelay {
+		h.duration = MinBackoffDelay
+	} else if h.duration < MaxBackoffDelay {
+		h.duration = time.Duration(BackoffMultiplier * h.duration)


can we add some jitter?

vyzo · 2022-04-21T07:55:36Z

backoff.go

+	if len(b.info) > b.ct {
+		b.cleanup()
+	}


Lets see if it is actually a problem in practice

vyzo · 2022-04-21T07:56:41Z

backoff.go

+	if len(b.info) > b.ct {
+		b.cleanup()
+	}


Actually it might be problematic, lets run a background goroutine that does it periodically (say once a minute).

vyzo

Looks pretty good, but see comment about giving up eventually.

I am concerned about attackability of the code without this failsafe, if an attacker figured out a way to trigger it.

vyzo · 2022-05-24T17:59:50Z

comm.go

@@ -121,6 +122,16 @@ func (p *PubSub) handleNewPeer(ctx context.Context, pid peer.ID, outgoing <-chan
 	}
 }

+func (p *PubSub) handleNewPeerWithBackoff(ctx context.Context, pid peer.ID, outgoing <-chan *RPC) {
+	delay := p.deadPeerBackoff.updateAndGet(pid)


I think we need to add a failure more if we have backed off too much and simply give up; say we try up to 10 times and then updateAndGet returns an error and we close the channel and forget the peer.

How does that sound?

maybe 10 is even too much, 3-4 attempts should be enough.

vyzo

Looks pretty good, some cosmetic suggestions for cleaner code.

vyzo · 2022-05-26T06:34:36Z

backoff.go

+	defer b.mu.Unlock()
+
+	h, ok := b.info[id]
+	if !ok || time.Since(h.lastTried) > TimeToLive {


let's write this if/else sequence with a switch, will be nicer.

vyzo · 2022-05-26T06:35:16Z

backoff.go

+		}
+	}
+
+	h.lastTried = time.Now()


let's get the time after checking the max attempts, will avoid the gettimeofday call in that case.

Please read my reply to the below comment as this part has got changed.

vyzo · 2022-05-26T06:36:45Z

comm.go

+	delay, valid := p.deadPeerBackoff.updateAndGet(pid)
+	if !valid {
+		return fmt.Errorf("backoff attempts to %s expired after reaching maximum allowed", pid)
+	}


let's return the error directly from updateAndGet instead of a bool, makes for simpler code.

I decoupled updating the backoff history from checking for the number of backoff attempts. Hence now updateAndGet only increments the backoff attempts without returning any boolean or error indicating the maximum attempts reached. Instead, I introduced a separate peerExceededBackoffThreshold method that checks whether the backoff attempts of a peer exceeds the defined threshold. The reasons for this change are:

We want to close the channel and forget the peer if the backoff attempts go beyond a threshold. The updateAndGet is called on a goroutine while the messages channel and peers map are residing on a separate goroutine. So, if we let updateAndGet return a backoff attempt exceeding error and the goroutine that it resides on attempts on closing the messages channel and forgetting peer from peers map, it results in a race condition between the two goroutines, and also the code is vulnerable to panic when this defer function is executed, as it is trying to close an already closed channel, i.e.,the messages channel that we closed on updateAndGet.

By this decoupling, we invoke peerExceededBackoffThreshold on the parent goroutine, hence we give up on backing off if the peer exceeds the backoff threshold rightaway, without opening any channel, spawning any child goroutine for updateAndGet. Moreover, the peer is forgotten by the subsequent lines. Hence, no vulnerability for race conditions and less resource allocation-deallocations.

Please let me know how does it sound?

sounds better than what we had :)

vyzo · 2022-05-27T04:11:23Z

pubsub.go

@@ -683,19 +683,15 @@ func (p *PubSub) handleDeadPeers() {

 		close(ch)

-		if p.host.Network().Connectedness(pid) == network.Connected {
+		if p.host.Network().Connectedness(pid) == network.Connected &&
+			!p.deadPeerBackoff.peerExceededBackoffThreshold(pid) {


we might want to (debug) log this.

8b64966

c00510e

vyzo

yes, this looks better and we avoid spawning the goroutine if we are going to stop.

One minor point is that the check/get is not atomic; I would suggest taking the base delay before spawning the goroutine, with an error return if it is execeeded.
Then check the error and log if this is exceeded (debug, no need to spam), and then spawn with the delay we got.

Does that sounds reasonable?

vyzo

Ok, i am pretty happy with this.
Lets use a ticker object in the bg loop and it is ready to merge.

backoff.go

vyzo

LGTM. Thank you!

yhassanzadeh13 added 4 commits April 6, 2022 09:45

updates gitignore

c3a6760

implements draft solution

5e8ec29

consolidates update and get

7c58f7a

extends test

42e310b

synzhu reviewed Apr 6, 2022

View reviewed changes

yhassanzadeh13 added 2 commits April 6, 2022 15:13

adds cleaner logic

90cdd55

removes a redundant else case

b3f58bc

yhassanzadeh13 marked this pull request as ready for review April 6, 2022 22:28

synzhu reviewed Apr 6, 2022

View reviewed changes

vyzo reviewed Apr 21, 2022

View reviewed changes

yhassanzadeh13 added 3 commits May 9, 2022 10:31

refactors cleanup in a goroutine

a9f4edf

adds a jitter to backoff

a77d435

stretches the sleep for cleanup

6e4b2f8

yhassanzadeh13 requested a review from vyzo May 9, 2022 18:24

yhassanzadeh13 added 2 commits May 9, 2022 11:30

reduces jitter time

2761b98

fixes a test

6401d8b

vyzo reviewed May 24, 2022

View reviewed changes

yhassanzadeh13 added 2 commits May 25, 2022 09:25

adds maximum backoff attempts

4c94e5f

returns error for closing channel

e260291

vyzo reviewed May 26, 2022

View reviewed changes

yhassanzadeh13 added 3 commits May 26, 2022 09:33

refactors peer status exceed backoff threshold

c74ae78

converts if-else to switch

7f815f0

nit

6ebc292

yhassanzadeh13 requested a review from vyzo May 26, 2022 20:08

vyzo reviewed May 27, 2022

View reviewed changes

yhassanzadeh13 added 3 commits May 27, 2022 16:27

consolidates update and maximum backoff check

8b64966

bug fix

c00510e

nit

e9d42fa

yhassanzadeh13 requested a review from vyzo May 27, 2022 23:35

vyzo reviewed May 28, 2022

View reviewed changes

backoff.go Show resolved Hide resolved

refactors cleanup with a ticker object

eede9ba

vyzo reviewed May 30, 2022

View reviewed changes

vyzo approved these changes May 30, 2022

View reviewed changes

vyzo merged commit 06b5ba4 into libp2p:master May 30, 2022

yhassanzadeh13 mentioned this pull request May 30, 2022

[Networking] Upgrades pubsub version of libp2p onflow/flow-go#2515

Merged

vyzo mentioned this pull request May 31, 2022

deps: update go-libp2p-pubsub to v0.7.0 filecoin-project/lotus#8770

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds exponential backoff to re-spawing new streams for supposedly dead peers #483

Adds exponential backoff to re-spawing new streams for supposedly dead peers #483

yhassanzadeh13 commented Apr 6, 2022 •

edited

Loading

synzhu left a comment •

edited

Loading

synzhu left a comment

synzhu Apr 6, 2022 •

edited

Loading

vyzo Apr 21, 2022

vyzo Apr 21, 2022

yhassanzadeh13 May 9, 2022

vyzo left a comment •

edited

Loading

vyzo Apr 21, 2022

yhassanzadeh13 May 9, 2022

vyzo Apr 21, 2022

vyzo Apr 21, 2022

vyzo left a comment

vyzo May 24, 2022

vyzo May 24, 2022

yhassanzadeh13 May 26, 2022

vyzo left a comment

vyzo May 26, 2022

yhassanzadeh13 May 26, 2022

vyzo May 26, 2022

yhassanzadeh13 May 26, 2022 •

edited

Loading

vyzo May 26, 2022

yhassanzadeh13 May 26, 2022

vyzo May 27, 2022

vyzo May 27, 2022

yhassanzadeh13 May 27, 2022 •

edited

Loading

vyzo left a comment

vyzo left a comment

vyzo left a comment

Adds exponential backoff to re-spawing new streams for supposedly dead peers #483

Adds exponential backoff to re-spawing new streams for supposedly dead peers #483

Conversation

yhassanzadeh13 commented Apr 6, 2022 • edited Loading

synzhu left a comment • edited Loading

Choose a reason for hiding this comment

synzhu left a comment

Choose a reason for hiding this comment

synzhu Apr 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyzo left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyzo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyzo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhassanzadeh13 May 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhassanzadeh13 May 27, 2022 • edited Loading

Choose a reason for hiding this comment

vyzo left a comment

Choose a reason for hiding this comment

vyzo left a comment

Choose a reason for hiding this comment

vyzo left a comment

Choose a reason for hiding this comment

yhassanzadeh13 commented Apr 6, 2022 •

edited

Loading

synzhu left a comment •

edited

Loading

synzhu Apr 6, 2022 •

edited

Loading

vyzo left a comment •

edited

Loading

yhassanzadeh13 May 26, 2022 •

edited

Loading

yhassanzadeh13 May 27, 2022 •

edited

Loading