Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds exponential backoff to re-spawing new streams for supposedly dead peers #483

Merged
merged 20 commits into from
May 30, 2022

Conversation

yhassanzadeh13
Copy link
Contributor

@yhassanzadeh13 yhassanzadeh13 commented Apr 6, 2022

Addresses #482 by implementing an exponential backoff mechanism for re-spawning new streams to peers that are originally supposedly dead.

closes #482

Copy link
Contributor

@synzhu synzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some mechanisms to prune this new datastructure once peers get disconnected, otherwise it could be a memory leak.

We also should reset the backoff timer after a while. We can track the time when the last request happened, and if we are more than some threshold past that time, then we can reset the timer and just return the base timeout.

Copy link
Contributor

@synzhu synzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not sure if we need to handle canceling a pending reconnect if the peer becomes disconnected?

Maybe before calling handleNewPeer we need to first check if the peer is still connected? But not sure if this is needed.

@yhassanzadeh13 yhassanzadeh13 marked this pull request as ready for review April 6, 2022 22:28
backoff.go Outdated
Comment on lines 58 to 60
if len(b.info) > b.ct {
b.cleanup()
}
Copy link
Contributor

@synzhu synzhu Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we have a lot of peers (more than b.ct), then we will end up running b.cleanup() on every call to this function (which requires looping though the entire map), even when there is nothing to cleanup. Perhaps we need a better way to handle this?

For example, if we maintain an additional heap datastructure which is sorted by the expiration time, then at every call to the function we can just pop all the expired entries from the heap one by one and remove them from the map. This would require us to track the explicit expiration time in each backoffHistory, rather than lastTried.

Will leave to @vyzo to decide whether this is necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets see if it is actually a problem in practice

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it might be problematic, lets run a background goroutine that does it periodically (say once a minute).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a9f4edf

Copy link
Collaborator

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks mostly good, just a couple of things:

  • lets add some jitter in the backoff
  • lets do the cleanup in a bacground goroutine.

backoff.go Outdated
} else if h.duration < MinBackoffDelay {
h.duration = MinBackoffDelay
} else if h.duration < MaxBackoffDelay {
h.duration = time.Duration(BackoffMultiplier * h.duration)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some jitter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backoff.go Outdated
Comment on lines 58 to 60
if len(b.info) > b.ct {
b.cleanup()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets see if it is actually a problem in practice

backoff.go Outdated
Comment on lines 58 to 60
if len(b.info) > b.ct {
b.cleanup()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it might be problematic, lets run a background goroutine that does it periodically (say once a minute).

@yhassanzadeh13 yhassanzadeh13 requested a review from vyzo May 9, 2022 18:24
Copy link
Collaborator

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, but see comment about giving up eventually.

I am concerned about attackability of the code without this failsafe, if an attacker figured out a way to trigger it.

comm.go Outdated
@@ -121,6 +122,16 @@ func (p *PubSub) handleNewPeer(ctx context.Context, pid peer.ID, outgoing <-chan
}
}

func (p *PubSub) handleNewPeerWithBackoff(ctx context.Context, pid peer.ID, outgoing <-chan *RPC) {
delay := p.deadPeerBackoff.updateAndGet(pid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add a failure more if we have backed off too much and simply give up; say we try up to 10 times and then updateAndGet returns an error and we close the channel and forget the peer.

How does that sound?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe 10 is even too much, 3-4 attempts should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, some cosmetic suggestions for cleaner code.

backoff.go Outdated
defer b.mu.Unlock()

h, ok := b.info[id]
if !ok || time.Since(h.lastTried) > TimeToLive {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's write this if/else sequence with a switch, will be nicer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backoff.go Outdated
}
}

h.lastTried = time.Now()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's get the time after checking the max attempts, will avoid the gettimeofday call in that case.

Copy link
Contributor Author

@yhassanzadeh13 yhassanzadeh13 May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read my reply to the below comment as this part has got changed.

comm.go Outdated
Comment on lines 127 to 130
delay, valid := p.deadPeerBackoff.updateAndGet(pid)
if !valid {
return fmt.Errorf("backoff attempts to %s expired after reaching maximum allowed", pid)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's return the error directly from updateAndGet instead of a bool, makes for simpler code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decoupled updating the backoff history from checking for the number of backoff attempts. Hence now updateAndGet only increments the backoff attempts without returning any boolean or error indicating the maximum attempts reached. Instead, I introduced a separate peerExceededBackoffThreshold method that checks whether the backoff attempts of a peer exceeds the defined threshold. The reasons for this change are:

  1. We want to close the channel and forget the peer if the backoff attempts go beyond a threshold. The updateAndGet is called on a goroutine while the messages channel and peers map are residing on a separate goroutine. So, if we let updateAndGet return a backoff attempt exceeding error and the goroutine that it resides on attempts on closing the messages channel and forgetting peer from peers map, it results in a race condition between the two goroutines, and also the code is vulnerable to panic when this defer function is executed, as it is trying to close an already closed channel, i.e.,the messages channel that we closed on updateAndGet.

  2. By this decoupling, we invoke peerExceededBackoffThreshold on the parent goroutine, hence we give up on backing off if the peer exceeds the backoff threshold rightaway, without opening any channel, spawning any child goroutine for updateAndGet. Moreover, the peer is forgotten by the subsequent lines. Hence, no vulnerability for race conditions and less resource allocation-deallocations.

Please let me know how does it sound?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds better than what we had :)

@yhassanzadeh13 yhassanzadeh13 requested a review from vyzo May 26, 2022 20:08
pubsub.go Outdated
@@ -683,19 +683,15 @@ func (p *PubSub) handleDeadPeers() {

close(ch)

if p.host.Network().Connectedness(pid) == network.Connected {
if p.host.Network().Connectedness(pid) == network.Connected &&
!p.deadPeerBackoff.peerExceededBackoffThreshold(pid) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to (debug) log this.

Copy link
Contributor Author

@yhassanzadeh13 yhassanzadeh13 May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this looks better and we avoid spawning the goroutine if we are going to stop.

One minor point is that the check/get is not atomic; I would suggest taking the base delay before spawning the goroutine, with an error return if it is execeeded.
Then check the error and log if this is exceeded (debug, no need to spam), and then spawn with the delay we got.

Does that sounds reasonable?

@yhassanzadeh13 yhassanzadeh13 requested a review from vyzo May 27, 2022 23:35
Copy link
Collaborator

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i am pretty happy with this.
Lets use a ticker object in the bg loop and it is ready to merge.

backoff.go Show resolved Hide resolved
Copy link
Collaborator

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backoff / retry for writer stream respawn
3 participants