-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds exponential backoff to re-spawing new streams for supposedly dead peers #483
Adds exponential backoff to re-spawing new streams for supposedly dead peers #483
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some mechanisms to prune this new datastructure once peers get disconnected, otherwise it could be a memory leak.
We also should reset the backoff timer after a while. We can track the time when the last request happened, and if we are more than some threshold past that time, then we can reset the timer and just return the base timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm not sure if we need to handle canceling a pending reconnect if the peer becomes disconnected?
Maybe before calling handleNewPeer
we need to first check if the peer is still connected? But not sure if this is needed.
backoff.go
Outdated
if len(b.info) > b.ct { | ||
b.cleanup() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we have a lot of peers (more than b.ct
), then we will end up running b.cleanup()
on every call to this function (which requires looping though the entire map), even when there is nothing to cleanup. Perhaps we need a better way to handle this?
For example, if we maintain an additional heap datastructure which is sorted by the expiration time, then at every call to the function we can just pop all the expired entries from the heap one by one and remove them from the map. This would require us to track the explicit expiration time in each backoffHistory
, rather than lastTried
.
Will leave to @vyzo to decide whether this is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets see if it is actually a problem in practice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it might be problematic, lets run a background goroutine that does it periodically (say once a minute).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in a9f4edf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks mostly good, just a couple of things:
- lets add some jitter in the backoff
- lets do the cleanup in a bacground goroutine.
backoff.go
Outdated
} else if h.duration < MinBackoffDelay { | ||
h.duration = MinBackoffDelay | ||
} else if h.duration < MaxBackoffDelay { | ||
h.duration = time.Duration(BackoffMultiplier * h.duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some jitter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backoff.go
Outdated
if len(b.info) > b.ct { | ||
b.cleanup() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets see if it is actually a problem in practice
backoff.go
Outdated
if len(b.info) > b.ct { | ||
b.cleanup() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it might be problematic, lets run a background goroutine that does it periodically (say once a minute).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, but see comment about giving up eventually.
I am concerned about attackability of the code without this failsafe, if an attacker figured out a way to trigger it.
comm.go
Outdated
@@ -121,6 +122,16 @@ func (p *PubSub) handleNewPeer(ctx context.Context, pid peer.ID, outgoing <-chan | |||
} | |||
} | |||
|
|||
func (p *PubSub) handleNewPeerWithBackoff(ctx context.Context, pid peer.ID, outgoing <-chan *RPC) { | |||
delay := p.deadPeerBackoff.updateAndGet(pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add a failure more if we have backed off too much and simply give up; say we try up to 10 times and then updateAndGet
returns an error and we close the channel and forget the peer.
How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe 10 is even too much, 3-4 attempts should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, some cosmetic suggestions for cleaner code.
backoff.go
Outdated
defer b.mu.Unlock() | ||
|
||
h, ok := b.info[id] | ||
if !ok || time.Since(h.lastTried) > TimeToLive { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's write this if/else sequence with a switch, will be nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backoff.go
Outdated
} | ||
} | ||
|
||
h.lastTried = time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's get the time after checking the max attempts, will avoid the gettimeofday call in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read my reply to the below comment as this part has got changed.
comm.go
Outdated
delay, valid := p.deadPeerBackoff.updateAndGet(pid) | ||
if !valid { | ||
return fmt.Errorf("backoff attempts to %s expired after reaching maximum allowed", pid) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's return the error directly from updateAndGet
instead of a bool, makes for simpler code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decoupled updating the backoff history from checking for the number of backoff attempts. Hence now updateAndGet
only increments the backoff attempts without returning any boolean or error indicating the maximum attempts reached. Instead, I introduced a separate peerExceededBackoffThreshold
method that checks whether the backoff attempts of a peer exceeds the defined threshold. The reasons for this change are:
-
We want to close the channel and forget the peer if the backoff attempts go beyond a threshold. The
updateAndGet
is called on a goroutine while themessages
channel andpeers
map are residing on a separategoroutine
. So, if we letupdateAndGet
return a backoff attempt exceeding error and thegoroutine
that it resides on attempts on closing themessages
channel and forgetting peer frompeers
map, it results in a race condition between the twogoroutines
, and also the code is vulnerable to panic when thisdefer
function is executed, as it is trying to close an already closed channel, i.e.,themessages
channel that we closed onupdateAndGet
. -
By this decoupling, we invoke
peerExceededBackoffThreshold
on the parentgoroutine
, hence we give up on backing off if the peer exceeds the backoff threshold rightaway, without opening any channel, spawning any childgoroutine
forupdateAndGet
. Moreover, the peer is forgotten by the subsequent lines. Hence, no vulnerability for race conditions and less resource allocation-deallocations.
Please let me know how does it sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds better than what we had :)
pubsub.go
Outdated
@@ -683,19 +683,15 @@ func (p *PubSub) handleDeadPeers() { | |||
|
|||
close(ch) | |||
|
|||
if p.host.Network().Connectedness(pid) == network.Connected { | |||
if p.host.Network().Connectedness(pid) == network.Connected && | |||
!p.deadPeerBackoff.peerExceededBackoffThreshold(pid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to (debug) log this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this looks better and we avoid spawning the goroutine if we are going to stop.
One minor point is that the check/get is not atomic; I would suggest taking the base delay before spawning the goroutine, with an error return if it is execeeded.
Then check the error and log if this is exceeded (debug, no need to spam), and then spawn with the delay we got.
Does that sounds reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i am pretty happy with this.
Lets use a ticker object in the bg loop and it is ready to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you!
Addresses #482 by implementing an exponential backoff mechanism for re-spawning new streams to peers that are originally supposedly dead.
closes #482