Remove entries from wantlists when their related requests are cancelled #3182

whyrusleeping · 2016-09-02T22:52:39Z

This PR has been a long time coming. A recent race condition fix has lead to us actually rebroadcasting our wantlist every ten seconds as we claim to do (instead of just pretending to). As a result, the gateways have started catching on fire. The reason for this is that since we never remove elements from the wantlist, we now rebroadcast thousands of wantlist entries every ten seconds. Thats kinda bad.

The solution, is to reference count wantlist entries and remove them when no active requests are referencing them any longer.

A few notable sub-fixes happened to make this work:

wantlist Entries are now stored as pointers. The reference count wasnt being changed before
GetBlocks now tracks which blocks it has successfully fetch so it can cancel the rest as needed.
Remove contexts from wantlist entries, it wasnt useful

License: MIT Signed-off-by: Jeromy <why@ipfs.io>

Kubuxu · 2016-09-03T18:43:26Z

exchange/bitswap/wantmanager.go

@@ -75,6 +75,7 @@ func (pm *WantManager) WantBlocks(ctx context.Context, ks []key.Key) {
 }

 func (pm *WantManager) CancelWants(ks []key.Key) {
+	log.Infof("cancel wants: %s", ks)


It probably shouldn't be info.

Ehh, I think its best to keep it at Info for now. Its one of the few things thats accurately reporting whats going on in this ipfs node (hey, we're requesting a block! oh, now we're not looking for that block anymore)

Kubuxu · 2016-09-03T18:44:45Z

SGTM, 2nd commit above is failing tests so it needs to be squashed or fixed.

License: MIT Signed-off-by: Jeromy <why@ipfs.io>

whyrusleeping · 2016-09-04T14:05:26Z

Squashed.

jbenet · 2016-09-05T16:50:56Z

actually rebroadcasting our wantlist every ten seconds as we claim to do

A long time ago we agreed not to send wantlists so often, and to instead endeavor to send correct diffs. (i.e. sending a wantlist should be a rare event, once on connect and then maybe every 10m to be safe (but again, diffs alone should work, because they happen over a reliable channel, so if a diff is lost the reliable channel would break, triggering a reconnection and thus new wantlist. I think re-sending want lists should in general not be necessary, and it's only a safety measure)

a good metric to collect is when a received wantlist differs from our view on it (i.e. when it was useful to have sent a wantlist update)

we never remove elements from the wantlist

What do you mean by that? we should and i thought we DID remove elements from wantlist after they're retrieved or cancelled.

Kubuxu · 2016-09-05T16:58:45Z

What do you mean by that? we should and i thought we DID remove elements from wantlist after they're retrieved or cancelled.

We weren't removing elements on context cancel.

jbenet · 2016-09-05T16:59:57Z

@Kubuxu thanks

whyrusleeping · 2016-09-06T00:46:21Z

@jbenet My bad, this isnt actually sending wantlists over again... It used to be that code, but now its only purpose is to periodically search for new providers for content we want. So every ten seconds, for each key in our wantlist, we call FindProviders and connect to each of them (which causes us to sync up wantlists).

Some options here:

Do this less frequently (every 30 seconds would likely help significantly)
Don't search for entire wantlist every time, maybe select a subset each time?

daviddias · 2016-09-06T01:17:36Z

So, since we already look for the root hash of every request when the blocks wanted are first added, that should give us a pretty good amount of walking the DHT to connect to peers that will have the blocks (unless we are unlucky and connected to the best peer already and that peer only has a really small set of the blocks).

Doing something like 30 seconds, send a query for a random block from the wantlist will reduce the bandwidth dramatically, the problem is that we can't tell if that will impact a ton our ability to search, but it is a good experiment. A node well peered should not have these issues. so it might be good to have a smaller interval for short lived nodes and then increase the interval as the node has more and more connections (i.e from 10 secs to 60 seconds). The outcomes of this will change a lot once the network grows bigger, right now, it is mostly easy to get well peered (except of the symmetric NAT traversal cases)

jbenet · 2016-09-09T20:16:49Z

Do this less frequently (every 30 seconds would likely help significantly)

yes, but downside is also longer downloads. the "urgency of a request" is not currently captured but could inform these decisions

Don't search for entire wantlist every time, maybe select a subset each time?

yes. +1 to this. another option is to use "trickle" (exponential) backoffs. They work really well in networking in general:

When nodes agree, they slow their communication
rate exponentially, such that in a stable state
nodes send at most a few packets per hour. Instead of
flooding a network with packets, the algorithm controls the
send rate so each node hears a small trickle of packets, just
enough to stay consistent.

Trickle is used differently -- more for broadcasts to regain consistency in routing protocols. But here, "agree" would mean "i searched the dht but found nothing". each successive query like that could have an exponential backoff, with a maximum (of say 30min or 1hr). That way, things that are "not around right now" don't place a strain on the network.

The way to implement this here is:

type trickleCfg struct {
  min time.Duration
  max time.Duration
}

var bsGetProvsTrickle := trickleCfg{time.Second * 5, time.Hour}

// keep one of these per key in wantlist
type trickleTimer struct {
  interval time.Duration  // initialized to bsGetProvsTrickle.Min
  next time.Time // initialized to now + interval
}

func (t *trickleTimer) Reset(cfg *TrickleCfg) {
  t.interval = cfg.Min
  t.next = time.Now().Add(cfg.Min)
}

// algorithm for updating.
toSearch = []cid.Cid
for cid, timer := keyTimers {
  if now > timer.next {
    toSearch = append(toSearch, cid)
    timer.next = timer.next.Add(timer.interval)
    timer.interval = timer.interval * 2 // exponential backoff (trickle)
    if timer.interval > bsGetProvsTrickle.Max {
      timer.interval = bsGetProvsTrickle.Max
    }
  }
}
go getProviders(toSearch)

// IMPORTANT:
// - if new request comes in, search for providers then AND
// call timer.Reset(bsGetProvsTrickle)
// - dont need to bring it back down ever, because if we FIND 
// or cancel the key, then the timer goes away entirely.

bitswap: add better tests around wantlist clearing

feb653b

License: MIT Signed-off-by: Jeromy <why@ipfs.io>

whyrusleeping added status/in-progress In progress need/review Needs a review and removed status/in-progress In progress labels Sep 2, 2016

whyrusleeping added this to the ipfs-0.4.3-rc4 milestone Sep 3, 2016

Kubuxu reviewed Sep 3, 2016
View reviewed changes

bitswap: clear wantlists when GetBlocks calls are cancelled

1548c8a

License: MIT Signed-off-by: Jeromy <why@ipfs.io>

whyrusleeping force-pushed the fix/bitswap/want-cancel branch from 2210ef6 to 1548c8a Compare September 4, 2016 13:45

whyrusleeping merged commit d6092eb into version/0.4.3-rc4 Sep 4, 2016

whyrusleeping deleted the fix/bitswap/want-cancel branch September 4, 2016 14:05

whyrusleeping mentioned this pull request Sep 6, 2016

bitswap: search for wantlist providers a little less often #3192

Merged

dignifiedquire mentioned this pull request Sep 6, 2016

Improve Wantlist handling like go did ipfs/js-ipfs-bitswap#26

Closed

Kubuxu mentioned this pull request Sep 13, 2016

wantlist entries are never cleared from gateway nodes #3165

Closed

whyrusleeping mentioned this pull request Sep 21, 2016

Resource Constraints + Limits #1482

Closed

11 tasks

daviddias removed the need/review Needs a review label Dec 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove entries from wantlists when their related requests are cancelled #3182

Remove entries from wantlists when their related requests are cancelled #3182

whyrusleeping commented Sep 2, 2016

Kubuxu Sep 3, 2016

whyrusleeping Sep 4, 2016

Kubuxu commented Sep 3, 2016

whyrusleeping commented Sep 4, 2016

jbenet commented Sep 5, 2016

Kubuxu commented Sep 5, 2016

jbenet commented Sep 5, 2016

whyrusleeping commented Sep 6, 2016

daviddias commented Sep 6, 2016

jbenet commented Sep 9, 2016 •

edited

Loading

Remove entries from wantlists when their related requests are cancelled #3182

Remove entries from wantlists when their related requests are cancelled #3182

Conversation

whyrusleeping commented Sep 2, 2016

Kubuxu Sep 3, 2016

Choose a reason for hiding this comment

whyrusleeping Sep 4, 2016

Choose a reason for hiding this comment

Kubuxu commented Sep 3, 2016

whyrusleeping commented Sep 4, 2016

jbenet commented Sep 5, 2016

Kubuxu commented Sep 5, 2016

jbenet commented Sep 5, 2016

whyrusleeping commented Sep 6, 2016

daviddias commented Sep 6, 2016

jbenet commented Sep 9, 2016 • edited Loading

jbenet commented Sep 9, 2016 •

edited

Loading