FindProviders needs a timeout #400

whyrusleeping · 2014-12-02T01:29:06Z

In some 'real world' testing ive been find a large number of goroutines stuck waiting for 'FindProviders' requests to come back, looking into this, There is no timeout set on those requests (sent off into goroutines) so they wait forever for a request from the service. This causes an ugly memory leak of goroutines. What do you guys think an appropriate way to handle this is?

btc · 2014-12-02T01:31:00Z

Would passing the context down into the function work? Or does the callstack hit a wall where a message is passed over a channel?

whyrusleeping · 2014-12-02T01:37:15Z

Well, we do have a context passed down. but the passed in context is not guaranteed to ever be cancelled

btc · 2014-12-02T01:45:09Z

The language won't prevent you from extending FindProviders with its own timeout, but early termination is not really its call to make. That's the caller's responsibility.

There are three types of contexts

request-scoped contexts that only exist during a single request
system-scoped contexts that persist throughout the lifetime of the program (async workers respect these)
orphaned TODO/Background contexts These have no rightful place in the system except for the top of main.

Are we seeing instances of 2 and 3 along the callpath to this subsystem where they shouldn't be?

whyrusleeping · 2014-12-02T03:21:33Z

yeah, we are seeing 2 and 3 in the FindProviders calls

btc · 2014-12-02T09:04:00Z

yeah, we are seeing 2 and 3 in the FindProviders calls

I didn't see 2 or 3 in the FindProviders/GetProviders calls. Maybe I'm misunderstanding what you're seeing. In any case, I took a look at code and fixed a couple things that stood out.

1
Functions called by FindProviderAsync were not respecting the context. These two commits address this by ensuring all functions inside of FPAsync respect it to the best of their abilities:

2
GetProviders was implemented using the promise pattern but with an unbuffered response channel. This causes the sender to hang when receivers aren't around to pick up the value. When receivers respect a context, this is a problem. A subtle detail, but one that is important for avoiding leaked goroutines. Addressed here:

dec6dad

jbenet · 2014-12-05T16:44:07Z

The language won't prevent you from extending FindProviders with its own timeout, but early termination is not really its call to make. That's the caller's responsibility.

Agreed fully. This is particularly important when wanting to work in high-latency-low-bandwidth networks (mobile everywhere outside the US).

GetProviders was implemented using the promise pattern but with an unbuffered response channel.

Shouldn't this really be:

select {
case pm.getprovs <- gp:
  select {
  case resp := <- gp.resp:
    return resp
  case <-ctx.Done():
    return nil
  }

case <-ctx.Done():
  return nil
}

Or to make it more readable:

// send out async get providers request
select {
case <-ctx.Done():
  return nil
case pm.getprovs <- gp:
}

// receive a request response
select {
case <-ctx.Done():
  return nil

case resp := <- gp.resp:
  return resp
}

btc · 2014-12-05T22:32:01Z

Shouldn't this really be

resp is buffered and is locally/stack-allocated with this function as the only sender, so it acts like a dropbox. sender will never block. that's the crux of the change. i wish i knew how to make this clearer. perhaps through some conventional name? perhaps with another comment?

jbenet · 2014-12-05T22:53:19Z

resp is buffered and is locally/stack-allocated with this function as the only sender

no this function is the only reader. whatever is at the other side of pm.getprovs decides when to send into this channel. If it never sends anything, this goroutine blocks forever.

btc · 2014-12-06T01:00:08Z

no this function is the only reader. whatever is at the other side of pm.getprovs decides when to send into this channel. If it never sends anything, this goroutine blocks forever.

my mistake. will fix using this one:

// send out async get providers request
select {
case <-ctx.Done():
  return nil
case pm.getprovs <- gp:
}

// receive a request response
select {
case <-ctx.Done():
  return nil

case resp := <- gp.resp:
  return resp
}

jbenet · 2015-01-05T17:08:17Z

Is this fixed on master? pls reopen if not.

feat: allow disabling value and provider storage/messages

whyrusleeping added the kind/bug A bug in existing code (including security flaws) label Dec 2, 2014

jbenet closed this as completed Jan 5, 2015

ariescodescream pushed a commit to ariescodescream/go-ipfs that referenced this issue Oct 23, 2021

Merge pull request ipfs#400 from libp2p/feat/disable-providers

2e6adb8

feat: allow disabling value and provider storage/messages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FindProviders needs a timeout #400

FindProviders needs a timeout #400

whyrusleeping commented Dec 2, 2014

btc commented Dec 2, 2014

whyrusleeping commented Dec 2, 2014

btc commented Dec 2, 2014

whyrusleeping commented Dec 2, 2014

btc commented Dec 2, 2014

jbenet commented Dec 5, 2014

btc commented Dec 5, 2014

jbenet commented Dec 5, 2014

btc commented Dec 6, 2014

jbenet commented Jan 5, 2015

FindProviders needs a timeout #400

FindProviders needs a timeout #400

Comments

whyrusleeping commented Dec 2, 2014

btc commented Dec 2, 2014

whyrusleeping commented Dec 2, 2014

btc commented Dec 2, 2014

whyrusleeping commented Dec 2, 2014

btc commented Dec 2, 2014

jbenet commented Dec 5, 2014

btc commented Dec 5, 2014

jbenet commented Dec 5, 2014

btc commented Dec 6, 2014

jbenet commented Jan 5, 2015