Add stream pooling #285

anacrolix · 2019-03-07T04:30:03Z

raulk · 2019-03-11T08:26:31Z

What’s wrong with using sync.Pools?

anacrolix · 2019-03-12T02:41:35Z

sync.Pool is for relieving pressure on the GC. I'm not sure it's an appropriate use here, and would probably want us to bind a finalizer to the poolStream object resetting the underlying stream. It's hard to say if the lifetime granted between GCs would be appropriate, I would lean toward no, although it would provide an easy way to prune unused streams. Maybe @Stebalien would have more insight into whether this is appropriate or necessary.

I'm not sure how it would interact with the poolStream.reader goroutine maintaining a reference to the poolStream, more than likely the poolStream would be evicted from the sync.Pool, and the finalizer would not be triggered. It can be done but it might result in fiddly, unidiomatic code.

anacrolix · 2019-03-12T09:28:42Z

I'm not sure about the onReady, onReaderErr, in particular those things can be done synchronously in newPoolStream and sendRequest by rearranging some interfaces, but I don't know if it's really worth it unless the callbacks are jarring. If anyone has some insight into any unusual behaviours I should check for, like number of streams, or errors that I might retry, rather than failing early, that would be helpful.

I've tested the crap out of it with the dht-tool.

Stebalien

Can we profile this in go-ipfs? That is, compile go-ipfs with this change, try running a bunch of queries, see if it changes requests times, look at the heap pprof profiles, etc. For example, we may need to:

Bound the number of pooled streams.
Garbage collecting pooled streams, or at least reduce them to 1.

However, we're not going to really know unless we do some profiling.

stream_pooling.go

dht_net.go

stream_pooling.go

anacrolix · 2019-03-13T02:18:06Z

@Stebalien all points are addressed save IPFS profiling. Can we put it on some nodes somewhere? Should we squash and merge it and give it some time to kick around?

Stebalien · 2019-03-13T02:38:42Z

We can just put it on a single node and run a bunch (100? bash?) of ipfs dht query ... commands. Do this with and without the patch and check:

A goroutine dump. That is, run curl http://127.0.0.1:5001/debug/pprof/goroutine?debug=2 and then use https://github.com/whyrusleeping/stackparse to count goroutines.
A heap dump: go tool pprof $(which ipfs) https://127.0.0.1:5001/debug/pprof/heap. Check to see how much memory is being used by, e.g., yamux (indicating that the streams are causing a problem).

This doesn't have to be an amazing scientific test. We need continuous benchmarks but, well, we don't have those yet so this is currently the best we can do. I'm just a bit worried that this will create a bunch of streams/goroutines and leave them around.

If we don't notice any issues, we can merge this and then throw it up on the gateways. However, that's more involved so it's good to get a rough idea of the impact first.

anacrolix · 2019-03-13T03:30:28Z

We can just put it on a single node and run a bunch (100? bash?) of ipfs dht query ... commands. Do this with and without the patch and check:

Who might be best qualified to make a call on this? The reduction in latency, and lock contention that can occur with the existing code should make it an appealing thing to test for someone so interested.

I'm just a bit worried that this will create a bunch of streams/goroutines and leave them around.

Mainly around "bursty" outbound queries to a single peer, this might happen. I understand the main concern is the memory overhead of having streams open. Idle stream management (if it exists), and connection churn (any connection dropping should trigger all the pool-stream readers to return and purge the peer's pool) should deal with any long-term effects.

Stebalien · 2019-03-13T23:44:53Z

Just test it. It's not hard and it's good practice. Really, the development cycle of any change like this should involve repeated sanity checks like this.

anacrolix · 2019-03-14T03:22:55Z

Okay I've ran it for some time with the old protocol taken out, and with the race detector enabled. Stream count typically sits marginally higher than the number of swarm connections.

raulk · 2019-03-14T13:09:37Z

@anacrolix re: sync.Pool – yeah, that makes sense. We need to control the lifecycle of streams, and sync.Pool doesn't offer facilities to check out existing objects avoiding creation. Funny, as literally that is all we would've needed to cleanly finalise a pool motu proprio.

raulk · 2019-03-14T13:17:06Z

Stream count typically sits marginally higher than the number of swarm connections.

Can you elaborate how you conducted the test? Are all swarm connections DHT peers? If yes, this indicates an approx. 1:1 mapping between peers and active streams, suggesting that pooling introduces little advantage? We should test if that observation stands in heavily loaded nodes like gateways (we can mirror traffic into a dev instance). @scout – let's chat about setting this up.

dht.go

anacrolix · 2019-03-14T23:25:11Z

The test was run per Steb's instructions. I hammered a node with requests and monitored the stream and connection counts. Without stream pooling, there appears to be a perfect 1:1 ratio. With it, there's slightly more streams, which would indicate some requests using new streams to avoid waiting their turn to use an existing one.

raulk

This has shaped up really well, @anacrolix! Happy to merge this in quickly. I just have a few minor comments and questions.

opts/options.go

dht_net.go

stream_pool.go

notif.go

stream.go

dht_net.go

vyzo

this is a complex change, don't rush to merge!

dht_net.go

stream.go

stream_pool.go

vyzo · 2019-04-08T15:51:27Z

I'd like to second @Stebalien's comment about garbage collecting streams and shrinking the stream pool.
We need a mechanism to scale-down once we've spinned a bunch of streams.

raulk · 2019-04-08T16:00:59Z

Actually, that’s a good point. This assumes that keeping streams open eternally is free, and that the footprint of lingering unused streams is negligible. That’s probably not the case.

Simplest solution is to model the per-peer pool as a stack, so that get and put are LIFO, you track the last usage timestamp for each stream, and you have a cleanup goroutine that visits all peers and closes any streams that are older than X (e.g. 5 minutes).

anacrolix · 2019-04-15T03:46:58Z

I believe stream pooling is the right way to go, but I'm going to wait for metrics and tracing to stabilize so it's a definitive improvement. After a lot of testing on private instances, I'm finding that in general the existing solution is fine. Any changes we make will be to improve potential contention that occurs in very rare cases (over 99th percentile).

anacrolix · 2019-04-15T03:49:43Z

Simplest solution is to model the per-peer pool as a stack, so that get and put are LIFO, you track the last usage timestamp for each stream, and you have a cleanup goroutine that visits all peers and closes any streams that are older than X (e.g. 5 minutes).

I think I'd just prefer to limit the size of the pool to a reasonable value, say 10. Metrics and tracing will help determine if it has any effect. I could expose a metric for number of stream pools, and another for number of total streams.

anacrolix · 2019-04-15T03:51:43Z

I've rebuilt this branch without the Prometheus metrics I had throughout, and left a lot of the placeholders in to tap into the metrics on master.

…nto waiters unless necessary

…ooling # Conflicts: # dht.go # dht_net.go # go.mod # go.sum

go.mod

stream_pool_peer.go

vyzo · 2019-04-15T09:11:51Z

@anacrolix a comment/request about code style: Can you use whitespace (empty lines) more to separate logical sections of the code for non-trivial functions? Noone likes to read a dense block of code.

raulk · 2019-04-15T11:42:37Z

@anacrolix even if there's no improvement below the 99th percentile, I think the code you're trying to merge is way cleaner than the status quo.

What kind of load are you modelling in your tests? I can see stream pooling bringing an improvement in DHT boosters and nodes with high traffic – possibly even IPFS. Maybe we are not testing in the right context?

Idea: we could make pooling an option, where pool size = 1 replaces the pool with a dumb single-instance container.

anacrolix · 2019-04-17T00:41:20Z

Thanks @raulk. I suspect why I haven't seen more improvement is that I'm running it on a quiet node. I want to tidy up the metrics in the DHT to make running with this PR observable, add metrics for total number of peer stream pools, and number of streams. 2 unresolved elements in this PR are cleaning up pools for peers we no longer have streams to, and the stream cap per peer. There should be some overlap in stream handling to #322, which is relevant to this PR but I haven't investigated yet.

anacrolix · 2019-04-17T00:43:26Z

CircleCI has a failure, probably due to OOM with the race detector, which is interesting, and probably related to the stream handling and #322.

ghost assigned anacrolix Mar 7, 2019

ghost added the status/in-progress In progress label Mar 7, 2019

anacrolix requested a review from Stebalien March 12, 2019 02:41

anacrolix force-pushed the stream-pooling branch 3 times, most recently from 16c168f to dc67e63 Compare March 12, 2019 07:41

anacrolix marked this pull request as ready for review March 12, 2019 09:22

raulk self-requested a review March 12, 2019 09:35

Stebalien requested changes Mar 13, 2019

View reviewed changes

stream_pooling.go Outdated Show resolved Hide resolved

dht_net.go Show resolved Hide resolved

stream_pooling.go Outdated Show resolved Hide resolved

anacrolix force-pushed the stream-pooling branch from 5423d7b to 7669029 Compare March 13, 2019 01:57

anacrolix force-pushed the stream-pooling branch from 7669029 to 4190960 Compare March 14, 2019 03:24

raulk reviewed Mar 14, 2019

View reviewed changes

dht.go Outdated Show resolved Hide resolved

anacrolix mentioned this pull request Mar 26, 2019

What metrics to export? #304

Open

Stebalien requested a review from raulk March 28, 2019 18:15

raulk reviewed Apr 8, 2019

View reviewed changes

vyzo reviewed Apr 8, 2019

View reviewed changes

anacrolix force-pushed the stream-pooling branch from 3c5594b to a5f3187 Compare April 15, 2019 02:17

Add stream pooling

d9db398

anacrolix force-pushed the stream-pooling branch from a5f3187 to d9db398 Compare April 15, 2019 03:04

anacrolix added 4 commits April 15, 2019 15:59

Move peerStreamPool into its own file, and don't push stream errors o…

48080cc

…nto waiters unless necessary

Hush the race detector

749cf1f

Add a bunch of comments

51b2d5d

Merge commit 'cf9bd9b649fb7df4a5632357a17d86d5ccc28649' into stream-p…

98ac825

…ooling # Conflicts: # dht.go # dht_net.go # go.mod # go.sum

vyzo reviewed Apr 15, 2019

View reviewed changes

go.mod Show resolved Hide resolved

vyzo reviewed Apr 15, 2019

View reviewed changes

stream_pool_peer.go Show resolved Hide resolved

vyzo reviewed Apr 15, 2019

View reviewed changes

stream_pool_peer.go Show resolved Hide resolved

raulk mentioned this pull request Apr 15, 2019

Help review and get DHT pipelining merged (go-libp2p-kad-dht) filecoin-project/venus#433

Closed

anacrolix added 2 commits April 17, 2019 10:27

Add comments about stream.m

3aecaa6

Comments and whitespace

6a7d51c

anacrolix removed their assignment Jun 4, 2020

Stebalien closed this Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stream pooling #285

Add stream pooling #285

anacrolix commented Mar 7, 2019

raulk commented Mar 11, 2019

anacrolix commented Mar 12, 2019

anacrolix commented Mar 12, 2019

Stebalien left a comment •

edited

Loading

anacrolix commented Mar 13, 2019

Stebalien commented Mar 13, 2019

anacrolix commented Mar 13, 2019

Stebalien commented Mar 13, 2019

anacrolix commented Mar 14, 2019

raulk commented Mar 14, 2019

raulk commented Mar 14, 2019

anacrolix commented Mar 14, 2019

raulk left a comment •

edited

Loading

vyzo left a comment

vyzo commented Apr 8, 2019

raulk commented Apr 8, 2019 •

edited

Loading

anacrolix commented Apr 15, 2019

anacrolix commented Apr 15, 2019

anacrolix commented Apr 15, 2019

vyzo commented Apr 15, 2019

raulk commented Apr 15, 2019 •

edited

Loading

anacrolix commented Apr 17, 2019 •

edited

Loading

anacrolix commented Apr 17, 2019

Add stream pooling #285

Add stream pooling #285

Conversation

anacrolix commented Mar 7, 2019

raulk commented Mar 11, 2019

anacrolix commented Mar 12, 2019

anacrolix commented Mar 12, 2019

Stebalien left a comment • edited Loading

Choose a reason for hiding this comment

anacrolix commented Mar 13, 2019

Stebalien commented Mar 13, 2019

anacrolix commented Mar 13, 2019

Stebalien commented Mar 13, 2019

anacrolix commented Mar 14, 2019

raulk commented Mar 14, 2019

raulk commented Mar 14, 2019

anacrolix commented Mar 14, 2019

raulk left a comment • edited Loading

Choose a reason for hiding this comment

vyzo left a comment

Choose a reason for hiding this comment

vyzo commented Apr 8, 2019

raulk commented Apr 8, 2019 • edited Loading

anacrolix commented Apr 15, 2019

anacrolix commented Apr 15, 2019

anacrolix commented Apr 15, 2019

vyzo commented Apr 15, 2019

raulk commented Apr 15, 2019 • edited Loading

anacrolix commented Apr 17, 2019 • edited Loading

anacrolix commented Apr 17, 2019

Stebalien left a comment •

edited

Loading

raulk left a comment •

edited

Loading

raulk commented Apr 8, 2019 •

edited

Loading

raulk commented Apr 15, 2019 •

edited

Loading

anacrolix commented Apr 17, 2019 •

edited

Loading