Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stream pooling #285

Closed
wants to merge 7 commits into from
Closed

Add stream pooling #285

wants to merge 7 commits into from

Conversation

anacrolix
Copy link
Contributor

See #271.

@ghost ghost assigned anacrolix Mar 7, 2019
@ghost ghost added the status/in-progress In progress label Mar 7, 2019
@raulk
Copy link
Member

raulk commented Mar 11, 2019

What’s wrong with using sync.Pools?

@anacrolix
Copy link
Contributor Author

sync.Pool is for relieving pressure on the GC. I'm not sure it's an appropriate use here, and would probably want us to bind a finalizer to the poolStream object resetting the underlying stream. It's hard to say if the lifetime granted between GCs would be appropriate, I would lean toward no, although it would provide an easy way to prune unused streams. Maybe @Stebalien would have more insight into whether this is appropriate or necessary.

I'm not sure how it would interact with the poolStream.reader goroutine maintaining a reference to the poolStream, more than likely the poolStream would be evicted from the sync.Pool, and the finalizer would not be triggered. It can be done but it might result in fiddly, unidiomatic code.

@anacrolix anacrolix requested a review from Stebalien March 12, 2019 02:41
@anacrolix anacrolix force-pushed the stream-pooling branch 3 times, most recently from 16c168f to dc67e63 Compare March 12, 2019 07:41
@anacrolix anacrolix marked this pull request as ready for review March 12, 2019 09:22
@anacrolix
Copy link
Contributor Author

I'm not sure about the onReady, onReaderErr, in particular those things can be done synchronously in newPoolStream and sendRequest by rearranging some interfaces, but I don't know if it's really worth it unless the callbacks are jarring. If anyone has some insight into any unusual behaviours I should check for, like number of streams, or errors that I might retry, rather than failing early, that would be helpful.

I've tested the crap out of it with the dht-tool.

@raulk raulk self-requested a review March 12, 2019 09:35
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we profile this in go-ipfs? That is, compile go-ipfs with this change, try running a bunch of queries, see if it changes requests times, look at the heap pprof profiles, etc. For example, we may need to:

  • Bound the number of pooled streams.
  • Garbage collecting pooled streams, or at least reduce them to 1.

However, we're not going to really know unless we do some profiling.

@anacrolix
Copy link
Contributor Author

@Stebalien all points are addressed save IPFS profiling. Can we put it on some nodes somewhere? Should we squash and merge it and give it some time to kick around?

@Stebalien
Copy link
Member

We can just put it on a single node and run a bunch (100? bash?) of ipfs dht query ... commands. Do this with and without the patch and check:

  1. A goroutine dump. That is, run curl http://127.0.0.1:5001/debug/pprof/goroutine?debug=2 and then use https://github.com/whyrusleeping/stackparse to count goroutines.
  2. A heap dump: go tool pprof $(which ipfs) https://127.0.0.1:5001/debug/pprof/heap. Check to see how much memory is being used by, e.g., yamux (indicating that the streams are causing a problem).

This doesn't have to be an amazing scientific test. We need continuous benchmarks but, well, we don't have those yet so this is currently the best we can do. I'm just a bit worried that this will create a bunch of streams/goroutines and leave them around.

If we don't notice any issues, we can merge this and then throw it up on the gateways. However, that's more involved so it's good to get a rough idea of the impact first.

@anacrolix
Copy link
Contributor Author

We can just put it on a single node and run a bunch (100? bash?) of ipfs dht query ... commands. Do this with and without the patch and check:

Who might be best qualified to make a call on this? The reduction in latency, and lock contention that can occur with the existing code should make it an appealing thing to test for someone so interested.

I'm just a bit worried that this will create a bunch of streams/goroutines and leave them around.

Mainly around "bursty" outbound queries to a single peer, this might happen. I understand the main concern is the memory overhead of having streams open. Idle stream management (if it exists), and connection churn (any connection dropping should trigger all the pool-stream readers to return and purge the peer's pool) should deal with any long-term effects.

@Stebalien
Copy link
Member

Just test it. It's not hard and it's good practice. Really, the development cycle of any change like this should involve repeated sanity checks like this.

@anacrolix
Copy link
Contributor Author

Okay I've ran it for some time with the old protocol taken out, and with the race detector enabled. Stream count typically sits marginally higher than the number of swarm connections.

@raulk
Copy link
Member

raulk commented Mar 14, 2019

@anacrolix re: sync.Pool – yeah, that makes sense. We need to control the lifecycle of streams, and sync.Pool doesn't offer facilities to check out existing objects avoiding creation. Funny, as literally that is all we would've needed to cleanly finalise a pool motu proprio.

@raulk
Copy link
Member

raulk commented Mar 14, 2019

Stream count typically sits marginally higher than the number of swarm connections.

Can you elaborate how you conducted the test? Are all swarm connections DHT peers? If yes, this indicates an approx. 1:1 mapping between peers and active streams, suggesting that pooling introduces little advantage? We should test if that observation stands in heavily loaded nodes like gateways (we can mirror traffic into a dev instance). @scout – let's chat about setting this up.

@anacrolix
Copy link
Contributor Author

The test was run per Steb's instructions. I hammered a node with requests and monitored the stream and connection counts. Without stream pooling, there appears to be a perfect 1:1 ratio. With it, there's slightly more streams, which would indicate some requests using new streams to avoid waiting their turn to use an existing one.

@Stebalien Stebalien requested a review from raulk March 28, 2019 18:15
Copy link
Member

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has shaped up really well, @anacrolix! Happy to merge this in quickly. I just have a few minor comments and questions.

Copy link
Contributor

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a complex change, don't rush to merge!

@vyzo
Copy link
Contributor

vyzo commented Apr 8, 2019

I'd like to second @Stebalien's comment about garbage collecting streams and shrinking the stream pool.
We need a mechanism to scale-down once we've spinned a bunch of streams.

@raulk
Copy link
Member

raulk commented Apr 8, 2019

Actually, that’s a good point. This assumes that keeping streams open eternally is free, and that the footprint of lingering unused streams is negligible. That’s probably not the case.

Simplest solution is to model the per-peer pool as a stack, so that get and put are LIFO, you track the last usage timestamp for each stream, and you have a cleanup goroutine that visits all peers and closes any streams that are older than X (e.g. 5 minutes).

@anacrolix
Copy link
Contributor Author

I believe stream pooling is the right way to go, but I'm going to wait for metrics and tracing to stabilize so it's a definitive improvement. After a lot of testing on private instances, I'm finding that in general the existing solution is fine. Any changes we make will be to improve potential contention that occurs in very rare cases (over 99th percentile).

@anacrolix
Copy link
Contributor Author

Simplest solution is to model the per-peer pool as a stack, so that get and put are LIFO, you track the last usage timestamp for each stream, and you have a cleanup goroutine that visits all peers and closes any streams that are older than X (e.g. 5 minutes).

I think I'd just prefer to limit the size of the pool to a reasonable value, say 10. Metrics and tracing will help determine if it has any effect. I could expose a metric for number of stream pools, and another for number of total streams.

@anacrolix
Copy link
Contributor Author

I've rebuilt this branch without the Prometheus metrics I had throughout, and left a lot of the placeholders in to tap into the metrics on master.

@vyzo
Copy link
Contributor

vyzo commented Apr 15, 2019

@anacrolix a comment/request about code style: Can you use whitespace (empty lines) more to separate logical sections of the code for non-trivial functions? Noone likes to read a dense block of code.

@raulk
Copy link
Member

raulk commented Apr 15, 2019

@anacrolix even if there's no improvement below the 99th percentile, I think the code you're trying to merge is way cleaner than the status quo.

What kind of load are you modelling in your tests? I can see stream pooling bringing an improvement in DHT boosters and nodes with high traffic – possibly even IPFS. Maybe we are not testing in the right context?

Idea: we could make pooling an option, where pool size = 1 replaces the pool with a dumb single-instance container.

@anacrolix
Copy link
Contributor Author

anacrolix commented Apr 17, 2019

Thanks @raulk. I suspect why I haven't seen more improvement is that I'm running it on a quiet node. I want to tidy up the metrics in the DHT to make running with this PR observable, add metrics for total number of peer stream pools, and number of streams. 2 unresolved elements in this PR are cleaning up pools for peers we no longer have streams to, and the stream cap per peer. There should be some overlap in stream handling to #322, which is relevant to this PR but I haven't investigated yet.

@anacrolix
Copy link
Contributor Author

CircleCI has a failure, probably due to OOM with the race detector, which is interesting, and probably related to the stream handling and #322.

@anacrolix anacrolix removed their assignment Jun 4, 2020
@Stebalien Stebalien closed this Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants