-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stream pooling #285
Add stream pooling #285
Conversation
What’s wrong with using |
I'm not sure how it would interact with the poolStream.reader goroutine maintaining a reference to the poolStream, more than likely the poolStream would be evicted from the sync.Pool, and the finalizer would not be triggered. It can be done but it might result in fiddly, unidiomatic code. |
16c168f
to
dc67e63
Compare
I'm not sure about the I've tested the crap out of it with the dht-tool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we profile this in go-ipfs? That is, compile go-ipfs with this change, try running a bunch of queries, see if it changes requests times, look at the heap pprof profiles, etc. For example, we may need to:
- Bound the number of pooled streams.
- Garbage collecting pooled streams, or at least reduce them to 1.
However, we're not going to really know unless we do some profiling.
5423d7b
to
7669029
Compare
@Stebalien all points are addressed save IPFS profiling. Can we put it on some nodes somewhere? Should we squash and merge it and give it some time to kick around? |
We can just put it on a single node and run a bunch (100? bash?) of
This doesn't have to be an amazing scientific test. We need continuous benchmarks but, well, we don't have those yet so this is currently the best we can do. I'm just a bit worried that this will create a bunch of streams/goroutines and leave them around. If we don't notice any issues, we can merge this and then throw it up on the gateways. However, that's more involved so it's good to get a rough idea of the impact first. |
Who might be best qualified to make a call on this? The reduction in latency, and lock contention that can occur with the existing code should make it an appealing thing to test for someone so interested.
Mainly around "bursty" outbound queries to a single peer, this might happen. I understand the main concern is the memory overhead of having streams open. Idle stream management (if it exists), and connection churn (any connection dropping should trigger all the pool-stream readers to return and purge the peer's pool) should deal with any long-term effects. |
Just test it. It's not hard and it's good practice. Really, the development cycle of any change like this should involve repeated sanity checks like this. |
Okay I've ran it for some time with the old protocol taken out, and with the race detector enabled. Stream count typically sits marginally higher than the number of swarm connections. |
7669029
to
4190960
Compare
@anacrolix re: |
Can you elaborate how you conducted the test? Are all swarm connections DHT peers? If yes, this indicates an approx. 1:1 mapping between peers and active streams, suggesting that pooling introduces little advantage? We should test if that observation stands in heavily loaded nodes like gateways (we can mirror traffic into a dev instance). @scout – let's chat about setting this up. |
The test was run per Steb's instructions. I hammered a node with requests and monitored the stream and connection counts. Without stream pooling, there appears to be a perfect 1:1 ratio. With it, there's slightly more streams, which would indicate some requests using new streams to avoid waiting their turn to use an existing one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has shaped up really well, @anacrolix! Happy to merge this in quickly. I just have a few minor comments and questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a complex change, don't rush to merge!
I'd like to second @Stebalien's comment about garbage collecting streams and shrinking the stream pool. |
Actually, that’s a good point. This assumes that keeping streams open eternally is free, and that the footprint of lingering unused streams is negligible. That’s probably not the case. Simplest solution is to model the per-peer pool as a stack, so that get and put are LIFO, you track the last usage timestamp for each stream, and you have a cleanup goroutine that visits all peers and closes any streams that are older than X (e.g. 5 minutes). |
3c5594b
to
a5f3187
Compare
a5f3187
to
d9db398
Compare
I believe stream pooling is the right way to go, but I'm going to wait for metrics and tracing to stabilize so it's a definitive improvement. After a lot of testing on private instances, I'm finding that in general the existing solution is fine. Any changes we make will be to improve potential contention that occurs in very rare cases (over 99th percentile). |
I think I'd just prefer to limit the size of the pool to a reasonable value, say 10. Metrics and tracing will help determine if it has any effect. I could expose a metric for number of stream pools, and another for number of total streams. |
I've rebuilt this branch without the Prometheus metrics I had throughout, and left a lot of the placeholders in to tap into the metrics on master. |
…nto waiters unless necessary
…ooling # Conflicts: # dht.go # dht_net.go # go.mod # go.sum
@anacrolix a comment/request about code style: Can you use whitespace (empty lines) more to separate logical sections of the code for non-trivial functions? Noone likes to read a dense block of code. |
@anacrolix even if there's no improvement below the 99th percentile, I think the code you're trying to merge is way cleaner than the status quo. What kind of load are you modelling in your tests? I can see stream pooling bringing an improvement in DHT boosters and nodes with high traffic – possibly even IPFS. Maybe we are not testing in the right context? Idea: we could make pooling an option, where pool size = 1 replaces the pool with a dumb single-instance container. |
Thanks @raulk. I suspect why I haven't seen more improvement is that I'm running it on a quiet node. I want to tidy up the metrics in the DHT to make running with this PR observable, add metrics for total number of peer stream pools, and number of streams. 2 unresolved elements in this PR are cleaning up pools for peers we no longer have streams to, and the stream cap per peer. There should be some overlap in stream handling to #322, which is relevant to this PR but I haven't investigated yet. |
CircleCI has a failure, probably due to OOM with the race detector, which is interesting, and probably related to the stream handling and #322. |
See #271.