Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive queue for staging dials #237

Merged
merged 11 commits into from
Jan 30, 2019
Merged

Conversation

raulk
Copy link
Member

@raulk raulk commented Jan 29, 2019

Currently the DHT is performing dials outside of the Alpha concurrency limit. We are dialling all nodes that peers return in CloserPeers without limit. As a result, we end up flooding the swarm with dial jobs, which trips over the file descriptor limits, and brings dialling to a halt under some circumstances. Our current approach is also algorithmically incorrect, and leads to suboptimal query patterns.

This patch introduces an adaptive dial queue that spawns a dynamically sized set of goroutines to preemptively stage dials for later handoff to the DHT protocol for RPC. It identifies backpressure on both ends (dial consumers and dial producers), and takes compensating action by adjusting the worker pool.

We start with DialQueueMinParallelism number of workers (6), and scale up and down based on demand and supply of dialled peers.

The following events trigger scaling:

  • we scale up when we can't immediately return a successful dial to a new consumer.
  • we scale down when we've been idle for a while waiting for new dial attempts.
  • we scale down when we complete a dial and realise nobody was waiting for it.

Dialler throttling (e.g. FD limit exceeded) is a concern, as we can easily spin up more workers to compensate, and end up adding fuel to the fire. Since we have no deterministic way to detect this for now, we hard-limit concurrency to DialQueueMaxParallelism (20).

Testing this patch in a production mirror reduced dial backlog considerably, and showed the adaptiveness in action:

Jan 29 00:42:02 localhost ipfs[6248]: 2019-01-29 00:42:02.873850 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.003489 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.083072 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.186067 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.246148 DEBUG dht dial_queue.go:199: shrunk dial worker pool: 9 => 6
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.246174 DEBUG dht dial_queue.go:177: grew dial worker pool: 13 => 19
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.328308 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.451401 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.462604 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.463808 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.464417 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.465973 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.470865 DEBUG dht dial_queue.go:199: shrunk dial worker pool: 18 => 12
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.697266 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.697847 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.838040 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13

This patch introduces an adaptive dial queue that spawns a dynamically sized
set of goroutines to preemptively stage dials for later handoff to the DHT
protocol for RPC. It identifies backpressure on both ends (dial consumers and
dial producers), and takes compensating action by adjusting the worker pool.

We start with `DialQueueMinParallelism` number of workers (6), and scale up
and down based on demand and supply of dialled peers.

The following events trigger scaling:
- we scale up when we can't immediately return a successful dial to a new
  consumer.
- we scale down when we've been idle for a while waiting for new dial
  attempts.
- we scale down when we complete a dial and realise nobody was waiting for it.

Dialler throttling (e.g. FD limit exceeded) is a concern, as we can easily
spin up more workers to compensate, and end up adding fuel to the fire. Since
we have no deterministic way to detect this for now, we hard-limit concurrency
to `DialQueueMaxParallelism` (20).
@ghost ghost assigned raulk Jan 29, 2019
@ghost ghost added the status/in-progress In progress label Jan 29, 2019
Copy link
Contributor

@anacrolix anacrolix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice and clean.

dial_queue.go Outdated Show resolved Hide resolved
dial_queue.go Outdated Show resolved Hide resolved
dial_queue.go Show resolved Hide resolved
@raulk
Copy link
Member Author

raulk commented Jan 29, 2019

Future optimisation: cancelling pending dials to worse nodes as we find closer nodes to the target.

EDIT: in practice, this is complex, because those theoretically better nodes may never respond, and we would've stopped making progress. The algorithm would have to compensate by backtracking and replaying those dials. Quite a dance.

dial_queue.go Outdated Show resolved Hide resolved
dial_queue.go Outdated Show resolved Hide resolved
query.go Outdated Show resolved Hide resolved
query.go Outdated Show resolved Hide resolved
query.go Outdated Show resolved Hide resolved
dial_queue.go Show resolved Hide resolved
dial_queue.go Show resolved Hide resolved
dial_queue.go Outdated Show resolved Hide resolved
dial_queue.go Outdated Show resolved Hide resolved
dial_queue.go Outdated Show resolved Hide resolved
@raulk
Copy link
Member Author

raulk commented Jan 29, 2019

Addressed the review comments, but I noticed a flaky test on CI along the way. I do deplore depending on time, but I cannot think of another way to test this.

dial_queue.go Outdated Show resolved Hide resolved
dial_queue.go Outdated Show resolved Hide resolved
query.go Outdated Show resolved Hide resolved
dial_queue.go Outdated Show resolved Hide resolved
@raulk
Copy link
Member Author

raulk commented Jan 30, 2019

@Stebalien – up for re-review. I ended up changing the waiting mechanism to a slice, like we discussed in comments.

@raulk raulk merged commit 7a255be into libp2p:master Jan 30, 2019
@ghost ghost removed the status/in-progress In progress label Jan 30, 2019
@raulk raulk deleted the feat/adaptive-dial-queue branch January 30, 2019 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants