-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCM-168: De-restrict parallelism to speed up enrollment. #103
Conversation
56378b8
to
fda7f61
Compare
3a89b7f
to
3509a96
Compare
@@ -762,7 +762,7 @@ object DiscoveryService { | |||
for { | |||
_ <- Task(logger.debug(s"Bonding with ${neighbors.size} neighbors...")) | |||
bonded <- Task | |||
.parTraverseN(config.kademliaAlpha)(neighbors) { neighbor => | |||
.parTraverseUnordered(neighbors) { neighbor => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't that mean that if malicious node send us lets say 1000 neighbours, we will start 1000 parallel bonding tries and it could potentially exhaust our resources ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can try to send us 1000 nodes, but a single Neighbors
message can only contain ~10 nodes, and we'll anyway discard anything beyond the first 16, so these will just be dropped.
What I thought is that we don't have that many resources tried up in bonding after all: we use the same single UDP socket to send and receive, so it's just how rapidly we have to react to incoming messages. But the impact of non-responding nodes with their 3 second timeout is really visible in mordor, takes minutes to get past the initialisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm fair enough. Maybe one more question, if the enrollement really takes such long time on mordor (and on mainner it will probably take much longer), maybe we should start enrollment in background and return from DiscoveryService.apply
immediately ? Then caller will be able to get already found peers, event hough enrollment did not finish yet. ( of course then we would need to start random lookups only when enrollment finishes, to avoid several concurrent lookups in progress)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that yeah, I tried to follow the KRouter
which waits until enrollment is done, however the semantics of enrollment are not black and white: even if it fails during initialisation, it will be re-attempted during lookups unless we have at least 16 closest nodes already to the target, so in case we have just 1 bootstrap node and it happens to be offline, we can still connect to it later. And yes, as you say the caller would be able to read discovered nodes earlier, before the whole enrollment is finished.
In Mantis this wouldn't be a major issue though because the default setting is to consider bootstrap nodes and previously persisted nodes as "already discovered" and serve them, while the enrollment runs in the background.
At the very least the initial self-lookup could be done in the background since subsequent lookups are also happening outside the control of the host application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added option to run in background in cf7358c with the default left as foreground because that's what integration tests assume.
…ed from __.it.test
0476daf
to
161cf98
Compare
Enable unrestricted parallelism when enrolling with bootstraps and bonding with neighbors to speed up the initialisation.
Also fixes the CI build config to run unit and integration tests separately.
mill __.test
was actually matchingmill scalanet.test.test
as well asmill scalanet.discovery.it.test
, so it ran both.