Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCM-168: De-restrict parallelism to speed up enrollment. #103

Merged
merged 9 commits into from
Nov 9, 2020

Conversation

aakoshh
Copy link
Contributor

@aakoshh aakoshh commented Nov 6, 2020

Enable unrestricted parallelism when enrolling with bootstraps and bonding with neighbors to speed up the initialisation.

Also fixes the CI build config to run unit and integration tests separately. mill __.test was actually matching mill scalanet.test.test as well as mill scalanet.discovery.it.test, so it ran both.

@aakoshh aakoshh requested a review from KonradStaniec November 6, 2020 17:23
@aakoshh aakoshh force-pushed the ETCM-168-speed-up-enrollment branch from 3a89b7f to 3509a96 Compare November 6, 2020 18:03
@@ -762,7 +762,7 @@ object DiscoveryService {
for {
_ <- Task(logger.debug(s"Bonding with ${neighbors.size} neighbors..."))
bonded <- Task
.parTraverseN(config.kademliaAlpha)(neighbors) { neighbor =>
.parTraverseUnordered(neighbors) { neighbor =>
Copy link
Contributor

@KonradStaniec KonradStaniec Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't that mean that if malicious node send us lets say 1000 neighbours, we will start 1000 parallel bonding tries and it could potentially exhaust our resources ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can try to send us 1000 nodes, but a single Neighbors message can only contain ~10 nodes, and we'll anyway discard anything beyond the first 16, so these will just be dropped.

What I thought is that we don't have that many resources tried up in bonding after all: we use the same single UDP socket to send and receive, so it's just how rapidly we have to react to incoming messages. But the impact of non-responding nodes with their 3 second timeout is really visible in mordor, takes minutes to get past the initialisation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm fair enough. Maybe one more question, if the enrollement really takes such long time on mordor (and on mainner it will probably take much longer), maybe we should start enrollment in background and return from DiscoveryService.apply immediately ? Then caller will be able to get already found peers, event hough enrollment did not finish yet. ( of course then we would need to start random lookups only when enrollment finishes, to avoid several concurrent lookups in progress)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that yeah, I tried to follow the KRouter which waits until enrollment is done, however the semantics of enrollment are not black and white: even if it fails during initialisation, it will be re-attempted during lookups unless we have at least 16 closest nodes already to the target, so in case we have just 1 bootstrap node and it happens to be offline, we can still connect to it later. And yes, as you say the caller would be able to read discovered nodes earlier, before the whole enrollment is finished.

In Mantis this wouldn't be a major issue though because the default setting is to consider bootstrap nodes and previously persisted nodes as "already discovered" and serve them, while the enrollment runs in the background.

At the very least the initial self-lookup could be done in the background since subsequent lookups are also happening outside the control of the host application.

Copy link
Contributor Author

@aakoshh aakoshh Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added option to run in background in cf7358c with the default left as foreground because that's what integration tests assume.

@aakoshh aakoshh force-pushed the ETCM-168-speed-up-enrollment branch from 0476daf to 161cf98 Compare November 9, 2020 13:01
@aakoshh aakoshh merged commit 1113264 into develop Nov 9, 2020
@aakoshh aakoshh deleted the ETCM-168-speed-up-enrollment branch November 9, 2020 13:19
@aakoshh aakoshh mentioned this pull request Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants