Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update routing table in a separate thread #3853

Closed
bowenwang1996 opened this issue Jan 25, 2021 · 17 comments · Fixed by #3882
Closed

Update routing table in a separate thread #3853

bowenwang1996 opened this issue Jan 25, 2021 · 17 comments · Fixed by #3882
Assignees
Labels
A-network Area: Network

Comments

@bowenwang1996
Copy link
Collaborator

Updating routing table could take a nontrivial amount of time when the routing table is large. It is better if the computation is done in a separate thread from PeerManagerActor so that we don't block it from performing other tasks. Also we can update it less frequently than once every second (once every 5s seems fine as well).

@bowenwang1996
Copy link
Collaborator Author

Also we rebuild the entire routing table from scratch every time it updates (even if it's just one edge), which is not optimal. @pmnoxx @mfornet

@pmnoxx
Copy link
Contributor

pmnoxx commented Feb 2, 2021

Why not do something like this: #3882

@pmnoxx
Copy link
Contributor

pmnoxx commented Feb 2, 2021

before

test calculate_distance_10_10  ... bench:     810,918 ns/iter (+/- 38,764)
test calculate_distance_10_100 ... 
bench: 508,540,532 ns/iter (+/- 5,596,531)
test calculate_distance_3_3    ... bench:      12,071 ns/iter (+/- 805)

after

test calculate_distance_10_10  ... bench:     202,464 ns/iter (+/- 32,640)
test calculate_distance_10_100 ... bench:  21,625,595 ns/iter (+/- 6,668,834)
test calculate_distance_3_3    ... bench:       5,712 ns/iter (+/- 634)

@pmnoxx
Copy link
Contributor

pmnoxx commented Feb 2, 2021

After further optimizations thanks to @matklad suggestion to use rustc_hash::FxHashMap

test calculate_distance_10_10  ... bench:     122,286 ns/iter (+/- 14,241)
test calculate_distance_10_100 ... bench:  13,631,962 ns/iter (+/- 2,390,379)
test calculate_distance_3_3    ... bench:       3,044 ns/iter (+/- 310)

@pmnoxx pmnoxx linked a pull request Feb 2, 2021 that will close this issue
@matklad
Copy link
Contributor

matklad commented Feb 2, 2021

big 👍 for the general "make it fast first, only then make is complicated". In rust-analyzer, we managed to reduce complexity a lot by just computing things effectively instead of gong for complicated caching/therading setups. zz

@pmnoxx
Copy link
Contributor

pmnoxx commented Feb 2, 2021

We also reduced memory usage on calculate_distance_10_100 from 16mb to 5mb

@pmnoxx
Copy link
Contributor

pmnoxx commented Feb 2, 2021

More work needs to be done, I saw at one point a huge pile of scheduled futures to recompute the graph. The code looks incorrect:

Feb 02 20:59:38.108  INFO near_performance_metrics::stats_enabled: Futures waiting for completion     
...
Feb 02 20:59:38.114  INFO near_performance_metrics::stats_enabled:     future chain/network/src/peer_manager.rs:575 2182  
...

@pmnoxx
Copy link
Contributor

pmnoxx commented Feb 4, 2021

After some more optimizations:

test calculate_distance_10_100  ... bench:   3,420,732 ns/iter (+/- 567,682)

@pmnoxx
Copy link
Contributor

pmnoxx commented Feb 4, 2021

After further tweaking

test calculate_distance_10_100 ... bench:   2,554,859 ns/iter (+/- 312,015)

@bowenwang1996
Copy link
Collaborator Author

@pmnoxx on mainnet I observed the following:

Mar 06 03:55:39.709  WARN delay_detector: LONG DELAY! Took 764.663579ms processing routing table update
Mar 06 03:55:39.710  WARN near_performance_metrics::actix_enabled: Slow function call run_later:31436 chain/network/src/peer_manager.rs:582 too
k: 764ms

The neard binary has #3882

@bowenwang1996 bowenwang1996 reopened this Mar 6, 2021
@pmnoxx
Copy link
Contributor

pmnoxx commented Mar 8, 2021

I saw cases where functions like processing routing table would take a long time, because other threads were busy. I think we should measure cpu time thread in addiction to real time.

https://docs.rs/cpu-time/1.0.0/cpu_time/struct.ThreadTime.html

@bowenwang1996
Copy link
Collaborator Author

@pmnoxx that is actually done in #3880

@pmnoxx
Copy link
Contributor

pmnoxx commented Mar 14, 2021

In #4099 I cleaned up the logic for scheduling routing table precomputation to happen at most once a second.

@bowenwang1996
Copy link
Collaborator Author

I cleaned up the logic for scheduling routing table precomputation to happen at most once a second.

How is that related to this issue? The issue is not that we do the computation multiple times, but rather that each computation takes a long time.

@pmnoxx
Copy link
Contributor

pmnoxx commented Mar 15, 2021

I cleaned up the logic for scheduling routing table precomputation to happen at most once a second.

How is that related to this issue? The issue is not that we do the computation multiple times, but rather that each computation takes a long time.

This is just a code cleanup. I noticed we had old code which was limiting how often we should do routing table computation. That code was never working properly, it's functionality was replaced by other check I added. Right now it's just a dead code.

@pmnoxx
Copy link
Contributor

pmnoxx commented Mar 19, 2021

@bowenwang1996 I added prints and I see that this code takes 3-4s on my local machine:

        self.edges_info.retain(|(peer0, peer1), edge| {
            if to_save.contains(peer0) || to_save.contains(peer1) {
                edges_in_component.push(edge.clone());
                false
            } else {
                true
            }
        });

We call this function once every 2 hours, so it's probably not a high priority to optimize it.

@bowenwang1996
Copy link
Collaborator Author

We call this function once every 2 hours, so it's probably not a high priority to optimize it.

Sure. Let's open a separate issue for this (if doesn't already exist)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants