Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

near-vm-runner: move protocol-sensitive error schemas to near-primitives #9295

Merged
merged 8 commits into from
Jul 14, 2023

Conversation

nagisa
Copy link
Collaborator

@nagisa nagisa commented Jul 13, 2023

This allows to drop a dependency on near-account-id and near-rpc-error-macro crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of nearcore/runtime there's also a neat rule of thumb: anything goes inside nearcore/runtime (as far as errors are concerned.)

This was a very straightforward to remove dependency -- errors largely
don't need as much typing as APIs, for instance, due to their tendency
to be handled by a simple print.

And if there's interest, the user of this can convert the Account ID
back to a strongly typed type.
Now there'll be fewer concerns about changing the protocol with
accidental changes to the runtime, and most importantly we have a place
where we can develop the runtime's API (cf limited replayability) and
the protocol independently.
Since the external-facing parts of these errors are now in
near-primitives, it is up to that crate to also generate the relevant
schemas.
@nagisa nagisa marked this pull request as ready for review July 13, 2023 13:01
@nagisa nagisa requested a review from a team as a code owner July 13, 2023 13:01
@nagisa nagisa requested a review from jakmeier July 13, 2023 13:01
@Ekleog-NEAR Ekleog-NEAR changed the title near-vm: move protocol-sensitive error schemas to near-primitives near-vm-runner: move protocol-sensitive error schemas to near-primitives Jul 14, 2023
@Ekleog-NEAR
Copy link
Collaborator

xref: #8197 #9301

LGTM, seems like the CI failure is probably due to a change in the order of the fields of the JSON? Though I’m not sure what our process should be to change the json, and whether it’d be easier to not change the order than to fix the json

@nagisa
Copy link
Collaborator Author

nagisa commented Jul 14, 2023

The JSON ordering change is just an artifact of how the macro is implemented, which gathers data into BTreeMap (thus sorted), but then when it merges results between two crates, it retains the order.

Copy link
Collaborator

@Ekleog-NEAR Ekleog-NEAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice, at some point, to make the merging re-sort the json :)

@near-bulldozer near-bulldozer bot merged commit 0de35c4 into master Jul 14, 2023
@near-bulldozer near-bulldozer bot deleted the nagisa/rm-acct-id branch July 14, 2023 11:26
nikurt pushed a commit that referenced this pull request Jul 15, 2023
…ves (#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)
nikurt pushed a commit that referenced this pull request Jul 20, 2023
…ves (#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)
near-bulldozer bot added a commit that referenced this pull request Jul 24, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (#9320)

In #9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like #9121.

* refactor(loadtest): backwards compatible type hints (#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see #9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <wacban@users.noreply.github.com>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Michal Nazarewicz <mina86@mina86.com>
Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com>
Co-authored-by: Saketh Are <saketh.are@gmail.com>
Co-authored-by: Yasir <goodwonder5@gmail.com>
Co-authored-by: Aleksandr Logunov <alex.logunov@near.org>
Co-authored-by: Razvan Barbascu <razvan@near.org>
Co-authored-by: Jure Bajic <jure@near.org>
nikurt added a commit to nikurt/nearcore that referenced this pull request Jul 26, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (near#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (near#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (near#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (near#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (near#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (near#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (near#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (near#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (near#9320)

In near#9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (near#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (near#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like near#9121.

* refactor(loadtest): backwards compatible type hints (near#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see near#9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (near#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <wacban@users.noreply.github.com>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Michal Nazarewicz <mina86@mina86.com>
Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com>
Co-authored-by: Saketh Are <saketh.are@gmail.com>
Co-authored-by: Yasir <goodwonder5@gmail.com>
Co-authored-by: Aleksandr Logunov <alex.logunov@near.org>
Co-authored-by: Razvan Barbascu <razvan@near.org>
Co-authored-by: Jure Bajic <jure@near.org>
nikurt added a commit that referenced this pull request Aug 24, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (#9320)

In #9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like #9121.

* refactor(loadtest): backwards compatible type hints (#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see #9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <wacban@users.noreply.github.com>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Michal Nazarewicz <mina86@mina86.com>
Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com>
Co-authored-by: Saketh Are <saketh.are@gmail.com>
Co-authored-by: Yasir <goodwonder5@gmail.com>
Co-authored-by: Aleksandr Logunov <alex.logunov@near.org>
Co-authored-by: Razvan Barbascu <razvan@near.org>
Co-authored-by: Jure Bajic <jure@near.org>
nikurt added a commit to nikurt/nearcore that referenced this pull request Aug 24, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (near#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (near#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (near#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (near#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (near#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (near#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (near#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (near#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (near#9320)

In near#9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (near#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (near#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like near#9121.

* refactor(loadtest): backwards compatible type hints (near#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see near#9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (near#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <wacban@users.noreply.github.com>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Michal Nazarewicz <mina86@mina86.com>
Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com>
Co-authored-by: Saketh Are <saketh.are@gmail.com>
Co-authored-by: Yasir <goodwonder5@gmail.com>
Co-authored-by: Aleksandr Logunov <alex.logunov@near.org>
Co-authored-by: Razvan Barbascu <razvan@near.org>
Co-authored-by: Jure Bajic <jure@near.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants