feat: dump state to s3 by multiple nodes #9049

ppca · 2023-05-11T19:10:20Z

To enable dumping with multiple nodes, do the following:

add

"state_sync": {
    "dump": {
      "location": {
        "S3": {
          "bucket":"state-parts-xiangyi",
          "region":"us-west-1"
        },
      },
    }
}

to ~/.near/config.json on your node;
2. run AWS_ACCESS_KEY_ID="xxxx" AWS_SECRET_ACCESS_KEY="xxx" ./target/release/neard run

How the multi-node dump work?
Each node will have one thread per shard, and will check s3 for missing parts per shard for the epoch it's dumping every 60 seconds. After checking missing parts, it will draw one part_id without replacement from the missing parts, generate and upload to s3 repeatedly until time's up. Then the thread repeat the process until all parts of the specific shard and epoch is dumped. Then the thread switch to the latest epoch that their chain head is in.
The nodes that are dumping state do not need to communicate to each other, they use s3 to check what's dumped and what's not.
The frequency of checking s3 for missing parts will be a trade off between repeated dumps and s3 latency: the more often the check, the less repeats happen, but the more s3 check latency hurts the speed of dump. May end up using a frequency different than 60s.

walnut-the-cat · 2023-05-11T22:08:40Z

when we roll out this, can we somehow include some baseline template to config file to give users some hint on how to enable the feature?

walnut-the-cat · 2023-05-11T22:10:18Z

The nodes that are dumping state do not need to communicate to each other, they use s3 to check what's dumped and what's not.

Is it common for multiple nodes to end up trying to upload the same part?

chain/client/src/sync/state.rs

nearcore/src/state_sync.rs

nikurt · 2023-05-12T11:53:37Z

nearcore/src/state_sync.rs

+        let mut existing_nums = HashSet::new();
+        for name in file_names {
+            let splitted: Vec<_> = name.split("_").collect();
+            let part_id = splitted.get(2).unwrap().to_string().parse::<u64>()?;


Please move this to a function. String manipulation is error-prone, and moving it to a separate function:

abstracts away the complexity

lets us unit-test the complexity

nearcore/src/state_sync.rs

ppca · 2023-05-15T20:10:11Z

when we roll out this, can we somehow include some baseline template to config file to give users some hint on how to enable the feature?

For now, we don't expect users to dump state yet, we will start expecting that once the decentralized version is ready.

ppca · 2023-05-15T20:11:47Z

The nodes that are dumping state do not need to communicate to each other, they use s3 to check what's dumped and what's not.

Is it common for multiple nodes to end up trying to upload the same part?

this will happen from time to time, we expect this to be less common at the start of dump of an epoch, and become more frequent as number of parts left to dump for an epoch decreases.

chain/client/src/sync/state.rs

nearcore/src/state_sync.rs

nikurt · 2023-05-16T09:35:56Z

tools/state-viewer/src/state_parts.rs

@@ -446,31 +447,43 @@ trait StatePartReader {
 fn get_state_part_reader(


This file can be refactored to remove StatePartReader and StatePartWriter, but it's fine to do it in a followup PR.

nikurt · 2023-05-16T09:38:01Z

nearcore/src/state_sync.rs

 async fn state_sync_dump(
    shard_id: ShardId,
    chain: Chain,
    epoch_manager: Arc<dyn EpochManagerAdapter>,
    shard_tracker: ShardTracker,
    runtime: Arc<dyn RuntimeAdapter>,
    chain_id: String,
-    restart_dump_for_shards: Vec<ShardId>,


Please consider removing restart_dump_for_shards from ClientConfig and from Config (i.e. NearConfig::config)

on second thought, restart is still handy to have. In situations where latest epoch is marked as alldumped by error, unless we wipe out data from rocksdb, the node will never dump for the epoch again, restart_dump_for_shards solves this issue.

nearcore/src/state_sync.rs

nikurt · 2023-05-16T09:48:01Z

nearcore/src/state_sync.rs

+                                    parts_to_dump.len().try_into().unwrap();
+                                let mut cnt_parts_dumped = num_parts - cnt_parts_to_dump;
+                                let timer = Instant::now();
+                                while timer.elapsed().as_secs() <= 60 && !parts_to_dump.is_empty() {


Please define a constant instead of magic value 60.

nearcore/src/state_sync.rs

nikurt

Great, thank you!

ppca marked this pull request as ready for review May 11, 2023 21:04

ppca requested a review from a team as a code owner May 11, 2023 21:04

ppca requested review from aborg-dev and nikurt and removed request for aborg-dev May 11, 2023 21:04

nikurt reviewed May 12, 2023

View reviewed changes

ppca requested a review from nikurt May 15, 2023 23:05

nikurt reviewed May 16, 2023

View reviewed changes

ppca requested a review from nikurt May 16, 2023 21:08

nikurt approved these changes May 17, 2023

View reviewed changes

ppca added 16 commits May 18, 2023 13:47

in progress

56a33b5

compiled code

1c934bb

add debug messages

47fe6c6

fix part_id to be 0..num_parts instead of 1..=num_parts

528a8ef

use a new prefix

2618899

add tmp check progress scripts

ae789e2

add ubuntu scripts

6928b5f

changes in progress

8c2b34e

fix unwrap

ce0d11e

format

80b7224

add integration test for multi node dump locally

baff814

delete temporary scripts

b9bdc4a

format

6ccf043

minor

b97a278

draw 1 part_id until timer is at 60s instead of a big batch size

4e6919d

refactor

8619afb

ppca added 11 commits May 18, 2023 13:47

fix after rebase

764876f

test multi-node in sync_state_nodes.rs

0b82eb9

add tests to expensive.txt

31685e8

remove idx from parts_to_dump only after it's dumped

34572b2

fmt

ac3e575

minor

4999ae9

address comments and keep only the code for multi node dump

4af4e1c

fmt

5144bd4

address comments

bc4400a

change time limit to 5 mins

f62a07b

add to changelog

5051015

ppca force-pushed the xiangyi/multi_node_state_dump branch from 56f5a85 to 5051015 Compare May 18, 2023 20:50

ppca merged commit d331000 into master May 18, 2023

ppca deleted the xiangyi/multi_node_state_dump branch May 18, 2023 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dump state to s3 by multiple nodes #9049

feat: dump state to s3 by multiple nodes #9049

ppca commented May 11, 2023 •

edited

Loading

walnut-the-cat commented May 11, 2023

walnut-the-cat commented May 11, 2023

nikurt May 12, 2023

ppca commented May 15, 2023

ppca commented May 15, 2023

nikurt May 16, 2023

nikurt May 16, 2023

ppca May 16, 2023

nikurt May 16, 2023

nikurt left a comment

		@@ -446,31 +447,43 @@ trait StatePartReader {
		fn get_state_part_reader(

feat: dump state to s3 by multiple nodes #9049

feat: dump state to s3 by multiple nodes #9049

Conversation

ppca commented May 11, 2023 • edited Loading

walnut-the-cat commented May 11, 2023

walnut-the-cat commented May 11, 2023

nikurt May 12, 2023

Choose a reason for hiding this comment

ppca commented May 15, 2023

ppca commented May 15, 2023

nikurt May 16, 2023

Choose a reason for hiding this comment

nikurt May 16, 2023

Choose a reason for hiding this comment

ppca May 16, 2023

Choose a reason for hiding this comment

nikurt May 16, 2023

Choose a reason for hiding this comment

nikurt left a comment

Choose a reason for hiding this comment

ppca commented May 11, 2023 •

edited

Loading