State sync improvement plan #8545

mm-near · 2023-02-09T10:43:15Z

State sync improvements - current progress & plan

Our goal is to drastically improve the performance of the state sync (so that we can both scale to 10 shards and allow chunk producers to track only a single shard).

The work consists of 4 separate pieces:

improving the speed of state sync part creation
improving the reliablity of fetching these parts
improving the speed of applying these parts
making it work also in the case where chain is under load.

Improving speed of part creation (in progress)

TL;DR - current code was creating the parts by iterating over trie nodes (which is slow - as these are basically random accesses).

The idea is to iterate over flat storage instead (which should be faster - as this is a linear scan).

The small-scale experiment has finished succesfully (we're able to generate the parts from flat storage that are matching the ones generated from Trie).

Currently working on setting up the large scale experiment to see the performance.

@Longarithm: let's track progress on #8899.

Reliability of fetching the state parts (not started)

Currently the parts are sent via our Tier2 network (as RoutedMessages). This might put a very large load on the network - especially on the uplinks from some of the peers.

The idea, is to add additonal sources (as alternatives) for nodes to download the parts from.

We're thinking about doing it with S3 - where some Pagoda nodes would put the current parts into S3 - and other nodes would be free to download their from those location.

Each part can be verified independently, so there is no additional trust needed.

Improving application speed (not started)

Currently the application of the state sync is done only after all the parts are fetched, and it happens 'single threaded'. We can drastically improve it by doing things in parallel and as soon as each part is received.

Making it work when chain is under load (in progress)

Currently, the node downloads the state (which might take couple hours) - and afterwards it has to run a 'catchup' - that is - apply all the transactions that happened while it was downloading the state.

This means, that if the network is full, nodes are under large time pressure to dowload the state ASAP. (Otherwise, if epoch is 12h and you spend 7h downloading the state, you have remaining 5h to basically apply all the transactions for this epoch - so you'll have to process transactions at 2.3x speed - which might not be possible if network is under load).

To fix this, we're experimenting with the ShardShadowing after StateSync.

The idea is following:

node does the state sync
then it does the shard-shadow (basically receiving the state-deltas)
and then it switches to the transaction application (catchup)

The assumption is, that the state-deltas can be downloaded and applied a lot faster than the catchup blocks.

nagisa mentioned this issue Mar 10, 2023

Idea: Storage sharding #8713

Open

nikurt self-assigned this Mar 30, 2023

nikurt added the T-node Team: issues relevant to the node experience team label Mar 30, 2023

Longarithm mentioned this issue Apr 6, 2023

Initiative: Integrate Flat storage with State sync #8899

Closed

4 tasks

This was referenced May 2, 2023

Rethink state sync #3777

Closed

Understand current syncing time #3839

Closed

testnet rpc node state sync error #5974

Closed

gmilescu added the Node Node team label Oct 19, 2023

nikurt closed this as completed Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State sync improvement plan #8545

State sync improvement plan #8545

mm-near commented Feb 9, 2023 •

edited by Longarithm

Loading

State sync improvement plan #8545

State sync improvement plan #8545

Comments

mm-near commented Feb 9, 2023 • edited by Longarithm Loading

State sync improvements - current progress & plan

Improving speed of part creation (in progress)

Reliability of fetching the state parts (not started)

Improving application speed (not started)

Making it work when chain is under load (in progress)

mm-near commented Feb 9, 2023 •

edited by Longarithm

Loading