You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our goal is to drastically improve the performance of the state sync (so that we can both scale to 10 shards and allow chunk producers to track only a single shard).
The work consists of 4 separate pieces:
improving the speed of state sync part creation
improving the reliablity of fetching these parts
improving the speed of applying these parts
making it work also in the case where chain is under load.
Improving speed of part creation (in progress)
TL;DR - current code was creating the parts by iterating over trie nodes (which is slow - as these are basically random accesses).
The idea is to iterate over flat storage instead (which should be faster - as this is a linear scan).
The small-scale experiment has finished succesfully (we're able to generate the parts from flat storage that are matching the ones generated from Trie).
Currently working on setting up the large scale experiment to see the performance.
Reliability of fetching the state parts (not started)
Currently the parts are sent via our Tier2 network (as RoutedMessages). This might put a very large load on the network - especially on the uplinks from some of the peers.
The idea, is to add additonal sources (as alternatives) for nodes to download the parts from.
We're thinking about doing it with S3 - where some Pagoda nodes would put the current parts into S3 - and other nodes would be free to download their from those location.
Each part can be verified independently, so there is no additional trust needed.
Improving application speed (not started)
Currently the application of the state sync is done only after all the parts are fetched, and it happens 'single threaded'. We can drastically improve it by doing things in parallel and as soon as each part is received.
Making it work when chain is under load (in progress)
Currently, the node downloads the state (which might take couple hours) - and afterwards it has to run a 'catchup' - that is - apply all the transactions that happened while it was downloading the state.
This means, that if the network is full, nodes are under large time pressure to dowload the state ASAP. (Otherwise, if epoch is 12h and you spend 7h downloading the state, you have remaining 5h to basically apply all the transactions for this epoch - so you'll have to process transactions at 2.3x speed - which might not be possible if network is under load).
To fix this, we're experimenting with the ShardShadowing after StateSync.
The idea is following:
node does the state sync
then it does the shard-shadow (basically receiving the state-deltas)
and then it switches to the transaction application (catchup)
The assumption is, that the state-deltas can be downloaded and applied a lot faster than the catchup blocks.
The text was updated successfully, but these errors were encountered:
State sync improvements - current progress & plan
Our goal is to drastically improve the performance of the state sync (so that we can both scale to 10 shards and allow chunk producers to track only a single shard).
The work consists of 4 separate pieces:
Improving speed of part creation (in progress)
TL;DR - current code was creating the parts by iterating over trie nodes (which is slow - as these are basically random accesses).
The idea is to iterate over flat storage instead (which should be faster - as this is a linear scan).
The small-scale experiment has finished succesfully (we're able to generate the parts from flat storage that are matching the ones generated from Trie).
Currently working on setting up the large scale experiment to see the performance.
@Longarithm: let's track progress on #8899.
Reliability of fetching the state parts (not started)
Currently the parts are sent via our Tier2 network (as RoutedMessages). This might put a very large load on the network - especially on the uplinks from some of the peers.
The idea, is to add additonal sources (as alternatives) for nodes to download the parts from.
We're thinking about doing it with S3 - where some Pagoda nodes would put the current parts into S3 - and other nodes would be free to download their from those location.
Each part can be verified independently, so there is no additional trust needed.
Improving application speed (not started)
Currently the application of the state sync is done only after all the parts are fetched, and it happens 'single threaded'. We can drastically improve it by doing things in parallel and as soon as each part is received.
Making it work when chain is under load (in progress)
Currently, the node downloads the state (which might take couple hours) - and afterwards it has to run a 'catchup' - that is - apply all the transactions that happened while it was downloading the state.
This means, that if the network is full, nodes are under large time pressure to dowload the state ASAP. (Otherwise, if epoch is 12h and you spend 7h downloading the state, you have remaining 5h to basically apply all the transactions for this epoch - so you'll have to process transactions at 2.3x speed - which might not be possible if network is under load).
To fix this, we're experimenting with the ShardShadowing after StateSync.
The idea is following:
The assumption is, that the state-deltas can be downloaded and applied a lot faster than the catchup blocks.
The text was updated successfully, but these errors were encountered: