Resharding V3 - state witness, implementation (#577)

Adding State Witness section and filling some empty sections.
near · Nov 22, 2024 · 910cf6c · 910cf6c
1 parent 080ac3c
commit 910cf6c
Showing 1 changed file with 110 additions and 25 deletions.
diff --git a/neps/nep-0568.md b/neps/nep-0568.md
@@ -163,6 +163,38 @@ supporting smooth transitions without altering storage structures directly.
 
 ### Stateless Validation
 
+As only a fraction of nodes track the split shard, there is a need to prove the transition from state root of parent shard 
+to new state roots for children shards to other validators.
+Otherwise the chunk producers for split shard may collude and provide invalid state roots,
+which may compromise the protocol, for example, with minting tokens out of thin air.
+
+The design allows to generate and check this state transition in the time, negligible compared to the time it takes to apply chunk.
+As shown above in [State Storage - MemTrie](#state-storage---memtrie) section, generation and verification logic consists of constant number of trie lookups.
+More specifically, we implement `retain_split_shard(boundary_account, RetainMode::{Left, Right})` method for trie, which leaves only keys in trie that
+belong to the left or right child shard. 
+Inside, we implement `retain_multi_range(intervals)` method, where `intervals` is a vector of trie key intervals to retain.
+Each interval corresponds to unique trie key type prefix byte (`Account`, `AccessKey`, etc.) and either defines an interval from empty key to `boundary_account` key for left shard, or from `boundary_account` to infinity for right shard.
+`retain_multi_range` is recursive. Based on current trie key prefix covered by current node, it either:
+
+* returns node back, if subtree is fully contained within some interval;
+* returns "empty" node, if subtree is outside of all intervals;
+* otherwise, descends into all children and constructs new node with children returned by recursive calls.
+
+Implementation is agnostic to the trie storage used for retrieving nodes, it applies to both memtrie and partial storage (state proof). 
+
+* calling it for memtrie generates a proof and new state root;
+* calling it for partial storage generates a new state root. If method doesn't fail with error that node wasn't found in the proof, it means that proof was sufficient, and it remains to compare generated state root with the one proposed by chunk producer.
+
+### State Witness
+
+Resharding state transition becomes one of `implicit_transitions` in `ChunkStateWitness`. It must be validated between processing last chunk (potentially missing) in the old epoch and the first chunk (potentially missing) in the new epoch. `ChunkStateTransition` fields also nicely correspond to the resharding state transition: in `block_hash` we store the hash of the last block of the parent shard, in `base_state` we store the resharding proof, and in `post_state_root` we store the proposed state root.
+
+Note that it leads to **two** state transitions corresponding to the same block hash. On the chunk producer side, the first transition is stored for the `(block_hash, parent_shard_uid)` pair and the second one is stored for the `(block_hash, child_shard_uid)` pair.
+
+The chunk validator has all the blocks, so it identifies whether implicit transition corresponds to applying missing chunk or resharding independently. This is implemented in `get_state_witness_block_range`, which iterates from `state_witness.chunk_header.prev_block_hash()` to the block with includes last last chunk for the (parent) shard, if it exists.
+
+Then, on `validate_chunk_state_witness`, if implicit transition corresponds to resharding, chunk validator calls `retain_split_shard` and proves state transition from parent to child shard.
+
 ### State Sync
 
 Changes to the state sync protocol aren't typically conisdered protocol changes requiring a version bump, since it's concerned with downloading state that isn't present locally, rather than with the rules of execution of blocks and chunks. But it might still be helpful to outline some planned changes to state sync intended to make the resharding implementation easier to work with.
@@ -187,18 +219,59 @@ In this NEP, we propose updating the ShardId semantics to allow for arbitrary id
 
 ## Reference Implementation
 
-```text
-[This technical section is required for Protocol proposals but optional for other categories. A draft implementation should demonstrate a minimal implementation that assists in understanding or implementing this proposal. Explain the design in sufficient detail that:
+### Overview
+<!-- markdownlint-disable MD029 -->
+
+1. Any node tracking shard must determine if it should split shard in the last block before the epoch where resharding should happen.
+
+```pseudocode
+should_split_shard(block, shard_id):
+  shard_layout = epoch_manager.shard_layout(block.epoch_id())
+  next_shard_layout = epoch_manager.shard_layout(block.next_epoch_id())
+  if epoch_manager.is_next_block_epoch_start(block) && 
+    shard_layout != next_shard_layout &&
+    next_shard_layout.shard_split_map().contains(shard_id):
+    return Some(next_shard_layout.split_shard_event(shard_id))
+  return None
+```
+
+2. This logic is triggered on block postprocessing, which means that block is valid and is being persisted to disk.
 
-* Its interaction with other features is clear.
-* Where possible, include a Minimum Viable Interface subsection expressing the required behavior and types in a target programming language. (ie. traits and structs for rust, interfaces and classes for javascript, function signatures and structs for c, etc.)
-* It is reasonably clear how the feature would be implemented.
-* Corner cases are dissected by example.
-* For protocol changes: A link to a draft PR on nearcore that shows how it can be integrated in the current code. It should at least solve the key technical challenges.
+```pseudocode
+on chain.postprocess_block(block):
+  next_shard_layout = epoch_manager.shard_layout(block.next_epoch_id())
+  if let Some(split_shard_event) = should_split_shard(block, shard_id):
+    resharding_manager.split_shard(split_shard_event)
+```
+
+3. The event triggers changes in all state storage components.
 
-The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work.]
+```pseudocode
+on resharding_manager.split_shard(split_shard_event, next_shard_layout):
+  set State mapping
+  start FlatState resharding
+  process MemTrie resharding:
+    freeze MemTrie, create HybridMemTries
+    for each child shard:
+      mem_tries[parent_shard].retain_split_shard(boundary_account)
 ```
 
+4. `retain_split_shard` leaves only keys in trie that belong to the left or right child shard. 
+It retains trie key intervals for left or right child as described above. Simultaneously the proof is generated.
+In the end, we get new state root, hybrid memtrie corresponding to child shard, and the proof.
+Proof is saved as state transition for pair `(block, new_shard_uid)`.
+
+5. The proof is sent as one of implicit transitions in ChunkStateWitness.
+
+6. On chunk validation path, chunk validator understands if resharding is 
+a part of state transition, using the same `should_split_shard` condition.
+
+7. It calls `Trie(state_transition_proof).retain_split_shard(boundary_account)` 
+which should succeed if proof is sufficient and generates new state root.
+
+8. Finally, it checks that the new state root matches the state root proposed in `ChunkStateWitness`.
+If the whole `ChunkStateWitness` is valid, then chunk validator sends endorsement which also endorses the resharding.
+
 ### State Storage - MemTrie
 
 The current implementation of MemTrie uses a pool of memory (`STArena`) to allocate and deallocate nodes and internal pointers in this pool to reference child nodes. MemTries, unlike the State representation of Trie, do not work with the hash of the nodes but internal memory pointers directly. Additionally, MemTries are not thread safe and one MemTrie exists per shard.
@@ -296,7 +369,7 @@ Elements inherited by both children:
 
 Elements inherited only be the lowest index child:
 
-* `BUFFERED_RECEIPT_INDICES `
+* `BUFFERED_RECEIPT_INDICES`
 * `BUFFERED_RECEIPT`
 
 #### Bring children shards up to date with the chain's head
@@ -410,15 +483,32 @@ The state sync algorithm defines a `sync_hash` that is used in many parts of the
 
 ## Security Implications
 
-```text
-[Explicitly outline any security concerns in relation to the NEP, and potential ways to resolve or mitigate them. At the very least, well-known relevant threats must be covered, e.g. person-in-the-middle, double-spend, XSS, CSRF, etc.]
-```
+### Fork Handling
+
+In theory, it can happen that there will be more than one candidate block which finishes the last epoch with old shard layout. For previous implementations it didn't matter because resharding decision was made in the beginning previous epoch. Now, the decision is made on the epoch boundary, so the new implementation handles this case as well.
+
+### Proof Validation
+
+With single shard tracking, nodes can't independently validate new state roots after resharding, because they don't have state of shard being split. That's why we generate resharding proofs, whose generation and validation may be a new weak point. However, `retain_split_shard` is equivalent to constant number of lookups in the trie, so its overhead its negligible. Even if proof is invalid, it will only imply that `retain_split_shard` fails early, similarly to other state transitions.
 
 ## Alternatives
 
-```text
-[Explain any alternative designs that were considered and the rationale for not choosing them. Why your design is superior?]
-```
+In the solution space which would keep blockchain stateful, we also considered an alternative to handle resharding through mechanism of `Receipts`. The workflow would be to:
+
+* create empty `target_shard`,
+* require `source_shard` chunk producers to create special `ReshardingReceipt(source_shard, target_shard, data)` where `data` would be an interval of key-value pairs in `source_shard` alongside with the proof,
+* then, `target_shard` trackers and validators would process that receipt, validate the proof and insert the key-value pairs into the new shard.
+
+However, `data` would occupy most of the whole state witness capacity and introduce overhead of proving every single interval in `source_shard`. Moreover, approach to sync target shard "dynamically" also requires some form of catchup, which makes it much less feasible than chosen approach.
+
+Another question is whether we should tie resharding to epoch boundaries. This would allow to come from resharding decision to completion much faster. But for that, we would need to:
+
+* agree if we should reshard in the middle of the epoch or allow "fast epoch completion" which has to be implemented,
+* keep chunk producers tracking "spare shards" ready to receive items from split shards,
+* on resharding event, implement specific form of state sync, on which source and target chunk producers would agree on new state roots offline,
+* then, new state roots would be validated by chunk validators in the same fashion.
+
+While it is much closer to Dynamic Resharding (below), it requires much more changes to the protocol. And the considered idea works very well as intermediate step to that, if needed.
 
 ## Future possibilities
 
@@ -428,27 +518,22 @@ The state sync algorithm defines a `sync_hash` that is used in many parts of the
 
 ## Consequences
 
-```text
-[This section describes the consequences, after applying the decision. All consequences should be summarized here, not just the "positive" ones. Record any concerns raised throughout the NEP discussion.]
-```
-
 ### Positive
 
-* p1
+* The protocol is able to execute resharding even while only a fraction of nodes track the split shard.
+* State for new shard layouts is computed in the matter of minutes instead of hours, thus ecosystem stability during resharding is greatly increased. As before, from the point of view of NEAR users it is instantaneous.
 
 ### Neutral
 
-* n1
+N/A
 
 ### Negative
 
-* n1
+* The storage components need to handle additional complexity of controlling the shard layout change.
 
 ### Backwards Compatibility
 
-```text
-[All NEPs that introduce backwards incompatibilities must include a section describing these incompatibilities and their severity. Author must explain a proposes to deal with these incompatibilities. Submissions without a sufficient backwards compatibility treatise may be rejected outright.]
-```
+Resharding is backwards compatible with existing protocol logic.
 
 ## Unresolved Issues (Optional)