near · shreyan-gupta · Jun 18, 2024 · Jun 18, 2024 · Jun 18, 2024 · Jun 18, 2024
diff --git a/neps/nep-0509.md b/neps/nep-0509.md
@@ -345,7 +345,7 @@ It is basically a triple of `(ChunkHash, AccountId, Signature)`.
 Receiving this message means that specific chunk validator account endorsed chunk with specific chunk hash.
 Ideally chunk validator would send chunk endorsement to just the next block producer at the same height for which chunk was produced.
 However, block at that height can be skipped and block producers at heights h+1, h+2, ... will have to pick up the chunk.
-To address that, we send `ChunkEndorsement` to all block producers at heights from h to `h+d-1`. We pick `d=5` as more than 5 skipped blocks in a row are very unlikely to occur.
+To address that, we send `ChunkEndorsement` to all block producers at heights from `h` to `h+d-1`. We pick `d=5` as more than 5 skipped blocks in a row are very unlikely to occur.
 
 On block producer side, chunk endorsements are collected and stored in `ChunkEndorsementTracker`.
 Small **caveat** is that *sometimes* chunk endorsement may be received before chunk header which is required to understand that sender is indeed a validator of the chunk.
@@ -376,36 +376,38 @@ After that, for each height in the epoch, [EpochInfo::sample_chunk_validators](h
 
 This way, everyone tracking block headers can compute chunk validator assignment for each height and shard. 
 
-### Limits
+### Size limits
 
 `ChunkStateWitness` is relatively large message. Given large number of receivers as well, its size must be strictly limited.
 If `ChunkStateWitness` for some state transition gets so uncontrollably large that it never can be handled by majority of validators, then its shard gets stuck.
 
-All the limits are described [here](https://github.com/near/nearcore/blob/b34db1e2281fbfe1d99a36b4a90df3fc7f5d00cb/docs/misc/state_witness_size_limits.md).
-Additionally, we have limit on currently stored chunk endorsements, because malicious chunk validators can spam these as well.
+We try to limit the size of the `ChunkStateWitness` to 16 MiB. All the limits are described [in this section](https://github.com/near/nearcore/blob/b34db1e2281fbfe1d99a36b4a90df3fc7f5d00cb/docs/misc/state_witness_size_limits.md).
+Additionally, we have limit on currently stored partial state witnesses and chunk endorsements, because malicious chunk validators can spam these as well.
+
+### Partial state witness
 
 ## State witness size limits
 
 A number of new limits will be introduced in order to keep the size of `ChunkStateWitness` reasonable.
-`ChunkStateWitness` contains all the incoming transactions and receipts that will be processed during chunk application and in theory a single receipt could be tens of megabatytes in size. Distributing a `ChunkStateWitness` this large would be troublesome, so we limit the size and number of transactions, receipts, etc. The limits aim to keep the total uncompressed size of `ChunkStateWitness` under 16MiB.
+`ChunkStateWitness` contains all the incoming transactions and receipts that will be processed during chunk application and in theory a single receipt could be tens of megabatytes in size. Distributing a `ChunkStateWitness` this large to all chunk validators would be troublesome, so we limit the size and number of transactions, receipts, etc. The limits aim to keep the total uncompressed size of `ChunkStateWitness` under 16 MiB.
 
 There are two types of size limits:
-* Hard limit - the size must be below this limit, anything else is considered invalid
-* Soft limit - things are added until the limit is exceeded, after that things stop being added. The last added thing is allowed to slightly exceed the limit.
+* Hard limit - The size must be below this limit, anything else is considered invalid. This is usually used in the context of having limits for a single item.
+* Soft limit - Things are added until the limit is exceeded, after that things stop being added. The last added thing is allowed to slightly exceed the limit. This is used in the context of having limits for a list of items.
 
 The limits are:
-* `max_transaction_size = 1.5 MiB`
+* `max_transaction_size - 1.5 MiB`
     * All transactions must be below 1.5 MiB, otherwise they'll be considered invalid and rejected.
-    * Previously was 4MiB, now reduced to 1.5MiB
+    * Previously was 4 MiB, now reduced to 1.5 MiB
 * `max_receipt_size - 4 MiB`:
     * All receipts must be below 4 MiB, otherwise they'll be considered invalid and rejected.
-    * Previously there was no limit on receipt size. Set to 4MiB, might be reduced to 1.5MiB in the future to match the transaction limit.
+    * Previously there was no limit on receipt size. Set to 4 MiB, might be reduced to 1.5 MiB in the future to match the transaction limit.
 * `combined_transactions_size_limit - 4 MiB`
     * Hard limit on total size of transactions from this and previous chunk. `ChunkStateWitness` contains transactions from two chunks, this limit applies to the sum of their sizes.
 * `new_transactions_validation_state_size_soft_limit - 500 KiB`
     * Validating new transactions generates storage proof (recorded trie nodes), which has to be limited. Once transaction validation generates more storage proof than this limit, the chunk producer stops adding new transactions to the chunk.
 * `per_receipt_storage_proof_size_limit - 4 MB`
-    * Executing a receipt generates storage proof. A single receipt is allowed to generate at most 4MB of storage proof. This is a hard limit, receipts which generate more than that will fail.
+    * Executing a receipt generates storage proof. A single receipt is allowed to generate at most 4 MB of storage proof. This is a hard limit, receipts which generate more than that will fail.
 * `main_storage_proof_size_soft_limit - 3 MB`
     * This is a limit on the total size of storage proof generated by receipts in one chunk. Once receipts generate more storage proof than this limit, the chunk producer stops processing receipts and moves the rest to the delayed queue.
     * It's a soft limit, which means that the total size of storage proof could reach 7 MB (2.99MB + one receipt which generates 4MB of storage proof)
@@ -441,6 +443,53 @@ If it turns out that some limits weren't respected, the validators will generate
 When a shard is missing some chunks, the following chunk on that shard will receive receipts from multiple blocks. This could lead to large `source_receipt_proofs` so a mechanism is added to reduce the impact. If there are two or more missing chunks in a row,
 the shard is considered fully congested and no new receipts will be sent to it (unless it's the `allowed_shard` to avoid deadlocks).
 
+## ChunkStateWitness distribution
+
+For chunk production, the chunk producer is required to distribute the chunk state witness to all the chunk validators. The chunk validators then validate the chunk and send the chunk endorsement to the block producer. Chunk state witness distribution is on a latency critical path.
+
+As we saw in the section above, the maximum size of the state witness can be ~16 MiB. If the chunk producer were to send the chunk state witness to all the chunk validators it would add a massive bandwidth requirement for the chunk producer. To ease and distribute the network requirements across all the chunk producers, we have a distribution mechanism similar to what we have for chunks in the shards manager. We divide the chunk state witness into a number of parts, and let the chunk validators distribute the parts among themselves, and later reconstruct the chunk state witness.
+
+### Distribution mechanism
+
+A chunk producer divides the state witness into a set of `N` parts where `N` is the number of chunk validators. The parts or partial witnesses are represented as [PartialEncodedStateWitness](https://github.com/near/nearcore/blob/66d3b134343d9f35f6e0b437ebbdbef3e4aa1de3/core/primitives/src/stateless_validation.rs#L40). Each chunk validator is the owner of one part. The chunk producer uses the [PartialEncodedStateWitnessMessage](https://github.com/near/nearcore/blob/66d3b134343d9f35f6e0b437ebbdbef3e4aa1de3/chain/network/src/state_witness.rs#L11) to send each part to their respective owners. The chunk validator part owners, on receiving the `PartialEncodedStateWitnessMessage`, forward this part to all other chunk validators via the [PartialEncodedStateWitnessForwardMessage](https://github.com/near/nearcore/blob/66d3b134343d9f35f6e0b437ebbdbef3e4aa1de3/chain/network/src/state_witness.rs#L15). Each validator then uses the partial witnesses received to reconstruct the full chunk state witness.
+
+We have a separate [PartialWitnessActor](https://github.com/near/nearcore/blob/66d3b134343d9f35f6e0b437ebbdbef3e4aa1de3/chain/client/src/stateless_validation/partial_witness/partial_witness_actor.rs#L32) actor/module that is responsible for dividing the state witness into parts, distributing the parts, handling both partial encoded state witness message and the forward message, validating and storing the parts, and reconstructing the state witness from the parts and sending is to the chunk validation module.
+
+### Building redundancy using Reed Solomon Erasure encoding
+
+During the distribution mechanism, it's possible that some of the chunk validators are malicious, offline, or have a high network latency. Since chunk witness distribution is on the critical path for block production, we safeguard the distribution mechanism by building in redundancy using the Reed Solomon Erasure encoding.
+
+With Reed Solomon Erasure encoding, we can divde the chunk state witness into `N` total parts with `D` number of data parts. We can reconstruct the whole state witness as long as we have `D` of the `N` parts. The ratio of data parts `r = D/N` is something we can play around with.
+
+While reducing `r`, i.e. reducing the number of data parts required to reconstruct the state witness does allow for a more robust distribution mechanism, it comes with the cost of bloating the overall size of parts we need to distribute. If `S` is the size of the state witness, after reed solomon encoding, the total size `S'` of all parts becomes `S' = S/r` or `S' = S * N / D`.
+
+For the first release of stateless validation, we've kept the ratio as `0.6` representing that ~2/3rd of all chunk validators need to be online for chunk state witness distribution mechanism to work smoothly.
+
+One thing to note here is that the redundancy and upkeep requirement of 2/3rd is the _number_ of chunk validators and not the _stake_ of chunk validators.
+
+### PartialEncodedStateWitness structure
+
+The partial encoded state witness has the following fields:
+- `(epoch_id, shard_id, height_created)` : These are the three fields that together uniquely determine the chunk associated with the partial witness. Since the chunk and chunk header distribution mechanism is independent of the partial witness, we rely on this triplet to uniquely identify which chunk is a part associated with.
+- `part_ord` : The index or id of the part in the array of partial witnesses.
+- `part` : The data associated with the part
+- `encoded_length` : The total length of the state witness. This is required in the reed solomon decoding process to reconstruct the state witness.
+- `signature` : Each part is signed by the chunk producer. This way the validity of the partial witness can be verified by the chunk validators receiving the parts.
+
+The `PartialEncodedStateWitnessTracker` module that is responsible for the storage and decoding of partial witnesses. This module has a LRU cache to store all the partial witnesses with `(epoch_id, shard_id, height_created)` triplet as the key. We reconstruct the state witness as soon as we have `D` of the `N` parts as forward the state witness to the validation module.
+
+### Network tradeoffs
+
+To get a sense of network requirements for validators with an without partial state witness distribution mechanism, we can do some quick back of the envelop calculations. Let `N` but the number of chunk validators, `S` be the size of the chunk state witness, `r` be the ratio of data parts to total parts for Reed Solomon Erasure encoding.
+
+Without the partial state witness distribution, each chunk producer would have to send the state witness to all chunk validators, which would require a bandwidth `B` of `B = N * S`. For the worst case of ~16 validators and ~16 MiB of state witness size, this can be a burst requirement of 2 Gbps.
+
+Partial state witness distribution takes this load off the chunk producer and distributes it evenly among all the chunk validators. However, we get an additional factor of `1/r` of extra data being transferred for redundancy. Each partial witness has a size of `P = S' / N` or `P = S / r / N`. The chunk producer and validators needs a bandwidth `B` of `B = P * N` or `B = S / r` to forward its owned part to all `N` chunk validators. For worst case of ~16 MiB of state witness size and encoding ratio of `0.6`, this works out to be ~214 Mbps, which is much more reasonable.
+
+### Future work
+
+In the Reed Solomon Erasure encoding section we discussed that the chunk state distribution mechanism relies on 2/3rd of the _number_ of chunk validators being available/non-malicious and not 2/3rd of the _total stake_ of the chunk validators. This can cause a potential issue where it's possible for more than 1/3rd of the chunk validators with small enough stake to be unavailable and cause the chunk production to stall. In the future we would like to address this problem.
+
 ## Validator Role Change
 Currently, there are two different types of validators and their responsibilities are as follows:
 |  | Top ~50% validators | Remaining validatiors (Chunk only producers) |
@@ -449,10 +498,6 @@ Currently, there are two different types of validators and their responsibilitie
 | chunk production | Y | Y |
 | block validation | Y | N |
 
-### Partial state witness distribution
-
-TODO
-
 ### Protocol upgrade
 
 The good property of the approach taken is that protocol upgrade happens almost seamlessly.