From a5f7f06e1110c61f68d4a13269e635eda23fd31e Mon Sep 17 00:00:00 2001 From: Longarithm Date: Thu, 13 Jun 2024 15:54:25 +0400 Subject: [PATCH] restruct --- neps/nep-0509.md | 231 ++++++++++++++++++++++++----------------------- 1 file changed, 119 insertions(+), 112 deletions(-) diff --git a/neps/nep-0509.md b/neps/nep-0509.md index 1d39f8c75..2f0013324 100644 --- a/neps/nep-0509.md +++ b/neps/nep-0509.md @@ -117,25 +117,123 @@ Let's formalise a proposed change to the validator roles and responsibilities, w * Do not require tracking any shard * Must collectively have a majority of all the validator stake, to ensure the security of chunk validation. -See the Validator Structure Change section for more details. +See the Validator Structure Change section below for more details. -## Chunk Validator Shuffling +## Validator Structure Change -Chunk validators will be randomly assigned to validate shards, for each block (or as we may decide later, for multiple blocks in a row, if required for performance reasons). A chunk validator may be assigned multiple shards at once, if it has sufficient stake. +### Roles +Currently, there are two different types of validators. Their responsibilities are defined as in the following pseudocode: -Each chunk validator's stake is divided into "mandates". There are full and partial mandates. The amount of stake for a full mandate is a fixed parameter determined by the stake distribution of all validators, and any remaining amount smaller than a full mandate is a partial mandate. A chunk validator therefore has zero or more full mandates plus up to one partial mandate. The list of full mandates and the list of partial mandates are then separately shuffled and partitioned equally (as in, no more than one mandate in difference between any two shards) across the shards. Any mandate assigned to a shard means that the chunk validator who owns the mandate is assigned to validate that shard. Because a chunk validator may have multiple mandates, it may be assigned multiple shards to validate. +```python +if index(validator) < 100: + roles(validator).append("block producer") +roles(validator).append("chunk producer") +``` + +The validators are ordered by non-increasing stake in the considered epoch. Here and below by "block production" we mean both production and validation. + +With stateless validation, this structure must change for several reasons: +* Chunk production is the most resource consuming activity. +* (Only) chunk production needs state in memory while other responsibilities can be completed via acquiring state witness +* Chunk production does not have to be performed by all validators. + +Hence, to make transition seamless, we change the role of nodes out of top 100 to only validate chunks: + +```python +if index(validator) < 100: + roles(validator).append("chunk producer") + roles(validator).append("block producer") +roles(validator).append("chunk validator") +``` + +The more stake validator has, the more **heavy** work it will get assigned, because we assume that validators with higher stakes have more powerful hardware. +With stateless validation, relative heaviness of the work changes. Comparing to the current order "block production" > "chunk production", the new order is "chunk production" > "block production" > "chunk validation". + +Shards are equally split among chunk producers: as in Mainnet on 12 Jun 2024 we have 6 shards, each shard would have ~16 chunk producers assigned. + +In the future, with increase in number of shards, we can generalise the assignment by saying that each shard should have `X` chunk producers assigned, if we have at least `X * S` validators. In such case, pseudocode for the role assignment would look as follows: + +```python +if index(validator) < X * S: + roles(validator).append("chunk producer") +if index(validator) < 100: + roles(validator).append("block producer") +roles(validator).append("chunk validator") +``` + +### Rewards + +Reward for each validator is defined as `total_epoch_reward * validator_relative_stake * work_quality_ratio`, where: +* `total_epoch_reward` is selected so that total inflation of the token is 5% per annum; +* `validator_relative_stake = validator_stake / total_epoch_stake`; +* `work_quality_ratio` is the measure of the work quality from 0 to 1. + +So, the actual reward never exceeds total reward, and when everyone does perfect work, they are equal. +For the context of the NEP, it is enough to assume that `work_quality_ratio = avg_{role}({role}_quality_ratio)`. +So, if node is both a block and chunk producer, we compute quality for each role separately and then take average of them. + +When epoch is finalized, all headers of blocks in it uniquely determine who was expected to produce each block and chunk. +Thus, if we define quality ratio for block producer as `produced_blocks/expected_blocks`, everyone is able to compute it. +Similarly, `produced_chunks/expected_chunks` is a quality for chunk producer. +It is more accurate to say `included_chunks/expected_chunks`, because inclusion of chunk in block is a final decision of a block producer which defines success here. + +Ideally, we could compute quality for chunk validator as `produced_endorsements/expected_endorsements`. Unfortunately, we won't do it in Stage 0 because: +* Mask of endorsements is not part of the block header, and it would be a significant change; +* Block producer doesn't have to wait for all endorsements to be collected, so it could be unfair to say that endorsement was not produced if block producer just went ahead. + +So for now we decided to compute quality for chunk validator as ratio of `included_chunks/expected_chunks`, where we iterate over chunks which node was expected to validate. +The obvious drawback here is that if chunks are not produced at all, chunk validators will also be impacted. We plan to address it in the future releases. + +### Kickouts + +In addition to that, if node performance is too poor, we want a mechanism to kick it out of the validator list, to ensure healthy protocol performance and validator rotation. +Currently, we have a threshold for each role, and if for some role the same `{role}_quality_ratio` is lower than threshold, the node is kicked out. + +If we write this in pseudocode, + +```python +if validator is block producer and block_producer_quality_ratio < 0.8: + kick out validator +if validator is chunk producer and chunk_producer_quality_ratio < 0.8: + kick out validator +``` + +For chunk validator, we apply absolutely the same formula. However, because: +* the formula doesn't count endorsements explicitly +* for chunk producers it kind of just makes chunk production condition stronger without adding value + +we apply it to nodes which **only validate chunks**. So, we add this line: + +```python +if validator is only chunk validator and chunk_validator_quality_ratio < 0.8: + kick out validator +``` -We have done research to show that the security of this algorithm is sufficient with a reasonable number of chunk validators and a reasonable number of shards, assuming a reasonable bound for the total stake of malicious nodes. TODO: Include or link to that research here. +As we pointed out above, current formula `chunk_validator_quality_ratio` is problematic. +Here it brings even a bigger issue: if chunk producers don't produce chunks, chunk validators will be kicked out as well, which impacts network stability. +This is another reason to come up with the better formula. + +### Shard assignment +As chunk producer becomes the most important role, we need to ensure that every epoch has significant amount of healthy chunk producers. +This is a **strong difference** with current logic, where chunk-only producers generally have low stake and their performance doesn't impact overall performance. + +The most challenging part of becoming a chunk producer for a shard is to download most recent shard state within previous epoch. This is called "state sync". +Unfortunately, as of now, state sync is centralised on published snapshots, which is a major point of failure, until we don't have decentralised state sync. + +Because of that, we make additional change: if node was a chunk producer for some shard in the previous epoch, and it is a chunk producer for current epoch, it will be assigned to the same shard. +This way, we minimise number of required state syncs at each epoch. + +The exact algorithm needs a thorough description to satisfy different edge cases, so we will just leave a link to full explanation: https://github.com/near/nearcore/issues/11213#issuecomment-2111234940. + ## Reference Implementation Here we carefully describe new structures and logic introduced, without going into too much technical details. ### ChunkStateWitness -Here we describe the most important properties, the full structure is described on [GitHub](https://github.com/near/nearcore/blob/b8f08d9ded5b7cbae9d73883785902b76e4626fc/core/primitives/src/stateless_validation.rs#L247). - -Let's construct the structure together with explaining why every field is needed. Start from simple data: +The full structure is described [here](https://github.com/near/nearcore/blob/b8f08d9ded5b7cbae9d73883785902b76e4626fc/core/primitives/src/stateless_validation.rs#L247). +Let's construct it sequentially together with explaining why every field is needed. Start from simple data: ```rust pub struct ChunkStateWitness { pub chunk_producer: AccountId, @@ -258,122 +356,31 @@ When chunk header is received, all pending endorsements are checked for validity All endorsements received after that are validated right away. Finally, when block producer attempts to produce a block, in addition to checking chunk existence, it also checks that it has 2/3 endorsement stake for that chunk hash. -To make chunk inclusion verifiable, we introduce another version of block body `BlockBodyV2` which has new field `chunk_endorsements` which is basically a `Vec>>` -where element with indices `(s, i)` contains signature of i-th chunk validator for shard s if it was included and None otherwise. +To make chunk inclusion verifiable, we introduce [another version](https://github.com/near/nearcore/blob/cf2caa3513f58da8be758d1c93b0900ffd5d51d2/core/primitives/src/block_body.rs#L30) of block body `BlockBodyV2` which has new field `chunk_endorsements`. +It is basically a `Vec>>` where element with indices `(s, i)` contains signature of i-th chunk validator for shard s if it was included and None otherwise. Lastly, we add condition to block validation, such that if chunk `s` was included in the block, then block body must contain 2/3 endorsements for that shard. -### Partial state witness distribution +This logic is triggered in `ChunkInclusionTracker` by methods [get_chunk_headers_ready_for_inclusion](https://github.com/near/nearcore/blob/6184e5dac45afb10a920cfa5532ce6b3c088deee/chain/client/src/chunk_inclusion_tracker.rs#L146) and couple similar ones. Number of ready chunks is returned by [num_chunk_headers_ready_for_inclusion](https://github.com/near/nearcore/blob/6184e5dac45afb10a920cfa5532ce6b3c088deee/chain/client/src/chunk_inclusion_tracker.rs#L178). -TODO - -## Validator Structure Change - -### Roles -Currently, there are two different types of validators. Their responsibilities are defined as in the following pseudocode: - -```python -if index(validator) < 100: - roles(validator).append("block producer") -roles(validator).append("chunk producer") -``` - -The validators are ordered by non-increasing stake in the considered epoch. Here and below by "block production" we mean both production and validation. - -With stateless validation, this structure must change for several reasons: -* Chunk production is the most resource consuming activity. -* (Only) chunk production needs state in memory while other responsibilities can be completed via acquiring state witness -* Chunk production does not have to be performed by all validators. - -Hence, to make transition seamless, we change the role of nodes out of top 100 to only validate chunks: - -```python -if index(validator) < 100: - roles(validator).append("chunk producer") - roles(validator).append("block producer") -roles(validator).append("chunk validator") -``` - -The more stake validator has, the more **heavy** work it will get assigned, because we assume that validators with higher stakes have more powerful hardware. -With stateless validation, relative heaviness of the work changes. Comparing to the current order "block production" > "chunk production", the new order is "chunk production" > "block production" > "chunk validation". - -Shards are equally split among chunk producers: as in Mainnet on 12 Jun 2024 we have 6 shards, each shard would have ~16 chunk producers assigned. - -In the future, with increase in number of shards, we can generalise the assignment by saying that each shard should have `X` chunk producers assigned, if we have at least `X * S` validators. In such case, pseudocode for the role assignment would look as follows: - -```python -if index(validator) < X * S: - roles(validator).append("chunk producer") -if index(validator) < 100: - roles(validator).append("block producer") -roles(validator).append("chunk validator") -``` - -### Rewards - -Reward for each validator is defined as `total_epoch_reward * validator_relative_stake * work_quality_ratio`, where: -* `total_epoch_reward` is selected so that total inflation of the token is 5% per annum; -* `validator_relative_stake = validator_stake / total_epoch_stake`; -* `work_quality_ratio` is the measure of the work quality from 0 to 1. - -So, the actual reward never exceeds total reward, and when everyone does perfect work, they are equal. -For the context of the NEP, it is enough to assume that `work_quality_ratio = avg_{role}({role}_quality_ratio)`. -So, if node is both a block and chunk producer, we compute quality for each role separately and then take average of them. - -When epoch is finalized, all headers of blocks in it uniquely determine who was expected to produce each block and chunk. -Thus, if we define quality ratio for block producer as `produced_blocks/expected_blocks`, everyone is able to compute it. -Similarly, `produced_chunks/expected_chunks` is a quality for chunk producer. -It is more accurate to say `included_chunks/expected_chunks`, because inclusion of chunk in block is a final decision of a block producer which defines success here. - -Ideally, we could compute quality for chunk validator as `produced_endorsements/expected_endorsements`. Unfortunately, we won't do it in Stage 0 because: -* Mask of endorsements is not part of the block header, and it would be a significant change; -* Block producer doesn't have to wait for all endorsements to be collected, so it could be unfair to say that endorsement was not produced if block producer just went ahead. - -So for now we decided to compute quality for chunk validator as ratio of `included_chunks/expected_chunks`, where we iterate over chunks which node was expected to validate. -The obvious drawback here is that if chunks are not produced at all, chunk validators will also be impacted. We plan to address it in the future releases. - -### Kickouts - -In addition to that, if node performance is too poor, we want a mechanism to kick it out of the validator list, to ensure healthy protocol performance and validator rotation. -Currently, we have a threshold for each role, and if for some role the same `{role}_quality_ratio` is lower than threshold, the node is kicked out. - -If we write this in pseudocode, - -```python -if validator is block producer and block_producer_quality_ratio < 0.8: - kick out validator -if validator is chunk producer and chunk_producer_quality_ratio < 0.8: - kick out validator -``` +## Chunk Validator Shuffling -For chunk validator, we apply absolutely the same formula. However, because: -* the formula doesn't count endorsements explicitly -* for chunk producers it kind of just makes chunk production condition stronger without adding value +Chunk validators will be randomly assigned to validate shards, for each block (or as we may decide later, for multiple blocks in a row, if required for performance reasons). A chunk validator may be assigned multiple shards at once, if it has sufficient stake. -we apply it to nodes which **only validate chunks**. So, we add this line: +Each chunk validator's stake is divided into "mandates". There are full and partial mandates. The amount of stake for a full mandate is a fixed parameter determined by the stake distribution of all validators, and any remaining amount smaller than a full mandate is a partial mandate. A chunk validator therefore has zero or more full mandates plus up to one partial mandate. The list of full mandates and the list of partial mandates are then separately shuffled and partitioned equally (as in, no more than one mandate in difference between any two shards) across the shards. Any mandate assigned to a shard means that the chunk validator who owns the mandate is assigned to validate that shard. Because a chunk validator may have multiple mandates, it may be assigned multiple shards to validate. -```python -if validator is only chunk validator and chunk_validator_quality_ratio < 0.8: - kick out validator -``` +For Stage 0, we select **target amount of mandates per shard** to 68, which was a [result of the latest research](https://near.zulipchat.com/#narrow/stream/407237-core.2Fstateless-validation/topic/validator.20seat.20assignment/near/435252304). +With this number of mandates per shard and 6 shards, we predict the protocol to be secure for 40 years at 90% confidence. +Based on target number of mandates and total chunk validators stake, [here](https://github.com/near/nearcore/blob/696190b150dd2347f9f042fa99b844b67c8001d8/core/primitives/src/validator_mandates/mod.rs#L76) we compute price of a single full mandate using binary search. -As we pointed out above, current formula `chunk_validator_quality_ratio` is problematic. -Here it brings even a bigger issue: if chunk producers don't produce chunks, chunk validators will be kicked out as well, which impacts network stability. -This is another reason to come up with the better formula. +### Limits -### Shard assignment +One big issue which `ChunkStateWitness` introduces is that it is relatively large, and it must be distributed -As chunk producer becomes the most important role, we need to ensure that every epoch has significant amount of healthy chunk producers. -This is a **strong difference** with current logic, where chunk-only producers generally have low stake and their performance doesn't impact overall performance. -The most challenging part of becoming a chunk producer for a shard is to download most recent shard state within previous epoch. This is called "state sync". -Unfortunately, as of now, state sync is centralised on published snapshots, which is a major point of failure, until we don't have decentralised state sync. +### Partial state witness distribution -Because of that, we make additional change: if node was a chunk producer for some shard in the previous epoch, and it is a chunk producer for current epoch, it will be assigned to the same shard. -This way, we minimise number of required state syncs at each epoch. +TODO -The exact algorithm needs a thorough description to satisfy different edge cases, so we will just leave a link to full explanation: https://github.com/near/nearcore/issues/11213#issuecomment-2111234940. - - ## Security Implications [Explicitly outline any security concerns in relation to the NEP, and potential ways to resolve or mitigate them. At the very least, well-known relevant threats must be covered, e.g. person-in-the-middle, double-spend, XSS, CSRF, etc.]