feat(chain): Chunk Only Producers #4193

birchmd · 2021-04-02T02:30:45Z

Chunk-only producers are an important stepping stone towards sharding in mainnet. See https://gov.near.org/t/block-and-chunk-producer-selection-algorithm-in-simple-nightshade/66 for more details. Also see near/NEPs#167 for the spec this work is based on.

This PR does most of the work towards landing this feature. Much of the work with this PR was updating tons of tests because they were using the assumption that validators produce blocks/chunk in a cyclic order. That is no longer true because the randomness is done on the fly at each height instead of when processing the proposals.

This PR is not yet suitable for merging to master; missing items are listed below:

Nayduck failures (looks like some tests are failing -- http://nayduck.eastus.cloudapp.azure.com:3000/#/run/1452)
Writing a new pytest to see this feature working end-to-end. This PR adds some new tests, and fixes a lot of old tests, so it probably works, but it's always nice to see an integration test.

List of (possible) Nayduck failures to be addressed:

expensive nearcore test_rejoin test::test_4_20_kill1_two_shards	
pytest sanity/one_val.py	
pytest sanity/rpc_state_changes.py	
pytest sanity/staking2.py	
pytest sanity/staking_repro1.py	
pytest sanity/state_sync2.py	
pytest sanity/sync_chunks_from_archival.py	
pytest stress/stress.py 3 3 3 0 staking transactions local_network packets_drop	
pytest stress/stress.py 3 3 3 0 staking transactions node_restart packets_drop	
pytest stress/stress.py 3 3 3 0 staking transactions node_restart wipe_data	
pytest sanity/gc_after_sync.py
pytest sanity/gc_sync_after_sync.py swap_nodes

…h height using the new algorithm

…height

…ure_chunk_only_producers

stale · 2021-07-01T07:27:48Z

This PR has been automatically marked as stale because it has not had recent activity in the 2 weeks.
It will be closed in 3 days if no further activity occurs.
Thank you for your contributions.

SkidanovAlex

Only reviewed the first half so far, some comments inline.

SkidanovAlex · 2021-08-24T18:57:56Z

chain/chain/src/store_validator/validate.rs

-                            "Can't get Outcome ids by Block Hash"
-                        ));
-                    }
+                if let Ok(Some(_)) = sv.store.get_ser::<ChunkExtra>(


If I understand what this logic was for, the changed code won't catch a situation when I'm supposed to have a chunk extra, but do not?

That is the difficult part -- there is a test that fails because the catch up has not finished downloading the state and as a result we don't actually have chunk extra for the block. There is nothing wrong with that and I am not sure what would be a better condition to check

SkidanovAlex · 2021-08-24T19:05:53Z

chain/client/src/client.rs

@@ -1425,6 +1425,8 @@ impl Client {
        let head = self.chain.head()?;
        if let Some(next_epoch_id) = self.get_next_epoch_id_if_at_boundary(&head)? {
            self.forward_tx(&next_epoch_id, tx)?;
+        } else {


Why is this needed?

If I follow the logic, this line changes the semantics of the method quite drastically. It used to forward the tx in extremely rare cases on the epoch boundary. Now it will result in practically doubling number of tx forwarding from validator nodes (since the above condition is almost always false).

Logically this seems like a good change (looking at the place where this is called in process_tx_internal it appears that if an active validator received a tx directly, the tx will never be forwarded before this change, resulting in very long processing times). But if this is indeed your intent here, the method needs to be renamed?

Also, in the process_tx_internal it would make sense to call this method now instead of forward_tx in the case when we are not a validator?

looking at the place where this is called in process_tx_internal it appears that if an active validator received a tx directly, the tx will never be forwarded before this change, resulting in very long processing times

Yes that is exactly the intention. One python test failed because of this.

But if this is indeed your intent here, the method needs to be renamed?

What do you suggest we rename it to?

chain/epoch_manager/src/shard_assignment.rs

SkidanovAlex · 2021-08-24T22:42:59Z

chain/epoch_manager/src/shard_assignment.rs

+/// best when the number of chunk producers is greater than
+/// `num_shards * min_validators_per_shard`.
+pub fn assign_shards<T: HasStake + Eq + Clone>(
+    chunk_producers: Vec<T>,


Are chunk_producers sorted in any particular way (in particular, in descending order of stake)?
It would be good to document here.
If they are not sorted in descending stake, it might be worth doing?

chain/epoch_manager/src/shard_assignment.rs

SkidanovAlex · 2021-08-24T22:47:36Z

chain/epoch_manager/src/shard_assignment.rs

+    Ok(result)
+}
+
+fn assign_with_possible_repeats<T: HasStake + Eq, I: Iterator<Item = (usize, T)>>(


This method seems overly complex.

random_shuffle(validators) // optionally last_ord = 0 for (shard_id in ...) { for validator_ord in 0..min_validators_per_shard { result[shard_id].push(last_ord); last_ord += 1 last_ord %= num_chunk_producers; } }

The difference is that this approach does not try to balance the stake across shards.

SkidanovAlex

Still haven't reviewed validator_selection.rs.

SkidanovAlex · 2021-08-25T01:14:19Z

core/primitives/src/rand.rs

+
+    #[test]
+    fn test_sample_should_produce_correct_distribution() {
+        let weights = vec![5, 1, 1];


(UPDATE: I had an incorrect challenge here)

The test itself is pretty flacky (fails with relative error 0.7%-0.8% one out of 5 executions). Though I ran it with rand::thread_rng instead of hashes, may be it changes the distribution.

The test as written passes 100% of the time because it is deterministic; this is why I chose to simulate random seeds with repeated hashing instead of something like thread_rng.

You are right that this is a statistical process, so with a fixed number of samples there will always be some probability of failure. Really the statement I want to check is "with probability 1, repeated sampling from weighted_index will eventually converge to the correct distribution". The trouble of course is translating that into a reliable test that runs relatively quickly. I chose to pick an arbitrary, though fixed, sequence of seeds which happens to work because I thought it was pretty convincing.

An alternative to using hashes to uniformly and deterministically sample the space of inputs to give to weighted_index would be to take some random number generator from the rand library and give it a fixed seed. This should be equivalent (in terms of uniform randomness) to the hashes scheme, but if you find that more convincing it's an easy change to make.

chain/epoch_manager/src/validator_selection.rs

bowenwang1996

Approve to unblock this PR and avoid having to deal with merge conflicts with master. If there are any further testing needed, we can do it after it is merged as a nightly protocol feature

bowenwang1996 · 2021-09-02T22:05:03Z

@frol looks like it needs your approval 🙏

frol

The changes in Rosetta-RPC and primitives LGTM

bowenwang1996 · 2021-09-04T23:23:02Z

@nikurt please review

birchmd added 20 commits April 2, 2021 00:36

WeightedIndex implementation

a643f86

Logic for rolling over past validators

d6eea89

Min heap structure

d4d8c6a

Shard assignment algorithm

aef9834

Finish first pass on new proposals_to_epoch_info

ee999cc

Get num_shards from epoch_config

cfc8834

Add some structures to EpochInfo to enable choosing validators at eac…

b502ee5

…h height using the new algorithm

Use stake-weighted sample to select block and chunk producer at each …

4dfccfa

…height

New staking action for staking as a chunk-only producer

ed4e115

cargo fmt

64afd0f

Propogate protocol_feature_chunk_only_producers to rosetta rpc

23df78d

Update EpochInfoAggregator::block_tracker correctly

c4fc0a3

Ignore neard::runtime::tests::test_multiple_shards with protocol_feat…

2f1a1c7

…ure_chunk_only_producers

Fix neard/rpc_nodes.rs tests (again)

a76942f

New tests for new validator selection

d82a349

WIP -- Initial skeleton of new validator selection

0b4f770

Add is_chunk_only field to ValidatorStake

b3a4eca

Fix tests

a9f252c

Bug fix: sample block and chunk producers properly

d6e4b72

Gate chunk-only producer logic behind a nightly protocol version

cb108e8

stale bot added the S-stale label Jul 1, 2021

bowenwang1996 added S-pinned and removed S-stale labels Jul 1, 2021

bowenwang1996 added 3 commits July 10, 2021 11:36

Merge branch 'master' into 2880-new-validator-selection

5b8a644

resolve conflicts

f942daf

Merge branch 'master' into 2880-new-validator-selection

f98788e

bowenwang1996 force-pushed the 2880-new-validator-selection branch from 4801de4 to c102d4d Compare July 15, 2021 21:42

bowenwang1996 requested review from bowenwang1996, EgorKulikov, frol and MaksymZavershynskyi as code owners August 24, 2021 18:32

SkidanovAlex reviewed Aug 24, 2021

View reviewed changes

SkidanovAlex reviewed Aug 25, 2021

View reviewed changes

address comments

52ae940

bowenwang1996 force-pushed the 2880-new-validator-selection branch from 13f2a01 to 52ae940 Compare August 26, 2021 23:36

bowenwang1996 requested a review from SkidanovAlex August 27, 2021 15:00

bowenwang1996 approved these changes Sep 2, 2021

View reviewed changes

frol approved these changes Sep 3, 2021

View reviewed changes

nikurt self-requested a review September 8, 2021 07:25

nikurt approved these changes Sep 8, 2021

View reviewed changes

Merge branch 'master' into 2880-new-validator-selection

aec28de

birchmd requested a review from matklad as a code owner September 8, 2021 22:22

bowenwang1996 added 2 commits September 8, 2021 15:25

fix compilation error

530b78c

python formatting

1ea9f49

bowenwang1996 added the S-automerge label Sep 8, 2021

near-bulldozer bot added 2 commits September 8, 2021 22:43

Merge refs/heads/master into 2880-new-validator-selection

e0accc6

Merge refs/heads/master into 2880-new-validator-selection

ad7930a

near-bulldozer bot merged commit dce2a47 into master Sep 8, 2021

near-bulldozer bot deleted the 2880-new-validator-selection branch September 8, 2021 23:39

bowenwang1996 mentioned this pull request Sep 10, 2021

Fuzz test the new validator selection algorithm #4813

Closed

bowenwang1996 mentioned this pull request Sep 30, 2021

Implement new block producer selection #2880

Closed

matklad mentioned this pull request Oct 6, 2021

feat: stabilize wasmer2 protocol feature #4934

Merged

mzhangmzz mentioned this pull request Oct 8, 2021

fix(test): fix a python test failure caused by storage validation #4952

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chain): Chunk Only Producers #4193

feat(chain): Chunk Only Producers #4193

birchmd commented Apr 2, 2021

stale bot commented Jul 1, 2021

SkidanovAlex left a comment

SkidanovAlex Aug 24, 2021

bowenwang1996 Aug 25, 2021

SkidanovAlex Aug 24, 2021

bowenwang1996 Aug 25, 2021

SkidanovAlex Aug 24, 2021

SkidanovAlex Aug 24, 2021

bowenwang1996 Aug 26, 2021

SkidanovAlex left a comment

SkidanovAlex Aug 25, 2021 •

edited

Loading

birchmd Aug 27, 2021 •

edited

Loading

bowenwang1996 left a comment

bowenwang1996 commented Sep 2, 2021

frol left a comment

bowenwang1996 commented Sep 4, 2021

feat(chain): Chunk Only Producers #4193

feat(chain): Chunk Only Producers #4193

Conversation

birchmd commented Apr 2, 2021

stale bot commented Jul 1, 2021

SkidanovAlex left a comment

Choose a reason for hiding this comment

SkidanovAlex Aug 24, 2021

Choose a reason for hiding this comment

bowenwang1996 Aug 25, 2021

Choose a reason for hiding this comment

SkidanovAlex Aug 24, 2021

Choose a reason for hiding this comment

bowenwang1996 Aug 25, 2021

Choose a reason for hiding this comment

SkidanovAlex Aug 24, 2021

Choose a reason for hiding this comment

SkidanovAlex Aug 24, 2021

Choose a reason for hiding this comment

bowenwang1996 Aug 26, 2021

Choose a reason for hiding this comment

SkidanovAlex left a comment

Choose a reason for hiding this comment

SkidanovAlex Aug 25, 2021 • edited Loading

Choose a reason for hiding this comment

birchmd Aug 27, 2021 • edited Loading

Choose a reason for hiding this comment

bowenwang1996 left a comment

Choose a reason for hiding this comment

bowenwang1996 commented Sep 2, 2021

frol left a comment

Choose a reason for hiding this comment

bowenwang1996 commented Sep 4, 2021

SkidanovAlex Aug 25, 2021 •

edited

Loading

birchmd Aug 27, 2021 •

edited

Loading