-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(chain): Chunk Only Producers #4193
Conversation
…h height using the new algorithm
…ure_chunk_only_producers
This PR has been automatically marked as stale because it has not had recent activity in the 2 weeks. |
4801de4
to
c102d4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reviewed the first half so far, some comments inline.
"Can't get Outcome ids by Block Hash" | ||
)); | ||
} | ||
if let Ok(Some(_)) = sv.store.get_ser::<ChunkExtra>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand what this logic was for, the changed code won't catch a situation when I'm supposed to have a chunk extra, but do not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the difficult part -- there is a test that fails because the catch up has not finished downloading the state and as a result we don't actually have chunk extra for the block. There is nothing wrong with that and I am not sure what would be a better condition to check
@@ -1425,6 +1425,8 @@ impl Client { | |||
let head = self.chain.head()?; | |||
if let Some(next_epoch_id) = self.get_next_epoch_id_if_at_boundary(&head)? { | |||
self.forward_tx(&next_epoch_id, tx)?; | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
If I follow the logic, this line changes the semantics of the method quite drastically. It used to forward the tx in extremely rare cases on the epoch boundary. Now it will result in practically doubling number of tx forwarding from validator nodes (since the above condition is almost always false).
Logically this seems like a good change (looking at the place where this is called in process_tx_internal
it appears that if an active validator received a tx directly, the tx will never be forwarded before this change, resulting in very long processing times). But if this is indeed your intent here, the method needs to be renamed?
Also, in the process_tx_internal
it would make sense to call this method now instead of forward_tx
in the case when we are not a validator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the place where this is called in process_tx_internal it appears that if an active validator received a tx directly, the tx will never be forwarded before this change, resulting in very long processing times
Yes that is exactly the intention. One python test failed because of this.
But if this is indeed your intent here, the method needs to be renamed?
What do you suggest we rename it to?
/// best when the number of chunk producers is greater than | ||
/// `num_shards * min_validators_per_shard`. | ||
pub fn assign_shards<T: HasStake + Eq + Clone>( | ||
chunk_producers: Vec<T>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are chunk_producers
sorted in any particular way (in particular, in descending order of stake)?
It would be good to document here.
If they are not sorted in descending stake, it might be worth doing?
Ok(result) | ||
} | ||
|
||
fn assign_with_possible_repeats<T: HasStake + Eq, I: Iterator<Item = (usize, T)>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems overly complex.
random_shuffle(validators) // optionally
last_ord = 0
for (shard_id in ...) {
for validator_ord in 0..min_validators_per_shard {
result[shard_id].push(last_ord);
last_ord += 1
last_ord %= num_chunk_producers;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is that this approach does not try to balance the stake across shards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still haven't reviewed validator_selection.rs
.
|
||
#[test] | ||
fn test_sample_should_produce_correct_distribution() { | ||
let weights = vec![5, 1, 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(UPDATE: I had an incorrect challenge here)
The test itself is pretty flacky (fails with relative error 0.7%-0.8% one out of 5 executions). Though I ran it with rand::thread_rng
instead of hashes, may be it changes the distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test as written passes 100% of the time because it is deterministic; this is why I chose to simulate random seeds with repeated hashing instead of something like thread_rng
.
You are right that this is a statistical process, so with a fixed number of samples there will always be some probability of failure. Really the statement I want to check is "with probability 1, repeated sampling from weighted_index
will eventually converge to the correct distribution". The trouble of course is translating that into a reliable test that runs relatively quickly. I chose to pick an arbitrary, though fixed, sequence of seeds which happens to work because I thought it was pretty convincing.
An alternative to using hashes to uniformly and deterministically sample the space of inputs to give to weighted_index
would be to take some random number generator from the rand
library and give it a fixed seed. This should be equivalent (in terms of uniform randomness) to the hashes scheme, but if you find that more convincing it's an easy change to make.
13f2a01
to
52ae940
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve to unblock this PR and avoid having to deal with merge conflicts with master. If there are any further testing needed, we can do it after it is merged as a nightly protocol feature
@frol looks like it needs your approval 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in Rosetta-RPC and primitives LGTM
@nikurt please review |
Chunk-only producers are an important stepping stone towards sharding in mainnet. See https://gov.near.org/t/block-and-chunk-producer-selection-algorithm-in-simple-nightshade/66 for more details. Also see near/NEPs#167 for the spec this work is based on.
This PR does most of the work towards landing this feature. Much of the work with this PR was updating tons of tests because they were using the assumption that validators produce blocks/chunk in a cyclic order. That is no longer true because the randomness is done on the fly at each height instead of when processing the proposals.
This PR is not yet suitable for merging to master; missing items are listed below:
List of (possible) Nayduck failures to be addressed: