-
Notifications
You must be signed in to change notification settings - Fork 1.6k
validator_discovery: pass PeerSet to the request #2372
validator_discovery: pass PeerSet to the request #2372
Conversation
We also handle |
Definitely! From the point of view of Polkadot, there' should never be a peer that is "connected" or "disconnected". It's always "has or has not a validation sustream" and "has or has not a collation substream". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
guide changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a guide change for ConnectToValidators
What changes exactly? I'm currently investigating the adder-collator test failure. |
Oh, I missed the change to the message definition earlier. Nvm |
Still investigating, b94a870 doesn't fix the issue unfortunately. |
…ub.com:paritytech/polkadot into ao-pass-peerset-to-connecttovalidators-request
…ub.com:paritytech/polkadot into ao-pass-peerset-to-connecttovalidators-request
Why are you even reading this???
* master: bump spec versions in kusama, polkadot and westend (#2391)
This should be OK to merge now. Please re-review. |
e.insert(vec![response_sender]); | ||
} | ||
connect_to_relevant_validators(&mut state.connection_requests, ctx, relay_parent, &descriptor).await; | ||
e.insert(vec![response_sender]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not insert the sender unconditionally. If there are no validators to connect to, the receiver will be pending forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. AFAIK a runtime request can fail spuriously while we can be already connected to the relevant peers.
And we will clean it up when the relay parent becomes irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous check did not check whether we were able to connect to validators, but whether there are any relevant validators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should always be relevant validators in theory.
However, the caller should anyway handle the case that we issued the connection request but haven't received the PoV for a while. And it does this by issuing a separate job for it in the background that should not block progress on other jobs:
polkadot/node/core/backing/src/lib.rs
Line 568 in 57eb9d1
// spawn background task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm saying is yes, I've changed the behavior here, but it is equally "correct" as the previous one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to double check, the changed code sure looks nicer ;-)
target: LOG_TARGET, | ||
peer = ?peer, | ||
"Peer sent us an invalid request", | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to remove this exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This says "invalid request", whereas ReportPeer
can be also used for increasing reputation, so the message is confusing.
In addition to that, we log it in on a higher level in subsystems.
// We only need one connection request per (relay_parent, para_id) | ||
// so here we take this shortcut to avoid calling `connect_to_validators` | ||
// more than once. | ||
if !connection_requests.contains_request(&relay_parent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having trouble understanding, what would happen if this function were called twice with the same relay_parent
and two different descriptors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you check the code where this is called, you see that the the connection_requests are per pov_hash
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, is it? state.connection_requests
doesn't know about pov_hash
and if we call it for two different pov_hashes
with the same relay_parent
, it will not issue the second request, or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh my comment was dumb 🤦
I had overseen that connection requests is global to the state.
The assumption was probably that we are only assigned to one parachain per relay parent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand it right that this is a legit problem and requires an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand it right that this is a legit problem and requires an issue?
Don't know, the following assumption seems legit to me:
The assumption was probably that we are only assigned to one parachain per relay parent?
Although, the less assumptions we make, the better. So if that assumption is easy to get rid of, I think that would be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on whether this assumption is wrong
The assumption was probably that we are only assigned to one parachain per relay parent?
If it is wrong, we'd probably need to change
id_map: HashMap<Hash, usize>, |
to map from
(Hash, ParaId)
instead of Hash
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite follow why Distribute
, at least with the current code, needs to connect to anybody.
Distribute places the PoV into a mapping for the relay-parent and then we advertise the PoV hashes to anyone who connects, after we get their view-update and see that they're on the same chains as us. Then they'll request the PoV.
I think this issue might be solved also by having nodes send their current view to fresh connections in the network-bridge.
That said, I think that the change is not a problem and is good to have. But it might be solving things for the wrong reason 🤔
I just tried it and the integration test still fails, there is no mention of
That's not how the pov-distribution currently works. It sends PoVs, not hashes on distribute:
Are you're saying we should change this? Anyway, this is unrelated to the problem here.
I think the reason why we don't send PoVs w/o this fix, is that we need to explicitly open the |
In case that helps, validation is also referred to as
The original bug is that the validation substream opens a few milliseconds before the collation substream. However, the bridge before this PR doesn't distinguish between "validation open" and "collation open". All it knows about and reports is "open". During the few milliseconds between validation opening and collation opening, the collation code detects this "open" from the bridge (well, it's a bit more complicated, but I'm not familiar with the code and don't remember the details) and tries to send a After this original bug was fixed by properly distinguishing between collation and validation, the second bug is that we didn't open the validation substream with anyone anymore. The code before this PR, when it receives a |
Yes, it looks like previously, collator issued I think maybe there was an implicit expectation that if the peers are "connected", they also have |
The problem is: Nobody connects. The problem as I see it, is that nobody issues a I am not sure yet, where there would be a better place to make sure we are actually connected to relevant validators. Making sure we are connected in distribute, kinda makes sense. "Ok we have something to distribute, maybe we should make sure we are connected to someone, so they can tell us whether they want it." Also now, he PoV distribution as a subsystem makes sure it is connected, which seems sensible. |
I think what would be nice is issuing background connection requests once in a while once we know who the relevant peers are (we have multiple layers of caching of the requests, so issuing it twice should not be a problem). This also goes hand in hand with the dormant peers idea. |
Wouldn't this indicate a problem in statement distribution? Validators are meant to issue |
I think, I checked that, but ended nowhere. But now that you say it, it sounds awfully likely that something is wrong there. Hence, issue. I am currently right in the middle of something (availability distribution), but can have a look at #2400 tomorrow. |
* master: Implement Approval Voting Subsystem (#2112) Introduce PerPeerSet utility that allows to segrate based on PeerSet (#2420) [CI] Move check_labels to github actions (#2415) runtime: set equivocation report longevity (#2404) Companion for #7936: Migrate pallet-balances to pallet attribute macro (#2331) Corrected Physical (#2414) validator_discovery: cache by (Hash, ParaId) (#2402) Enable wasmtime caching for PVF (companion for #8057) (#2387) Use construct_runtime in tests, remove default PalletInfo impl (#2409) validator_discovery: pass PeerSet to the request (#2372) guide: more robust approval counting procedure (#2378) Publish rococo on every push to `rococo-v1` branch (#2388) Bump trie-db from 0.22.2 to 0.22.3 (#2344) Send view to new peers (#2392)
* master: Implement Approval Voting Subsystem (#2112) Introduce PerPeerSet utility that allows to segrate based on PeerSet (#2420) [CI] Move check_labels to github actions (#2415) runtime: set equivocation report longevity (#2404) Companion for #7936: Migrate pallet-balances to pallet attribute macro (#2331) Corrected Physical (#2414) validator_discovery: cache by (Hash, ParaId) (#2402) Enable wasmtime caching for PVF (companion for #8057) (#2387) Use construct_runtime in tests, remove default PalletInfo impl (#2409) validator_discovery: pass PeerSet to the request (#2372) guide: more robust approval counting procedure (#2378) Publish rococo on every push to `rococo-v1` branch (#2388) Bump trie-db from 0.22.2 to 0.22.3 (#2344) Send view to new peers (#2392)
Fixes #2242.