Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

validator_discovery: pass PeerSet to the request #2372

Merged
merged 24 commits into from
Feb 8, 2021

Conversation

ordian
Copy link
Member

@ordian ordian commented Feb 3, 2021

Fixes #2242.

@ordian ordian added B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Feb 3, 2021
@github-actions github-actions bot added the A3-in_progress Pull request is in progress. No review needed at this stage. label Feb 3, 2021
@ordian ordian marked this pull request as ready for review February 3, 2021 14:07
@github-actions github-actions bot added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Feb 3, 2021
@ordian
Copy link
Member Author

ordian commented Feb 3, 2021

We also handle on_peer_connected, on_peer_disconnected in validator discovery service. Not sure if we should take peer_set into account there as well 🤔

@tomaka
Copy link
Contributor

tomaka commented Feb 3, 2021

Definitely!
find_connected_validators should also get a PeerSet parameter, and connected_peers should be a HashMap<(PeerId, PeerSet), ...>.

From the point of view of Polkadot, there' should never be a peer that is "connected" or "disconnected". It's always "has or has not a validation sustream" and "has or has not a collation substream".

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rphmeier
Copy link
Contributor

rphmeier commented Feb 3, 2021

guide changes?

Copy link
Contributor

@rphmeier rphmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a guide change for ConnectToValidators

@ordian
Copy link
Member Author

ordian commented Feb 3, 2021

What changes exactly?

I'm currently investigating the adder-collator test failure.

@rphmeier
Copy link
Contributor

rphmeier commented Feb 3, 2021

Oh, I missed the change to the message definition earlier. Nvm

@ordian
Copy link
Member Author

ordian commented Feb 3, 2021

I'm currently investigating the adder-collator test failure.

Still investigating, b94a870 doesn't fix the issue unfortunately.

* master:
  Diagnostics quality of life improvements (#2375)
  Generic request/response infrastructure for Polkadot (#2352)
  Fix bug in statement table (#2369)
* master:
  Unused variable can be ignored (#2381)
  End multiplexer stream once one of its inputs end. (#2380)
  Cargo build step needs pre requisites not mentioned (#2379)
ordian and others added 6 commits February 5, 2021 11:33
* master:
  node: pass local authorship info to the transaction pool (#2385)
  Explicit Para Lifecycle w/ Upgrades and Downgrades (#2354)
  Cancel Proxy Type (#2334)
  Cargo.lock: Update to libp2p-swarm v0.27.2 (#2384)
…ub.com:paritytech/polkadot into ao-pass-peerset-to-connecttovalidators-request
@ordian
Copy link
Member Author

ordian commented Feb 5, 2021

This should be OK to merge now. Please re-review.

e.insert(vec![response_sender]);
}
connect_to_relevant_validators(&mut state.connection_requests, ctx, relay_parent, &descriptor).await;
e.insert(vec![response_sender]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not insert the sender unconditionally. If there are no validators to connect to, the receiver will be pending forever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. AFAIK a runtime request can fail spuriously while we can be already connected to the relevant peers.
And we will clean it up when the relay parent becomes irrelevant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous check did not check whether we were able to connect to validators, but whether there are any relevant validators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should always be relevant validators in theory.
However, the caller should anyway handle the case that we issued the connection request but haven't received the PoV for a while. And it does this by issuing a separate job for it in the background that should not block progress on other jobs:

// spawn background task.
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm saying is yes, I've changed the behavior here, but it is equally "correct" as the previous one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to double check, the changed code sure looks nicer ;-)

target: LOG_TARGET,
peer = ?peer,
"Peer sent us an invalid request",
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to remove this exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says "invalid request", whereas ReportPeer can be also used for increasing reputation, so the message is confusing.
In addition to that, we log it in on a higher level in subsystems.

node/network/bridge/src/lib.rs Outdated Show resolved Hide resolved
node/network/bridge/src/validator_discovery.rs Outdated Show resolved Hide resolved
node/network/pov-distribution/src/lib.rs Outdated Show resolved Hide resolved
// We only need one connection request per (relay_parent, para_id)
// so here we take this shortcut to avoid calling `connect_to_validators`
// more than once.
if !connection_requests.contains_request(&relay_parent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble understanding, what would happen if this function were called twice with the same relay_parent and two different descriptors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Although this code wasn't introduced in this PR, the comment says it should be unique per (relay_parent, para_id), but the only handles relay_parent.

cc @montekki #1990

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check the code where this is called, you see that the the connection_requests are per pov_hash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, is it? state.connection_requests doesn't know about pov_hash and if we call it for two different pov_hashes with the same relay_parent, it will not issue the second request, or?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh my comment was dumb 🤦

I had overseen that connection requests is global to the state.

The assumption was probably that we are only assigned to one parachain per relay parent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand it right that this is a legit problem and requires an issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand it right that this is a legit problem and requires an issue?

Don't know, the following assumption seems legit to me:

The assumption was probably that we are only assigned to one parachain per relay parent?

Although, the less assumptions we make, the better. So if that assumption is easy to get rid of, I think that would be a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on whether this assumption is wrong

The assumption was probably that we are only assigned to one parachain per relay parent?

If it is wrong, we'd probably need to change

id_map: HashMap<Hash, usize>,

to map from (Hash, ParaId) instead of Hash.

Copy link
Contributor

@rphmeier rphmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow why Distribute, at least with the current code, needs to connect to anybody.

Distribute places the PoV into a mapping for the relay-parent and then we advertise the PoV hashes to anyone who connects, after we get their view-update and see that they're on the same chains as us. Then they'll request the PoV.

I think this issue might be solved also by having nodes send their current view to fresh connections in the network-bridge.

That said, I think that the change is not a problem and is good to have. But it might be solving things for the wrong reason 🤔

* master:
  Send view to new peers (#2392)
@ordian
Copy link
Member Author

ordian commented Feb 6, 2021

I think this issue might be solved also by having nodes send their current view to fresh connections in the network-bridge.

I just tried it and the integration test still fails, there is no mention of /polkadot/validation/1 protocol in the sub-libp2p logs.

Distribute places the PoV into a mapping for the relay-parent and then we advertise the PoV hashes to anyone who connects, after we get their view-update and see that they're on the same chains as us. Then they'll request the PoV.

That's not how the pov-distribution currently works. It sends PoVs, not hashes on distribute:

distribute_to_awaiting(

Are you're saying we should change this? Anyway, this is unrelated to the problem here.

But it might be solving things for the wrong reason thinking

I think the reason why we don't send PoVs w/o this fix, is that we need to explicitly open the /protocol/validation/1 substream, which is done in connect_to_validators request when we do here. @tomaka please correct me if I'm wrong.

@tomaka
Copy link
Contributor

tomaka commented Feb 6, 2021

I just tried it and the integration test still fails, there is no mention of /polkadot/validation/1 protocol in the sub-libp2p logs.

In case that helps, validation is also referred to as SetId(4) in connections/disconnections messages. Collation is SetId(3).

I think the reason why we don't send PoVs w/o this fix, is that we need to explicitly open the /protocol/validation/1 substream, which is done in connect_to_validators request when we do here. @tomaka please correct me if I'm wrong.

The original bug is that the validation substream opens a few milliseconds before the collation substream. However, the bridge before this PR doesn't distinguish between "validation open" and "collation open". All it knows about and reports is "open".

During the few milliseconds between validation opening and collation opening, the collation code detects this "open" from the bridge (well, it's a bit more complicated, but I'm not familiar with the code and don't remember the details) and tries to send a Distribute on the collation substream, which fails because it isn't open yet.

After this original bug was fixed by properly distinguishing between collation and validation, the second bug is that we didn't open the validation substream with anyone anymore. The code before this PR, when it receives a connect_to_validators(collation) simply connects to everyone both with collation and validation. After fixing this, my understanding (again, might be wrong) was that we realized that there isn't actually any code that does connect_to_validations(validation) anywhere.

@ordian ordian merged commit 57eb9d1 into master Feb 8, 2021
@ordian ordian deleted the ao-pass-peerset-to-connecttovalidators-request branch February 8, 2021 07:58
@ordian
Copy link
Member Author

ordian commented Feb 8, 2021

After this original bug was fixed by properly distinguishing between collation and validation, the second bug is that we didn't open the validation substream with anyone anymore. The code before this PR, when it receives a connect_to_validators(collation) simply connects to everyone both with collation and validation. After fixing this, my understanding (again, might be wrong) was that we realized that there isn't actually any code that does connect_to_validations(validation) anywhere.

Yes, it looks like previously, collator issued connect_to_validators(collation) and collate_to_validators(validation)` requests and validators were distributing the PoV to each other over the collator after receiving it from the collator.

I think maybe there was an implicit expectation that if the peers are "connected", they also have /polkadot/validation/1 substream open. But in reality we need to explicitly open this substream.

@eskimor
Copy link
Member

eskimor commented Feb 8, 2021

I don't quite follow why Distribute, at least with the current code, needs to connect to anybody.

Distribute places the PoV into a mapping for the relay-parent and then we advertise the PoV hashes to anyone who connects, after we get their view-update and see that they're on the same chains as us. Then they'll request the PoV.

The problem is: Nobody connects. The problem as I see it, is that nobody issues a FetchPoV, because no validator except the ones connected to the collator actually know anything about the PoV. The ones that know about it, know it because they got it via the collation protocol, so they do not need to fetch it. It seems FetchPoV is at least initially the only code path where we actually connect to the validation protocol. Without being connected we also won't get any view updates, which would tell us who wants what.

I am not sure yet, where there would be a better place to make sure we are actually connected to relevant validators. Making sure we are connected in distribute, kinda makes sense. "Ok we have something to distribute, maybe we should make sure we are connected to someone, so they can tell us whether they want it."

Also now, he PoV distribution as a subsystem makes sure it is connected, which seems sensible.

@ordian
Copy link
Member Author

ordian commented Feb 8, 2021

I think what would be nice is issuing background connection requests once in a while once we know who the relevant peers are (we have multiple layers of caching of the requests, so issuing it twice should not be a problem). This also goes hand in hand with the dormant peers idea.

@rphmeier
Copy link
Contributor

rphmeier commented Feb 8, 2021

The problem as I see it, is that nobody issues a FetchPoV, because no validator except the ones connected to the collator actually know anything about the PoV.

Wouldn't this indicate a problem in statement distribution? Validators are meant to issue FetchPoV based on what they've observed in Seconded statements.

@eskimor
Copy link
Member

eskimor commented Feb 8, 2021

Wouldn't this indicate a problem in statement distribution? Validators are meant to issue FetchPoV based on what they've observed in Seconded statements.

I think, I checked that, but ended nowhere. But now that you say it, it sounds awfully likely that something is wrong there. Hence, issue. I am currently right in the middle of something (availability distribution), but can have a look at #2400 tomorrow.

ordian added a commit that referenced this pull request Feb 11, 2021
* master:
  Implement Approval Voting Subsystem (#2112)
  Introduce PerPeerSet utility that allows to segrate based on PeerSet (#2420)
  [CI] Move check_labels to github actions (#2415)
  runtime: set equivocation report longevity (#2404)
  Companion for #7936: Migrate pallet-balances to pallet attribute macro (#2331)
  Corrected Physical (#2414)
  validator_discovery: cache by (Hash, ParaId) (#2402)
  Enable wasmtime caching for PVF (companion for #8057) (#2387)
  Use construct_runtime in tests, remove default PalletInfo impl (#2409)
  validator_discovery: pass PeerSet to the request (#2372)
  guide: more robust approval counting procedure (#2378)
  Publish rococo on every push to `rococo-v1` branch (#2388)
  Bump trie-db from 0.22.2 to 0.22.3 (#2344)
  Send view to new peers (#2392)
ordian added a commit that referenced this pull request Feb 11, 2021
* master:
  Implement Approval Voting Subsystem (#2112)
  Introduce PerPeerSet utility that allows to segrate based on PeerSet (#2420)
  [CI] Move check_labels to github actions (#2415)
  runtime: set equivocation report longevity (#2404)
  Companion for #7936: Migrate pallet-balances to pallet attribute macro (#2331)
  Corrected Physical (#2414)
  validator_discovery: cache by (Hash, ParaId) (#2402)
  Enable wasmtime caching for PVF (companion for #8057) (#2387)
  Use construct_runtime in tests, remove default PalletInfo impl (#2409)
  validator_discovery: pass PeerSet to the request (#2372)
  guide: more robust approval counting procedure (#2378)
  Publish rococo on every push to `rococo-v1` branch (#2388)
  Bump trie-db from 0.22.2 to 0.22.3 (#2344)
  Send view to new peers (#2392)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

validator discovery: deduplicate peersets
7 participants