Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

authorithy-discovery: Make changing of peer-id while active a bit more robust #3786

Merged
merged 58 commits into from
Jul 25, 2024

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Mar 21, 2024

In the case when nodes don't persist their node-key or they want to generate a new one while being in the active set, things go wrong because both the old addresses and the new ones will still be present in DHT, so because of the distributed nature of the DHT both will survive in the network untill the old ones expires which is 36 hours. Nodes in the network will randomly resolve the authorithy-id to the old address or the new one.

More details in: #3673

This PR proposes we mitigate this problem, by:

  1. Let the query for a DHT key retrieve more than one results(4), that is also bounded by the replication factor which is 20, currently we interrupt the querry on the first result.
    2. Modify the authority-discovery service to keep all the discovered addresses around for 24h since they last seen an address.
    3. Plumb through other subsystems where the assumption was that an authorithy-id will resolve only to one PeerId. Currently, the authorithy-discovery keeps just the last record it received from DHT and queries the DHT every 10 minutes. But they could always receive only the old address, only the new address or a flip-flop between them depending on what node wins the race to provide the record

  2. Extend the SignedAuthorityRecord with a signed creation_time.

  3. Modify authority discovery to keep track of nodes that sent us old record and once we are made aware of a new record update the nodes we know about with the new record.

  4. Update gossip-support to try resolve authorities more often than every session.

This would gives us a lot more chances for the nodes in the networks to also discover not only the old address of the node but also the new one and should improve the time it takes for a node to be properly connected in the network. The behaviour won't be deterministic because there is no guarantee the all nodes will see the new record at least once, since they could query only nodes that have the old one.

TODO

  • Add unittests for the new paths.
  • Make sure the implementation is backwards compatible
  • Evaluate if there are any bad consequence of letting the query continue rather than finish it at first record found.
  • Bake in versi the new changes.

…e robust

In the case when nodes don't persist their node-key or they want to
generate a new one while being in the active set, things go wrong
because both the old addresses and the new ones will still be present in
DHT, so because of the distributed nature of the DHT both will survive
in the network untill the old ones expires which is 36 hours.
Nodes in the network will randomly resolve the authorithy-id to the old
address or the new one.

More details in: #3673

This PR proposes we mitigate this problem, by:

1. Let the query for a DHT key retrieve all the results, that is usually
   bounded by the replication factor which is 20, currently we interrupt
   the querry on the first result.
2. Modify the authority-discovery service to keep all the discovered
   addresses around for 24h since they last seen an address.
3. Plumb through other subsystems where the assumption was that an
   authorithy-id will resolve only to one PeerId. Currently, the
   authorithy-discovery keeps just the last record it received from DHT
   and queries the DHT every 10 minutes. But they could always receive
   only the old address, only the new address or a flip-flop between
   them depending on what node wins the race to provide the record
4. Update gossip-support to try resolve authorities more often than
   every session.

This would gives us a lot more chances for the nodes in the networks to
also discover not only the old address of the node but also the new one
and should improve the time it takes for a node to be properly connected
in the network. The behaviour won't be deterministic because there is no
guarantee the all nodes will see the new record at least once, since
they could query only nodes that have the old one.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Copy link
Contributor

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

substrate/client/network/src/litep2p/mod.rs Outdated Show resolved Hide resolved
substrate/client/authority-discovery/src/worker.rs Outdated Show resolved Hide resolved
substrate/client/authority-discovery/src/worker.rs Outdated Show resolved Hide resolved
alexggh and others added 4 commits July 18, 2024 10:45
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Copy link
Contributor

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Maybe could bound the total number of peers in the record info using an LRU with a decent amount. In theory to not update every outdated peer (whole discoverable network) from this node, but only a limited number of peers?

alexggh and others added 4 commits July 18, 2024 13:51
Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor Author

alexggh commented Jul 25, 2024

Validate this PR in versi for the past 2 days with both litep2p and libp2p backends, things seems to work as expected, merging it now.

@alexggh alexggh added this pull request to the merge queue Jul 25, 2024
Merged via the queue into master with commit 6720279 Jul 25, 2024
158 of 161 checks passed
@alexggh alexggh deleted the alexaggh/fix_change_node_id_at_restart branch July 25, 2024 16:55
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
This PR updates the litep2p crate to the latest version.

This fixes the build for developers that want to perform `cargo update`
on all their dependencies:
paritytech#4343, by porting the
latest changes.

The peer records were introduced to litep2p to be able to distinguish
and update peers with outdated records.
It is going to be properly used in substrate via:
paritytech#3786, however that is
pending the commit to merge on litep2p master:
paritytech/litep2p#96.

Closes: paritytech#4343

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
…e robust (paritytech#3786)

In the case when nodes don't persist their node-key or they want to
generate a new one while being in the active set, things go wrong
because both the old addresses and the new ones will still be present in
DHT, so because of the distributed nature of the DHT both will survive
in the network untill the old ones expires which is 36 hours. Nodes in
the network will randomly resolve the authorithy-id to the old address
or the new one.

More details in: paritytech#3673

This PR proposes we mitigate this problem, by:

1. Let the query for a DHT key retrieve more than one results(4), that
is also bounded by the replication factor which is 20, currently we
interrupt the querry on the first result.
~2. Modify the authority-discovery service to keep all the discovered
addresses around for 24h since they last seen an address.~
~3. Plumb through other subsystems where the assumption was that an
authorithy-id will resolve only to one PeerId. Currently, the
authorithy-discovery keeps just the last record it received from DHT and
queries the DHT every 10 minutes. But they could always receive only the
old address, only the new address or a flip-flop between them depending
on what node wins the race to provide the record~

2. Extend the `SignedAuthorityRecord` with a signed creation_time.
3. Modify authority discovery to keep track of nodes that sent us old
record and once we are made aware of a new record update the nodes we
know about with the new record.
4. Update gossip-support to try resolve authorities more often than
every session.

~This would gives us a lot more chances for the nodes in the networks to
also discover not only the old address of the node but also the new one
and should improve the time it takes for a node to be properly connected
in the network. The behaviour won't be deterministic because there is no
guarantee the all nodes will see the new record at least once, since
they could query only nodes that have the old one.~


## TODO
- [x] Add unittests for the new paths.
- [x] Make sure the implementation is backwards compatible
- [x] Evaluate if there are any bad consequence of letting the query
continue rather than finish it at first record found.
- [x] Bake in versi the new changes.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants