Deterministic subnet peer discovery #5090

ackintosh · 2024-01-19T05:01:12Z

Issue Addressed

Proposed Changes

Added a CLI param --prefix-search-for-subnet to enable prefix search.
- Running Lighthouse without the parameter behaves as before, defaulting to searching on a random id.
Added PrefixMapping that stores mappings of SubnetId to NodeIds.
- Discovery calls PrefixMapping::get_target_nodes to get target NodeIds when run prefix searching.

Additional Info

This PR is based on #4959, which is Diva's branch, so I think we need to merge #4959 first.

I've measured the improvement in (attestation subnet) peer discovery brought about by this PR, based on the sent/received bytes and the number of nodes found during the discovery process.

# Random search
$ cargo run --bin lighthouse -- -l bn --enr-address {public ip address}
# Prefix search
$ cargo run --bin lighthouse -- -l bn --enr-address {public ip address} --prefix-search-for-subnet

I've run lighthouse five times each for Random search and Prefix search, observing the sent/received bytes for discovery and the number of nodes found. Based on the results below, the discovery process with Prefix search has improved by approximately 3x compared to Random search.

Note for self: Here are the changes used for the measurement.

ackintosh · 2024-01-19T05:24:14Z

Note: some of the conflicts come from the base branch #4959.

AgeManning · 2024-01-31T09:38:14Z

Hey @ackintosh - Nice work. We've merged the base branch now, but I have some questions about your results.

I don't fully understand the measurements you are making. Could you explain a bit further how the tests are done.

Are you doing discovery queries? Is that what the # number represents? Each query should only return at most 16 peers, but it looks like in the prefix search its returning more?

Are runs between the prefix and non-prefix actually separate. When lighthouse shuts down, it retains the peers it found in its db. So if you start a fresh lighthouse node and run a bunch of searchers, then restart it and run the same bunch, the second run will be faster because it has already seeded it's local routing table.

I'm asking this because I expect that the prefix-search will actually have no benefit at the moment. We should only see improvements as nodes start to use the new deterministic subnet scheme. Because pretty much all nodes on the network haven't got this yet, the prefix search should be equivalent to a normal search.

AgeManning · 2024-01-31T10:01:30Z

beacon_node/network/src/prefix_mapping.rs

+                    let node_ids = ids
+                        .into_iter()
+                        .map(|id| {
+                            let raw_node_id: [u8; 32] = id.into();


The calculation for node-ids which comes from compute_prefix_mapping_for_epoch generates node-ids that contain the prefix, followed by 0's for the rest. i.e XXXX0000000000000000000000...

The problem with using this as a target for searching in discovery is that it is predictable and malicious actors can take advantage of discovery queries if they always know what the target is going to be.

When we do the searchers, we should randomize the trailing bits, so that we get a unique set of peers inside each prefix.

I think the best way, might be to just store the prefix bytes in this mapping. So that every epoch we calculate the new prefix bytes and delete the old ones. Then when we do the actual search, we generate a random target node-id and replace the first few bytes with the prefix we find here. This way, every time we do a search we have a random target.

… `lighthouse_network`

…covery # Conflicts: # beacon_node/lighthouse_network/src/service/mod.rs # beacon_node/network/src/service.rs

…ministic-subnets-peer-discovery

…covery # Conflicts: # beacon_node/lighthouse_network/src/discovery/mod.rs

ackintosh · 2024-04-08T05:54:04Z

beacon_node/network/src/prefix_mapping.rs

+                // We generate a random NodeId and replace the first few bits with the prefix.
+                let random_node_id = U256::from(NodeId::random().raw());
+                let prefixed_node_id = U256::from(prefix) << (256 - subnet_prefix_bits);
+
+                // Replace the first few bits of the random NodeID with the prefix. The bits that we
+                // want to replace are identified by the `mask`.
+                let raw_node_id: [u8; 32] =
+                    (random_node_id ^ ((random_node_id ^ prefixed_node_id) & mask)).into();


@AgeManning Thanks for your review! I have updated PrefixMapping::get_target_nodes to return random node-ids. 6f614df

The method used to replace the first few bits might seem a bit hacky. I'm not sure if this is the best approach, but I'm open to other suggestions.

ackintosh · 2024-04-09T14:24:50Z

Are runs between the prefix and non-prefix actually separate. When lighthouse shuts down, it retains the peers it found in its db. So if you start a fresh lighthouse node and run a bunch of searchers, then restart it and run the same bunch, the second run will be faster because it has already seeded it's local routing table.

Oh, I didn't know that the peers are retained in the db upon shutdown. The first measurement I conducted might be incorrect, as I didn't clear the db before running the tests.

I've updated the measurement procedure and conducted a second measurement.

Preparation

Create a branch for the measurement based on this PR. The changes in this branch include:

Run queries only for Subnet::Attestation (skipping queries for Subnet::SyncCommittee and QueryType::FindPeers). This aims to obtain accurate metrics for sent/recv bytes specifically for the Subnet::SyncCommittee query.
Logging sent/recv bytes and the number of found nodes.
- Here, "found nodes" refers to nodes subscribing to the specific subnet we queried.

Measurement procedure

I followed the procedure below, performing it 5 times each for both normal search and prefix search:

Clear datadir to start a fresh lighthouse node on each run.
Start lighthouse with (or without) the --prefix-search-for-subnet option.
Stop lighthouse after 60 seconds.
Check the metrics logs added in the Preparation section.

Result

The result shows that subnet peer discovery has improved approximately 7x with Prefix search (see efficiency in the result table).

Note:

#: The test was run 5 times each for both normal search and prefix search, so # ranges from 1 to 5.
bytes_sent / bytes_recv: The bytes sent and received by queries during the 60-second lighthouse run.
total found nodes: The number of nodes subscribing to the specific subnet and found by the queries during the lighthouse run.
efficiency: Calculated as total bytes / found nodes. This is used to compare the two search methods. A lower number indicates better efficiency, meaning lower bandwith consumption to search for nodes.

Why the result indicates better efficiency at this time

I'm asking this because I expect that the prefix-search will actually have no benefit at the moment. We should only see improvements as nodes start to use the new deterministic subnet scheme. Because pretty much all nodes on the network haven't got this yet, the prefix search should be equivalent to a normal search.

The subnet-id computed by the new deterministic subnet scheme is relatively close to one computed by the current subnet scheme. We implemented the new subnet scheme in #4959, where expected_subnets in the unit test was partially updated. I believe this is why we see improvements with the prefix search at this time.

If anything is unclear, please let me know. 🙏

AgeManning · 2024-04-15T00:06:29Z

Hey @ackintosh, nice work.

I still can't understand why we are seeing any improvements at all. If anything I think this indicates a bug in our new code or something wrong with the testing methodology.

Isn't the only difference that we are no longer searching for a random node-id, but we prefix-it with some bits. In either case the node's are uniformly distributed and searching via a random node-id or prefixing a few bits should yield the exact same results.

It doesn't make sense to me why there is such a difference between the two?

ackintosh · 2024-04-18T13:22:14Z

@AgeManning As you mentioned, nodes are using the current subnet scheme (not the new scheme) as defined in the spec. I did another test to see how the computed subnet_ids from the current and the new scheme match, with 10,000 random node_id and epoch pairs.

Approximately 67% of the subnet_ids matched. (Details of the test are below.)

If the computed subnet_ids were completely different, we would likely see no improvements. However, about 67% of computed subnet_ids were the same. I think this is a reason why we are seeing the improvements by prefix search with the new subnet scheme.

If there is any misunderstanding on my part, I would appreciate it if you could point it out. 🙇

Here is the test code used for the test. Most of the code is copied from the consensus-specs repository.

# ################
# p2p-interface.md
# ################

# Current scheme, copied from current `p2p-interface.md`.
def compute_subscribed_subnet_by_current_schema(node_id: NodeID, epoch: Epoch, index: int) -> SubnetID:
    node_id_prefix = node_id >> (NODE_ID_BITS - ATTESTATION_SUBNET_PREFIX_BITS)
    node_offset = node_id % EPOCHS_PER_SUBNET_SUBSCRIPTION
    permutation_seed = hash(uint_to_bytes(uint64((epoch + node_offset) // EPOCHS_PER_SUBNET_SUBSCRIPTION)))
    permutated_prefix = compute_shuffled_index(
        node_id_prefix,
        1 << ATTESTATION_SUBNET_PREFIX_BITS,
        permutation_seed,
    )
    return SubnetID((permutated_prefix + index) % ATTESTATION_SUBNET_COUNT)

# New scheme, copied from https://github.com/ethereum/consensus-specs/pull/3545
def compute_subscribed_subnet_by_new_schema(node_id: NodeID, epoch: Epoch, index: int) -> SubnetID:
    subnet_prefix = node_id >> (NODE_ID_BITS - ATTESTATION_SUBNET_PREFIX_BITS)
    shuffling_bit_size = (
        NODE_ID_BITS
        - ATTESTATION_SUBNET_PREFIX_BITS
        - ATTESTATION_SUBNET_SHUFFLING_PREFIX_BITS
    )
    shuffling_bits = (node_id >> shuffling_bit_size) % (1 << ATTESTATION_SUBNET_SHUFFLING_PREFIX_BITS)
    shuffling_multiplier = EPOCHS_PER_SUBNET_SUBSCRIPTION >> ATTESTATION_SUBNET_SHUFFLING_PREFIX_BITS
    epoch_transition = (
        (subnet_prefix + (shuffling_bits * shuffling_multiplier)) % EPOCHS_PER_SUBNET_SUBSCRIPTION
    )
    permutation_seed = hash(uint_to_bytes(uint64((epoch + epoch_transition) // EPOCHS_PER_SUBNET_SUBSCRIPTION)))
    permutated_prefix = compute_shuffled_index(
        subnet_prefix,
        1 << ATTESTATION_SUBNET_PREFIX_BITS,
        permutation_seed,
    )
    return SubnetID((permutated_prefix + index) % ATTESTATION_SUBNET_COUNT)

# ################
# test_validator_unittest.py
# ################

@with_phases([PHASE0])
@spec_test
@single_phase
def test_compute_subscribed_subnets_similarity(spec):
    for i in range(10000):
        rng = random.Random(i)
        res = check_if_computed_subnets_are_equal_between_current_and_new(spec, rng)
        print('res', res)

def check_if_computed_subnets_are_equal_between_current_and_new(spec, rng):
    node_id = rng.randint(0, 2**256 - 1)
    epoch = rng.randint(0, 2**64 - 1)

    subnets_current = spec.compute_subscribed_subnets_by_current_schema(node_id, epoch)
    subnets_new = spec.compute_subscribed_subnets_by_new_schema(node_id, epoch)

    return subnets_current == subnets_new

I ran the test in the consensus-specs repo as shown below:

cd consensus-specs/tests/core/pyspec

python \
  -s \
  -m pytest \
  --preset=mainnet \
  -k test_compute_subscribed_subnets_similarity \
  eth2spec/test/phase0/unittests/validator/test_validator_unittest.py

AgeManning · 2024-09-08T23:32:00Z

Closing in favour of #6146

ackintosh added the Networking label Jan 19, 2024

ackintosh mentioned this pull request Jan 19, 2024

Deterministic subnet peer discovery divagant-martian/lighthouse#4

Closed

ackintosh requested review from AgeManning and divagant-martian January 19, 2024 05:04

dapplion added blocked ready-for-review The code is ready for review labels Jan 29, 2024

AgeManning reviewed Jan 31, 2024

View reviewed changes

divagant-martian and others added 21 commits February 28, 2024 14:55

fix bug and update expected subnets with validation from pyspec

12b0d4a

fix values and make test prettier (?)

bdd23dc

Compute prefix mappings

31c1719

Add PrefixMapping

604cfda

Run prefix searching

4ee119c

Move the discovery logic to network and avoid extra trait bounds in…

d22bc4f

… `lighthouse_network`

Add target field to SubnetQuery

f70809a

prefix search

13db970

Fix tests

e62dd11

Shuffle node_ids

4f9bfa7

Add CLI flag to enable prefix search

dd53e49

Refactor PrefixMapping

54d5920

cargo fmt

80a1a4c

Add test for PrefixMapping

78768b7

Add target field to Debug impl

ae30070

Add test module to improve imports

0188291

Tweak

c395b32

Fix clippy warnings

fce8527

Tweak comments

28dff34

Fix clippy warnings

ce3eab3

Fix unused imports

6c37f88

ackintosh force-pushed the ackintosh-deterministic-subnets-peer-discovery branch from b832cf0 to 6c37f88 Compare February 28, 2024 06:18

Merge branch 'unstable' into ackintosh-deterministic-subnets-peer-dis…

f881191

…covery # Conflicts: # beacon_node/lighthouse_network/src/service/mod.rs # beacon_node/network/src/service.rs

ackintosh added work-in-progress PR is a work-in-progress and removed blocked ready-for-review The code is ready for review labels Mar 19, 2024

ackintosh added 2 commits March 28, 2024 15:36

Randomize the trailing bits of prefixed node-ids

6f614df

Merge remote-tracking branch 'upstream/unstable' into ackintosh-deter…

bd62f0e

…ministic-subnets-peer-discovery

AgeManning mentioned this pull request Apr 4, 2024

DAS data column custody tracker #5469

Closed

Merge branch 'unstable' into ackintosh-deterministic-subnets-peer-dis…

5feefef

…covery # Conflicts: # beacon_node/lighthouse_network/src/discovery/mod.rs

ackintosh commented Apr 8, 2024

View reviewed changes

AgeManning closed this Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic subnet peer discovery #5090

Deterministic subnet peer discovery #5090

ackintosh commented Jan 19, 2024 •

edited

Loading

ackintosh commented Jan 19, 2024

AgeManning commented Jan 31, 2024

AgeManning Jan 31, 2024

ackintosh Apr 8, 2024

ackintosh commented Apr 9, 2024 •

edited

Loading

AgeManning commented Apr 15, 2024

ackintosh commented Apr 18, 2024

AgeManning commented Sep 8, 2024

Deterministic subnet peer discovery #5090

Deterministic subnet peer discovery #5090

Conversation

ackintosh commented Jan 19, 2024 • edited Loading

Issue Addressed

Proposed Changes

Additional Info

ackintosh commented Jan 19, 2024

AgeManning commented Jan 31, 2024

AgeManning Jan 31, 2024

Choose a reason for hiding this comment

ackintosh Apr 8, 2024

Choose a reason for hiding this comment

ackintosh commented Apr 9, 2024 • edited Loading

Preparation

Measurement procedure

Result

Why the result indicates better efficiency at this time

AgeManning commented Apr 15, 2024

ackintosh commented Apr 18, 2024

AgeManning commented Sep 8, 2024

ackintosh commented Jan 19, 2024 •

edited

Loading

ackintosh commented Apr 9, 2024 •

edited

Loading