Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic subnet peer discovery #5090

Conversation

ackintosh
Copy link
Member

@ackintosh ackintosh commented Jan 19, 2024

Issue Addressed

#3648

Proposed Changes

  • Added a CLI param --prefix-search-for-subnet to enable prefix search.
    • Running Lighthouse without the parameter behaves as before, defaulting to searching on a random id.
  • Added PrefixMapping that stores mappings of SubnetId to NodeIds.
    • Discovery calls PrefixMapping::get_target_nodes to get target NodeIds when run prefix searching.

Additional Info

This PR is based on #4959, which is Diva's branch, so I think we need to merge #4959 first.


I've measured the improvement in (attestation subnet) peer discovery brought about by this PR, based on the sent/received bytes and the number of nodes found during the discovery process.

# Random search
$ cargo run --bin lighthouse -- -l bn --enr-address {public ip address}
# Prefix search
$ cargo run --bin lighthouse -- -l bn --enr-address {public ip address} --prefix-search-for-subnet

I've run lighthouse five times each for Random search and Prefix search, observing the sent/received bytes for discovery and the number of nodes found. Based on the results below, the discovery process with Prefix search has improved by approximately 3x compared to Random search.

image

Note for self: Here are the changes used for the measurement.

@ackintosh
Copy link
Member Author

Note: some of the conflicts come from the base branch #4959.

@dapplion dapplion added blocked ready-for-review The code is ready for review labels Jan 29, 2024
@AgeManning
Copy link
Member

Hey @ackintosh - Nice work. We've merged the base branch now, but I have some questions about your results.

I don't fully understand the measurements you are making. Could you explain a bit further how the tests are done.

Are you doing discovery queries? Is that what the # number represents? Each query should only return at most 16 peers, but it looks like in the prefix search its returning more?

Are runs between the prefix and non-prefix actually separate. When lighthouse shuts down, it retains the peers it found in its db. So if you start a fresh lighthouse node and run a bunch of searchers, then restart it and run the same bunch, the second run will be faster because it has already seeded it's local routing table.

I'm asking this because I expect that the prefix-search will actually have no benefit at the moment. We should only see improvements as nodes start to use the new deterministic subnet scheme. Because pretty much all nodes on the network haven't got this yet, the prefix search should be equivalent to a normal search.

let node_ids = ids
.into_iter()
.map(|id| {
let raw_node_id: [u8; 32] = id.into();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation for node-ids which comes from compute_prefix_mapping_for_epoch generates node-ids that contain the prefix, followed by 0's for the rest. i.e XXXX0000000000000000000000...

The problem with using this as a target for searching in discovery is that it is predictable and malicious actors can take advantage of discovery queries if they always know what the target is going to be.

When we do the searchers, we should randomize the trailing bits, so that we get a unique set of peers inside each prefix.

I think the best way, might be to just store the prefix bytes in this mapping. So that every epoch we calculate the new prefix bytes and delete the old ones. Then when we do the actual search, we generate a random target node-id and replace the first few bytes with the prefix we find here. This way, every time we do a search we have a random target.

@ackintosh ackintosh force-pushed the ackintosh-deterministic-subnets-peer-discovery branch from b832cf0 to 6c37f88 Compare February 28, 2024 06:18
…covery

# Conflicts:
#	beacon_node/lighthouse_network/src/service/mod.rs
#	beacon_node/network/src/service.rs
@ackintosh ackintosh added work-in-progress PR is a work-in-progress and removed blocked ready-for-review The code is ready for review labels Mar 19, 2024
…covery

# Conflicts:
#	beacon_node/lighthouse_network/src/discovery/mod.rs
Comment on lines +67 to +74
// We generate a random NodeId and replace the first few bits with the prefix.
let random_node_id = U256::from(NodeId::random().raw());
let prefixed_node_id = U256::from(prefix) << (256 - subnet_prefix_bits);

// Replace the first few bits of the random NodeID with the prefix. The bits that we
// want to replace are identified by the `mask`.
let raw_node_id: [u8; 32] =
(random_node_id ^ ((random_node_id ^ prefixed_node_id) & mask)).into();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AgeManning Thanks for your review! I have updated PrefixMapping::get_target_nodes to return random node-ids. 6f614df

The method used to replace the first few bits might seem a bit hacky. I'm not sure if this is the best approach, but I'm open to other suggestions.

@ackintosh
Copy link
Member Author

ackintosh commented Apr 9, 2024

Are runs between the prefix and non-prefix actually separate. When lighthouse shuts down, it retains the peers it found in its db. So if you start a fresh lighthouse node and run a bunch of searchers, then restart it and run the same bunch, the second run will be faster because it has already seeded it's local routing table.

Oh, I didn't know that the peers are retained in the db upon shutdown. The first measurement I conducted might be incorrect, as I didn't clear the db before running the tests.


I've updated the measurement procedure and conducted a second measurement.

Preparation

Create a branch for the measurement based on this PR. The changes in this branch include:

  • Run queries only for Subnet::Attestation (skipping queries for Subnet::SyncCommittee and QueryType::FindPeers). This aims to obtain accurate metrics for sent/recv bytes specifically for the Subnet::SyncCommittee query.
  • Logging sent/recv bytes and the number of found nodes.
    • Here, "found nodes" refers to nodes subscribing to the specific subnet we queried.

Measurement procedure

I followed the procedure below, performing it 5 times each for both normal search and prefix search:

  1. Clear datadir to start a fresh lighthouse node on each run.
  2. Start lighthouse with (or without) the --prefix-search-for-subnet option.
  3. Stop lighthouse after 60 seconds.
  4. Check the metrics logs added in the Preparation section.

Result

The result shows that subnet peer discovery has improved approximately 7x with Prefix search (see efficiency in the result table).

Note:

  • #: The test was run 5 times each for both normal search and prefix search, so # ranges from 1 to 5.
  • bytes_sent / bytes_recv: The bytes sent and received by queries during the 60-second lighthouse run.
  • total found nodes: The number of nodes subscribing to the specific subnet and found by the queries during the lighthouse run.
  • efficiency: Calculated as total bytes / found nodes. This is used to compare the two search methods. A lower number indicates better efficiency, meaning lower bandwith consumption to search for nodes.
image

Why the result indicates better efficiency at this time

I'm asking this because I expect that the prefix-search will actually have no benefit at the moment. We should only see improvements as nodes start to use the new deterministic subnet scheme. Because pretty much all nodes on the network haven't got this yet, the prefix search should be equivalent to a normal search.

The subnet-id computed by the new deterministic subnet scheme is relatively close to one computed by the current subnet scheme. We implemented the new subnet scheme in #4959, where expected_subnets in the unit test was partially updated. I believe this is why we see improvements with the prefix search at this time.


If anything is unclear, please let me know. 🙏

@AgeManning
Copy link
Member

Hey @ackintosh, nice work.

I still can't understand why we are seeing any improvements at all. If anything I think this indicates a bug in our new code or something wrong with the testing methodology.

Isn't the only difference that we are no longer searching for a random node-id, but we prefix-it with some bits. In either case the node's are uniformly distributed and searching via a random node-id or prefixing a few bits should yield the exact same results.

It doesn't make sense to me why there is such a difference between the two?

@ackintosh
Copy link
Member Author

@AgeManning As you mentioned, nodes are using the current subnet scheme (not the new scheme) as defined in the spec. I did another test to see how the computed subnet_ids from the current and the new scheme match, with 10,000 random node_id and epoch pairs.

Approximately 67% of the subnet_ids matched. (Details of the test are below.)

If the computed subnet_ids were completely different, we would likely see no improvements. However, about 67% of computed subnet_ids were the same. I think this is a reason why we are seeing the improvements by prefix search with the new subnet scheme.

If there is any misunderstanding on my part, I would appreciate it if you could point it out. 🙇


Here is the test code used for the test. Most of the code is copied from the consensus-specs repository.

# ################
# p2p-interface.md
# ################

# Current scheme, copied from current `p2p-interface.md`.
def compute_subscribed_subnet_by_current_schema(node_id: NodeID, epoch: Epoch, index: int) -> SubnetID:
    node_id_prefix = node_id >> (NODE_ID_BITS - ATTESTATION_SUBNET_PREFIX_BITS)
    node_offset = node_id % EPOCHS_PER_SUBNET_SUBSCRIPTION
    permutation_seed = hash(uint_to_bytes(uint64((epoch + node_offset) // EPOCHS_PER_SUBNET_SUBSCRIPTION)))
    permutated_prefix = compute_shuffled_index(
        node_id_prefix,
        1 << ATTESTATION_SUBNET_PREFIX_BITS,
        permutation_seed,
    )
    return SubnetID((permutated_prefix + index) % ATTESTATION_SUBNET_COUNT)

# New scheme, copied from https://github.com/ethereum/consensus-specs/pull/3545
def compute_subscribed_subnet_by_new_schema(node_id: NodeID, epoch: Epoch, index: int) -> SubnetID:
    subnet_prefix = node_id >> (NODE_ID_BITS - ATTESTATION_SUBNET_PREFIX_BITS)
    shuffling_bit_size = (
        NODE_ID_BITS
        - ATTESTATION_SUBNET_PREFIX_BITS
        - ATTESTATION_SUBNET_SHUFFLING_PREFIX_BITS
    )
    shuffling_bits = (node_id >> shuffling_bit_size) % (1 << ATTESTATION_SUBNET_SHUFFLING_PREFIX_BITS)
    shuffling_multiplier = EPOCHS_PER_SUBNET_SUBSCRIPTION >> ATTESTATION_SUBNET_SHUFFLING_PREFIX_BITS
    epoch_transition = (
        (subnet_prefix + (shuffling_bits * shuffling_multiplier)) % EPOCHS_PER_SUBNET_SUBSCRIPTION
    )
    permutation_seed = hash(uint_to_bytes(uint64((epoch + epoch_transition) // EPOCHS_PER_SUBNET_SUBSCRIPTION)))
    permutated_prefix = compute_shuffled_index(
        subnet_prefix,
        1 << ATTESTATION_SUBNET_PREFIX_BITS,
        permutation_seed,
    )
    return SubnetID((permutated_prefix + index) % ATTESTATION_SUBNET_COUNT)
# ################
# test_validator_unittest.py
# ################

@with_phases([PHASE0])
@spec_test
@single_phase
def test_compute_subscribed_subnets_similarity(spec):
    for i in range(10000):
        rng = random.Random(i)
        res = check_if_computed_subnets_are_equal_between_current_and_new(spec, rng)
        print('res', res)

def check_if_computed_subnets_are_equal_between_current_and_new(spec, rng):
    node_id = rng.randint(0, 2**256 - 1)
    epoch = rng.randint(0, 2**64 - 1)

    subnets_current = spec.compute_subscribed_subnets_by_current_schema(node_id, epoch)
    subnets_new = spec.compute_subscribed_subnets_by_new_schema(node_id, epoch)

    return subnets_current == subnets_new

I ran the test in the consensus-specs repo as shown below:

cd consensus-specs/tests/core/pyspec

python \
  -s \
  -m pytest \
  --preset=mainnet \
  -k test_compute_subscribed_subnets_similarity \
  eth2spec/test/phase0/unittests/validator/test_validator_unittest.py

@AgeManning
Copy link
Member

Closing in favour of #6146

@AgeManning AgeManning closed this Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Networking work-in-progress PR is a work-in-progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants