Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swarm/src/lib: Prioritize Behaviour over Pool and Pool over Listeners #2627

Merged
merged 7 commits into from
May 5, 2022

Conversation

mxinden
Copy link
Member

@mxinden mxinden commented May 2, 2022

Description

Builds on top of #2625.

Have the main event loop (Swarm::poll_next_event) prioritize:

  1. Work on NetworkBehaviour over work on Pool, thus prioritizing
    local work over work coming from a remote.

  2. Work on Pool over work on ListenersStream, thus prioritizing work
    on existing connections over upgrading new incoming connections.

Refactors NetworkBehaviourAction, PoolEvent and ListenersEvent handling along the way through 9f0ca95 87b7cb3 00346de to simplify the actual change, namely 426023e. Thus best reviewed one commit at a time.

I will run this change on a larger machine to make sure it does not introduce any subtle errors.

Links to any relevant issues

Related to #2626.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates
    • I don't think this needs to be called out in the changelog.

Simplifies `PoolEvent`, no longer carrying a reference to an
`EstablishedConnection` or the `Pool`, but instead the `PeerId`,
`ConnectionId` and `ConnectedPoint` directly.
Have the main event loop (`Swarm::poll_next_event`) prioritize:

1. Work on `NetworkBehaviour` over work on `Pool`, thus prioritizing
   local work over work coming from a remote.

2. Work on `Pool` over work on `ListenersStream`, thus prioritizing work
   on existing connections over upgrading new incoming connections.
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! :)

swarm/src/lib.rs Outdated Show resolved Hide resolved
@@ -1018,88 +1018,92 @@ where
// across a `Deref`.
let this = &mut *self;

// This loop polls the components below in a prioritized order.
//
// 1. [`NetworkBehaviour`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Unless you are in rustdoc land (///), I don't think this reference will do anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am hoping for this to eventually become a feature and if not, at least it is consistent when find-replacing. I would keep it as is unless you feel strongly about it @thomaseizinger.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No feeling strongly at all, just figured it might have been a mistake :)

swarm/src/lib.rs Outdated
Comment on lines 1022 to 1023
let mut listeners_not_ready = false;
let mut connections_not_ready = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for getting rid of these flags!

return Poll::Pending
}
// Poll the listener(s) for new connections.
match ListenersStream::poll(Pin::new(&mut this.listeners), cx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ListenersStream would actually implement Stream then we could utilise poll_unpin here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, though we would have to handle the None case which can never happen. rust-lang/futures-rs#1857 would be very useful here.

@mxinden
Copy link
Member Author

mxinden commented May 3, 2022

As an example @winksaville 426023e is one of those changes on a complex poll loop method. Error prone indeed, though I don't see a better way for doing it.

Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
@elenaf9
Copy link
Contributor

elenaf9 commented May 4, 2022

  1. Work on Pool over work on ListenersStream, thus prioritizing work
    on existing connections over upgrading new incoming connections.

👍 Makes sense.

  1. Work on NetworkBehaviour over work on Pool, thus prioritizing
    local work over work coming from a remote.

Not sure if I agree with this. Many NetworkBehaviours internally have a state that is influenced by events in the pool (e.g. list of connected peers) and the behaviour might rely on the fact that this state is really the most recent one. Also when the behaviour now initiates a dial-attempt, with PeerCondition::Disconnected, it could happen that we initiate a dial attempt even though there is a connection to that peer pending in the pool, that has just not been polled to completion yet.
I think prioritization of local work <-> remote work has to happen within the NetworkBehaviour (e.g. do local work first before handling events reported by the ConnectionHandler).
Could you give an example where you think it makes more sense to poll the NetworkBehaviour before polling the Pool?

@mxinden
Copy link
Member Author

mxinden commented May 4, 2022

Many NetworkBehaviours internally have a state that is influenced by events in the pool (e.g. list of connected peers) and the behaviour might rely on the fact that this state is really the most recent one

As an aside a NetworkBehaviour must not depend on their view of the connection pool to be up-to-date, as they are updated in an eventual consistent way only.

Also when the behaviour now initiates a dial-attempt, with PeerCondition::Disconnected, it could happen that we initiate a dial attempt even though there is a connection to that peer pending in the pool, that has just not been polled to completion yet.

True, though I would argue that this can happen anyways, thus each NetworkBehaviour needs to be able to cope with the situation.

I think prioritization of local work <-> remote work has to happen within the NetworkBehaviour (e.g. do local work first before handling events reported by the ConnectionHandler).

But how would a NetworkBehaviour push back on events coming from the ConnectionHandler? Today they are injected via NetworkBehaviour::inject_event with no way for the NetworkBehaivour to refuse an event other than dropping it.

Could you give an example where you think it makes more sense to poll the NetworkBehaviour before polling the Pool?

Good question. I should have included an example in the pull request description.

Say we have a request-response style protocol. The remote is sending requests to the local node back-to-back. They are read from the socket in the corresponding ConnectionHandler, send to the Pool via the connection->pool channel. The Pool bubbles the events up the Swarm, which in turn forwards the request to the NetworkBehaviour.

Say that the NetworkBehaviour is busy handling a set of requests. The Swarm would thus poll the NetworkBehaviour and not the Pool. Given that the Pool is not polled, new requests are not taken out of the connection->pool channel. Thus the channel eventually fills up and the ConnectionHandler stops reading new requests from the socket. Not reading new requests from the socket results in TCPs back-pressure mechanism to kick-in, thus stopping the remote from sending new requests into the socket. Thus the remote is being forced to slow down in order to not overwhelm the local node.

Does the above example make sense @elenaf9?

@elenaf9
Copy link
Contributor

elenaf9 commented May 4, 2022

Thanks for the explanation @mxinden! Yes that makes sense, I agree with you then that NetworkBehaviour should be polled before Pool.

@mxinden
Copy link
Member Author

mxinden commented May 4, 2022

I think prioritization of local work <-> remote work has to happen within the NetworkBehaviour (e.g. do local work first before handling events reported by the ConnectionHandler).

But how would a NetworkBehaviour push back on events coming from the ConnectionHandler? Today they are injected via NetworkBehaviour::inject_event with no way for the NetworkBehaivour to refuse an event other than dropping it.

By the way, long term it would be cool for a NetworkBehaviour to be able to push back on events it can currently not handle. Though I have not been able to think of a way to do so.

Copy link
Contributor

@elenaf9 elenaf9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Also very much in favor of splitting larger poll methods into smaller chunks.

@rkuhn
Copy link
Contributor

rkuhn commented May 4, 2022

By the way, long term it would be cool for a NetworkBehaviour to be able to push back on events it can currently not handle. Though I have not been able to think of a way to do so.

So you want Erlang-style selective receive from a Mailbox, or its close cousin join calculus.

Copy link
Contributor

@MarcoPolo MarcoPolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@mxinden mxinden merged commit afc5b8d into libp2p:master May 5, 2022
@mxinden
Copy link
Member Author

mxinden commented May 6, 2022

By the way, long term it would be cool for a NetworkBehaviour to be able to push back on events it can currently not handle. Though I have not been able to think of a way to do so.

So you want Erlang-style selective receive from a Mailbox

I guess so. Still have to give it more thought.

or its close cousin join calculus.

Will take a look. Thanks @rkuhn.

@elenaf9
Copy link
Contributor

elenaf9 commented May 10, 2022

One more thing that just came to my mind:
Because the behaviour is now polled before the pool, I think the promise stated in #2262 (comment) does not hold true anymore:

In general, NetworkBehaviour::poll is guaranteed to be called after NetworkBehaviour::inject_xxx is called

This is relevant for the very specific case that a behaviour internally holds a timer that is reset within those method calls. For almost all method calls this won't be a problem because a SwarmEvent is returned and afterwards everything is polled again.
Only for inject_event and inject_address_change this does not happen.
Looking through our existing behaviour this won't cause issues for any of them. But maybe this is reason enough to amend a changlog stating this change, in case that a user's behaviour relies on this.

@mxinden
Copy link
Member Author

mxinden commented May 11, 2022

Because the behaviour is now polled before the pool, I think the promise stated in #2262 (comment) does not hold true anymore:

In general, NetworkBehaviour::poll is guaranteed to be called after NetworkBehaviour::inject_xxx is called

I think it does:

Say that this.pool.poll resolves to a PoolEvent::ConnectionEvent.

rust-libp2p/swarm/src/lib.rs

Lines 1082 to 1092 in f04f6bb

// Poll the known peers.
match this.pool.poll(cx) {
Poll::Pending => {}
Poll::Ready(pool_event) => {
if let Some(swarm_event) = this.handle_pool_event(pool_event) {
return Poll::Ready(swarm_event);
}
continue;
}
};

Thus this.handle_pool_event(pool_event) would be called with the PoolEvent::ConnectionEvent.

The corresponding match arm would call inject_event on the NetworkBehaviour, potentially making the NetworkBehaviour reset a timer.

PoolEvent::ConnectionEvent { peer_id, id, event } => {
if self.banned_peer_connections.contains(&id) {
log::debug!("Ignoring event from banned peer: {} {:?}.", peer_id, id);
} else {
self.behaviour.inject_event(peer_id, id, event);
}
}

Given that the match arm does not return, None is returned at the end of the method.

Back in Swarm::poll_next_event the conditional in the if is false (i.e. not None) and thus the continue statement is hit.

rust-libp2p/swarm/src/lib.rs

Lines 1086 to 1090 in f04f6bb

if let Some(swarm_event) = this.handle_pool_event(pool_event) {
return Poll::Ready(swarm_event);
}
continue;

Thus execution jumps to the top of the loop where the NetworkBehaviour will be called again.

rust-libp2p/swarm/src/lib.rs

Lines 1031 to 1080 in f04f6bb

match this.pending_event.take() {
// Try to deliver the pending event emitted by the [`NetworkBehaviour`] in the previous
// iteration to the connection handler(s).
Some((peer_id, handler, event)) => match handler {
PendingNotifyHandler::One(conn_id) => {
match this.pool.get_established(conn_id) {
Some(mut conn) => match notify_one(&mut conn, event, cx) {
None => continue,
Some(event) => {
this.pending_event = Some((peer_id, handler, event));
}
},
None => continue,
}
}
PendingNotifyHandler::Any(ids) => {
match notify_any::<_, _, TBehaviour>(ids, &mut this.pool, event, cx) {
None => continue,
Some((event, ids)) => {
let handler = PendingNotifyHandler::Any(ids);
this.pending_event = Some((peer_id, handler, event));
}
}
}
},
// No pending event. Allow the [`NetworkBehaviour`] to make progress.
None => {
let behaviour_poll = {
let mut parameters = SwarmPollParameters {
local_peer_id: &this.local_peer_id,
supported_protocols: &this.supported_protocols,
listened_addrs: &this.listened_addrs,
external_addrs: &this.external_addrs,
};
this.behaviour.poll(cx, &mut parameters)
};
match behaviour_poll {
Poll::Pending => {}
Poll::Ready(behaviour_event) => {
if let Some(swarm_event) = this.handle_behaviour_event(behaviour_event)
{
return Poll::Ready(swarm_event);
}
continue;
}
}
}
}

@elenaf9 am I missing something?

@elenaf9
Copy link
Contributor

elenaf9 commented May 11, 2022

@elenaf9 am I missing something?

Nope, I was missing that there was the continue statement after this.handle_pool_event. Sorry for the noise.

@mxinden
Copy link
Member Author

mxinden commented May 11, 2022

@elenaf9 am I missing something?

Nope, I was missing that there was the continue statement after this.handle_pool_event. Sorry for the noise.

By now I feel comfortable with the either return or continue once progress (Poll::Ready) was made pattern. That said, I don't think this is intuitive when first seeing it. Thus happy for alternative suggestions @elenaf9.

@elenaf9 elenaf9 mentioned this pull request May 19, 2022
1 task
@mxinden mxinden mentioned this pull request May 29, 2022
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants