-
Notifications
You must be signed in to change notification settings - Fork 2.7k
sc_network::NetworkService::write_notifications should have proper back-pressure #5481
Comments
While intrusive, I am in favor of making this call asynchronous. It would allow for proper backpressure. Whenever the queue is full for a given peer, we need to make sure to reduce the reputation of the peer and eventually disconnect, as otherwise a peer could halt all of Grandpa.
This sounds like a lot more work than the suggestion above it.
Agree to the complexity this might cause.
From what I understand this is already partially the case. While probably a good property to have anyways, there is no feedback back to Grandpa, thus it has no clue that messages are being dropped and why messages are being dropped. Idea (1) would give that feedback to Grandpa.
I think this could be an interim solution until (1) is in place. |
Isn't it the case that some later messages can supersede former messages? If so, then rather than spitting out these notifications and expecting them all to be delivered, maybe instead have some sort of channel system so that grandpa can express which messages still actually need to be delivered at any given time. So e.g. if you send a notification with a channel 42 and then at some time later, send another with the same channel 42, then networking won't bother trying to send the first. |
This just doesn't work - the grandpa paper & the asynchronous network model can provide more information. However, it is more ok to drop messages to a single peer. The issue is when messages are completely dropped and are not delivered at all.
In some rare cases, but not for the majority of message. This would specifically apply for COMMIT messages which are fine to drop anyway. |
Maybe I'm overestimating the difficulty of the change, but I really don't see this as easy. You can't just do However I also don't personally know enough about GP's code to know what to change to properly handle that. |
What timeout behavior are you thinking for Grandpa relays messages using substrate-network-gossip and doesn't do For network-gossip, the send-message API could join internally all the write-notification calls under a The only exception is for things like neighbor packets where we really want them to be delivered to specific peers before we can proceed. How we handle these failures depends on what happens to a peer when they are unresponsive. |
My last comment is mostly about how to handle failures to deliver messages. Since we don't handle this right now, what we mostly care about as an incremental change is failures to buffer messages. So the initial |
Just to have data, here is a graph of the maximum size that the buffer reached in the past 24 hours. (the first peak corresponds to an era change and the second to a session change, but that's irrelevant for this issue) |
My opinion is that |
One major reason for introducing an application-side send buffer is that only the application knows how to prioritise buffered messages and drop obsolete messages, as I describe here. Then you'd set the OS-level send buffer to a small value (to hold one single message) with this approach. Making use of the OS-level TCP send buffer in a significant way, is only really appropriate if the data you're sending has a strict linear ordering, which is the case when you are uploading a single large file, or using SSH, but is not the case for p2p gossip protocols. |
@infinity0 Application level should adapt to limited bandwidth by producing less messages in the first place, and not by generating a lot, and then dropping some of them from the queue. E.g. if the peer can't handle GRANPA traffic, start sending very other round. Drop them if there are too many rounds skipped. |
Grandpa, and many other applications, cannot avoid producing messages in the first place - since these are broadcast messages generated independently of the receiving peer, with no knowledge of the specific peers that might be receiving the messages. The cleanest way to implement flow control therefore, is to drop the messages that become obsolete before each peer can process them. There is no tenet of network engineering that says applications "should adapt to limited bandwidth by producing less messages in the first place", I do not know why you think that is somehow better or "should". |
Following up on a discussion in Riot. Question: What is the difficulty of introducing backpressure into Finality-grandpa depends on the |
Instead of an asynchronous call, you can perform a synchronous call that writes synchronously into a bounded priority buffer, as I describe here. That also avoids the need to perform refactorings to cope with the added asynchronity. |
Yes, introducing a bounded priority buffer would be a second option. That way the methods on There are a couple of problems that come to my mind which is why we have not pursued this approach:
|
By caller, do you mean the producer (upper layer)? The consumer (network layer, sending side) is also going to be calling the queue to pop items off it. The producer should not need to be checking the status of the queue in a loop - they simply push into it synchronously, and synchronously perform any producer-specific logic needed to keep the queue within bounds. The consumer (the per-peer sending thread, AIUI) reads synchronously from the queue, blocking until it is non-empty. The key point is that dropping is unavoidable in our scenario. Any solution that does not involve dropping messages is doomed to fail - it is physically not possible to send X+Y bandwidth when your peer is only receiving Y bandwidth, without dropping things. Making things In a more simple scenario such as a simple proxy where A is sending to you and you are forwarding this to B, you may use backpressure from B to in turn apply the same backpressure to A, and then you don't need to drop anything explicitly. But we are dealing with a complex p2p protocol with multiple senders and multiple recipients, and so it is unrealistic to attempt to "pass through" the backpressure from our recipient peers to our sending peers. Dropping is far simpler. |
The network-gossip issue
The location where There is an implementation of a
The functions on this trait do not return futures, and furthermore accept an
or or in other places. Approaches to the network-gossip issueThe first approach is to modify both the This may be a more robust approach in the long term. Since the Note that within the implementation of However, there is also another approach: alter the practical implementation of
|
Following up on #5481 (comment) and expanding on the question Why it is hard to make the call-sites async or have them have access to a Let's e.g. look at the finality-grandpa substrate/client/finality-grandpa/src/communication/gossip.rs Lines 1277 to 1283 in 0ab1c4f
Within its
This function gets as a parameter a When introducing backpressure into Thus in order to achieve our goal we would (1) need to get rid of the |
This is another benefit of the bounded priority queue approach, whether it is implemented in an async or sync way - you do not need to modify the existing application logic, only add to it, to specify:
In the meantime, Grandpa can still produce messages in the same way as it did before, i.e. in response to messages we receive, without worrying about who might be receiving the messages we send. As opposed to, folding what is effectively the same logic as above (i.e. 1, 2) into the generation of Grandpa's responses to specific outgoing peers, that would be much more complex, since Grandpa would then have to be aware of which specific peers we are sending to, which it really does not need to be aware of. |
Referencing #5481 (comment)
My above comment contends against this; see the second approach I mention.
And this again doesn't seem too bad as |
The last open question raised by my comment is "what should the semantic of I argue that we should emulate whatever the network code currently does if the unbounded send of |
Can you expand on how that would work @rphmeier? How would that list be bounded? Who is making sure all items from the list are send down to the asynchronous network? |
I believe GRANDPA gossip validator already maintains some network-related context for each peer.
It is a simple efficiency principle. Don't waste CPU and memory. Adding something to a queue only to be dropped later sounds like a waste if it can be avoided. |
The logic for reifying the abstract broadcast protocol as a point-to-point protocol, can be completely generic of what Grandpa does as a consensus protocol. It can, and should be written in a way that works for any possible application protocol, and the interface I outlined involving the bounded priority buffer achieves this separation.
In general, you should prefer simplicity over efficiency, especially if the efficiency is theoretical - other overheads involved in the complexity of doing what you're suggesting can totally wipe out any potential efficiency. In this case, the messages must be generated at the earliest possible opportunity anyway, and must reside in memory until all peers have consumed them - so adding references to the message to the queue of each peer, then dropping them later, is not a heavy cost. Simply put, the efficiency gain in the alternative approach, is not worth it. |
It is pretty simple. Imagine that the async fn do_work_with_gossip_validator(val: &mut impl Validator, net: &mut NetworkHandle) {
loop {
let mut actions = Vec::new();
val.validate(net.recv().await, &mut actions); // synchronous.
for Action::SendMessage(to, msg) in actions {
// assume only one action type, in practice do a `match`.
net.send(to, msg).await;
}
}
} So it's basically this but the The I do want to say that this is only one approach. And I also went into why your suggested approach (which I tend to agree with) of making all |
In practice we are going to have a buffer full of messages from different components. Take a look at the "network bridge" architecture of Polkadot's networking code: https://github.com/paritytech/polkadot/blob/ff708f3/roadmap/implementors-guide/src/node/utility/network-bridge.md I don't see any way where we can expose a synchronous API of a peer's sending buffer in a meaningful way to other subsystems, if there is only one buffer per peer and this buffer needs to be shared across all application logic. Alternatively, we can allow creating multiple buffers per peer and give one of those to each subsystem. Then each subsystem would be inspecting messages only originating from its corner of the application-specific logic. I'm still not sure exactly what the API would be or if this is user-friendly at all. Maybe something like this? struct PeerSendingSet {
// a fixed-size buffer for each peer.
}
impl PeerSendingSet {
/// Push a message onto the buffer for a peer, returning the message if the buffer is full.
/// On an `Err` we'd expect a `peer_buffer_retain` call or `flush_until_space` call.
fn push_message(&mut self, peer: PeerId, msg: Vec<u8>) -> Result<(), Vec<u8>> {
...
}
/// Inspect each item from the buffer in order, retaining only those which pass the predicate.
// Annoyance: this struct stores encoded messages we'd have to decode all the time.
fn peer_buffer_retain(&mut self, peer: PeerId, pred: impl Fn(&[u8]) -> bool) { ... }
/// Asynchronously drive network flushes for all peers until
//// there is space for at least `n` messages in this specific peer's buffer.
/// If the buffer length is less than `n`, only wait until the buffer is empty.
// Annoyance: frame this in terms of bytes instead?
async fn flush_until_space(&mut self, peer: PeerId, n: usize) { ... }
/// Asynchronously works to flush messages. Once a message leaves the buffer and gets to the network, it can
/// can be assumed to have been successfully sent.
///
/// This future is an infinite loop, so it should be `selected` with some other event producers.
async fn flush(&mut self) { ... }
} So each subsystem gets one of these, with the ability to add and remove peer connections. It uses some kind of underlying network API ( Used like this loop {
// For subsystems, we might obtain this from the network bridge.
let mut peer_channel_set = PeerChannelSet::new(buffer_size_for_this_protocol);
select! {
// unless we have something better to do, try to flush messages to the network all the time.
_ = peer_channel_set.flush() => {},
x = network_events.next() => match x {
NetworkEvent::PeerConnected(peer) => { peer_channel_set.add_peer(peer); }
NetworkEvent::PeerDisconnected(peer) => { peer_channel_set.remove_peer(peer); }
_ => { } // ignore incoming messages
}
x = logic_events.next() => match handle_logic(x) {
Action::SendMessage(peer, msg) => {
if let Err(msg) = peer_channel_set.push_message(peer, msg) {
// decide whether to `flush_until_space`, `peer_buffer_retain`, or drop
// the peer. I'm unsure how to handle this general situation where we
// don't want one peer to slow down our handling of all other peer
// connections. It seems like choosing peers to disconnect should happen
// at a lower level, in which case `flush_until_space` needs to resolve
// shortly after the peer is disconnected so we don't deadlock here.
}
}
}
}
} This is set up so that we can ensure that we are still trying to flush messages to other peers even while waiting on a slow/overloaded peer. The long comment I have about choosing when to move on and say "fuck it, this peer is too slow" is my main open question. I think this occurs in some way whether or not we take this "application does buffering" approach. This will clearly need to vary from protocol to protocol. I think the So furthermore, on the bandwidth question, @infinity0 your initial proposal was to do bandwidth management based on the type of peer that each peer is. Whereas this buffering approach (which I am aware is my own tangled improvisation on top of misinterpretation and not something that you are proposing) does bandwidth management based on which protocols each peer is part of and how much bandwidth should be expected for each peer in those protocols. This naturally becomes peer-type based bandwidth management when we start to restrict the types of peers that are members of particular protocols. |
With multiple subsystems, the simplest solution is to just send the messages over distinct streams - QUIC streams, or TCP connections, etc (yamux is faking it and we'd expect worse performance, but the API usage would "look similar" at least to the good versions). Each stream/subsystem would have its own outgoing buffer, managed by the subsystem according to its own logic. This should be the go-to option if the messages between the subsystems do not depend on each other and can be processed in any order by the recipient. If the messages between the subsystems do depend on each other, and you really need to ensure that this ordering is preserved when sending (as opposed to, having the recipient buffer out-of-order messages and re-create the correct order on their side), then there is no general solution and you'd need to pick a specific solution depending on how the subsystems compose together, and the dropping logic might have to look at all the outgoing buffers. However for a blockchain protocol this might not be so bad in practise. Concrete example:
where (9, XXX) means "some message at round 9". When subsystem A wants to send (10, mainA) indicating the start of round 10, it can drop all the (9, aux-*) messages from both buffers. On top of this, you'd need to define whether you should send (9, mainA) before or after (9, mainB), since the network sending logic needs to pick one when the receiving peer unblocks their connection. I'd say that the distinct-streams solution is sufficient for a first implementation and we can see if it actually causes problems before attempting the second more sophisticated solution. In terms of a rust interface, since every application needs to specify its own drop/priority logic, IMO the very high-level API should just be something like: trait PeerSendQueueWriter {
type Msg;
/// Push a message onto the buffer for a peer, dropping any obsolete messages to maintain the size bound.
/// If this is not possible (e.g. because all existing messages are critical to the application) then returns an Error.
/// In this case you should probably log an error, alert the node operator, and perhaps disconnect the peer.
fn push_message(&mut self, msg: Msg) -> Result<SomeError, ()>;
}
trait PeerSendQueueReader {
type Msg;
/// Pop the most urgent message from the buffer for a peer, blocking until one is available.
fn pop_message(&mut self) -> Msg;
} You need an associated abstract type In order to implement this trait, you might use the |
Also, a simple naive implementation can simply drop the least-recently-pushed message (or even have the queue be unbounded, and never drop anything), and have the priority ordering be random (or select the least-recently-pushed message again). That can be a test to see that any refactoring was successful, before we think about how Grandpa should actually define the dropping/priorities. |
I think flushing is a bit of a red herring in this scenario. Instead of flushing, we can just set our own TCP send-buffers to be relatively small, possibly even 0 if that works. From upper layers' perspective, I don't see an inherent difference between local TCP send buffers, vs all the in-flight network traffic already out there - they are all part of the same overall pipe, flushing is not going to help the overall process be quicker,
Yes, this all sounds fine, per-stream control implies per-peer control. |
For the higher level use-cases that I am thinking of, there is some of this, but only a small amount. In particular, there is one protocol for neighbor view updates that all other network protocols will hinge on to some extent. Still, off the top of my head I don't see any obvious ways that reordering across subsystems could cause issues in the set of subsystems that we are planning to implement. Having a separate low-level substream per-peer makes sense to me but I still don't understand how it would fit in with bandwidth control. What I've referred to as "flushing" is probably better referred to as "completing" the sends from application-space into libp2p space as opposed to flushing on any kind of socket. Basically just freeing up the source buffer that we use in some subsystem. I actually think my proposed However, it might not be that easy to get new substreams for each peer on-demand as that is what's required here. @mxinden and @tomaka could answer that better than me. |
Since #6692, there is now an alternative As such, I'm going to close this issue in favour of paritytech/polkadot-sdk#551 |
We need an answer to the question: what happens if someone runs the code below?
Right now we would buffer up 256 of these notifications and then discard the 257th and above.
While it could be a sensible option (UDP-style), GrandPa in particular isn't robust to messages being silently dropped.
I can see a couple of solutions:
Turn
write_notification
into an asynchronous function whose future is ready once the remote has accepted our message. This is the way that this problem would normally be solved, but would require changes in GrandPa.write_notification
returns an error if the queue is full. Note that this might be difficult to implement because the queue of messages is actually in a background task, but if that's the solution I will accommodate for it.If the buffer is full, we drop all pending messages then report on the API that we have disconnected and reconnected. GrandPa (and others) need to account for random disconnections anyway, and by making a "full buffer" look like a disconnection, we use the same code for both. This might however cause an infinite loop if we immediately send messages as a response to a reconnection.
Make GrandPa robust to messages being silently dropped.
Just increase the size of the buffer if we notice that the limit gets reached.
cc @rphmeier @andresilva @mxinden
The text was updated successfully, but these errors were encountered: