-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossipsub: Lazy sending #850
Comments
That seems a bit scary to me, might open a new category of attacks ("slow peers" which lowers our scores regarding legit peers since we become slow) We could however cancel the send if we receive the duplicate while the message was queued |
well, there are already cases where we run with |
My worry isn't that we only send to Dmin peers (though that will also have an impact on the time it takes to reach the full network), but that other peers which are in our mesh will think that we are slow, and thus descore us |
there are 2 cases here:
|
Good point. I would still like to do this in two steps, #851 which just cancels outgoing messages, and a second PR that caps the inflight requests to dmin (with a timeout) I think the cancel of inflight will naturally send less stuff if you are the one being slow, since more stuff will be queued. But we can measure & verify that |
Idea from @lchenut: if we expose the "buffered send size" from the muxer, we could start the staggered send to the peers with the smallest buffered amount, which would alleviate some of the slowdown in the "peer is the bottleneck" cases |
this is not a bad idea either - though in effect, it relies on the same underlying mechanism, namely more fine-grained tracking of "sent" all the want up to the point where messages are still individual messages and not flattened to form a stream: the other key idea this relies on is that in gossipsub, we can differentiate between "best-effort" messages that don't have to be sent and those that are critical for the protocol to function. By and large, One other aspect I thought of: sending messages to non-mesh peers may have an "island-breaking" in case a mesh becomes an isolated island, similar to how IHAVE works as a bridging message. |
My first attempts to implement this in simulations turns out tricky, because the OS buffer is big enough that it's really hard to hit the point where Example, on some random fleet node, out of 2120 sockets, only 542 sockets have something in the send queue, with on average 4k bytes in it. AFAIK, Sure, once the blocks get bigger, the buffer will get a bit more busy, but realistically, In my early simulations attempt, I used 0.5mb blocks broadcasted with 1mbit bandwidth cap per node (so too slow to be realistic), and for 80% of nodes, every |
I'll try 3 variants: 2 - Add fixed delay between peers const bandwidth = 10_000_000 / 8 # 10 mbps
await sleepAsync(milliseconds(message.len * 1000 div bandwidth)) 3 - Reduce the TCP buffer size |
Some simulation results.
Default: default gossipsub, no ping & bandwidth limit (would ideally be 0 everywhere, but the CPU bottlenecks at some point) Sorry for the cryptic names Basically, Lazy Publish improves the minimum latency a lot (52%). That makes sense since we send to a single node, instead of all in parallel Combining Lazy Publish & Full Lazy push brings mean & max latencies down by 20%. I would have hoped more reduction, since at this point the number of duplicates is close to 0, I'll try to investigate what is going on. |
well, this is true for the "first hop" - after that, peers start sending each other stuff redundantly decreasing the utility of each send |
first hops, and the first hops are the most important one Here is a simplified example that gives the good intuitions, I think. Let's say you have a network with 10k nodes where each "step" of the network forwards the message to
Number of holders = Now let's compare with a network with duplicates == 0:
While this takes a lot of shortcuts, advantaging both scenarios, we can see that they are not as different as one expect However, controlling the amount of parallel sends can give big benefits to spread time, as shown in my last message, which is why I'm now focusing on bringing this to life, instead of trying to reduce duplicates |
Today I tested this branch: 9b11fa7 which will limit the number of parallel sends based on an ACK. The ACK mechanism is also used to reduce the amount of duplicates (with limited success) Simulation parameters:
As visible on the graph, I originally got mixed results. Turns out, the TCP congestion algorithm was playing more games with me I didn't write it out this far, but I often had issues with the TCP congestion control. Basically, TCP is not optimized for bursty connections, especially with high latency. It needs a big stream of data to be able to guesstimate the available bandwidth. When only sending a big message (but not that big) every few seconds, it completely looses its footings. Switching to BBR (default being Unfortunately, this is AFAICT only available on linux, and only with kernel >=4.9, and requires a So basically:
Other interesting metric, 1 parallel send without BBR = 730ms, with BBR = 500ms Also, BBR takes a bit of time to stabilize the connection, here is a graph of a single run: We can clearly see a slowdown in the first 30s. This simulation runs twice as fast as the real network (6 seconds slots), so in reality, it would take up to 1m for a connection to open properly. This means that we must limit churn to some extent |
@Menduist your example on hops and duplicates is great!
I didn't come up with a formula, thus did the simulation for a variety of parameters. The spreadsheet with results could be found here Below is a sample slice of the data: So according to these assumptions for example a message of size 600K have 8.8s theoretically minimal time of dissemination in the network of 10K nodes with |
Though this intuition does not account for "sending time"/"message size" and link saturation, which is critical to the problem at hand .. ie it's not really latency we're targeting here but rather bandwidth/duplication reduction. |
Ok so taking the same table, but adding a "percent of bandwidth wasted" (
So what this "waste" does, is prevent us from getting faster "step". If you are sending 8 messages of 0.5mb over 25mbit connection, one step will take 1280ms. If you knew that 4 of them already have the message, you could send to just 4, and that step would take 640 ms instead. So, doing a table of "time per step":
Time with dups will always "send to 8 peers" at each step, so 1280 every time. So, reaching 95% of the network with dups takes 6400 ms VS 5080 ms without dups, or 20% of improvement. |
I'm happy to call this a significant improvement in latency alone - ie gossip in general is hard to beat in terms of latency so clawing back 20% due to increased message size effects is massive. The bandwidth reduction is also significant because as soon as the block has been disseminated, we need the bandwidth for other things (attestations etc) - if we're still busy sending blocks at that time, the latency will accumulate. |
@Menduist didn't you have a chance to try that option?
From my tests it seems to do its job great! Only the first transmission had that 'slow start' effect, all the following messages are delivered on their best Here are results for emulated 25Mbps and 50ms delay (100ms RRT), 512K message, and 15 sec delay between messages:
|
@Nashatyrev as it happens, in your example you have a message size that is close to BDP (25Mbps * 0.1s => BDP of 312KB). This is the worst-case regime in terms of what you are showing, e.g. in this regime slow start dominates transfer time. If you increase BDP (or reduce message size) then the transfer will be latency-bound and slow-start will be less and less relevant. Conversely, if you increase message size (or reduce BDP), then the transfer will be bandwidth bound and again slow-start will be less relevant. It might be worth assessing how often we will be in the "message size ~ BDP" regime before attempting to disable slow-start via protocol or config changes. |
I don't think increasing the BDP will help us, on the contrary. Our issue is precisely that we want to send big messages sporadically (ie after the window went idle and reset), hence the idea to ping frequently to avoid the window going idle We are not going to disable slow start, but the "slow start after idle", which is specified in this RFC: https://datatracker.ietf.org/doc/html/rfc2581#section-4.1 |
Well, we don't get to choose, BDP is a property of the path. What I'm saying is, if the message size is >> BDP, then transfer time is dominated by bandwidth - e.g the slow start phase is negligible. On the other end if message is << BDP, then transfer time is dominated by latency, and once again slow start is negligible. When (as in the example) message size ~~ BDP, then that's where slow start has the highest relative impact on transfer time.
The one measured example given did not have "big messages" (again, in the TCP congestion control realm what matters is message size wrt BDP)
Slow start after idle is just slow start on an existing connection. |
When a message is small enough (~~ initial Below there are some samples with unbounded bandwidth, 1Mb message and various latencies. See the difference between
|
I think @henridf point was that a big enough message will take so long to be transmitted that the effect of the slow open is swallowed by this Tried to compute this, here is the results I got:
(BDP ratio = message size / BDP) So basically, when we start sending messages which are 5mb over a 25mbps / 100ms RTT connection, slow start would have a very small effect |
Thanks for digging in @Menduist !! Fwiw, in @Nashatyrev simulation (#850 (comment)), the message size/BDP ratio is 1.6, and we have 650ms vs 250ms with/without slow start (so a factor of 2.6). |
Fixed a mistake, updated the original table, and here it is with a 16kb initial window, which matchs up the 2.6 factor on 1.6 ratio
|
I think the entries at 0.04 and 0.4 are incorrect - the speed factor should go down for BDP < 1. The max is at 1. |
Can you help me understand the "Gain factor"? At BDP ratio 16, I'd expect the transfer times with / without slow start to be quite close to each other. Here, there's a 100% gain factor, which would appear to indicate that slow start makes the transfer twice as slow, which is very counter-intuitive. (TCP would be pretty badly broken if that were the case 😅). |
Sorry that's speed factor (time to send with initial window / time to send with large window) |
That's what I was arguing about in #850 (comment) In my experiment My point is that for small messages ( << BDP) the latency is the major limiting factor but not the BDP (i.e. not the bandwidth) Or am I missing something here? @Menduist could you please share the formula you used for you table calculations? |
To recap: 0 < msgSize < initialWindow: limited by latency
It was originally in an ugly spreadsheet, moved to a python script: There is probably an off by one hiding in there, but it should be in the right ballpark |
The code below can be used to simulate the results in the first table here #850 (comment)
|
Consider:
In the above, we can skip sending to peer D as well - to make this work:
Problems:
Dmin
peers, then use lazy sending for the restThe text was updated successfully, but these errors were encountered: