Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossipsub: Lazy sending #850

Open
arnetheduck opened this issue Jan 23, 2023 · 34 comments
Open

gossipsub: Lazy sending #850

arnetheduck opened this issue Jan 23, 2023 · 34 comments
Labels
gossipsub performance performance and resource ussage

Comments

@arnetheduck
Copy link
Contributor

Consider:

  • peer A sends a message to us which is sent to validation - peer B sends the same message to us and when the message validates, we skip sending it to peer B
  • however, we still send the data to peer C and D
  • sending data to peed C takes time - for example, streaming 1mb to peer C takes 1s on a 10mbit line
  • peer D completes sending the same message to us

In the above, we can skip sending to peer D as well - to make this work:

  • track the completion of "sending" to peer C
  • when the future completes, check if sending to D is still relevant

Problems:

  • C might be slow which increases latency of sending to D
    • send to Dmin peers, then use lazy sending for the rest
@Menduist
Copy link
Contributor

That seems a bit scary to me, might open a new category of attacks ("slow peers" which lowers our scores regarding legit peers since we become slow)

We could however cancel the send if we receive the duplicate while the message was queued

@arnetheduck
Copy link
Contributor Author

well, there are already cases where we run with Dmin peers - ie we consider that "ok" - we can further limit damage by adding a "wait-for-send-timeout" in the low 100's of ms (or stagger sending after Dmin - the savings here are significant though, ie peer A/B case that we already do today gives a noticable reduction in processing/bandwidth

@Menduist
Copy link
Contributor

well, there are already cases where we run with Dmin peers - ie we consider that "ok"

My worry isn't that we only send to Dmin peers (though that will also have an impact on the time it takes to reach the full network), but that other peers which are in our mesh will think that we are slow, and thus descore us

@arnetheduck
Copy link
Contributor Author

our mesh will think that we are slow,

there are 2 cases here:

  • we run out of bandwidth -> sending in parallel slows down everyone because of multuplexing effects -> worst case for all -> lazy sending allows some peers to receive the whole message faster -> better scoring overall
  • we hit the timeout before we saturate bandwidth -> the >dmin peers receive the message later than they would have already - we have to tune the timeout parameter such that we are withing safe "Slowness" limits with respect to scoring

@Menduist
Copy link
Contributor

lazy sending allows some peers to receive the whole message faster -> better scoring overall

Good point. I would still like to do this in two steps, #851 which just cancels outgoing messages, and a second PR that caps the inflight requests to dmin (with a timeout)

I think the cancel of inflight will naturally send less stuff if you are the one being slow, since more stuff will be queued. But we can measure & verify that

@Menduist
Copy link
Contributor

Idea from @lchenut: if we expose the "buffered send size" from the muxer, we could start the staggered send to the peers with the smallest buffered amount, which would alleviate some of the slowdown in the "peer is the bottleneck" cases

@arnetheduck
Copy link
Contributor Author

Idea from @lchenut: if we expose the "buffered send size" from the muxer,

this is not a bad idea either - though in effect, it relies on the same underlying mechanism, namely more fine-grained tracking of "sent" all the want up to the point where messages are still individual messages and not flattened to form a stream: the other key idea this relies on is that in gossipsub, we can differentiate between "best-effort" messages that don't have to be sent and those that are critical for the protocol to function.

By and large, Dmin represents the minimum viable spread in the protocol necessary for basic operation, Dmax is sort of "medium" priority (send-after-Dmin) and anything above is really best effort and should be dropped if there are not enough resources available / pushed back in the queue with respect to other messages in general.

One other aspect I thought of: sending messages to non-mesh peers may have an "island-breaking" in case a mesh becomes an isolated island, similar to how IHAVE works as a bridging message.

@arnetheduck arnetheduck added the performance performance and resource ussage label Mar 8, 2023
@Menduist
Copy link
Contributor

Menduist commented Mar 9, 2023

My first attempts to implement this in simulations turns out tricky, because the OS buffer is big enough that it's really hard to hit the point where send blocks

Example, on some random fleet node, out of 2120 sockets, only 542 sockets have something in the send queue, with on average 4k bytes in it. AFAIK, send will only block once this buffer is full, which, AFAICT is 4.194.304 bytes (on this specific node)

Sure, once the blocks get bigger, the buffer will get a bit more busy, but realistically, send will only block if the peer cannot keep up with the network.

In my early simulations attempt, I used 0.5mb blocks broadcasted with 1mbit bandwidth cap per node (so too slow to be realistic), and for 80% of nodes, every sends were instantaneous
(of course, this simulation only does blocks, and not attestations & co, but I don't think it matters at this scale)

@Menduist
Copy link
Contributor

I'll try 3 variants:
1 - Push full message to dLow, send IHAVE to the rest
The RTT it will take for the IWANT to come in will give some natural staggering, and also drastically reduce the amount of duplicates (should only get dLow in the end). Example here, in the process of assessing its perf

2 - Add fixed delay between peers
We can try to take some assumption and hardcode a delay between the first dLow peers and the rest. For instance

const bandwidth = 10_000_000 / 8 # 10 mbps
await sleepAsync(milliseconds(message.len * 1000 div bandwidth))

3 - Reduce the TCP buffer size
I'm not sure about this, because I think the send queue will get flushed once the peer ACKed the data, which effectively adds 0.5 RTT of delay for each buffer_size bytes we send. Also, might impact TCP perf of the whole application, and may be hard to do in a plaftorm-independent way

@Menduist
Copy link
Contributor

Some simulation results.
Using:

  • 200 nodes
  • 25 mbps of bandwidth per node
  • 0.5 blocks in a single message
  • 100 ms of jitter per node (so 200ms ping on average)

image

Default: default gossipsub, no ping & bandwidth limit (would ideally be 0 everywhere, but the CPU bottlenecks at some point)
Latency: default gossipsub, no bandwidth limit, ping applied
Lat & BW: default gossipsub, bandwidth & ping applied
Lat & BW & Lazy: GS with "Push full message to dLow, send IHAVE to the rest" (lazy push)
Lat & BW & FL: GS with Full Lazy push: "Push to 0 peers, send IHAVE to everyone" (except for initial publish)
Lat & BW & LP: GS with Lazy Publish: when publishing, sleep between each peer to give time for message to be sent
Lat & BW & LP & FL: GS with Lazy Publish and Full Lazy push
Lat & BW & LB & FL: GS with Lazy Broadcast (like lazy publish, but applies to relayed messages too) and Full Lazy Push

Sorry for the cryptic names

Basically, Lazy Publish improves the minimum latency a lot (52%). That makes sense since we send to a single node, instead of all in parallel
The issue with Lazy Publish is that it assumes some bandwidth limit, and will be detrimental if the node has more bandwidth than what was assumed.

Combining Lazy Publish & Full Lazy push brings mean & max latencies down by 20%. I would have hoped more reduction, since at this point the number of duplicates is close to 0, I'll try to investigate what is going on.

@Menduist
Copy link
Contributor

Just to confirm, I did a quick spreadsheet maths to see how much gain we should be seeing, and that seems more promising.
image

The time to spread should be linear with the amount of peer sent in parallel to, and going from 8 to 1 should reduce latency by 60%. I'll try to understand why "real" simulations didn't got us there then.

@arnetheduck
Copy link
Contributor Author

The time to spread should be linear with the amount of peer sent in parallel to,

well, this is true for the "first hop" - after that, peers start sending each other stuff redundantly decreasing the utility of each send

@Menduist
Copy link
Contributor

Menduist commented Mar 27, 2023

well, this is true for the "first hop"

first hops, and the first hops are the most important one

Here is a simplified example that gives the good intuitions, I think. Let's say you have a network with 10k nodes where each "step" of the network forwards the message to D (8) peers:

Number of holders Percent of holders Sent to Duplicates
1 0.01% 8 0
9 0.09% 64 0
73 0.73% 531 16
568 5.68% 3264 983
3832 38.32% 5715 20396
9547 95.47% 448 45271
9995 99.95% 5 3579
10000 10000% 0 40

Number of holders = last number of holders + last sent to
Sent to = last sent to * 8 - Duplicates
Duplicates = for _ 0..<lastSentTo * 8: duplicates += percentHolder; holders += 1 - percentHolder (very pseudocode)

Now let's compare with a network with duplicates == 0:

Number of holders Percent of holders Sent to
1 0.01% 8
9 0.09% 64
73 0.73% 512
585 5.85% 4096
4681 46.81% 5319
10000 100% 0

Charting this for clarity:
image

While this takes a lot of shortcuts, advantaging both scenarios, we can see that they are not as different as one expect
The duplicates case wastes bandwidth, which could be saved, or used for other messages. But the effect on spread time isn't as big as one might expect. That is also confirmed in more realistic scenarios (compare "Lat & BW" VS "Lat & BW & FL", for instance, since FL completely eradicates duplicates)

However, controlling the amount of parallel sends can give big benefits to spread time, as shown in my last message, which is why I'm now focusing on bringing this to life, instead of trying to reduce duplicates

@Menduist
Copy link
Contributor

Menduist commented Mar 28, 2023

Today I tested this branch: 9b11fa7 which will limit the number of parallel sends based on an ACK. The ACK mechanism is also used to reduce the amount of duplicates (with limited success)
The cost is 1 RTT (since we wait for an ACK)

Simulation parameters:

  • 200 nodes
  • 25 mbps of bandwidth per node
  • latency 20..80ms per node
  • Jitter 8ms per node
  • 0.5mb messages
    image

As visible on the graph, I originally got mixed results. Turns out, the TCP congestion algorithm was playing more games with me

I didn't write it out this far, but I often had issues with the TCP congestion control. Basically, TCP is not optimized for bursty connections, especially with high latency. It needs a big stream of data to be able to guesstimate the available bandwidth. When only sending a big message (but not that big) every few seconds, it completely looses its footings.
Even worst, it does not take into account the "overall available bandwidth", instead it just tries to guess the available bandwidth of each connection.

Switching to BBR (default being cubic), which is supposed to be more responsive in its window scaling, and take into account multiple connections for pacing, seems to not only improve the results of "raw gossipsub", but also augment the effect of the optimizations developed here

Unfortunately, this is AFAICT only available on linux, and only with kernel >=4.9, and requires a modprobe

So basically:

  • Switching to BBR reduces mean latency by 20%
  • Limiting to 2 parallel sends improves on top of BBR by 25%*
  • Which brings the total reduction to 32%

* According to my completely theoretical graph, this should be 50%, not 25%, which would bring total reduction to 60%. While I doubt we'll get there, I think it's possible to get closer to it

Other interesting metric, 1 parallel send without BBR = 730ms, with BBR = 500ms
Since the limit is 25mbps and the message size is 0.5mb, the perfect transfer time would be 160 ms + <80ms of ping.
So while BBR brings us closer to that, the bursty nature of the connection is still problematic.

Also, BBR takes a bit of time to stabilize the connection, here is a graph of a single run:
image

We can clearly see a slowdown in the first 30s. This simulation runs twice as fast as the real network (6 seconds slots), so in reality, it would take up to 1m for a connection to open properly. This means that we must limit churn to some extent

@Nashatyrev
Copy link

@Menduist your example on hops and duplicates is great!
It inspired me to think about theoretically optimal pubsub algorithm and what results it could yield. This is just to have intuition on how far we are from optimum.

  • Let's assume for simplicity that any node could send a message to any node
  • Optimal pubsub has no duplicate messages, thus any node receives a message just once
  • Once receiving and validating a message a node starts sequentially forwarding a message to nodes which didn't receive the message yet
  • In the ideal network it should take exactly messageSize / bandwidth time to send a message to one peer
  • After sending a message it 'flies through the network' for latency time until reaching the target peer

I didn't come up with a formula, thus did the simulation for a variety of parameters. The spreadsheet with results could be found here

Below is a sample slice of the data:
image

So according to these assumptions for example a message of size 600K have 8.8s theoretically minimal time of dissemination in the network of 10K nodes with 10Mbps bandwidth, 100ms connections latency and 10ms of message validation time

@arnetheduck
Copy link
Contributor Author

Here is a simplified example that gives the good intuitions, I think.

Though this intuition does not account for "sending time"/"message size" and link saturation, which is critical to the problem at hand .. ie it's not really latency we're targeting here but rather bandwidth/duplication reduction.

@Menduist
Copy link
Contributor

Menduist commented Apr 3, 2023

Though this intuition does not account for "sending time"/"message size" and link saturation

Ok so taking the same table, but adding a "percent of bandwidth wasted" (1 - Sent to / (Sent to + Duplicates)), and a step counter for clarity:

Step Number of holders Percent of holders Sent to Duplicates Waste
1 1 0.01% 8 0 0%
2 9 0.09% 64 0 0%
3 73 0.73% 531 16 2%
4 568 5.68% 3264 983 23%
5 3832 38.32% 5715 20396 78%
6 9547 95.47% 448 45271 99%

So what this "waste" does, is prevent us from getting faster "step". If you are sending 8 messages of 0.5mb over 25mbit connection, one step will take 1280ms. If you knew that 4 of them already have the message, you could send to just 4, and that step would take 640 ms instead.

So, doing a table of "time per step":

Step Percent of holders Time without dups Time with dups
1 0.01% 0 0
2 0.09% 1280 1280
3 0.73% 2560 2560
4 5.68% 3814 3840
5 38.32% 4799 5120
6 95.47% 5080 6400

Time with dups will always "send to 8 peers" at each step, so 1280 every time.
Without dups, it will send to only "Sent to", so it will be faster (formula: Last Time without dups + 1280 * (1 - Waste))

So, reaching 95% of the network with dups takes 6400 ms VS 5080 ms without dups, or 20% of improvement.
Reality being a bit more complex than tables and a calculator, the real gain from my simulations is around 13%. Which again, is not nothing, but is probably less than what people expect when thinking about completely removing duplicates

@arnetheduck
Copy link
Contributor Author

So, reaching 95% of the network with dups takes 6400 ms VS 5080 ms without dups, or 20% of improvement.

I'm happy to call this a significant improvement in latency alone - ie gossip in general is hard to beat in terms of latency so clawing back 20% due to increased message size effects is massive.

The bandwidth reduction is also significant because as soon as the block has been disseminated, we need the bandwidth for other things (attestations etc) - if we're still busy sending blocks at that time, the latency will accumulate.

@Nashatyrev
Copy link

Switching to BBR (default being cubic), which is supposed to be more responsive in its window scaling, and take into account multiple connections for pacing, seems to not only improve the results of "raw gossipsub", but also augment the effect of the optimizations developed here

@Menduist didn't you have a chance to try that option?

sysctl -w net.ipv4.tcp_slow_start_after_idle=0

From my tests it seems to do its job great! Only the first transmission had that 'slow start' effect, all the following messages are delivered on their best

Here are results for emulated 25Mbps and 50ms delay (100ms RRT), 512K message, and 15 sec delay between messages:

net.ipv4.tcp_slow_start_after_idle = 1 :

[08:58:19.019] [client] Write 512Kb (count: 1, total: 512Kb)
[08:58:19.688] [server] Read TOTAL count: 159, size: 512Kb in 540 ms)
[08:58:34.339] [client] Write 512Kb (count: 1, total: 512Kb)
[08:58:34.993] [server] Read TOTAL count: 166, size: 512Kb in 521 ms)
[08:58:49.342] [client] Write 512Kb (count: 1, total: 512Kb)
[08:58:49.996] [server] Read TOTAL count: 167, size: 512Kb in 522 ms)
[08:59:04.344] [client] Write 512Kb (count: 1, total: 512Kb)
[08:59:04.998] [server] Read TOTAL count: 166, size: 512Kb in 521 ms)
[08:59:19.360] [client] Write 512Kb (count: 1, total: 512Kb)
[08:59:20.015] [server] Read TOTAL count: 165, size: 512Kb in 522 ms)

net.ipv4.tcp_slow_start_after_idle = 0 :

[08:55:52.867] [client] Write 512Kb (count: 1, total: 512Kb)
[08:55:53.536] [server] Read TOTAL count: 157, size: 512Kb in 539 ms)
[08:56:08.192] [client] Write 512Kb (count: 1, total: 512Kb)
[08:56:08.447] [server] Read TOTAL count: 180, size: 512Kb in 170 ms)
[08:56:23.195] [client] Write 512Kb (count: 1, total: 512Kb)
[08:56:23.454] [server] Read TOTAL count: 177, size: 512Kb in 171 ms)
[08:56:38.197] [client] Write 512Kb (count: 1, total: 512Kb)
[08:56:38.451] [server] Read TOTAL count: 168, size: 512Kb in 172 ms)
[08:56:53.214] [client] Write 512Kb (count: 1, total: 512Kb)
[08:56:53.468] [server] Read TOTAL count: 170, size: 512Kb in 171 ms)

@Menduist
Copy link
Contributor

Menduist commented Apr 24, 2023

Indeed, the main reason cubic was so bad was because of the slow start after idle. However, instead of changing this sysctl parameter, which can't be tuned per socket, we can just ping every RTT to avoid getting the congestion control in idle state.

Some results of small scale sims (50 nodes, 25 mbps, 100 ms ping):
image

For regular gossipsub, all of this doesn't really have an impact, since it's so inefficient anyway.
Gossipsub with 2 parallel sends max:
image

Here, with the default cubic, the 2 parallel sends are about as bad as the regular gossipsub. However, with ping, we get close to BBR: 33% improvement (vs 41% for BBR). The ping doesn't really has an impact on BBR, however

@Menduist
Copy link
Contributor

So, recap of the progress:
(50 nodes, 50mbps / node, 90..120ms of latency, 0.5mb messages)
chart (1)

4 chunks

Split the message in 4 chunks

2 parallel

Each hop with limit to 2 sends in parallel. To do so, they wait for an ACK from the peer they are sending to.
They will also shuffle the ordering of the peer to send the data to.

This opens an attack where if two peer never send an ACK, the relaying will be paused at this hop. Ideally, we would integrate a timeout, depending on the message size and estimated bandwidth to avoid this attack

In case where there is a high latency, high bandwidth, and a single concurrent message, this technique may be detrimental, as we wait for the ACK to come while doing nothing. Here, the timeout would also help.

Ideally, we would instead use some congestion algorithm to estimate our overall available bandwidth, and the bandwidth of each peer. However, this is rather complex to do, and this naive "ACK" system seem to perform well enough for a first version. Said congestion algorithm could ideally implemented without changing the wire format, so each libp2p implementation would be free to at least implement the "ACK naive" one, but implement a better one if they so desire

Choke (Message)

When receiving a message, we send a "don't send" message to each peer in our mesh. Before sending a message to a peer, we check that we didn't receive such a message from him. This reduces nicely the amount of duplicates, which as explained above, helps with tail latency & bandwidth usage, but not so much mean latency

Ping loop

All of the above, except Default, use the "ping at every RTT" trick explained above


Bringed together, here are the gains compared to Default:

  Default 4 chunks 4 chunks + 2 parallel 4 chunks + 2 parallel + choke
Max lat 0.00% 26.00% 50.03% 61.47%
Mean lat 0.00% 18.92% 54.00% 58.69%
Bandwidth 0.00% 12.40% 20.80% 52.79%

These optimizations require some gossipsub modifications:

  1. Adding an "ACK" for the 2 parallel send
  2. Adding a "ChokeMessage" / "Don't Send" for the Choke

2 can be done easily in a backward compatible way, if the peer doesn't support it, it won't send it / handle it, but it will work with compatible peers.

For 1, there are a few ways to do it:
1.1 Simply ping the peer after sending the message (with any ping protocol)
Assuming we are using a single underlying connection and a dumb enough multiplexer, the peer will process the ping after receiving the full message. However, yamux for instance, may interleave the ping with the full message.
But this wouldn't require any wire change to gossipsub itself

1.2 Ping the peer via a new gossipsub extension
Would avoid the issue with yamux interleaving. Gossipsub wouldn't rely on a third party protocol to be able to ping, which is also a bonus

1.3 Adding an explicit ACK for each message
Explicitly send "I received [msgid]". Hard to say what the pros/cons are compared to 1.2

If we go for 1.1, this could be fully backward compatible, which is a nice bonus


So, next steps:

  • Validate that the "2 parallel send" isn't too costly in its worst case scenario (high bandwidth & high latency, sybil attacks)
  • Figure out the best way between 1.1 1.2 1.3
  • Open spec PR

@henridf
Copy link

henridf commented Jun 27, 2023

From my tests it seems to do its job great! Only the first transmission had that 'slow start' effect, all the following messages are delivered on their best
Here are results for emulated 25Mbps and 50ms delay (100ms RRT), 512K message, and 15 sec delay between messages:

@Nashatyrev as it happens, in your example you have a message size that is close to BDP (25Mbps * 0.1s => BDP of 312KB). This is the worst-case regime in terms of what you are showing, e.g. in this regime slow start dominates transfer time. If you increase BDP (or reduce message size) then the transfer will be latency-bound and slow-start will be less and less relevant. Conversely, if you increase message size (or reduce BDP), then the transfer will be bandwidth bound and again slow-start will be less relevant.

It might be worth assessing how often we will be in the "message size ~ BDP" regime before attempting to disable slow-start via protocol or config changes.

@Menduist
Copy link
Contributor

I don't think increasing the BDP will help us, on the contrary.
If you increase latency, it will take longer (same amount of round trips, more milliseconds) to open the window to its full size
If you increase bandwidth, it will take longer to open the window to its full size (since this "full size" is bigger than before)
And a window not opened to its full size means we are not using our bandwidth to its full potential

Our issue is precisely that we want to send big messages sporadically (ie after the window went idle and reset), hence the idea to ping frequently to avoid the window going idle

We are not going to disable slow start, but the "slow start after idle", which is specified in this RFC: https://datatracker.ietf.org/doc/html/rfc2581#section-4.1

@henridf
Copy link

henridf commented Jun 27, 2023

I don't think increasing the BDP will help us, on the contrary.

Well, we don't get to choose, BDP is a property of the path. What I'm saying is, if the message size is >> BDP, then transfer time is dominated by bandwidth - e.g the slow start phase is negligible. On the other end if message is << BDP, then transfer time is dominated by latency, and once again slow start is negligible.

When (as in the example) message size ~~ BDP, then that's where slow start has the highest relative impact on transfer time.

Our issue is precisely that we want to send big messages sporadically

The one measured example given did not have "big messages" (again, in the TCP congestion control realm what matters is message size wrt BDP)

We are not going to disable slow start, but the "slow start after idle"

Slow start after idle is just slow start on an existing connection.

@Nashatyrev
Copy link

On the other end if message is << BDP, then transfer time is dominated by latency, and once again slow start is negligible.

When a message is small enough (~~ initial cwnd size) then slow start is negligible.
When a message size >> initial cwnd size then slow start matters. In the process of transferring a message cwnd is expanding and the speed of expansion depends on latency.
When cwnd is not collapsed back (due to slow start after idle) the next message would be transferred faster with 'opened' cnwd.

Below there are some samples with unbounded bandwidth, 1Mb message and various latencies. See the difference between wave 0 when initial cwnd was opening and wave 1, 2, 3, 4 after it was opened.

halfPing clientCount wave firstRead firstDelivery avrgDelivery lastDelivery lastWrite lastWritten maxReadDelay    goodput
========|===========|====|=========|=============|============|============|=========|===========|============|==========
      10           1    0        30           161          161          161         0         110           17  50Mbits/s
      10           1    1        12            32           32           32         0           2            8 250Mbits/s
      10           1    2        12            19           19           19         0           2            1 421Mbits/s
      10           1    3        13            18           18           18         0           1            1 444Mbits/s
      10           1    4        11            19           19           19         0           1            1 421Mbits/s

      20           1    0        21           287          287          287         0         186           40  28Mbits/s
      20           1    1        20            47           47           47         0           0            2 170Mbits/s
      20           1    2        20            42           42           42         0           0            1 190Mbits/s
      20           1    3        21            41           41           41         0           1            1 195Mbits/s
      20           1    4        21            41           41           41         0           1            1 195Mbits/s

      50           1    0        51           713          713          713         0         463          100  11Mbits/s
      50           1    1        50           118          118          118         0           0            2  68Mbits/s
      50           1    2        50           108          108          108         0           0            1  74Mbits/s
      50           1    3        50           105          105          105         0           0            1  76Mbits/s
      50           1    4        50           103          103          103         0           0            1  78Mbits/s

     100           1    0       101          1426         1426         1426         0         926          200 5.6Mbits/s
     100           1    1       100           237          237          237         0           0            1  34Mbits/s
     100           1    2       100           217          217          217         0           0            1  37Mbits/s
     100           1    3       100           210          210          210         0           0            1  38Mbits/s
     100           1    4       100           208          208          208         0           0            1  38Mbits/s

@Menduist
Copy link
Contributor

Menduist commented Jun 27, 2023

I think @henridf point was that a big enough message will take so long to be transmitted that the effect of the slow open is swallowed by this

Tried to compute this, here is the results I got:

BDP ratio 0.04 0.40 1 2 4 8 16
Speed factor 12500.00% 1250.00% 500.00% 250.00% 150.00% 112.50% 106.25%

(BDP ratio = message size / BDP)
(Speed factor = time to send with small window / time to send with large window)
This is a random example with a random initial window, a different initial window would change the results, but not the trend

So basically, when we start sending messages which are 5mb over a 25mbps / 100ms RTT connection, slow start would have a very small effect
Or 100mb on a 500mbps/100ms connection

@henridf
Copy link

henridf commented Jun 27, 2023

Thanks for digging in @Menduist !!

Fwiw, in @Nashatyrev simulation (#850 (comment)), the message size/BDP ratio is 1.6, and we have 650ms vs 250ms with/without slow start (so a factor of 2.6).

@Menduist
Copy link
Contributor

Menduist commented Jun 27, 2023

Fixed a mistake, updated the original table, and here it is with a 16kb initial window, which matchs up the 2.6 factor on 1.6 ratio

BDP ratio 0.04 0.40 1 2 4 8 16
Speed factor 10000.00% 1000.00% 400.00% 250.00% 150.00% 112.50% 100.00%

@henridf
Copy link

henridf commented Jun 27, 2023

I think the entries at 0.04 and 0.4 are incorrect - the speed factor should go down for BDP < 1. The max is at 1.

@henridf
Copy link

henridf commented Jun 27, 2023

Can you help me understand the "Gain factor"? At BDP ratio 16, I'd expect the transfer times with / without slow start to be quite close to each other. Here, there's a 100% gain factor, which would appear to indicate that slow start makes the transfer twice as slow, which is very counter-intuitive. (TCP would be pretty badly broken if that were the case 😅).

@Menduist
Copy link
Contributor

Sorry that's speed factor (time to send with initial window / time to send with large window)
So they are basically equal

@Nashatyrev
Copy link

I think the entries at 0.04 and 0.4 are incorrect - the speed factor should go down for BDP < 1. The max is at 1.

That's what I was arguing about in #850 (comment)

In my experiment BDP = 32Gbps * 0.2s = 800MB (for 100ms half-RRT)
Message size is 1MB
Measured speed factor is ~700%

My point is that for small messages ( << BDP) the latency is the major limiting factor but not the BDP (i.e. not the bandwidth)

Or am I missing something here?

@Menduist could you please share the formula you used for you table calculations?

@Menduist
Copy link
Contributor

Menduist commented Jun 28, 2023

To recap:

0 < msgSize < initialWindow: limited by latency
initialWindow < msgSize < BDP * 40: limited by opening window
BDP * 40 < msgSize: limited by bandwidth (and by opening window, but the extra cost of the opening window is very small compared to the full transmit time)

could you please share the formula you used for you table calculations?

It was originally in an ugly spreadsheet, moved to a python script:
https://gist.github.com/Menduist/cda951879afda8d0bc5e8b5998a24337

There is probably an off by one hiding in there, but it should be in the right ballpark

@diegomrsantos
Copy link
Contributor

The code below can be used to simulate the results in the first table here #850 (comment)

import std/random
randomize()

var numHolders = 1
var sentTo = 1
let nodes = 10_000

while numHolders < nodes:
  
  if sentTo == 0: break # stop when we can't send msgs to any new node
  
  var duplicates = 0
  var newSentTo = 0
  var percentHolders = 0.0
  let initialPercentHolders = numHolders / nodes
  for _ in 0..<sentTo * 8:
    percentHolders = (numHolders + newSentTo) / nodes
    if rand(1.0) <= percentHolders:
      duplicates += 1
    else:
      newSentTo += 1
  sentTo = newSentTo
  
  echo "numHolders: " & $numHolders
  numHolders += newSentTo
  
  echo "initialPercentHolders: " & $(initialPercentHolders * 100)
  echo "finalPercentHolders: " & $((numHolders / nodes) * 100)
  echo "newSentTo: " & $newSentTo
  echo "duplicates: " & $duplicates
  echo "\n"
  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gossipsub performance performance and resource ussage
Projects
Status: Experimental
Development

No branches or pull requests

5 participants