gossipsub: Lazy sending #850

arnetheduck · 2023-01-23T21:25:41Z

Consider:

peer A sends a message to us which is sent to validation - peer B sends the same message to us and when the message validates, we skip sending it to peer B
however, we still send the data to peer C and D
sending data to peed C takes time - for example, streaming 1mb to peer C takes 1s on a 10mbit line
peer D completes sending the same message to us

In the above, we can skip sending to peer D as well - to make this work:

track the completion of "sending" to peer C
when the future completes, check if sending to D is still relevant

Problems:

C might be slow which increases latency of sending to D
- send to Dmin peers, then use lazy sending for the rest

The text was updated successfully, but these errors were encountered:

Menduist · 2023-01-24T09:18:02Z

That seems a bit scary to me, might open a new category of attacks ("slow peers" which lowers our scores regarding legit peers since we become slow)

We could however cancel the send if we receive the duplicate while the message was queued

arnetheduck · 2023-01-24T09:23:48Z

well, there are already cases where we run with Dmin peers - ie we consider that "ok" - we can further limit damage by adding a "wait-for-send-timeout" in the low 100's of ms (or stagger sending after Dmin - the savings here are significant though, ie peer A/B case that we already do today gives a noticable reduction in processing/bandwidth

Menduist · 2023-01-24T09:34:19Z

well, there are already cases where we run with Dmin peers - ie we consider that "ok"

My worry isn't that we only send to Dmin peers (though that will also have an impact on the time it takes to reach the full network), but that other peers which are in our mesh will think that we are slow, and thus descore us

arnetheduck · 2023-01-24T10:51:05Z

our mesh will think that we are slow,

there are 2 cases here:

we run out of bandwidth -> sending in parallel slows down everyone because of multuplexing effects -> worst case for all -> lazy sending allows some peers to receive the whole message faster -> better scoring overall
we hit the timeout before we saturate bandwidth -> the >dmin peers receive the message later than they would have already - we have to tune the timeout parameter such that we are withing safe "Slowness" limits with respect to scoring

Menduist · 2023-01-24T11:06:09Z

lazy sending allows some peers to receive the whole message faster -> better scoring overall

Good point. I would still like to do this in two steps, #851 which just cancels outgoing messages, and a second PR that caps the inflight requests to dmin (with a timeout)

I think the cancel of inflight will naturally send less stuff if you are the one being slow, since more stuff will be queued. But we can measure & verify that

Menduist · 2023-02-10T10:37:55Z

Idea from @lchenut: if we expose the "buffered send size" from the muxer, we could start the staggered send to the peers with the smallest buffered amount, which would alleviate some of the slowdown in the "peer is the bottleneck" cases

arnetheduck · 2023-02-27T10:41:21Z

Idea from @lchenut: if we expose the "buffered send size" from the muxer,

this is not a bad idea either - though in effect, it relies on the same underlying mechanism, namely more fine-grained tracking of "sent" all the want up to the point where messages are still individual messages and not flattened to form a stream: the other key idea this relies on is that in gossipsub, we can differentiate between "best-effort" messages that don't have to be sent and those that are critical for the protocol to function.

By and large, Dmin represents the minimum viable spread in the protocol necessary for basic operation, Dmax is sort of "medium" priority (send-after-Dmin) and anything above is really best effort and should be dropped if there are not enough resources available / pushed back in the queue with respect to other messages in general.

One other aspect I thought of: sending messages to non-mesh peers may have an "island-breaking" in case a mesh becomes an isolated island, similar to how IHAVE works as a bridging message.

Menduist · 2023-03-09T17:14:21Z

My first attempts to implement this in simulations turns out tricky, because the OS buffer is big enough that it's really hard to hit the point where send blocks

Example, on some random fleet node, out of 2120 sockets, only 542 sockets have something in the send queue, with on average 4k bytes in it. AFAIK, send will only block once this buffer is full, which, AFAICT is 4.194.304 bytes (on this specific node)

Sure, once the blocks get bigger, the buffer will get a bit more busy, but realistically, send will only block if the peer cannot keep up with the network.

In my early simulations attempt, I used 0.5mb blocks broadcasted with 1mbit bandwidth cap per node (so too slow to be realistic), and for 80% of nodes, every sends were instantaneous
(of course, this simulation only does blocks, and not attestations & co, but I don't think it matters at this scale)

Menduist · 2023-03-10T13:21:41Z

I'll try 3 variants:
1 - Push full message to dLow, send IHAVE to the rest
The RTT it will take for the IWANT to come in will give some natural staggering, and also drastically reduce the amount of duplicates (should only get dLow in the end). Example here, in the process of assessing its perf

2 - Add fixed delay between peers
We can try to take some assumption and hardcode a delay between the first dLow peers and the rest. For instance

const bandwidth = 10_000_000 / 8 # 10 mbps
await sleepAsync(milliseconds(message.len * 1000 div bandwidth))

3 - Reduce the TCP buffer size
I'm not sure about this, because I think the send queue will get flushed once the peer ACKed the data, which effectively adds 0.5 RTT of delay for each buffer_size bytes we send. Also, might impact TCP perf of the whole application, and may be hard to do in a plaftorm-independent way

Menduist · 2023-03-22T14:54:18Z

Some simulation results.
Using:

200 nodes
25 mbps of bandwidth per node
0.5 blocks in a single message
100 ms of jitter per node (so 200ms ping on average)

Default: default gossipsub, no ping & bandwidth limit (would ideally be 0 everywhere, but the CPU bottlenecks at some point)
Latency: default gossipsub, no bandwidth limit, ping applied
Lat & BW: default gossipsub, bandwidth & ping applied
Lat & BW & Lazy: GS with "Push full message to dLow, send IHAVE to the rest" (lazy push)
Lat & BW & FL: GS with Full Lazy push: "Push to 0 peers, send IHAVE to everyone" (except for initial publish)
Lat & BW & LP: GS with Lazy Publish: when publishing, sleep between each peer to give time for message to be sent
Lat & BW & LP & FL: GS with Lazy Publish and Full Lazy push
Lat & BW & LB & FL: GS with Lazy Broadcast (like lazy publish, but applies to relayed messages too) and Full Lazy Push

Sorry for the cryptic names

Basically, Lazy Publish improves the minimum latency a lot (52%). That makes sense since we send to a single node, instead of all in parallel
The issue with Lazy Publish is that it assumes some bandwidth limit, and will be detrimental if the node has more bandwidth than what was assumed.

Combining Lazy Publish & Full Lazy push brings mean & max latencies down by 20%. I would have hoped more reduction, since at this point the number of duplicates is close to 0, I'll try to investigate what is going on.

Menduist · 2023-03-23T09:52:59Z

Just to confirm, I did a quick spreadsheet maths to see how much gain we should be seeing, and that seems more promising.

The time to spread should be linear with the amount of peer sent in parallel to, and going from 8 to 1 should reduce latency by 60%. I'll try to understand why "real" simulations didn't got us there then.

arnetheduck · 2023-03-27T11:14:03Z

The time to spread should be linear with the amount of peer sent in parallel to,

well, this is true for the "first hop" - after that, peers start sending each other stuff redundantly decreasing the utility of each send

Menduist · 2023-03-27T14:20:44Z

well, this is true for the "first hop"

first hops, and the first hops are the most important one

Here is a simplified example that gives the good intuitions, I think. Let's say you have a network with 10k nodes where each "step" of the network forwards the message to D (8) peers:

Number of holders	Percent of holders	Sent to	Duplicates
1	0.01%	8	0
9	0.09%	64	0
73	0.73%	531	16
568	5.68%	3264	983
3832	38.32%	5715	20396
9547	95.47%	448	45271
9995	99.95%	5	3579
10000	10000%	0	40

Number of holders = last number of holders + last sent to
Sent to = last sent to * 8 - Duplicates
Duplicates = for _ 0..<lastSentTo * 8: duplicates += percentHolder; holders += 1 - percentHolder (very pseudocode)

Now let's compare with a network with duplicates == 0:

Number of holders	Percent of holders	Sent to
1	0.01%	8
9	0.09%	64
73	0.73%	512
585	5.85%	4096
4681	46.81%	5319
10000	100%	0

Charting this for clarity:

While this takes a lot of shortcuts, advantaging both scenarios, we can see that they are not as different as one expect
The duplicates case wastes bandwidth, which could be saved, or used for other messages. But the effect on spread time isn't as big as one might expect. That is also confirmed in more realistic scenarios (compare "Lat & BW" VS "Lat & BW & FL", for instance, since FL completely eradicates duplicates)

However, controlling the amount of parallel sends can give big benefits to spread time, as shown in my last message, which is why I'm now focusing on bringing this to life, instead of trying to reduce duplicates

Menduist · 2023-03-28T14:30:25Z

Today I tested this branch: 9b11fa7 which will limit the number of parallel sends based on an ACK. The ACK mechanism is also used to reduce the amount of duplicates (with limited success)
The cost is 1 RTT (since we wait for an ACK)

Simulation parameters:

200 nodes
25 mbps of bandwidth per node
latency 20..80ms per node
Jitter 8ms per node
0.5mb messages

As visible on the graph, I originally got mixed results. Turns out, the TCP congestion algorithm was playing more games with me

I didn't write it out this far, but I often had issues with the TCP congestion control. Basically, TCP is not optimized for bursty connections, especially with high latency. It needs a big stream of data to be able to guesstimate the available bandwidth. When only sending a big message (but not that big) every few seconds, it completely looses its footings.
Even worst, it does not take into account the "overall available bandwidth", instead it just tries to guess the available bandwidth of each connection.

Switching to BBR (default being cubic), which is supposed to be more responsive in its window scaling, and take into account multiple connections for pacing, seems to not only improve the results of "raw gossipsub", but also augment the effect of the optimizations developed here

Unfortunately, this is AFAICT only available on linux, and only with kernel >=4.9, and requires a modprobe

So basically:

Switching to BBR reduces mean latency by 20%
Limiting to 2 parallel sends improves on top of BBR by 25%*
Which brings the total reduction to 32%

* According to my completely theoretical graph, this should be 50%, not 25%, which would bring total reduction to 60%. While I doubt we'll get there, I think it's possible to get closer to it

Other interesting metric, 1 parallel send without BBR = 730ms, with BBR = 500ms
Since the limit is 25mbps and the message size is 0.5mb, the perfect transfer time would be 160 ms + <80ms of ping.
So while BBR brings us closer to that, the bursty nature of the connection is still problematic.

Also, BBR takes a bit of time to stabilize the connection, here is a graph of a single run:

We can clearly see a slowdown in the first 30s. This simulation runs twice as fast as the real network (6 seconds slots), so in reality, it would take up to 1m for a connection to open properly. This means that we must limit churn to some extent

Nashatyrev · 2023-03-31T11:31:32Z

@Menduist your example on hops and duplicates is great!
It inspired me to think about theoretically optimal pubsub algorithm and what results it could yield. This is just to have intuition on how far we are from optimum.

Let's assume for simplicity that any node could send a message to any node
Optimal pubsub has no duplicate messages, thus any node receives a message just once
Once receiving and validating a message a node starts sequentially forwarding a message to nodes which didn't receive the message yet
In the ideal network it should take exactly messageSize / bandwidth time to send a message to one peer
After sending a message it 'flies through the network' for latency time until reaching the target peer

I didn't come up with a formula, thus did the simulation for a variety of parameters. The spreadsheet with results could be found here

Below is a sample slice of the data:

So according to these assumptions for example a message of size 600K have 8.8s theoretically minimal time of dissemination in the network of 10K nodes with 10Mbps bandwidth, 100ms connections latency and 10ms of message validation time

arnetheduck · 2023-04-01T07:16:23Z

Here is a simplified example that gives the good intuitions, I think.

Though this intuition does not account for "sending time"/"message size" and link saturation, which is critical to the problem at hand .. ie it's not really latency we're targeting here but rather bandwidth/duplication reduction.

Menduist · 2023-04-03T13:32:38Z

Though this intuition does not account for "sending time"/"message size" and link saturation

Ok so taking the same table, but adding a "percent of bandwidth wasted" (1 - Sent to / (Sent to + Duplicates)), and a step counter for clarity:

Step	Number of holders	Percent of holders	Sent to	Duplicates	Waste
1	1	0.01%	8	0	0%
2	9	0.09%	64	0	0%
3	73	0.73%	531	16	2%
4	568	5.68%	3264	983	23%
5	3832	38.32%	5715	20396	78%
6	9547	95.47%	448	45271	99%

So what this "waste" does, is prevent us from getting faster "step". If you are sending 8 messages of 0.5mb over 25mbit connection, one step will take 1280ms. If you knew that 4 of them already have the message, you could send to just 4, and that step would take 640 ms instead.

So, doing a table of "time per step":

Step	Percent of holders	Time without dups	Time with dups
1	0.01%	0	0
2	0.09%	1280	1280
3	0.73%	2560	2560
4	5.68%	3814	3840
5	38.32%	4799	5120
6	95.47%	5080	6400

Time with dups will always "send to 8 peers" at each step, so 1280 every time.
Without dups, it will send to only "Sent to", so it will be faster (formula: Last Time without dups + 1280 * (1 - Waste))

So, reaching 95% of the network with dups takes 6400 ms VS 5080 ms without dups, or 20% of improvement.
Reality being a bit more complex than tables and a calculator, the real gain from my simulations is around 13%. Which again, is not nothing, but is probably less than what people expect when thinking about completely removing duplicates

arnetheduck · 2023-04-11T09:51:55Z

So, reaching 95% of the network with dups takes 6400 ms VS 5080 ms without dups, or 20% of improvement.

I'm happy to call this a significant improvement in latency alone - ie gossip in general is hard to beat in terms of latency so clawing back 20% due to increased message size effects is massive.

The bandwidth reduction is also significant because as soon as the block has been disseminated, we need the bandwidth for other things (attestations etc) - if we're still busy sending blocks at that time, the latency will accumulate.

Nashatyrev · 2023-04-23T09:05:43Z

Switching to BBR (default being cubic), which is supposed to be more responsive in its window scaling, and take into account multiple connections for pacing, seems to not only improve the results of "raw gossipsub", but also augment the effect of the optimizations developed here

@Menduist didn't you have a chance to try that option?

sysctl -w net.ipv4.tcp_slow_start_after_idle=0

From my tests it seems to do its job great! Only the first transmission had that 'slow start' effect, all the following messages are delivered on their best

Here are results for emulated 25Mbps and 50ms delay (100ms RRT), 512K message, and 15 sec delay between messages:

net.ipv4.tcp_slow_start_after_idle = 1 :

[08:58:19.019] [client] Write 512Kb (count: 1, total: 512Kb)
[08:58:19.688] [server] Read TOTAL count: 159, size: 512Kb in 540 ms)
[08:58:34.339] [client] Write 512Kb (count: 1, total: 512Kb)
[08:58:34.993] [server] Read TOTAL count: 166, size: 512Kb in 521 ms)
[08:58:49.342] [client] Write 512Kb (count: 1, total: 512Kb)
[08:58:49.996] [server] Read TOTAL count: 167, size: 512Kb in 522 ms)
[08:59:04.344] [client] Write 512Kb (count: 1, total: 512Kb)
[08:59:04.998] [server] Read TOTAL count: 166, size: 512Kb in 521 ms)
[08:59:19.360] [client] Write 512Kb (count: 1, total: 512Kb)
[08:59:20.015] [server] Read TOTAL count: 165, size: 512Kb in 522 ms)

net.ipv4.tcp_slow_start_after_idle = 0 :

[08:55:52.867] [client] Write 512Kb (count: 1, total: 512Kb)
[08:55:53.536] [server] Read TOTAL count: 157, size: 512Kb in 539 ms)
[08:56:08.192] [client] Write 512Kb (count: 1, total: 512Kb)
[08:56:08.447] [server] Read TOTAL count: 180, size: 512Kb in 170 ms)
[08:56:23.195] [client] Write 512Kb (count: 1, total: 512Kb)
[08:56:23.454] [server] Read TOTAL count: 177, size: 512Kb in 171 ms)
[08:56:38.197] [client] Write 512Kb (count: 1, total: 512Kb)
[08:56:38.451] [server] Read TOTAL count: 168, size: 512Kb in 172 ms)
[08:56:53.214] [client] Write 512Kb (count: 1, total: 512Kb)
[08:56:53.468] [server] Read TOTAL count: 170, size: 512Kb in 171 ms)

Menduist · 2023-04-24T16:37:50Z

Indeed, the main reason cubic was so bad was because of the slow start after idle. However, instead of changing this sysctl parameter, which can't be tuned per socket, we can just ping every RTT to avoid getting the congestion control in idle state.

Some results of small scale sims (50 nodes, 25 mbps, 100 ms ping):

For regular gossipsub, all of this doesn't really have an impact, since it's so inefficient anyway.
Gossipsub with 2 parallel sends max:

Here, with the default cubic, the 2 parallel sends are about as bad as the regular gossipsub. However, with ping, we get close to BBR: 33% improvement (vs 41% for BBR). The ping doesn't really has an impact on BBR, however

Menduist · 2023-04-26T10:11:19Z

So, recap of the progress:
(50 nodes, 50mbps / node, 90..120ms of latency, 0.5mb messages)

4 chunks

Split the message in 4 chunks

2 parallel

Each hop with limit to 2 sends in parallel. To do so, they wait for an ACK from the peer they are sending to.
They will also shuffle the ordering of the peer to send the data to.

This opens an attack where if two peer never send an ACK, the relaying will be paused at this hop. Ideally, we would integrate a timeout, depending on the message size and estimated bandwidth to avoid this attack

In case where there is a high latency, high bandwidth, and a single concurrent message, this technique may be detrimental, as we wait for the ACK to come while doing nothing. Here, the timeout would also help.

Ideally, we would instead use some congestion algorithm to estimate our overall available bandwidth, and the bandwidth of each peer. However, this is rather complex to do, and this naive "ACK" system seem to perform well enough for a first version. Said congestion algorithm could ideally implemented without changing the wire format, so each libp2p implementation would be free to at least implement the "ACK naive" one, but implement a better one if they so desire

Choke (Message)

When receiving a message, we send a "don't send" message to each peer in our mesh. Before sending a message to a peer, we check that we didn't receive such a message from him. This reduces nicely the amount of duplicates, which as explained above, helps with tail latency & bandwidth usage, but not so much mean latency

Ping loop

All of the above, except Default, use the "ping at every RTT" trick explained above

Bringed together, here are the gains compared to Default:

	Default	4 chunks	4 chunks + 2 parallel	4 chunks + 2 parallel + choke
Max lat	0.00%	26.00%	50.03%	61.47%
Mean lat	0.00%	18.92%	54.00%	58.69%
Bandwidth	0.00%	12.40%	20.80%	52.79%

These optimizations require some gossipsub modifications:

Adding an "ACK" for the 2 parallel send
Adding a "ChokeMessage" / "Don't Send" for the Choke

2 can be done easily in a backward compatible way, if the peer doesn't support it, it won't send it / handle it, but it will work with compatible peers.

For 1, there are a few ways to do it:
1.1 Simply ping the peer after sending the message (with any ping protocol)
Assuming we are using a single underlying connection and a dumb enough multiplexer, the peer will process the ping after receiving the full message. However, yamux for instance, may interleave the ping with the full message.
But this wouldn't require any wire change to gossipsub itself

1.2 Ping the peer via a new gossipsub extension
Would avoid the issue with yamux interleaving. Gossipsub wouldn't rely on a third party protocol to be able to ping, which is also a bonus

1.3 Adding an explicit ACK for each message
Explicitly send "I received [msgid]". Hard to say what the pros/cons are compared to 1.2

If we go for 1.1, this could be fully backward compatible, which is a nice bonus

So, next steps:

Validate that the "2 parallel send" isn't too costly in its worst case scenario (high bandwidth & high latency, sybil attacks)
Figure out the best way between 1.1 1.2 1.3
Open spec PR

henridf · 2023-06-27T13:01:17Z

From my tests it seems to do its job great! Only the first transmission had that 'slow start' effect, all the following messages are delivered on their best
Here are results for emulated 25Mbps and 50ms delay (100ms RRT), 512K message, and 15 sec delay between messages:

@Nashatyrev as it happens, in your example you have a message size that is close to BDP (25Mbps * 0.1s => BDP of 312KB). This is the worst-case regime in terms of what you are showing, e.g. in this regime slow start dominates transfer time. If you increase BDP (or reduce message size) then the transfer will be latency-bound and slow-start will be less and less relevant. Conversely, if you increase message size (or reduce BDP), then the transfer will be bandwidth bound and again slow-start will be less relevant.

It might be worth assessing how often we will be in the "message size ~ BDP" regime before attempting to disable slow-start via protocol or config changes.

Menduist · 2023-06-27T13:36:17Z

I don't think increasing the BDP will help us, on the contrary.
If you increase latency, it will take longer (same amount of round trips, more milliseconds) to open the window to its full size
If you increase bandwidth, it will take longer to open the window to its full size (since this "full size" is bigger than before)
And a window not opened to its full size means we are not using our bandwidth to its full potential

Our issue is precisely that we want to send big messages sporadically (ie after the window went idle and reset), hence the idea to ping frequently to avoid the window going idle

We are not going to disable slow start, but the "slow start after idle", which is specified in this RFC: https://datatracker.ietf.org/doc/html/rfc2581#section-4.1

henridf · 2023-06-27T13:58:53Z

I don't think increasing the BDP will help us, on the contrary.

Well, we don't get to choose, BDP is a property of the path. What I'm saying is, if the message size is >> BDP, then transfer time is dominated by bandwidth - e.g the slow start phase is negligible. On the other end if message is << BDP, then transfer time is dominated by latency, and once again slow start is negligible.

When (as in the example) message size ~~ BDP, then that's where slow start has the highest relative impact on transfer time.

Our issue is precisely that we want to send big messages sporadically

The one measured example given did not have "big messages" (again, in the TCP congestion control realm what matters is message size wrt BDP)

We are not going to disable slow start, but the "slow start after idle"

Slow start after idle is just slow start on an existing connection.

Nashatyrev · 2023-06-27T14:53:06Z

On the other end if message is << BDP, then transfer time is dominated by latency, and once again slow start is negligible.

When a message is small enough (~~ initial cwnd size) then slow start is negligible.
When a message size >> initial cwnd size then slow start matters. In the process of transferring a message cwnd is expanding and the speed of expansion depends on latency.
When cwnd is not collapsed back (due to slow start after idle) the next message would be transferred faster with 'opened' cnwd.

Below there are some samples with unbounded bandwidth, 1Mb message and various latencies. See the difference between wave 0 when initial cwnd was opening and wave 1, 2, 3, 4 after it was opened.

halfPing clientCount wave firstRead firstDelivery avrgDelivery lastDelivery lastWrite lastWritten maxReadDelay    goodput
========|===========|====|=========|=============|============|============|=========|===========|============|==========
      10           1    0        30           161          161          161         0         110           17  50Mbits/s
      10           1    1        12            32           32           32         0           2            8 250Mbits/s
      10           1    2        12            19           19           19         0           2            1 421Mbits/s
      10           1    3        13            18           18           18         0           1            1 444Mbits/s
      10           1    4        11            19           19           19         0           1            1 421Mbits/s

      20           1    0        21           287          287          287         0         186           40  28Mbits/s
      20           1    1        20            47           47           47         0           0            2 170Mbits/s
      20           1    2        20            42           42           42         0           0            1 190Mbits/s
      20           1    3        21            41           41           41         0           1            1 195Mbits/s
      20           1    4        21            41           41           41         0           1            1 195Mbits/s

      50           1    0        51           713          713          713         0         463          100  11Mbits/s
      50           1    1        50           118          118          118         0           0            2  68Mbits/s
      50           1    2        50           108          108          108         0           0            1  74Mbits/s
      50           1    3        50           105          105          105         0           0            1  76Mbits/s
      50           1    4        50           103          103          103         0           0            1  78Mbits/s

     100           1    0       101          1426         1426         1426         0         926          200 5.6Mbits/s
     100           1    1       100           237          237          237         0           0            1  34Mbits/s
     100           1    2       100           217          217          217         0           0            1  37Mbits/s
     100           1    3       100           210          210          210         0           0            1  38Mbits/s
     100           1    4       100           208          208          208         0           0            1  38Mbits/s

Menduist · 2023-06-27T15:53:40Z

I think @henridf point was that a big enough message will take so long to be transmitted that the effect of the slow open is swallowed by this

Tried to compute this, here is the results I got:

BDP ratio	0.04	0.40	1	2	4	8	16
Speed factor	12500.00%	1250.00%	500.00%	250.00%	150.00%	112.50%	106.25%

(BDP ratio = message size / BDP)
(Speed factor = time to send with small window / time to send with large window)
This is a random example with a random initial window, a different initial window would change the results, but not the trend

So basically, when we start sending messages which are 5mb over a 25mbps / 100ms RTT connection, slow start would have a very small effect
Or 100mb on a 500mbps/100ms connection

henridf · 2023-06-27T15:58:24Z

Thanks for digging in @Menduist !!

Fwiw, in @Nashatyrev simulation (#850 (comment)), the message size/BDP ratio is 1.6, and we have 650ms vs 250ms with/without slow start (so a factor of 2.6).

Menduist · 2023-06-27T16:28:20Z

Fixed a mistake, updated the original table, and here it is with a 16kb initial window, which matchs up the 2.6 factor on 1.6 ratio

BDP ratio	0.04	0.40	1	2	4	8	16
Speed factor	10000.00%	1000.00%	400.00%	250.00%	150.00%	112.50%	100.00%

henridf · 2023-06-27T16:37:54Z

I think the entries at 0.04 and 0.4 are incorrect - the speed factor should go down for BDP < 1. The max is at 1.

henridf · 2023-06-27T16:40:43Z

Can you help me understand the "Gain factor"? At BDP ratio 16, I'd expect the transfer times with / without slow start to be quite close to each other. Here, there's a 100% gain factor, which would appear to indicate that slow start makes the transfer twice as slow, which is very counter-intuitive. (TCP would be pretty badly broken if that were the case 😅).

Menduist · 2023-06-27T17:42:36Z

Sorry that's speed factor (time to send with initial window / time to send with large window)
So they are basically equal

Nashatyrev · 2023-06-28T07:28:20Z

I think the entries at 0.04 and 0.4 are incorrect - the speed factor should go down for BDP < 1. The max is at 1.

That's what I was arguing about in #850 (comment)

In my experiment BDP = 32Gbps * 0.2s = 800MB (for 100ms half-RRT)
Message size is 1MB
Measured speed factor is ~700%

My point is that for small messages ( << BDP) the latency is the major limiting factor but not the BDP (i.e. not the bandwidth)

Or am I missing something here?

@Menduist could you please share the formula you used for you table calculations?

Menduist · 2023-06-28T09:52:23Z

To recap:

0 < msgSize < initialWindow: limited by latency
initialWindow < msgSize < BDP * 40: limited by opening window
BDP * 40 < msgSize: limited by bandwidth (and by opening window, but the extra cost of the opening window is very small compared to the full transmit time)

could you please share the formula you used for you table calculations?

It was originally in an ugly spreadsheet, moved to a python script:
https://gist.github.com/Menduist/cda951879afda8d0bc5e8b5998a24337

There is probably an off by one hiding in there, but it should be in the right ballpark

diegomrsantos · 2023-12-20T18:17:36Z

The code below can be used to simulate the results in the first table here #850 (comment)

import std/random
randomize()

var numHolders = 1
var sentTo = 1
let nodes = 10_000

while numHolders < nodes:
  
  if sentTo == 0: break # stop when we can't send msgs to any new node
  
  var duplicates = 0
  var newSentTo = 0
  var percentHolders = 0.0
  let initialPercentHolders = numHolders / nodes
  for _ in 0..<sentTo * 8:
    percentHolders = (numHolders + newSentTo) / nodes
    if rand(1.0) <= percentHolders:
      duplicates += 1
    else:
      newSentTo += 1
  sentTo = newSentTo
  
  echo "numHolders: " & $numHolders
  numHolders += newSentTo
  
  echo "initialPercentHolders: " & $(initialPercentHolders * 100)
  echo "finalPercentHolders: " & $((numHolders / nodes) * 100)
  echo "newSentTo: " & $newSentTo
  echo "duplicates: " & $duplicates
  echo "\n"

Menduist mentioned this issue Jan 24, 2023

GossipSub: cancel inflight msgs when receiving duplicate #851

Closed

Menduist mentioned this issue Feb 10, 2023

GossipSub: improve flood publish #854

Open

arnetheduck added the gossipsub label Feb 22, 2023

arnetheduck added the performance performance and resource ussage label Mar 8, 2023

arnetheduck mentioned this issue Jun 1, 2023

[GossipSub 1.2] IDONTWANT control message libp2p/specs#548

Merged

This was referenced Jun 9, 2023

GossipSub: Limit flood publishing #911

Merged

Add GossipSub ping #912

Merged

kaiserd mentioned this issue Jun 26, 2023

Roadmap(DR): Gossipsub Scaling vacp2p/research#192

Closed

9 tasks

This was referenced Jul 25, 2023

GossipSub: IDontWant #934

Merged

GossipSub: stagger sending using TCP internals #937

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gossipsub: Lazy sending #850

gossipsub: Lazy sending #850

arnetheduck commented Jan 23, 2023

Menduist commented Jan 24, 2023

arnetheduck commented Jan 24, 2023

Menduist commented Jan 24, 2023

arnetheduck commented Jan 24, 2023

Menduist commented Jan 24, 2023

Menduist commented Feb 10, 2023

arnetheduck commented Feb 27, 2023

Menduist commented Mar 9, 2023 •

edited

Loading

Menduist commented Mar 10, 2023

Menduist commented Mar 22, 2023

Menduist commented Mar 23, 2023

arnetheduck commented Mar 27, 2023

Menduist commented Mar 27, 2023 •

edited

Loading

Menduist commented Mar 28, 2023 •

edited

Loading

Nashatyrev commented Mar 31, 2023

arnetheduck commented Apr 1, 2023

Menduist commented Apr 3, 2023

arnetheduck commented Apr 11, 2023

Nashatyrev commented Apr 23, 2023

Menduist commented Apr 24, 2023 •

edited

Loading

Menduist commented Apr 26, 2023

henridf commented Jun 27, 2023

Menduist commented Jun 27, 2023

henridf commented Jun 27, 2023 •

edited

Loading

Nashatyrev commented Jun 27, 2023

Menduist commented Jun 27, 2023 •

edited

Loading

henridf commented Jun 27, 2023

Menduist commented Jun 27, 2023 •

edited

Loading

henridf commented Jun 27, 2023 •

edited

Loading

henridf commented Jun 27, 2023

Menduist commented Jun 27, 2023

Nashatyrev commented Jun 28, 2023

Menduist commented Jun 28, 2023 •

edited

Loading

diegomrsantos commented Dec 20, 2023

gossipsub: Lazy sending #850

gossipsub: Lazy sending #850

Comments

arnetheduck commented Jan 23, 2023

Menduist commented Jan 24, 2023

arnetheduck commented Jan 24, 2023

Menduist commented Jan 24, 2023

arnetheduck commented Jan 24, 2023

Menduist commented Jan 24, 2023

Menduist commented Feb 10, 2023

arnetheduck commented Feb 27, 2023

Menduist commented Mar 9, 2023 • edited Loading

Menduist commented Mar 10, 2023

Menduist commented Mar 22, 2023

Menduist commented Mar 23, 2023

arnetheduck commented Mar 27, 2023

Menduist commented Mar 27, 2023 • edited Loading

Menduist commented Mar 28, 2023 • edited Loading

Nashatyrev commented Mar 31, 2023

arnetheduck commented Apr 1, 2023

Menduist commented Apr 3, 2023

arnetheduck commented Apr 11, 2023

Nashatyrev commented Apr 23, 2023

Menduist commented Apr 24, 2023 • edited Loading

Menduist commented Apr 26, 2023

4 chunks

2 parallel

Choke (Message)

Ping loop

henridf commented Jun 27, 2023

Menduist commented Jun 27, 2023

henridf commented Jun 27, 2023 • edited Loading

Nashatyrev commented Jun 27, 2023

Menduist commented Jun 27, 2023 • edited Loading

henridf commented Jun 27, 2023

Menduist commented Jun 27, 2023 • edited Loading

henridf commented Jun 27, 2023 • edited Loading

henridf commented Jun 27, 2023

Menduist commented Jun 27, 2023

Nashatyrev commented Jun 28, 2023

Menduist commented Jun 28, 2023 • edited Loading

diegomrsantos commented Dec 20, 2023

Menduist commented Mar 9, 2023 •

edited

Loading

Menduist commented Mar 27, 2023 •

edited

Loading

Menduist commented Mar 28, 2023 •

edited

Loading

Menduist commented Apr 24, 2023 •

edited

Loading

henridf commented Jun 27, 2023 •

edited

Loading

Menduist commented Jun 27, 2023 •

edited

Loading

Menduist commented Jun 27, 2023 •

edited

Loading

henridf commented Jun 27, 2023 •

edited

Loading

Menduist commented Jun 28, 2023 •

edited

Loading