-
-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alternative - unfair but lower latency - send stream scheduling strategy #2002
Conversation
b3af7a4
to
d09bc62
Compare
This change seems well motivated, thanks! I wonder if we could have the best of both worlds, and avoid yet another esoteric configuration knob, with a more heuristic approach. It sounds like your main problem is that, when sending many individually sub-MTU streams, streams that get split across packets are subject to a much higher maximum latency than you might otherwise see. What if we special-cased those? E.g. don't advance the round-robin state if:
Either should let us retain fairness for larger-than-MTU streams, while reducing the maximum latency for sub-MTU streams, without requiring users to understand any of these concerns. |
For the general case, this sounds like a good idea, I'm happy to implement it if you want. For our case specifically though, we have a proposal to 3x the max transaction size. We haven't really been able to even evaluate its feasibility so far because latency was already pretty bad, but once we deploy this change, I think we could reasonably enable 3x, therefore spanning multiple MTUs. In that case, I'd still want all datagrams of a tx to follow each other sequentially.
Yeah this is a valid point ofc. How about I implement the heuristic to disable RR for < MTU, and then keep an option to always disable RR but instead of exposing it as a flag, I make it an opt-in feature flag? |
I can also always keep this out of tree ofc! I'd rather avoid having to have a vendor fork just for this tho. |
Heuristics sound good to me, but given the limited complexity of the additional configuration I feel like we could also accept the configuration upstream if it's well-motivated even in the presence of the heuristics. On the other hand, maybe we should avoid adding heuristics if they don't address the only actual use case we've seen? |
This is compelling. Heuristics are a greater maintenance burden, and if they're not actually addressing the motivating case, why pay that cost? I'm happy to move ahead with a global config flag, as originally proposed. I do suspect there's a more flexible middle ground here somewhere (maybe the setting should be per-stream?) but we don't necessarily have to work that out here and now when there's a clear win on the table already. |
55d6cff
to
af41f52
Compare
Add methods to PendingStreams to avoid accessing PendingStreams::streams directly.
This adds TransportConfig::send_fairness(bool). When set to false, streams are still scheduled based on priority, but once a chunk of a stream has been written out, we'll try to complete the stream instead of trying to round-robin balance it among the streams with the same priority. This reduces fragmentation, protocol overhead and stream receive latency when sending many small streams. It also sends same-priority streams in the order they are opened. This - assuming little to no network packet reordering - allows receivers to advertise a large stream window but keep a smaller, sliding receive window.
af41f52
to
adc4fa1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful, thanks!
@lijunwangs this change may be of interest to you as well -- I recall you had similar concerns. |
Before quinn-rs/quinn#2002 we could get streams fragmented and out of order (stream concurrency). Now streams always come in order, so there's no reason anymore to spawn multiple tasks to read them. Before we could have: [s1][s2][s3][s2 fin][s3 fin][s1 fin] So spawning multiple tasks led to overall faster ingestion, since to complete s1 we didn't have to waiy for all the other streams to arrive. Now we always have: [s1 fin][s2 fin][s3 fin] So there's no reason to spawn a task per stream: each task will be created, read all its stream's chunks, exit, before the next stream arrives. This change removes the per-stream task and instead uses the connection task to read all the streams. This removes the CPU cost of creating tasks and the corresponding memory allocations.
Before quinn-rs/quinn#2002 we could get streams fragmented and out of order (stream concurrency). Now streams always come in order, so there's no reason anymore to spawn multiple tasks to read them. Before we could have: [s1][s2][s3][s2 fin][s3 fin][s1 fin] So spawning multiple tasks led to overall faster ingestion, since to complete s1 we didn't have to waiy for all the other streams to arrive. Now we always have: [s1 fin][s2 fin][s3 fin] So there's no reason to spawn a task per stream: each task will be created, read all its stream's chunks, exit, before the next stream arrives. This change removes the per-stream task and instead uses the connection task to read all the streams. This removes the CPU cost of creating tasks and the corresponding memory allocations.
Before quinn-rs/quinn#2002 we could get streams fragmented and out of order (stream concurrency). Now streams always come in order, so there's no reason anymore to spawn multiple tasks to read them. Before we could have: [s1][s2][s3][s2 fin][s3 fin][s1 fin] So spawning multiple tasks led to overall faster ingestion, since to complete s1 we didn't have to waiy for all the other streams to arrive. Now we always have: [s1 fin][s2 fin][s3 fin] So there's no reason to spawn a task per stream: each task will be created, read all its stream's chunks, exit, before the next stream arrives. This change removes the per-stream task and instead uses the connection task to read all the streams. This removes the CPU cost of creating tasks and the corresponding memory allocations.
Before quinn-rs/quinn#2002 we could get streams fragmented and out of order (stream concurrency). Now streams always come in order, so there's no reason anymore to spawn multiple tasks to read them. Before we could have: [s1][s2][s3][s2 fin][s3 fin][s1 fin] So spawning multiple tasks led to overall faster ingestion, since to complete s1 we didn't have to waiy for all the other streams to arrive. Now we always have: [s1 fin][s2 fin][s3 fin] So there's no reason to spawn a task per stream: each task will be created, read all its stream's chunks, exit, before the next stream arrives. This change removes the per-stream task and instead uses the connection task to read all the streams. This removes the CPU cost of creating tasks and the corresponding memory allocations.
Before quinn-rs/quinn#2002 we could get streams fragmented and out of order (stream concurrency). Now streams always come in order, so there's no reason anymore to spawn multiple tasks to read them. Before we could have: [s1][s2][s3][s2 fin][s3 fin][s1 fin] So spawning multiple tasks led to overall faster ingestion, since to complete s1 we didn't have to waiy for all the other streams to arrive. Now we always have: [s1 fin][s2 fin][s3 fin] So there's no reason to spawn a task per stream: each task will be created, read all its stream's chunks, exit, before the next stream arrives. This change removes the per-stream task and instead uses the connection task to read all the streams. This removes the CPU cost of creating tasks and the corresponding memory allocations.
Before quinn-rs/quinn#2002 we could get streams fragmented and out of order (stream concurrency). Now streams always come in order, so there's no reason anymore to spawn multiple tasks to read them. Before we could have: [s1][s2][s3][s2 fin][s3 fin][s1 fin] So spawning multiple tasks led to overall faster ingestion, since to complete s1 we didn't have to waiy for all the other streams to arrive. Now we always have: [s1 fin][s2 fin][s3 fin] So there's no reason to spawn a task per stream: each task will be created, read all its stream's chunks, exit, before the next stream arrives. This change removes the per-stream task and instead uses the connection task to read all the streams. This removes the CPU cost of creating tasks and the corresponding memory allocations.
…3283) Before quinn-rs/quinn#2002 we could get streams fragmented and out of order (stream concurrency). Now streams always come in order, so there's no reason anymore to spawn multiple tasks to read them. Before we could have: [s1][s2][s3][s2 fin][s3 fin][s1 fin] So spawning multiple tasks led to overall faster ingestion, since to complete s1 we didn't have to waiy for all the other streams to arrive. Now we always have: [s1 fin][s2 fin][s3 fin] So there's no reason to spawn a task per stream: each task will be created, read all its stream's chunks, exit, before the next stream arrives. This change removes the per-stream task and instead uses the connection task to read all the streams. This removes the CPU cost of creating tasks and the corresponding memory allocations.
…nza-xyz#3283) Before quinn-rs/quinn#2002 we could get streams fragmented and out of order (stream concurrency). Now streams always come in order, so there's no reason anymore to spawn multiple tasks to read them. Before we could have: [s1][s2][s3][s2 fin][s3 fin][s1 fin] So spawning multiple tasks led to overall faster ingestion, since to complete s1 we didn't have to waiy for all the other streams to arrive. Now we always have: [s1 fin][s2 fin][s3 fin] So there's no reason to spawn a task per stream: each task will be created, read all its stream's chunks, exit, before the next stream arrives. This change removes the per-stream task and instead uses the connection task to read all the streams. This removes the CPU cost of creating tasks and the corresponding memory allocations.
Solana leader slots are 400ms. When sending and ingesting transactions, minimizing latency is therefore key. Quinn currently tries to implement fairness when writing out send streams, which is a good default, but not great for our use case.
We want to pack as many transactions in a datagram as possible, without any fragmentation if not at the very end if a transaction doesn't fit in the remaining space. In that case, we want the end (fin) of the transaction to come immediately after in order to minimize latency. The current round-robin algorithm doesn't allow this, and in fact leads to very high latency if the stream receive window is large.
This PR tries to address this problem. It introduces a
TransportConfig::send_fairness(bool)
config. When set to false, streams are still scheduled based on priority, but once a chunk of a stream has been written out, we'll try to complete the stream instead of trying to round-robin balance it among the streams with the same priority.This gets rid of fragmentation, and effectively allows API clients to precisely control the order in which streams are written out.
Here's a server log without the patch:
Take a look at stream
7462
. It starts at2024-10-07T13:51:43.960
(pn=1257)
, and it's completed at2024-10-07T13:51:44.352
(pn=7175)
together with a bunch of other segmented transactions (note this is on localhost, so over the internet would be even worse).Here's a log with the PR instead:
As you can see the fin=false transactions get completed immediately in the next packet.