Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alternative - unfair but lower latency - send stream scheduling strategy #2002

Merged
merged 2 commits into from
Oct 12, 2024

Conversation

alessandrod
Copy link
Contributor

Solana leader slots are 400ms. When sending and ingesting transactions, minimizing latency is therefore key. Quinn currently tries to implement fairness when writing out send streams, which is a good default, but not great for our use case.

We want to pack as many transactions in a datagram as possible, without any fragmentation if not at the very end if a transaction doesn't fit in the remaining space. In that case, we want the end (fin) of the transaction to come immediately after in order to minimize latency. The current round-robin algorithm doesn't allow this, and in fact leads to very high latency if the stream receive window is large.

This PR tries to address this problem. It introduces a TransportConfig::send_fairness(bool) config. When set to false, streams are still scheduled based on priority, but once a chunk of a stream has been written out, we'll try to complete the stream instead of trying to round-robin balance it among the streams with the same priority.

This gets rid of fragmentation, and effectively allows API clients to precisely control the order in which streams are written out.

Here's a server log without the patch:

[2024-10-07T13:51:43.960817939Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7460 offset=0 len=255 fin=true
[2024-10-07T13:51:43.960824640Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:43.960829630Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7461 offset=0 len=255 fin=true
[2024-10-07T13:51:43.960836460Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:43.960841420Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7462 offset=0 len=110 fin=false
[2024-10-07T13:51:43.960849560Z TRACE quinn_proto::connection] got Data packet (1452 bytes) from 127.0.0.1:8014 using id f7b061311008438f
[2024-10-07T13:51:43.960859210Z TRACE quinn_proto::connection] recv; space=Data pn=1258
[2024-10-07T13:51:43.960865050Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:43.960869920Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7463 offset=0 len=255 fin=true

[snip]

[2024-10-07T13:51:44.352021247Z TRACE quinn_proto::connection] recv; space=Data pn=7175
[2024-10-07T13:51:44.352024037Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:44.352026447Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7420 offset=110 len=145 fin=true
[2024-10-07T13:51:44.352029877Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:44.352032267Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7426 offset=110 len=145 fin=true
[2024-10-07T13:51:44.352036647Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:44.352039017Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7432 offset=110 len=145 fin=true
[2024-10-07T13:51:44.352042398Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:44.352044778Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7438 offset=110 len=145 fin=true
[2024-10-07T13:51:44.352048378Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:44.352050758Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7444 offset=110 len=145 fin=true
[2024-10-07T13:51:44.352055058Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:44.352057518Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7450 offset=110 len=145 fin=true
[2024-10-07T13:51:44.352061078Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:44.352063438Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7456 offset=110 len=145 fin=true
[2024-10-07T13:51:44.352067018Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:44.352069528Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7462 offset=110 len=145 fin=true
[2024-10-07T13:51:44.352073708Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T13:51:44.352076208Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 7468 offset=110 len=145 fin=true
[2024-10-07T13:51:44.352079558Z TRACE quinn_proto::connection] frame; ty=STREAM

Take a look at stream 7462. It starts at 2024-10-07T13:51:43.960 (pn=1257), and it's completed at 2024-10-07T13:51:44.352 (pn=7175) together with a bunch of other segmented transactions (note this is on localhost, so over the internet would be even worse).

Here's a log with the PR instead:

2024-10-07T14:06:53.171353122Z TRACE quinn_proto::connection] recv; space=Data pn=1894
[2024-10-07T14:06:53.171356272Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171358852Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10179 offset=0 len=255 fin=true
[2024-10-07T14:06:53.171362262Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171365652Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10180 offset=0 len=255 fin=true
[2024-10-07T14:06:53.171369062Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171371622Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10181 offset=0 len=255 fin=true
[2024-10-07T14:06:53.171375192Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171378072Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10182 offset=0 len=255 fin=true
[2024-10-07T14:06:53.171382332Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171384892Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10183 offset=0 len=255 fin=true
[2024-10-07T14:06:53.171390133Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171392773Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10184 offset=0 len=110 fin=false
[2024-10-07T14:06:53.171397623Z TRACE quinn_proto::connection] got Data packet (1452 bytes) from 127.0.0.1:8004 using id 01dfdefc072f6a67
[2024-10-07T14:06:53.171401733Z TRACE quinn_proto::connection] recv; space=Data pn=1895
[2024-10-07T14:06:53.171404703Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171407263Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10184 offset=110 len=145 fin=true
[2024-10-07T14:06:53.171414553Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171417323Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10185 offset=0 len=255 fin=true
[2024-10-07T14:06:53.171420713Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171423283Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10186 offset=0 len=255 fin=true
[2024-10-07T14:06:53.171428433Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171432083Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10187 offset=0 len=255 fin=true
[2024-10-07T14:06:53.171435713Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171438304Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10188 offset=0 len=255 fin=true
[2024-10-07T14:06:53.171441674Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171445274Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10189 offset=0 len=218 fin=false
[2024-10-07T14:06:53.171449364Z TRACE quinn_proto::connection] got Data packet (1452 bytes) from 127.0.0.1:8004 using id 01dfdefc072f6a67
[2024-10-07T14:06:53.171453354Z TRACE quinn_proto::connection] recv; space=Data pn=1896
[2024-10-07T14:06:53.171456324Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171460384Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10189 offset=218 len=37 fin=true
[2024-10-07T14:06:53.171464434Z TRACE quinn_proto::connection] frame; ty=STREAM
[2024-10-07T14:06:53.171467174Z TRACE quinn_proto::connection] got stream frame id=client unidirectional stream 10190 offset=0 len=255 fin=true

As you can see the fin=false transactions get completed immediately in the next packet.

@alessandrod alessandrod force-pushed the send-unfair branch 2 times, most recently from b3af7a4 to d09bc62 Compare October 7, 2024 14:18
veeso added a commit to veeso/solana-quinn that referenced this pull request Oct 8, 2024
quinn-proto/src/connection/streams/mod.rs Outdated Show resolved Hide resolved
quinn-proto/src/connection/streams/state.rs Outdated Show resolved Hide resolved
@Ralith
Copy link
Collaborator

Ralith commented Oct 9, 2024

This change seems well motivated, thanks!

I wonder if we could have the best of both worlds, and avoid yet another esoteric configuration knob, with a more heuristic approach. It sounds like your main problem is that, when sending many individually sub-MTU streams, streams that get split across packets are subject to a much higher maximum latency than you might otherwise see. What if we special-cased those? E.g. don't advance the round-robin state if:

  • a fragmented stream has less than an MTU of data remaining
  • alternatively, a fragmented stream was not the first stream in a packet

Either should let us retain fairness for larger-than-MTU streams, while reducing the maximum latency for sub-MTU streams, without requiring users to understand any of these concerns.

@alessandrod
Copy link
Contributor Author

This change seems well motivated, thanks!

I wonder if we could have the best of both worlds, and avoid yet another esoteric configuration knob, with a more heuristic approach. It sounds like your main problem is that, when sending many individually sub-MTU streams, streams that get split across packets are subject to a much higher maximum latency than you might otherwise see. What if we special-cased those? E.g. don't advance the round-robin state if:

* a fragmented stream has less than an MTU of data remaining

* alternatively, a fragmented stream was not the first stream in a packet

Either should let us retain fairness for larger-than-MTU streams, while reducing the maximum latency for sub-MTU streams, without requiring users to understand any of these concerns.

For the general case, this sounds like a good idea, I'm happy to implement it if you want.

For our case specifically though, we have a proposal to 3x the max transaction size. We haven't really been able to even evaluate its feasibility so far because latency was already pretty bad, but once we deploy this change, I think we could reasonably enable 3x, therefore spanning multiple MTUs. In that case, I'd still want all datagrams of a tx to follow each other sequentially.

without requiring users to understand any of these concerns

Yeah this is a valid point ofc. How about I implement the heuristic to disable RR for < MTU, and then keep an option to always disable RR but instead of exposing it as a flag, I make it an opt-in feature flag?

@alessandrod
Copy link
Contributor Author

keep an option to always disable RR but instead of exposing it as a flag, I make it an opt-in feature flag?

I can also always keep this out of tree ofc! I'd rather avoid having to have a vendor fork just for this tho.

@djc
Copy link
Member

djc commented Oct 9, 2024

Heuristics sound good to me, but given the limited complexity of the additional configuration I feel like we could also accept the configuration upstream if it's well-motivated even in the presence of the heuristics.

On the other hand, maybe we should avoid adding heuristics if they don't address the only actual use case we've seen?

@Ralith
Copy link
Collaborator

Ralith commented Oct 11, 2024

maybe we should avoid adding heuristics if they don't address the only actual use case we've seen?

This is compelling. Heuristics are a greater maintenance burden, and if they're not actually addressing the motivating case, why pay that cost?

I'm happy to move ahead with a global config flag, as originally proposed. I do suspect there's a more flexible middle ground here somewhere (maybe the setting should be per-stream?) but we don't necessarily have to work that out here and now when there's a clear win on the table already.

@alessandrod alessandrod force-pushed the send-unfair branch 2 times, most recently from 55d6cff to af41f52 Compare October 11, 2024 08:30
Add methods to PendingStreams to avoid accessing PendingStreams::streams
directly.
This adds TransportConfig::send_fairness(bool). When set to false,
streams are still scheduled based on priority, but once a chunk of a
stream has been written out, we'll try to complete the stream instead of
trying to round-robin balance it among the streams with the same
priority.

This reduces fragmentation, protocol overhead and stream receive latency
when sending many small streams. It also sends same-priority streams in
the order they are opened. This - assuming little to no network packet
reordering - allows receivers to advertise a large stream window but
keep a smaller, sliding receive window.
Copy link
Member

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Collaborator

@Ralith Ralith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful, thanks!

@Ralith Ralith added this pull request to the merge queue Oct 12, 2024
Merged via the queue into quinn-rs:main with commit 9d63e62 Oct 12, 2024
14 checks passed
@Ralith
Copy link
Collaborator

Ralith commented Oct 12, 2024

@lijunwangs this change may be of interest to you as well -- I recall you had similar concerns.

alessandrod added a commit to alessandrod/solana that referenced this pull request Oct 23, 2024
Before quinn-rs/quinn#2002 we could get streams
fragmented and out of order (stream concurrency). Now streams always
come in order, so there's no reason anymore to spawn multiple tasks to
read them.

Before we could have:

[s1][s2][s3][s2 fin][s3 fin][s1 fin]

So spawning multiple tasks led to overall faster ingestion, since to
complete s1 we didn't have to waiy for all the other streams to arrive.

Now we always have:

[s1 fin][s2 fin][s3 fin]

So there's no reason to spawn a task per stream: each task will be
created, read all its stream's chunks, exit, before the next stream
arrives.

This change removes the per-stream task and instead uses the connection
task to read all the streams. This removes the CPU cost of creating
tasks and the corresponding memory allocations.
alessandrod added a commit to alessandrod/solana that referenced this pull request Oct 23, 2024
Before quinn-rs/quinn#2002 we could get streams
fragmented and out of order (stream concurrency). Now streams always
come in order, so there's no reason anymore to spawn multiple tasks to
read them.

Before we could have:

[s1][s2][s3][s2 fin][s3 fin][s1 fin]

So spawning multiple tasks led to overall faster ingestion, since to
complete s1 we didn't have to waiy for all the other streams to arrive.

Now we always have:

[s1 fin][s2 fin][s3 fin]

So there's no reason to spawn a task per stream: each task will be
created, read all its stream's chunks, exit, before the next stream
arrives.

This change removes the per-stream task and instead uses the connection
task to read all the streams. This removes the CPU cost of creating
tasks and the corresponding memory allocations.
alessandrod added a commit to alessandrod/solana that referenced this pull request Oct 24, 2024
Before quinn-rs/quinn#2002 we could get streams
fragmented and out of order (stream concurrency). Now streams always
come in order, so there's no reason anymore to spawn multiple tasks to
read them.

Before we could have:

[s1][s2][s3][s2 fin][s3 fin][s1 fin]

So spawning multiple tasks led to overall faster ingestion, since to
complete s1 we didn't have to waiy for all the other streams to arrive.

Now we always have:

[s1 fin][s2 fin][s3 fin]

So there's no reason to spawn a task per stream: each task will be
created, read all its stream's chunks, exit, before the next stream
arrives.

This change removes the per-stream task and instead uses the connection
task to read all the streams. This removes the CPU cost of creating
tasks and the corresponding memory allocations.
alessandrod added a commit to alessandrod/solana that referenced this pull request Oct 24, 2024
Before quinn-rs/quinn#2002 we could get streams
fragmented and out of order (stream concurrency). Now streams always
come in order, so there's no reason anymore to spawn multiple tasks to
read them.

Before we could have:

[s1][s2][s3][s2 fin][s3 fin][s1 fin]

So spawning multiple tasks led to overall faster ingestion, since to
complete s1 we didn't have to waiy for all the other streams to arrive.

Now we always have:

[s1 fin][s2 fin][s3 fin]

So there's no reason to spawn a task per stream: each task will be
created, read all its stream's chunks, exit, before the next stream
arrives.

This change removes the per-stream task and instead uses the connection
task to read all the streams. This removes the CPU cost of creating
tasks and the corresponding memory allocations.
alessandrod added a commit to alessandrod/solana that referenced this pull request Oct 24, 2024
Before quinn-rs/quinn#2002 we could get streams
fragmented and out of order (stream concurrency). Now streams always
come in order, so there's no reason anymore to spawn multiple tasks to
read them.

Before we could have:

[s1][s2][s3][s2 fin][s3 fin][s1 fin]

So spawning multiple tasks led to overall faster ingestion, since to
complete s1 we didn't have to waiy for all the other streams to arrive.

Now we always have:

[s1 fin][s2 fin][s3 fin]

So there's no reason to spawn a task per stream: each task will be
created, read all its stream's chunks, exit, before the next stream
arrives.

This change removes the per-stream task and instead uses the connection
task to read all the streams. This removes the CPU cost of creating
tasks and the corresponding memory allocations.
alessandrod added a commit to alessandrod/solana that referenced this pull request Oct 27, 2024
Before quinn-rs/quinn#2002 we could get streams
fragmented and out of order (stream concurrency). Now streams always
come in order, so there's no reason anymore to spawn multiple tasks to
read them.

Before we could have:

[s1][s2][s3][s2 fin][s3 fin][s1 fin]

So spawning multiple tasks led to overall faster ingestion, since to
complete s1 we didn't have to waiy for all the other streams to arrive.

Now we always have:

[s1 fin][s2 fin][s3 fin]

So there's no reason to spawn a task per stream: each task will be
created, read all its stream's chunks, exit, before the next stream
arrives.

This change removes the per-stream task and instead uses the connection
task to read all the streams. This removes the CPU cost of creating
tasks and the corresponding memory allocations.
alessandrod added a commit to anza-xyz/agave that referenced this pull request Oct 28, 2024
…3283)

Before quinn-rs/quinn#2002 we could get streams
fragmented and out of order (stream concurrency). Now streams always
come in order, so there's no reason anymore to spawn multiple tasks to
read them.

Before we could have:

[s1][s2][s3][s2 fin][s3 fin][s1 fin]

So spawning multiple tasks led to overall faster ingestion, since to
complete s1 we didn't have to waiy for all the other streams to arrive.

Now we always have:

[s1 fin][s2 fin][s3 fin]

So there's no reason to spawn a task per stream: each task will be
created, read all its stream's chunks, exit, before the next stream
arrives.

This change removes the per-stream task and instead uses the connection
task to read all the streams. This removes the CPU cost of creating
tasks and the corresponding memory allocations.
ray-kast pushed a commit to abklabs/agave that referenced this pull request Nov 27, 2024
…nza-xyz#3283)

Before quinn-rs/quinn#2002 we could get streams
fragmented and out of order (stream concurrency). Now streams always
come in order, so there's no reason anymore to spawn multiple tasks to
read them.

Before we could have:

[s1][s2][s3][s2 fin][s3 fin][s1 fin]

So spawning multiple tasks led to overall faster ingestion, since to
complete s1 we didn't have to waiy for all the other streams to arrive.

Now we always have:

[s1 fin][s2 fin][s3 fin]

So there's no reason to spawn a task per stream: each task will be
created, read all its stream's chunks, exit, before the next stream
arrives.

This change removes the per-stream task and instead uses the connection
task to read all the streams. This removes the CPU cost of creating
tasks and the corresponding memory allocations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants