Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transmit directly from connection tasks #1729

Merged
merged 15 commits into from
Apr 26, 2024
Merged

Transmit directly from connection tasks #1729

merged 15 commits into from
Apr 26, 2024

Conversation

Ralith
Copy link
Collaborator

@Ralith Ralith commented Dec 19, 2023

This allows outgoing data to parallelize perfectly. Initial informal testing suggests a performance improvement for bulk data, likely due to reduced allocation and cross-task messaging. A larger performance benefit should be expected for endpoints hosting numerous connections.

I'd left the original userspace-multiplexed transmit strategy in place for a long time because I assumed the UDP socket had a mutex-guarded send buffer, so leaning into task-parallelism for sending would just lead to contention. This turns out to be complete nonsense, at least on Linux: outgoing UDP datagrams are buffered by the kernel with dynamically allocated memory, and the primitive NIC queuing operation is apparently scalable, as borne out by empirical testing. The following minimal test scales almost linearly with physical parallelism on Linux:

use std::{
    net::{SocketAddr, UdpSocket},
    thread,
    time::Instant,
};

fn main() {
    let sock = UdpSocket::bind("[::]:0").unwrap();
    let target: SocketAddr = "[::1]:1234".parse().unwrap();
    let start = Instant::now();
    const DATAGRAMS: usize = 1_000_000;
    thread::scope(|scope| {
        let threads = thread::available_parallelism().unwrap().get();
        dbg!(threads);
        let each = DATAGRAMS / threads;
        let sock = &sock;
        for _ in 0..threads {
            scope.spawn(move || {
                for _ in 0..each {
                    sock.send_to(b"hello, world", target).unwrap();
                }
            });
        }
    });
    let seconds = start.elapsed().as_secs_f32();
    println!("{} pps", DATAGRAMS as f32 / seconds);
}

It'd be interesting to see how this compares on other major platforms, but Linux's drastic improvement and the simplification of Quinn's internals are enough to satisfy me that we should go this way.

@Ralith
Copy link
Collaborator Author

Ralith commented Dec 19, 2023

Hmm, this needs a mechanism to broadcast wakeups when the UDP socket is backpressured...

edit: solved by duplicating (try_cloneing) the socket

Copy link
Member

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks cool -- initial round of feedback trying to make sense of it all.

quinn/src/connection.rs Outdated Show resolved Hide resolved
quinn/src/connection.rs Outdated Show resolved Hide resolved
quinn/src/connection.rs Outdated Show resolved Hide resolved
quinn/src/endpoint.rs Outdated Show resolved Hide resolved
quinn/src/connection.rs Outdated Show resolved Hide resolved
quinn/src/endpoint.rs Outdated Show resolved Hide resolved
@Ralith Ralith force-pushed the parallel-transmit branch 10 times, most recently from e0f1c5b to e1fb663 Compare December 20, 2023 03:20
@Ralith
Copy link
Collaborator Author

Ralith commented Dec 20, 2023

I've replaced the original try_clone_box mechanism with the runtime's built-in concurrent I/O support, which exists in both tokio and async-std. This required a somewhat fiddly extension of the runtime abstraction layer, because for both runtimes, concurrent I/O is only possible if you use async fn interfaces, probably due to self-reference somewhere in the critical futures. Luckily, the fiddly bits are isolated almost exclusively within runtime.rs, with both implementations and callers remaining straightforward. See discussion at tokio-rs/tokio#6226 for context.

@flub
Copy link
Contributor

flub commented Dec 21, 2023

Hi, curious to try this out on various other platforms as well. How have you been running benchmarks to compare this?

I guess any perf loss due to losing sendmmsg is offset by the perf gain of not doing the userspace-multiplexing. sendmmsg does give some small benefit when you manage to fill the pipe to many peers in one syscall, but I'd have to spend more time figuring out how often quinn really manages to do that.

@Ralith
Copy link
Collaborator Author

Ralith commented Dec 21, 2023

I haven't benchmarked this very rigorously yet, since I don't have convenient access to a machine that's both running Linux (for our most optimized UDP backend) and not a laptop (subject to unpredictable throttling). We could use more data on that, if you're interested. I do plan to do some benchmarking on Windows, but we have neither sendmmsg nor GSO there just yet, so it might be a bit unfair to the status quo.

For a single connection, sendmmsg is strictly worse than GSO, and for multiple connections I think the near-linear speedup from parallelism will be a much bigger win in any case, so I'm not too concerned about regressions, though it's always possible we might miss something silly.

@Ralith Ralith force-pushed the parallel-transmit branch 2 times, most recently from 1525c4d to a71e5c0 Compare December 29, 2023 20:50
@Ralith
Copy link
Collaborator Author

Ralith commented Apr 19, 2024

Rebased.

Copy link
Member

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

quinn/src/connection.rs Outdated Show resolved Hide resolved
quinn/src/endpoint.rs Outdated Show resolved Hide resolved
quinn-udp/src/fallback.rs Outdated Show resolved Hide resolved
quinn-proto/src/tests/util.rs Show resolved Hide resolved
`UdpState` was removed some time ago, and various other interface
details have changed. As a result, the build would fail on platforms
without native support.
As we no longer buffer multiple transmits in memory, this complexity
is unused. GSO is expected to account for most, if not all, of the
performance benefit.
Because we no longer buffer transmits for unpredictable periods,
there's no need to share ownership of their contents.
This didn't impact any tests, but was confusing.
We no longer need to share ownership of this memory, so we should use
a simpler type to reflect our simpler requirements.
@Ralith Ralith merged commit 46ac7d7 into main Apr 26, 2024
8 checks passed
@Ralith Ralith deleted the parallel-transmit branch April 26, 2024 01:21
mxinden added a commit to mxinden/neqo that referenced this pull request May 13, 2024
`neqo-bin` has been importing `quinn-udp` as a git reference, in order to
include quinn-rs/quinn#1765. The quinn project has since
released `quinn-udp` `v0.5.0`.

This commit upgrades `neqo-bin` to use `quinn-udp` `v0.5.0`.

`quinn-udp` now takes a data reference (`&[u8]`) instead of owned
data (`bytes::Bytes`) on its send path, thus no longer requiring `neqo-bin` to
convert, but simply pass a reference. See
quinn-rs/quinn#1729 (comment) for details.

`quinn-udp` has dropped `sendmmsg` support in the `v0.5.0`
release (quinn-rs/quinn@ee08826).
`neqo-bin` does not (yet) use `sendmmsg`. This might change in the
future (mozilla#1693).
github-merge-queue bot pushed a commit to mozilla/neqo that referenced this pull request May 13, 2024
* feat(bin): use quinn-udp crates.io release instead of git ref

`neqo-bin` has been importing `quinn-udp` as a git reference, in order to
include quinn-rs/quinn#1765. The quinn project has since
released `quinn-udp` `v0.5.0`.

This commit upgrades `neqo-bin` to use `quinn-udp` `v0.5.0`.

`quinn-udp` now takes a data reference (`&[u8]`) instead of owned
data (`bytes::Bytes`) on its send path, thus no longer requiring `neqo-bin` to
convert, but simply pass a reference. See
quinn-rs/quinn#1729 (comment) for details.

`quinn-udp` has dropped `sendmmsg` support in the `v0.5.0`
release (quinn-rs/quinn@ee08826).
`neqo-bin` does not (yet) use `sendmmsg`. This might change in the
future (#1693).

* remove impl From<Datagram> for Vec<u8>
@djc djc mentioned this pull request May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants