Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conn, device, tun: implement vectorized I/O on Linux #64

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions conn/controlfns.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,13 @@ import (
"syscall"
)

// UDP socket read/write buffer size (7MB). The value of 7MB is chosen as it is
// the max supported by a default configuration of macOS. Some platforms will
// silently clamp the value to other maximums, such as linux clamping to
// net.core.{r,w}mem_max (see _linux.go for additional implementation that works
// around this limitation)
const socketBufferSize = 7 << 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be platform defined? Seems odd to justify the use of 7MBs on the grounds of macOS and then apply it everywhere. Or is it the case where every other platform clamps below 7MB and never above, so 7MB represents a useful upperbound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be platform defined anytime if we decide that's a good idea, if we demonstrate a scenario where it is worthwhile.

This value would ideally actually be tunable, it is essentially the main tuning knob around which you can address BDP. What we observed in testing is that we need a value substantially larger than the rather conservative default of hundreds of kilobytes in scenarios where there is either significant latency between peers, or significant bandwidth available.

The 7mb value while a bit strange courtesy of macOS limit, is not strongly worse than say, splitting this out and picking 8mb for Linux. We'd like to see more feedback from users regarding use cases here, in our testing this makes 100ms latency links perform much better (once whatever is inside also warms up, such as window expansion), as well as removes a significant bottleneck in reaching speeds in the 1-10gbps range. It is possible it is not sufficient for some 25gbps+ use cases, however other bottlenecks in the system need to be resolved first, so adding potentially substantially more memory usage now would be premature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just worry that we'll want to change it at some point and then inadvertently break macOS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just worry that we'll want to change it at some point and then inadvertently break macOS.

Break because we make the value too large and miss the comment? Would adding a darwin test checking the value be sufficient in this case? Happy to break it out per-platform, but maybe one test file is simpler? Or we could do both.


// controlFn is the callback function signature from net.ListenConfig.Control.
// It is used to apply platform specific configuration to the socket prior to
// bind.
Expand Down
15 changes: 15 additions & 0 deletions conn/controlfns_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,21 @@ import (
func init() {
controlFns = append(controlFns,

// Attempt to set the socket buffer size beyond net.core.{r,w}mem_max by
// using SO_*BUFFORCE. This requires CAP_NET_ADMIN, and is allowed here to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you envision tun being available but not SO_*BUFFORCE? Android where it's passed over a unix domain socket?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for better and worse it's becoming a bit chaotic out there what feature sets overlap, largely due to security policy constraints. The two big aspects are Android and container runtimes, with there being much larger variation in configurations in the latter. In Android this call fails due to lack of CAP_NET_ADMIN, though I would have expected it to also be caught by the selinux configuration (as sadly the offload is) 🤷🏼

We might consider reporting the outcome, but to some extent I'm also waiting to see what happens here with the defaults once userspace QUIC deployments become more common. This will come up there once those folks also get through optimizations, perhaps it's time to change the defaults or some of the constraints.

I'd be happy to write some notes about this in a doc somewhere, if we have a suitable doc in a suitable place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to write some notes about this in a doc somewhere, if we have a suitable doc in a suitable place.

The right place for those types of implementation notes is commit messages. It seems like this snippet can be separated out into a standalone thing. If you think there's interesting context and such, just put it in its own commit and stick that in a commit message. (But in case if it's not that interesting and doesn't matter, don't bother - up to you.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The socket options for UDP snd/rcv buffer size have been broken out into their own commit, with these details included in the commit message aeed2ba.

// fail silently - the result of failure is lower performance on very fast
// links or high latency links.
func(network, address string, c syscall.RawConn) error {
return c.Control(func(fd uintptr) {
// Set up to *mem_max
_ = unix.SetsockoptInt(int(fd), unix.SOL_SOCKET, unix.SO_RCVBUF, socketBufferSize)
_ = unix.SetsockoptInt(int(fd), unix.SOL_SOCKET, unix.SO_SNDBUF, socketBufferSize)
// Set beyond *mem_max if CAP_NET_ADMIN
_ = unix.SetsockoptInt(int(fd), unix.SOL_SOCKET, unix.SO_RCVBUFFORCE, socketBufferSize)
_ = unix.SetsockoptInt(int(fd), unix.SOL_SOCKET, unix.SO_SNDBUFFORCE, socketBufferSize)
})
},

// Enable receiving of the packet information (IP_PKTINFO for IPv4,
// IPV6_PKTINFO for IPv6) that is used to implement sticky socket support.
func(network, address string, c syscall.RawConn) error {
Expand Down
7 changes: 7 additions & 0 deletions conn/controlfns_unix.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ import (

func init() {
controlFns = append(controlFns,
func(network, address string, c syscall.RawConn) error {
return c.Control(func(fd uintptr) {
zx2c4 marked this conversation as resolved.
Show resolved Hide resolved
_ = unix.SetsockoptInt(int(fd), unix.SOL_SOCKET, unix.SO_RCVBUF, socketBufferSize)
_ = unix.SetsockoptInt(int(fd), unix.SOL_SOCKET, unix.SO_SNDBUF, socketBufferSize)
})
},

func(network, address string, c syscall.RawConn) error {
var err error
if network == "udp6" {
Expand Down
23 changes: 23 additions & 0 deletions conn/controlfns_windows.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
/* SPDX-License-Identifier: MIT
*
* Copyright (C) 2017-2023 WireGuard LLC. All Rights Reserved.
*/

package conn

import (
"syscall"

"golang.org/x/sys/windows"
)

func init() {
controlFns = append(controlFns,
func(network, address string, c syscall.RawConn) error {
return c.Control(func(fd uintptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually do anything on Windows, given that Windows uses WinRIO? It might, I just don't know myself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at least for API users who use StdNetBind, such as Tailscale.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SO_{RCV,SND}BUF seems like a new thing. Can you separate its use out into a different commit with its own justification?

(It can stay in this PR, but just have a separate commit with a message I can read to understand this.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The socket options for UDP snd/rcv buffer size have been broken out into their own commit, with these details included in the commit message aeed2ba.

_ = windows.SetsockoptInt(windows.Handle(fd), windows.SOL_SOCKET, windows.SO_RCVBUF, socketBufferSize)
_ = windows.SetsockoptInt(windows.Handle(fd), windows.SOL_SOCKET, windows.SO_SNDBUF, socketBufferSize)
})
},
)
}