Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gen_udp with Unix domain socket on Linux can block, leaking inet_reply messages into calling process #8989

Open
mjm opened this issue Oct 26, 2024 · 8 comments
Assignees
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS

Comments

@mjm
Copy link

mjm commented Oct 26, 2024

Describe the bug

When using the gen_udp module with Unix domain sockets, sending packets can return an EINTR error, which seems to be unexpected by the sendto implementation in inet, as it responds with an {inet_reply, Port, Ref} message (no reply value) that goes unhandled by sendto and ends up in the calling process's mailbox.

To Reproduce

I've reduced this to a small reproduction in Elixir: https://gist.github.com/mjm/490abd286e526fceaeb0e373414e1214

It reproduces for me on Linux but not on macOS, so I used docker run -it elixir /bin/bash to get a Linux Elixir environment. Then you can paste the module in the gist into two iex sessions, and run UdsBlockExample.test_listen() in one, and UdsBlockExample.test_socket() in the other.

test_socket() will raise an error that it received an unexpected inet_reply message.

Expected behavior

This example code should run without error, as inet_reply messages should not leak out of these calls.

In production, this is manifesting as some of our genservers suddenly receiving these unexpected messages after we switched to using Unix domain sockets for reporting telemetry to statsd.

Using the new socket inet_backend also causes this to work as expected.

Affected versions

In production we hit this on OTP 26.2.5 but it also reproduces on the latest OTP 27.

Additional context

The undesired messages come from this code path in the inet driver.

A comment a short bit above this suggests that EINTR should not happen for UDP, and that seems to be true, but it appears that it can happen for AF_UNIX datagram sockets, at least on Linux.

And here is where sendto is not handling this shape of message, which is what allows it to leak. The implementation of send above this has a case for handling 3-tuples, but sendto assumes that won't happen.

@mjm mjm added the bug Issue is reported as a bug label Oct 26, 2024
@IngelaAndin IngelaAndin added the team:PS Assigned to OTP team PS label Oct 28, 2024
@IngelaAndin IngelaAndin assigned IngelaAndin and bmk and unassigned IngelaAndin Oct 29, 2024
@bmk
Copy link
Contributor

bmk commented Oct 29, 2024

There is a comment in the code that explains why gen_udp has problems with this:

`/* "code" analysis is the same for both SCTP and UDP above,

  • although ERRNO_BLOCK | EINTR never happens for UDP
    */`

So, EINTR is "not supposed" to be possible.
Clearly, when on Unix Domain Socket, this can happen (on Liinux)...

@bmk
Copy link
Contributor

bmk commented Oct 30, 2024

Should have asked this before, but what flavor and version of Linux did you test this with?

@frej
Copy link
Contributor

frej commented Oct 30, 2024

So, EINTR is "not supposed" to be possible.

EINTR is documented as a valid error for all of send, sendto and sendmsg if you get a signal, so the comment is wrong. Unless the vm traps it using a signalfd, that is :)

@bmk
Copy link
Contributor

bmk commented Oct 30, 2024

I mentioned the comment as an explanation of the behavior, not a justification.
Regardless, I have done some testing:

On FreeBSD (14.1), OpenIndiana (Hipster 2023.10), MacOS (14.4.1/23.4.0), NetBSD (9.0) the result is 'enoent'.

I have also tested this on the following versions of Linux without being able to reproduce the issue:
Ubuntu 22.04.5 (6.8.0-47-generic), Ubuntu 20.04.6 (5.4.0-196-generic),
Linux Mint 21 (5.15.0-122-generic), LMDE 5 (5.10.0-33-amd64),
SLES 12 (3.12.60-52.54-default), SLES 12-SP2 (4.4.74-92.35-default).

Here is a PR for testing:
https://github.com/bmk/otp/tree/bmk/kernel/20241030/gen_udp_blocking_send_on_local

@mjm
Copy link
Author

mjm commented Oct 30, 2024

Should have asked this before, but what flavor and version of Linux did you test this with?

In production, we're running on Google Kubernetes Engine, so the nodes are running Container-Optimized OS cos-113-18244-151-27. When I was creating the reproduction example, I was running on Docker Desktop on macOS 4.34.3 (170107). I'm not sure what version of Linux that's using on the VM it manages.

In both contexts, sysctl net.unix.max_dgram_qlen appears to be 10. I think it being so low is why this happens.

@bmk
Copy link
Contributor

bmk commented Oct 31, 2024

Aha. On my machine:
$ sysctl net.unix.max_dgram_qlen net.unix.max_dgram_qlen = 512

If you can, please test my branch, and see if that solves the problem.

@mjm
Copy link
Author

mjm commented Oct 31, 2024

Okay, today I'll see if I can get that built today in a context where I've actually had the problem.

@mjm
Copy link
Author

mjm commented Oct 31, 2024

I was able to build your branch in a Docker container and test it alongside both 27.1.2 and 25.3.2.15. The former reproduces the bug, while the latter does not because the logic for handling EINTR special doesn't exist yet in that version.

Your branch did not reproduce the problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS
Projects
None yet
Development

No branches or pull requests

4 participants