Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receive: netlink receive: recvmsg: no buffer space available #103

Closed
danopia opened this issue Jul 13, 2020 · 12 comments
Closed

Receive: netlink receive: recvmsg: no buffer space available #103

danopia opened this issue Jul 13, 2020 · 12 comments

Comments

@danopia
Copy link

danopia commented Jul 13, 2020

Hello, I've been using this library to repeatedly 'replace' a table with ~50 rules, and one thing I've noticed is that certain patterns seem to trigger this cryptic message:

Receive: netlink receive: recvmsg: no buffer space available

The cryptic part is that it doesn't seem related to the overall size of the table, but rather the presence of certain rules (jumps, etc).

Even so, the flush seems to take effect anyway; the changes are reflected in the kernel after receiving the message.

I found some info on ENOBUFS here: https://netfilter.vger.kernel.narkive.com/cwzlgk8d/why-no-buffer-space-available and noticed that the golang netlink library has an option exposed to help the error: https://github.com/mdlayher/netlink/blob/c5f8ab79aa345dcfcf7f14d746659ca1b80a0ecc/conn.go#L437

I patched my local checkout of google/nftables/conn.go to include this line:

        conn.SetOption(netlink.NoENOBUFS, true)

after:

        defer conn.Close()

and it seems to fix the error.. I don't know enough at this point to understand the implications of using that option, or why it might not be desired, so I decided to start with a ticket instead of a PR.

Has anyone else seen this message?

@stapelberg
Copy link
Collaborator

@mdlayher any thoughts?

@mdlayher
Copy link
Contributor

Unfortunately I don't have any more info. I've never run into this error in my work and only chose to implement the option because it appeared in a list in the man pages.

@danopia
Copy link
Author

danopia commented Jul 14, 2020

Ok thanks, I'll see if I can capture and minify a repro for this. I assume without that there's basically nothing to look at 😄

I am running my code on two different Debian releases, 10 and sid, and see the error frequently on both. (I've modified my calling code to consider ENOBUFS as a successful Flush(), instead of patching NoENOBUFS on, which might be a more reasonable workaround. Really not sure though.)

@mdlayher
Copy link
Contributor

I took a look at the mailing list thread you mentioned above and it seems to indicate that userspace cannot keep up with the kernel's messages.

When I wrote my netlink code, simplicity and correctness were the primary goals. I know there are some outstanding work items related to OS thread locking and etc that can slow things down, but since I haven't run into them in my use cases, it's low priority work for me.

You could try DisableNSLockThread in https://godoc.org/github.com/mdlayher/netlink#Config but I have no other advice at the moment unfortunately.

@danopia
Copy link
Author

danopia commented Jul 14, 2020

I took a look at the mailing list thread you mentioned above and it seems to indicate that userspace cannot keep up with the kernel's messages.

That's the part that I don't really understand--it makes sense to fall behind when processing individual packets in userspace as the thread indicates, but I'm just sending ~100 rules to the kernel in one batch and waiting for an ACK batch back. It seems like it should be hard to fall behind on that? I suppose it's related to how the ACK includes handles for rules that are created.

You could try DisableNSLockThread in https://godoc.org/github.com/mdlayher/netlink#Config but I have no other advice at the moment unfortunately.

This is the case already:

nftables/conn.go

Lines 92 to 98 in 26bcabf

disableNSLockThread := false
if cc.NetNS == 0 {
// Since process is operating in main namespace, there is no reason to have NSLockThread lock,
// as it causes some performance penalties.
disableNSLockThread = true
}
return netlink.Dial(unix.NETLINK_NETFILTER, &netlink.Config{NetNS: cc.NetNS, DisableNSLockThread: disableNSLockThread})
and I've confirmed that it's passing true for me.


Here's a super simplified reproduction that just puts a bunch of accept rules into one table/chain: https://gist.github.com/danopia/437173efc0046962fafb746eccfe830e

On my laptop it reaches ENOBUFS at 138 accept rules. I also tried tcp dport 80 ip saddr 1.1.1.1 accept and ENOBUFS appeared at 104 rules. (So rule complexity is relevant but not as relevant as I'd expect.)

> go build .; and sudo setcap cap_net_admin+ep firewall-buf-repro; and ./firewall-buf-repro
2020/07/14 19:54:39 Making ever more rules...
2020/07/14 19:54:39 Trying 75 rules
2020/07/14 19:54:39 Trying 80 rules
2020/07/14 19:54:39 Trying 85 rules
2020/07/14 19:54:39 Trying 90 rules
2020/07/14 19:54:39 Trying 95 rules
2020/07/14 19:54:39 Trying 100 rules
2020/07/14 19:54:39 Trying 105 rules
2020/07/14 19:54:39 Trying 110 rules
2020/07/14 19:54:39 Trying 115 rules
2020/07/14 19:54:39 Trying 120 rules
2020/07/14 19:54:39 Trying 125 rules
2020/07/14 19:54:39 Trying 130 rules
2020/07/14 19:54:39 Trying 135 rules
2020/07/14 19:54:39 Trying 140 rules
2020/07/14 19:54:39 nftables error: Receive: netlink receive: recvmsg: no buffer space available

Does it make sense that this library is unable to acknowledge 150 accept rules in one flush? Is everyone who manages a couple hundred rules creating them piecemeal? I didn't think that this would be too many rules!

# one too many rules to receive Flush() confirmation on my laptop
> sudo nft list table demo-table
table ip demo-table {
        chain cluster-ips {
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
                accept
        }
}

@stapelberg
Copy link
Collaborator

Thanks for the repro program.

On my machine, it prints:

2020/07/22 23:26:08 Making ever more rules...
2020/07/22 23:26:08 Trying 75 rules
2020/07/22 23:26:09 Trying 80 rules
2020/07/22 23:26:09 Trying 85 rules
2020/07/22 23:26:09 Trying 90 rules
2020/07/22 23:26:09 Trying 95 rules
2020/07/22 23:26:09 Trying 100 rules
2020/07/22 23:26:09 Trying 105 rules
2020/07/22 23:26:09 Trying 110 rules
2020/07/22 23:26:09 Trying 115 rules
2020/07/22 23:26:09 Trying 120 rules
2020/07/22 23:26:09 Trying 125 rules
2020/07/22 23:26:09 Trying 130 rules
2020/07/22 23:26:09 Trying 135 rules
2020/07/22 23:26:09 Trying 140 rules
2020/07/22 23:26:09 nftables error: Receive: netlink receive: recvmsg: no buffer space available

My guess would be that due to the NLM_F_ACK flag, each request generates a reply, but the combination of replies exceeds the socket buffer size, and hence the kernel returns -ENOBUFS.

I’m not yet sure what the best way to fix this is:

  1. Do nothing at the library level, and instead rely on users not causing too large reply packets.
  2. Stop requesting acknowledgements to keep reply sizes small? Maybe this only reduces the likelihood of triggering the issue, in case there are replies which are always big.
  3. Reading https://patchwork.ozlabs.org/project/netdev/patch/20090323141858.6808.50788.stgit@Decadence/, I think setting the NoENOBUFS option results in dropping acknowledgements instead of returning -ENOBUFS. Instead of setting NoENOBUFS, we should probably stop requesting acknowledgements.

@mdlayher
Copy link
Contributor

Instead of setting NoENOBUFS, we should probably stop requesting acknowledgements.

Seems reasonable to me.

@danopia
Copy link
Author

danopia commented Jul 24, 2020

Thanks for looking into the netdev source. Since we're dealing with the kernel I suppose it makes sense that packets would get dropped instead of buffered when using NoENOBUFS.

I found the specific message for adding rules and noticed that it has both ECHO and ACK flags masked in:

flags = netlink.Request | netlink.Acknowledge | netlink.Create | unix.NLM_F_ECHO | unix.NLM_F_APPEND

So I decided to do a little bit of testing with the above accept repro:

  • With ECHO | ACK, my loop crashes at 140 rules.
  • With only ACK, it crashes at 280 rules.
  • With only ECHO, it also crashes at 280 rules.
  • With neither, it seems error free up to / past 1000 rules.

Then I adjusted the repro to include 8 verdicts per rule, accept accept accept accept accept accept accept accept:

  • With ECHO | ACK, it crashes at 105 rules
  • With only ACK, it crashes at 280 rules.
  • With only ECHO, it crashes at 170 rules.

So it seems like ACK isn't the only problem, especially as rule complexity is increased. ECHO is a larger contributor towards read buffer space. Frustratingly, I've noticed this unrelated work that depends on the ECHO responses: #88

Maybe there's need for configurability if the masking is being changed? So folks that want ECHO on each frame can keep it. I only care about knowing if the whole batch was accepted as a unit.

(I also toyed with adding calls like conn.SetReadBuffer(200*1000) in conn.go which directly changed the ENOBUFS boundary and could be made configurable, however rmem_max is still at play without resorting to SO_RCVBUFFORCE. So that's a separate hole to go down.)

@stapelberg
Copy link
Collaborator

Thanks for the details!

How does the nft(8) tool handle this issue?

@danopia
Copy link
Author

danopia commented Aug 8, 2020

When using nft --debug mnl -f ..., I see neither ECHO nor ACK flags on any of the packets. Thus it shouldn't have this problem at all.

When using nft --debug mnl --echo -f ..., I see only the ECHO flag added to all the packets. With this arg I actually see the nft(8) tool fail silently, after around 170 lines of tcp dport http ip saddr 1.1.1.1 accept repeated:

> sudo nft -f nft.txt
> echo $status
0

> sudo nft -f nft.txt --echo
> echo $status
1

I don't fully understand when nft(8) uses ACK. Passing --handle by itself seems to be a no-op. Passing --handle --echo starting using ACK for some packets, but the main rules packets still seem to only have ECHO masked in.

In summary, nft(8) doesn't handle this error at all, but it doesn't use ACK or ECHO by default either, so it generally avoids the issue overall.

@stapelberg
Copy link
Collaborator

Thanks for the details! Sounds like removing the ACK and ECHO flags is the best choice to make, then. Is there any downside to that I’m not seeing? Would you be interested in sending a PR that removes them?

@danopia
Copy link
Author

danopia commented Aug 25, 2021

I'm closing this as stale. I ignore the error in my program and have had no further issues since then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants