-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not all response messages are received causing the receive buffer to overflow #235
Comments
After sending a message only a single call to `conn.Receive` is made which can cause messages to queue up in the receive buffer if not all messages can be returned at once. This causes `ENOBUFS` as the kernel is unable to write any further messages. This commit introduces a check that ensures we call `conn.Receive` as often as needed to get the right number of responses: one for acknowledgment and one for the echo. Resolves: google#235
Hi @maxmoehl, I believe this is closely related to what I have stumbled upon once and tried to address in #191 with receive retries. See the following comment for more details: #191 (comment) I ended up opening a PR in netlink as agreed in the original nftables PR #191, but it is not yet reviewed: mdlayher/netlink#200 (comment) Are you utilizing nftables lasting connection? Have you tried changing this to a standard connection and see how your program behaves? |
Thanks for the references! I will take a closer look at those to better understand the issue. Yes we are using a lasting connection, it seemed like a reasonable choice given that we know we will be sending 1000+ messages. I too saw that there was another issue (see the second comment in my PR) however the change still improved the behaviour in some situations (but I have to admit that I never tested it properly beyond a few test runs). |
I have further looked at the issues I have previously sent you and I found that the specific case that I was investigating was resolved in #194, but please still check all links to confirm whether these are related to your case or not. Since you have confirmed that you are using lasting connection, I can just say at this point that lasting conn behaves differently than the standard one and it is possible that you have found a case which is not covered by receiveAckAware. To be completely sure, I would need to take some time to verify this. Can you maybe share a code snippet that reproduces the issue? In the meantime, you can transfer to using a standard connection as I would expect that it is not affected by this issue. Your best bet for fixing this would be to check how original nftables handles these messages. |
I just tried running it with a non-lasting connection and it seems to be working fine (which makes sense, given that each connection will have its own buffer which is discarded after one operation). However, as expected, the overhead is quite high. All benchmarks I currently have take about twice as long compared to the lasting connection. This could be tweaked by running more than one operation at a time and therefore increasing the number of rules written per flush. But at some point I will then run into the send buffer overflowing because there isn't any protection against that as well. I've put a working example here. These are the results I got:
For completeness: the tests have been run an AWS VM with the following specs:
|
Regarding the change you made: that only applies to get operations. So far I'm only seeing issues with writing things. But at the end it's a similar issue, the responses we should expect are not properly read. The netlink protocol has some features to ensure reliable transmission. If the responses are simply discarded all of these guarantees are lost. It could even be that the kernel is requesting an ACK from the client, if we don't respond to that it might be sending the same message indefinitely causing the receive buffer to overflow.
|
Hi @maxmoehl, I have tested your snippet and I believe I see where the issue is and it is related to From the output it is seen that in your lasting connection there are multiple messages sent as a response, for example:
Note the sequences we are sending and receiving. Observe how the last In my observations I am seeing that netlink responds with a following set of messages after we initiate a request for a new rule:
All of those messages are sent for each message requesting I have patched the
I am not entirely sure whether this patch is complete. I can confirm that I cannot reproduce the issue anymore and that all tests are passing. |
Yes, it works! Thanks! |
Can you provide a PR for this change? |
Fixes google#235 Added support for messages having overrun flag Changed `conn.Receive` call to `receiveAckAware` in `Flush`
This is closely related to #103.
I was observing a similar issue: after writing about 240 rules I'd receive
conn.Receive: netlink receive: recvmsg: no buffer space available
. The rules I'm writing are always created in pairs, twoAddRule
calls are always followed by aFlush
. Using theNLDEBUG=level=1
to produce debug output from the netlink library I was able to observe that my application executed 490 calls toSendMessages
but only around 320 messages were received (this doesn't match with the 240 rules I see but I'll ignore that for now).Looking at the receive logic it seems like it doesn't really care how many messages are read back:
nftables/conn.go
Lines 234 to 241 in d27cc52
We may, or may not, get multiple messages with that single receive call. What should be done at this stage is to ensure we read as many messages as we expect.
The amount of messages returned by each call to
Receive
seems to depend on a number of variables but it looks like netlink is unable to return more than two (and often not even two) at a time. This causes messages to collect in the receive buffer as we callReceive
only once per message sent.To confirm my theory I've implemented a fix in #236. With the fix applied I was able to write 10.000 rules continuously without any changes to my logic.
The text was updated successfully, but these errors were encountered: