-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auditbeat: Fixes for system/socket dataset #19033
Conversation
Pinging @elastic/siem (Team:SIEM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for a bit deeper explanation for a bit of posterity sake.
So, it looks like this happens when you get kernel pointer re-use of the sockets after a missed inet_release
syscall. Our old clean-up code failed to remove the socket from the socketLRU
but overwrote the map lookup value in the underlying sockets
map. When the reaper code for cleaning up old sockets then hit, the orphaned record in the socketLRU
would be referenced along with its kernel-based pointer which now pointed to a new socket. As a result the reference in the socketLRU
would never get removed and would get evaluated again and again via the Peek
call in the for loop. The fix works because any time we expire a socket we now also explicitly remove it from the socketLRU
(which a socket reference is always added to when the socket is created) and mark it in a closing
state.
Does that sound about right? If so, would it be possible to add a simple test for this condition? Basically flow --> missed inet_release
--> reused kernel pointer uint64
--> make sure we have the old socket reference in a closing
state?
The feature was using socket.closeTime as a reference for expiration, but this timestamp was only set once the socket was closed or expired, so it caused all sockets to expirate every closeTimeout.
@andrewstucki while adding the test I found yet another problem. It wasn't dealing with socket timeouts properly, see the new commit. Can you have another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
painful 😬 thanks for tracking these down and adding the tests @adriansr
yep, the whole state.go should be rewritten from scratch :D |
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations.
@adriansr auditbeat version 7.17.0 (amd64), libbeat 7.17.0 system/socket Peak CPU Usage 100%+,mean value Cpu Usage 40%+. |
…9081) Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 9555ff4)
What does this PR do?
Fixes two problems with the system/socket dataset:
A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in 2.6 / CentOS/RHEL 6.x).
Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check.
Also fixes other two minor issues:
Why is it important?
It has been observed that the dataset would use 100% CPU and stop reporting events. During testing it was discovered that socket expiration, a new feature to prevent excessive memory usage, wasn't working as expected.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
The infinite loop is easy to trigger in RHEL 6.x by running: