-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[x-pack/filebeat/netflow] implement netflow with multiple workers #40122
[x-pack/filebeat/netflow] implement netflow with multiple workers #40122
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
2121465
to
78c69b9
Compare
Pinging @elastic/sec-deployment-and-devices (Team:Security-Deployment and Devices) |
run docs-build |
this commit c0dbb72 fixes an issue with the overgrowing under-the-hood slice of LRU that was brought to my attention offline by @aleksmaus; ty 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great write up and charts
run docs-build |
1 similar comment
run docs-build |
@andrewkroh @aleksmaus any more feedback on this PR? 🙂 From my experimental analysis I can see the benefits of scaling where we can sustain 5000 packets/sec with 4 workers under certain hardware and network characteristics. |
…silis/scale_netflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unblocking. Would appreciate if @andrewkroh takes a look as well
ty @aleksmaus! I would also appreciate it if @andrewkroh takes a look as well 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing new from me. Great work!
The fleet netflow package will need updated to expose |
yy I had that in mind to do that but thx for the reminder @andrewkroh |
Proposed commit message
This PR introduces scaling up support for Netflow input. To accommodate for the parallel processing of template definitions/options and data records causing eventual consistency, Netflow v9 and IPFIX decoders utilize a short-term LRU (Least Recently Used) cache. This cache is designed to temporarily store events whose templates have not yet been processed, ensuring that these events can be properly handled and eventually sent out once the corresponding template is available. Please read also the performance results. TL;DR for outputs with just 1 worker there are no performance gains when scaling. Gains can be seen for outputs with 4 and 8 workers but there is a plateau for higher numbers of workers.
Data Flow:
Performance results:
The performance of this PR was evaluated using a local Elasticsearch cluster running with
mage docker:composeUp
(minimized network latencies) and increasing the buffer the OS UDP buffer to guarantee no packets drops at this level withThe image below, titled "Netflow Performance [15000 packets/sec, 100 Input workers, scaling Output workers]," provides an analysis of the system's performance with a fixed number of 100 input workers while varying the number of output workers. The x-axis again represents time over 16 seconds, and the y-axis shows the total flows published as reported by Elasticsearch. The scenarios include a mocked pipeline for maximum performance and several real pipeline configurations with 100 netflow workers paired with 1, 4, 8, 16, and 32 output workers. Additionally, it includes performance metrics for the prior to this PR netflow implementation under similar conditions. The mocked pipeline, as expected, achieves the highest performance, serving as an upper benchmark. For the real pipeline and 1 output worker, the existing implementation and the scaling one introduced by this PR exhibit the same performance. Performance improves noticeably as the number of output workers increases up to 8. But for output workers more than 8, such as 16, 32, there are no apparent extra performance gains.
The next image below, titled "Netflow Performance [15000 packets/sec, 32 Output workers, scaling Input workers]," illustrates the performance of a Netflow system under various configurations of input workers while keeping the number of output workers constant at 32. I conducted this experiment to validate that the performance plateau observed in previous image for workers 8+ is not because the input rate is maxed out. Once again, the x-axis represents time over 16 seconds, and the y-axis shows the total flows published as reported by Elasticsearch. The graph includes a mocked pipeline, which serves as the benchmark for maximum system performance with zero publish overhead, and three real pipeline configurations with 100, 200, and 300 netflow workers respectively. We observe that scaling input workers does not provide any noticeable performance gains, suggesting that factors other than the number of netflow workers might be limiting their performance.
The take home point is that for outputs with just 1 worker there is no performance gains when scaling. The gains of scaling can be seen for outputs with 4 and 8 workers but there is a plateau. More details about the reasoning of this plateau can be found here
Testing:
As shown in the pictures below, the effectiveness of the LRU cache introduced for Netflow v9 and IPFIX decoders is tested with a new pcap file, namely 'ipfix_cisco.reversed.pcap', that holds the same packets as 'ipfix_cisco.pcap' but in reversed order; data records come first and the template ones last.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Disruptive User Impact
N/A
Author's Checklist
N/A
How to test this PR locally
Related issues
Use cases
N/A
Screenshots
N/A
Logs
N/A