[x-pack/filebeat/netflow] implement netflow with multiple workers #40122

pkoutsovasilis · 2024-07-05T14:59:33Z

Proposed commit message

This PR introduces scaling up support for Netflow input. To accommodate for the parallel processing of template definitions/options and data records causing eventual consistency, Netflow v9 and IPFIX decoders utilize a short-term LRU (Least Recently Used) cache. This cache is designed to temporarily store events whose templates have not yet been processed, ensuring that these events can be properly handled and eventually sent out once the corresponding template is available. Please read also the performance results. TL;DR for outputs with just 1 worker there are no performance gains when scaling. Gains can be seen for outputs with 4 and 8 workers but there is a plateau for higher numbers of workers.

Data Flow:

flowchart TD
    classDef default line-height:1.5,text-align:center;
    U([UDP Receiver])
    Q[(Buffered Channel)]
    N["Netflow Decoder<br>(same instance across workers)"]
    IQ[(Internal Memory Queue)]
    CP["Pipeline Client<br>(dedicated client per worker)"]
    O(Output)

    U -->|push| C{Channel Full?}
    C --> Q
    Q -->|read| N
    subgraph input[M input workers]
        N --> CP
    end
    style input stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5
    CP --> IQ
    C --> D([Drop])
    IQ --> O
    subgraph N output workers
        O
    end

Performance results:

The performance of this PR was evaluated using a local Elasticsearch cluster running with mage docker:composeUp (minimized network latencies) and increasing the buffer the OS UDP buffer to guarantee no packets drops at this level with

sudo sysctl -w net.core.rmem_max=26214400
sudo sysctl -w net.core.rmem_default=26214400

The image below, titled "Netflow Performance [15000 packets/sec, 100 Input workers, scaling Output workers]," provides an analysis of the system's performance with a fixed number of 100 input workers while varying the number of output workers. The x-axis again represents time over 16 seconds, and the y-axis shows the total flows published as reported by Elasticsearch. The scenarios include a mocked pipeline for maximum performance and several real pipeline configurations with 100 netflow workers paired with 1, 4, 8, 16, and 32 output workers. Additionally, it includes performance metrics for the prior to this PR netflow implementation under similar conditions. The mocked pipeline, as expected, achieves the highest performance, serving as an upper benchmark. For the real pipeline and 1 output worker, the existing implementation and the scaling one introduced by this PR exhibit the same performance. Performance improves noticeably as the number of output workers increases up to 8. But for output workers more than 8, such as 16, 32, there are no apparent extra performance gains.

The next image below, titled "Netflow Performance [15000 packets/sec, 32 Output workers, scaling Input workers]," illustrates the performance of a Netflow system under various configurations of input workers while keeping the number of output workers constant at 32. I conducted this experiment to validate that the performance plateau observed in previous image for workers 8+ is not because the input rate is maxed out. Once again, the x-axis represents time over 16 seconds, and the y-axis shows the total flows published as reported by Elasticsearch. The graph includes a mocked pipeline, which serves as the benchmark for maximum system performance with zero publish overhead, and three real pipeline configurations with 100, 200, and 300 netflow workers respectively. We observe that scaling input workers does not provide any noticeable performance gains, suggesting that factors other than the number of netflow workers might be limiting their performance.

The take home point is that for outputs with just 1 worker there is no performance gains when scaling. The gains of scaling can be seen for outputs with 4 and 8 workers but there is a plateau. More details about the reasoning of this plateau can be found here

Testing:

As shown in the pictures below, the effectiveness of the LRU cache introduced for Netflow v9 and IPFIX decoders is tested with a new pcap file, namely 'ipfix_cisco.reversed.pcap', that holds the same packets as 'ipfix_cisco.pcap' but in reversed order; data records come first and the template ones last.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

N/A

Author's Checklist

N/A

How to test this PR locally

cd x-pack/filebeat
mage goUnitTest
mage goIntegTest

Related issues

Related [Filebeat] Auto-scale number of netflow decoder workers #37761

Use cases

N/A

Screenshots

N/A

Logs

N/A

mergify · 2024-07-05T15:00:10Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @pkoutsovasilis? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2024-07-08T09:12:25Z

Pinging @elastic/sec-deployment-and-devices (Team:Security-Deployment and Devices)

pkoutsovasilis · 2024-07-08T14:47:07Z

run docs-build

pkoutsovasilis · 2024-07-08T14:49:56Z

this commit c0dbb72 fixes an issue with the overgrowing under-the-hood slice of LRU that was brought to my attention offline by @aleksmaus; ty 🙏

andrewkroh

Great write up and charts

x-pack/filebeat/input/netflow/config.go

x-pack/filebeat/input/netflow/decoder/config/config.go

x-pack/filebeat/input/netflow/config.go

x-pack/filebeat/input/netflow/decoder/v9/lru.go

pkoutsovasilis · 2024-07-09T08:13:15Z

run docs-build

pkoutsovasilis · 2024-07-09T09:36:25Z

run docs-build

pkoutsovasilis · 2024-07-22T09:03:39Z

@andrewkroh @aleksmaus any more feedback on this PR? 🙂 From my experimental analysis I can see the benefits of scaling where we can sustain 5000 packets/sec with 4 workers under certain hardware and network characteristics.

x-pack/filebeat/input/netflow/decoder/v9/lru.go

x-pack/filebeat/input/netflow/decoder/v9/v9.go

x-pack/filebeat/input/netflow/decoder/v9/lru.go

x-pack/filebeat/input/netflow/decoder/v9/v9.go

…ync/atomic

…silis/scale_netflow

x-pack/filebeat/input/netflow/decoder/v9/lru.go

…n lru.Less

x-pack/filebeat/input/netflow/decoder/v9/lru.go

x-pack/filebeat/input/netflow/decoder/v9/lru_test.go

aleksmaus

Unblocking. Would appreciate if @andrewkroh takes a look as well

pkoutsovasilis · 2024-07-25T16:06:46Z

Unblocking. Would appreciate if @andrewkroh takes a look as well

ty @aleksmaus! I would also appreciate it if @andrewkroh takes a look as well 🙂

andrewkroh

Nothing new from me. Great work!

x-pack/filebeat/input/netflow/input.go

x-pack/filebeat/docs/inputs/input-netflow.asciidoc

andrewkroh · 2024-07-26T16:08:46Z

The fleet netflow package will need updated to expose workers configuration.

pkoutsovasilis · 2024-07-26T17:17:21Z

The fleet netflow package will need updated to expose workers configuration.

yy I had that in mind to do that but thx for the reminder @andrewkroh

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 5, 2024

mergify bot assigned pkoutsovasilis Jul 5, 2024

pkoutsovasilis changed the title ~~feat: implement netflow with multiple workers~~ [do not merge] implement netflow with multiple workers Jul 5, 2024

pkoutsovasilis added enhancement Team:Security-Deployment and Devices Deployment and Devices Team in Security Solution labels Jul 5, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 5, 2024

feat: implement netflow with multiple workers

78c69b9

pkoutsovasilis force-pushed the pkoutsovasilis/scale_netflow branch from 2121465 to 78c69b9 Compare July 5, 2024 15:48

pkoutsovasilis added 3 commits July 7, 2024 01:27

feat: lru improvements

bf3ec66

feat: add reversed order pcap for testing lru

2a33181

doc: update CHANGELOG.next.asciidoc

a19dfcb

pkoutsovasilis mentioned this pull request Jul 8, 2024

[Filebeat] Auto-scale number of netflow decoder workers #37761

Closed

pkoutsovasilis changed the title ~~[do not merge] implement netflow with multiple workers~~ [x-pack/filebeat/netflow] implement netflow with multiple workers Jul 8, 2024

pkoutsovasilis marked this pull request as ready for review July 8, 2024 09:12

pkoutsovasilis requested a review from a team as a code owner July 8, 2024 09:12

pkoutsovasilis added the Filebeat Filebeat label Jul 8, 2024

fix: reduce capacity of LRU slice during Pop

c0dbb72

andrewkroh reviewed Jul 9, 2024

View reviewed changes

pkoutsovasilis added 4 commits July 9, 2024 10:09

fix: ensure uniqueness for flows from different exporters in lru

cffc71d

fix: correct godocs

b4ab4a6

fix: rename workers_number to number_of_workers

91d3db6

doc: add number_of_workers in netflow plugin asciidoc

db0bf5b

pkoutsovasilis added 2 commits July 9, 2024 12:55

fix: unique id for number_of_workers field in netflow asciidoc

2aa4862

fix: rename number_of_workers to workers

df1ef35

aleksmaus reviewed Jul 22, 2024

View reviewed changes

pkoutsovasilis and others added 7 commits July 22, 2024 18:45

fix: replace github.com/elastic/beats/v7/libbeat/common/atomic with s…

513c86a

…ync/atomic

fix: improve lru cleanup code flow

4be3b9e

feat: add lru unit-tests

5b64eae

feat: remove nil checks from lru, replace done chan with context

1b13d48

Merge branch 'main' into pkoutsovasilis/scale_netflow

3bdee4c

Merge remote-tracking branch 'refs/remotes/beats/main' into pkoutsova…

5da5471

…silis/scale_netflow

Merge branch 'main' into pkoutsovasilis/scale_netflow

75a74a3

aleksmaus requested changes Jul 24, 2024

View reviewed changes

x-pack/filebeat/input/netflow/decoder/v9/lru.go Outdated Show resolved Hide resolved

fix: replace entryTime.Sub with the more efficient entryTime.Before i…

b79f348

…n lru.Less

aleksmaus reviewed Jul 24, 2024

View reviewed changes

x-pack/filebeat/input/netflow/decoder/v9/lru.go Show resolved Hide resolved

x-pack/filebeat/input/netflow/decoder/v9/lru.go Show resolved Hide resolved

fix: remove race-condition prone isEmpty atomic

159c77e

aleksmaus reviewed Jul 24, 2024

View reviewed changes

x-pack/filebeat/input/netflow/decoder/v9/lru_test.go Show resolved Hide resolved

pkoutsovasilis added 2 commits July 24, 2024 23:56

fix: rename stop to wait for lru to capture its functionality

5c0a7b0

Merge branch 'main' into pkoutsovasilis/scale_netflow

1aa4b24

aleksmaus approved these changes Jul 25, 2024

View reviewed changes

andrewkroh approved these changes Jul 26, 2024

View reviewed changes

x-pack/filebeat/input/netflow/input.go Show resolved Hide resolved

x-pack/filebeat/docs/inputs/input-netflow.asciidoc Outdated Show resolved Hide resolved

pkoutsovasilis added 2 commits July 26, 2024 14:26

doc: update workers and output preset documentation

a0d878c

Merge branch 'main' into pkoutsovasilis/scale_netflow

0a3be7d

andrewkroh approved these changes Jul 26, 2024

View reviewed changes

pkoutsovasilis merged commit 6c400f1 into elastic:main Jul 26, 2024
19 checks passed

pkoutsovasilis deleted the pkoutsovasilis/scale_netflow branch July 26, 2024 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[x-pack/filebeat/netflow] implement netflow with multiple workers #40122

[x-pack/filebeat/netflow] implement netflow with multiple workers #40122

pkoutsovasilis commented Jul 5, 2024 •

edited

Loading

mergify bot commented Jul 5, 2024

elasticmachine commented Jul 8, 2024

pkoutsovasilis commented Jul 8, 2024

pkoutsovasilis commented Jul 8, 2024 •

edited

Loading

andrewkroh left a comment

pkoutsovasilis commented Jul 9, 2024

pkoutsovasilis commented Jul 9, 2024

pkoutsovasilis commented Jul 22, 2024

aleksmaus left a comment

pkoutsovasilis commented Jul 25, 2024 •

edited

Loading

andrewkroh left a comment

andrewkroh commented Jul 26, 2024

pkoutsovasilis commented Jul 26, 2024

[x-pack/filebeat/netflow] implement netflow with multiple workers #40122

[x-pack/filebeat/netflow] implement netflow with multiple workers #40122

Conversation

pkoutsovasilis commented Jul 5, 2024 • edited Loading

Proposed commit message

Performance results:

Testing:

Checklist

Disruptive User Impact

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

mergify bot commented Jul 5, 2024

elasticmachine commented Jul 8, 2024

pkoutsovasilis commented Jul 8, 2024

pkoutsovasilis commented Jul 8, 2024 • edited Loading

andrewkroh left a comment

Choose a reason for hiding this comment

pkoutsovasilis commented Jul 9, 2024

pkoutsovasilis commented Jul 9, 2024

pkoutsovasilis commented Jul 22, 2024

aleksmaus left a comment

Choose a reason for hiding this comment

pkoutsovasilis commented Jul 25, 2024 • edited Loading

andrewkroh left a comment

Choose a reason for hiding this comment

andrewkroh commented Jul 26, 2024

pkoutsovasilis commented Jul 26, 2024

pkoutsovasilis commented Jul 5, 2024 •

edited

Loading

pkoutsovasilis commented Jul 8, 2024 •

edited

Loading

pkoutsovasilis commented Jul 25, 2024 •

edited

Loading