Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate performance improvements for netbox enrichment #547

Open
mmguero opened this issue Jan 9, 2025 · 1 comment
Open

investigate performance improvements for netbox enrichment #547

mmguero opened this issue Jan 9, 2025 · 1 comment
Assignees
Labels
netbox Related to Malcolm's use of NetBox performance Related to speed/performance
Milestone

Comments

@mmguero
Copy link
Collaborator

mmguero commented Jan 9, 2025

The netbox enrichment code is by far the slowest part of the logstash pipeline. Here's the end of the output of the list of all the logstash filters, with the final column being the duration of that filter in milliseconds:

$ docker compose exec logstash curl -XGET http://localhost:9600/_node/stats/pipelines | jq -r '.. | .filters? // empty | .[] | objects | select (.events.in > 0) | [.id, .events.in, .events.out, .events.duration_in_millis] | join (";")' | sort -n -t ';' -k4
...
...
...
cidr_detect_network_type_ipv4_source;22357;22357;1953
ruby_dns_freq_lookup;407;407;2098
ruby_zeek_remove_empty_values;13007;13007;2962
ruby_suricata_timestamp_calc;11469;11469;4816
cidr_add_tag_internal_destination;22397;22397;8875
cidr_add_tag_internal_source;22357;22357;14472
ruby_netbox_enrich_destination_ip_segment;8434;8403;858710
ruby_netbox_enrich_source_ip_segment;9759;9691;1227869
ruby_netbox_enrich_destination_ip_device;8403;8403;1391125
ruby_netbox_enrich_source_ip_device;9691;9600;1782989

You can see that the enrichment stuff is far and away the most costly. Beyond some caching, there isn't a ton I'm doing optimization/performance wise. We should examine the netbox enrichment ruby filter code (linked above) and see if we can do some of the following:

  • examine cache settings... do they make sense? are we getting cache misses?
  • is there any sort of profiling code we can do to find the hot spots in the code?
  • are there particular features (autodiscovery, regular lookups, devices, services, etc.) that are more costly than others?

All in all, it would be probably the biggest performance benefit we could get for Malcolm if we could improve the speed of that code without sacrificing functionality.

@mmguero mmguero added netbox Related to Malcolm's use of NetBox performance Related to speed/performance labels Jan 9, 2025
@mmguero mmguero added this to the z.staging milestone Jan 9, 2025
@mmguero mmguero added this to Malcolm Jan 9, 2025
@mmguero mmguero moved this to Todo (investigate) in Malcolm Jan 9, 2025
@mmguero mmguero modified the milestones: z.staging, v25.02.0 Jan 16, 2025
@jjrush jjrush self-assigned this Jan 22, 2025
@mmguero mmguero moved this from Todo (investigate) to In Progress in Malcolm Feb 10, 2025
@jjrush
Copy link
Collaborator

jjrush commented Feb 17, 2025

Wrote some profiling code and cache tracking code for the netbox_enrich.rb script. Fed malcolm 145 pcaps (11GB) and got the below output.

Method Performance (cumulative)

Method Calls Avg (ms) Min (ms) Max (ms) Total (ms) Outliers Avg Outlier (ms)
filter 25,045,705 0.32 0.01 2,599.61 8,061,228.42 28,007 (0.1%) 207.84
netbox_lookup 758,274 1.91 0.00 1,477.37 1,447,055.80 5,211 (0.7%) 216.26
lookup_devices 5,123 201.56 26.29 793.43 1,032,571.86 5,011 (97.8%) 204.80
lookup_prefixes 5,139 64.31 29.46 565.91 330,512.69 96 (1.9%) 278.54
lookup_or_create_site 25,055,884 0.01 0.00 432.59 320,552.84 82 (0.0%) 221.92
create_device_interface 84 262.81 159.30 625.18 22,076.34 84 (100.0%) 262.81
lookup_manuf 31 210.11 0.01 721.75 6,513.29 21 (67.7%) 310.15
autopopulate_prefixes 14 87.76 57.76 129.51 1,228.64 4 (28.6%) 114.07
lookup_or_create_role 67 7.10 0.01 49.10 475.82 0 (0.0%) 0.00

Cache Performance

Metric Value Percentage
Total Lookups 25,803,938 100%
Hits 25,045,705 97.1%
Misses 758,233 2.9%

From what I'm seeing the cache is pretty reliably getting hit and all of the methods that get called a lot have pretty good performance of <1ms on average.

However even the relatively low outlier % is still having a significant performance impact.
The 5,211 method calls to netbox_lookup that are classified as outliers multiplied by the average outlier runtime of 216.26 milliseconds results in 1,126,930.86 seconds or roughly 18.7 minutes.
The entire method execution time for netbox_lookup is 1,447,055.8 milliseconds or roughly 24.11 minutes.
That means that the outliers (0.7% of method calls) take up 75% of the execution time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
netbox Related to Malcolm's use of NetBox performance Related to speed/performance
Projects
Status: In Progress
Development

No branches or pull requests

2 participants