Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leakage in SNMP input #4457

Closed
Anderen2 opened this issue Jul 23, 2018 · 12 comments · Fixed by #4938
Closed

Memory leakage in SNMP input #4457

Anderen2 opened this issue Jul 23, 2018 · 12 comments · Fixed by #4938
Assignees
Labels
area/snmp bug unexpected problem or unintended behavior
Milestone

Comments

@Anderen2
Copy link

Relevant telegraf.conf:

# grep -R snmp /etc/telegraf/* | grep -Po "inputs\.snmp.+" | grep -Po "[\w\.]+" | sort | uniq -c | sort -h
      1 inputs.snmp.host.table
      1 inputs.snmp_legacy
      2 inputs.snmp.bulk
      3 inputs.snmp.get
      3 inputs.snmp.host
    382 inputs.snmp.tagpass
    940 inputs.snmp.tags
    942 inputs.snmp
   1094 inputs.snmp.table
   1759 inputs.snmp.field
  15220 inputs.snmp.table.field

Sample config: https://pastebin.com/x1qWYsYk

System info:

Telegraf v1.6.0 (git: release-1.6 9253f6f)
Red Hat Enterprise Linux Server release 7.5 (Maipo)

Steps to reproduce:

  1. Run Telegraf with the SNMP input for longer durations of time
  2. Check that the memory usage is increasing over time
  3. Eventually the process will consume all the memory on the host and get killed by OOM

Expected behavior:

That the memory usage were consistent and not increasing over time.

Actual behavior:

Telegraf's memory usage seems to increase over time.

Additional info:

Graph showing increasing memory usage until the process restarts.
image

Newly started process

(pprof) top10
Showing nodes accounting for 84.14MB, 96.46% of 87.22MB total
Showing top 10 nodes out of 89
      flat  flat%   sum%        cum   cum%
   59.57MB 68.30% 68.30%    64.09MB 73.48%  github.com/soniah/gosnmp.(*GoSNMP).Connect
    7.50MB  8.60% 76.90%     8.57MB  9.82%  github.com/influxdata/telegraf/plugins/inputs/snmp.Table.Build
    4.02MB  4.61% 81.51%     4.02MB  4.61%  math/rand.NewSource
       3MB  3.44% 84.95%        3MB  3.44%  runtime.malg
    2.50MB  2.87% 87.83%     2.50MB  2.87%  reflect.unsafe_NewArray
       2MB  2.29% 90.12%        2MB  2.29%  github.com/influxdata/telegraf/internal/config.buildInput
       2MB  2.29% 92.41%        2MB  2.29%  github.com/influxdata/telegraf/metric.New
    1.76MB  2.02% 94.43%     1.76MB  2.02%  compress/flate.NewWriter
       1MB  1.15% 95.58%        1MB  1.15%  crypto/x509/pkix.(*Name).FillFromRDNSequence
    0.77MB  0.88% 96.46%     0.77MB  0.88%  github.com/zensqlmonitor/go-mssqldb.init

After running for several days

(pprof) top10
Showing nodes accounting for 784.13MB, 96.14% of 815.64MB total
Dropped 67 nodes (cum <= 4.08MB)
Showing top 10 nodes out of 60
      flat  flat%   sum%        cum   cum%
  465.43MB 57.06% 57.06%   546.85MB 67.05%  github.com/influxdata/telegraf/plugins/inputs/snmp.Table.Build
  188.82MB 23.15% 80.21%   205.40MB 25.18%  github.com/soniah/gosnmp.(*GoSNMP).Connect
   26.50MB  3.25% 83.46%    26.50MB  3.25%  github.com/influxdata/telegraf/plugins/inputs/snmp.fieldConvert
   23.50MB  2.88% 86.34%    23.50MB  2.88%  strconv.formatBits
      23MB  2.82% 89.16%       23MB  2.82%  github.com/soniah/gosnmp.(*GoSNMP).decodeValue
   20.27MB  2.49% 91.65%    23.48MB  2.88%  compress/flate.NewWriter
   15.58MB  1.91% 93.56%    15.58MB  1.91%  math/rand.NewSource
       8MB  0.98% 94.54%    10.50MB  1.29%  github.com/influxdata/telegraf/metric.(*metric).AddField
    6.52MB   0.8% 95.34%     6.52MB   0.8%  reflect.unsafe_NewArray
    6.50MB   0.8% 96.14%     6.50MB   0.8%  runtime.malg

pprof png generated from http://localhost:6060/debug/pprof/heap?debug=1
profile003

@Anderen2 Anderen2 changed the title Memory leakage in SNMP plugins Memory leakage in SNMP input Jul 23, 2018
@glinton glinton added bug unexpected problem or unintended behavior area/snmp labels Jul 23, 2018
@danielnelson danielnelson self-assigned this Jul 23, 2018
@danielnelson
Copy link
Contributor

Can you run these? Letting it double in memory usage from base should be enough time before running.

go tool pprof --inuse_space -top http://localhost:6060/debug/pprof/heap
go tool pprof --inuse_objects -top http://localhost:6060/debug/pprof/heap

@Anderen2
Copy link
Author

Hi @danielnelson!

I ran the commands towards the heapdump where the process had ran for several days.
Output pasted here: https://pastebin.com/4BWW1FQQ

@danielnelson
Copy link
Contributor

Can you do this one too? curl 'http://localhost:6060/debug/pprof/goroutine?debug=1'

@Anderen2
Copy link
Author

Anderen2 commented Jul 25, 2018

Contents pasted here: https://pastebin.com/i5S7bm5X

Dumped at same time/run as the heap from above

@danielnelson
Copy link
Contributor

Some more follow up questions:

  • About how many SNMP agents total are you scraping?
  • What is the value of the Telegraf agent's metric_buffer_limit and metric_batch_size?

@danielnelson
Copy link
Contributor

Can you also grab the full goroutine dump with curl 'http://localhost:6060/debug/pprof/goroutine?debug=2', hopefully it won't be too large to upload.

@Anderen2
Copy link
Author

@danielnelson Here you go: https://gist.github.com/Anderen2/7c2c1bfbd2d0960ef73120ae06a80db2

That is not from the same run as the others though, Telegraf have been running for 3 days here.

@sfrode
Copy link

sfrode commented Jul 30, 2018

Agent configuration:

[agent]
  interval               = "120s"
  round_interval         = false
  metric_batch_size      = 1000
  metric_buffer_limit    = 10000
  collection_jitter      = "9s"
  flush_interval         = "30s"
  flush_jitter           = "10s"
  precision              = ""

@danielnelson
Copy link
Contributor

@Anderen2 I have a couple theories about the cause, can you tell me approximately how frequently you reload these Telegraf via sending a SIGHUP or through your init system service telegraf reload?

@sfrode
Copy link

sfrode commented Aug 2, 2018

@danielnelson: Every hour.

@danielnelson
Copy link
Contributor

I think it may be leaking memory on reload. If you zoom in on memory used metric for a single Telegraf, and group by ~10m, do you see a stair step pattern every hour?

@sfrode
Copy link

sfrode commented Aug 8, 2018

It's difficult to separate the suspected leak from the background noise in the graphs. However, we have rewritten some of our stuff to avoid reloading Telegraf every hour. That seems to have slow down (or stopped) the growth rate, so your theory sounds reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snmp bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants