Memory leakage in SNMP input #4457

Anderen2 · 2018-07-23T17:24:28Z

Relevant telegraf.conf:

# grep -R snmp /etc/telegraf/* | grep -Po "inputs\.snmp.+" | grep -Po "[\w\.]+" | sort | uniq -c | sort -h
      1 inputs.snmp.host.table
      1 inputs.snmp_legacy
      2 inputs.snmp.bulk
      3 inputs.snmp.get
      3 inputs.snmp.host
    382 inputs.snmp.tagpass
    940 inputs.snmp.tags
    942 inputs.snmp
   1094 inputs.snmp.table
   1759 inputs.snmp.field
  15220 inputs.snmp.table.field

Sample config: https://pastebin.com/x1qWYsYk

System info:

Telegraf v1.6.0 (git: release-1.6 9253f6f)
Red Hat Enterprise Linux Server release 7.5 (Maipo)

Steps to reproduce:

Run Telegraf with the SNMP input for longer durations of time
Check that the memory usage is increasing over time
Eventually the process will consume all the memory on the host and get killed by OOM

Expected behavior:

That the memory usage were consistent and not increasing over time.

Actual behavior:

Telegraf's memory usage seems to increase over time.

Additional info:

Graph showing increasing memory usage until the process restarts.

Newly started process

(pprof) top10
Showing nodes accounting for 84.14MB, 96.46% of 87.22MB total
Showing top 10 nodes out of 89
      flat  flat%   sum%        cum   cum%
   59.57MB 68.30% 68.30%    64.09MB 73.48%  github.com/soniah/gosnmp.(*GoSNMP).Connect
    7.50MB  8.60% 76.90%     8.57MB  9.82%  github.com/influxdata/telegraf/plugins/inputs/snmp.Table.Build
    4.02MB  4.61% 81.51%     4.02MB  4.61%  math/rand.NewSource
       3MB  3.44% 84.95%        3MB  3.44%  runtime.malg
    2.50MB  2.87% 87.83%     2.50MB  2.87%  reflect.unsafe_NewArray
       2MB  2.29% 90.12%        2MB  2.29%  github.com/influxdata/telegraf/internal/config.buildInput
       2MB  2.29% 92.41%        2MB  2.29%  github.com/influxdata/telegraf/metric.New
    1.76MB  2.02% 94.43%     1.76MB  2.02%  compress/flate.NewWriter
       1MB  1.15% 95.58%        1MB  1.15%  crypto/x509/pkix.(*Name).FillFromRDNSequence
    0.77MB  0.88% 96.46%     0.77MB  0.88%  github.com/zensqlmonitor/go-mssqldb.init

After running for several days

(pprof) top10
Showing nodes accounting for 784.13MB, 96.14% of 815.64MB total
Dropped 67 nodes (cum <= 4.08MB)
Showing top 10 nodes out of 60
      flat  flat%   sum%        cum   cum%
  465.43MB 57.06% 57.06%   546.85MB 67.05%  github.com/influxdata/telegraf/plugins/inputs/snmp.Table.Build
  188.82MB 23.15% 80.21%   205.40MB 25.18%  github.com/soniah/gosnmp.(*GoSNMP).Connect
   26.50MB  3.25% 83.46%    26.50MB  3.25%  github.com/influxdata/telegraf/plugins/inputs/snmp.fieldConvert
   23.50MB  2.88% 86.34%    23.50MB  2.88%  strconv.formatBits
      23MB  2.82% 89.16%       23MB  2.82%  github.com/soniah/gosnmp.(*GoSNMP).decodeValue
   20.27MB  2.49% 91.65%    23.48MB  2.88%  compress/flate.NewWriter
   15.58MB  1.91% 93.56%    15.58MB  1.91%  math/rand.NewSource
       8MB  0.98% 94.54%    10.50MB  1.29%  github.com/influxdata/telegraf/metric.(*metric).AddField
    6.52MB   0.8% 95.34%     6.52MB   0.8%  reflect.unsafe_NewArray
    6.50MB   0.8% 96.14%     6.50MB   0.8%  runtime.malg

pprof png generated from http://localhost:6060/debug/pprof/heap?debug=1

The text was updated successfully, but these errors were encountered:

danielnelson · 2018-07-24T02:58:20Z

Can you run these? Letting it double in memory usage from base should be enough time before running.

go tool pprof --inuse_space -top http://localhost:6060/debug/pprof/heap
go tool pprof --inuse_objects -top http://localhost:6060/debug/pprof/heap

Anderen2 · 2018-07-24T10:42:29Z

Hi @danielnelson!

I ran the commands towards the heapdump where the process had ran for several days.
Output pasted here: https://pastebin.com/4BWW1FQQ

danielnelson · 2018-07-24T21:23:19Z

Can you do this one too? curl 'http://localhost:6060/debug/pprof/goroutine?debug=1'

Anderen2 · 2018-07-25T11:25:15Z

Contents pasted here: https://pastebin.com/i5S7bm5X

Dumped at same time/run as the heap from above

danielnelson · 2018-07-25T21:28:59Z

Some more follow up questions:

About how many SNMP agents total are you scraping?
What is the value of the Telegraf agent's metric_buffer_limit and metric_batch_size?

danielnelson · 2018-07-25T21:38:40Z

Can you also grab the full goroutine dump with curl 'http://localhost:6060/debug/pprof/goroutine?debug=2', hopefully it won't be too large to upload.

Anderen2 · 2018-07-27T11:45:53Z

@danielnelson Here you go: https://gist.github.com/Anderen2/7c2c1bfbd2d0960ef73120ae06a80db2

That is not from the same run as the others though, Telegraf have been running for 3 days here.

sfrode · 2018-07-30T11:50:43Z

Agent configuration:

[agent]
  interval               = "120s"
  round_interval         = false
  metric_batch_size      = 1000
  metric_buffer_limit    = 10000
  collection_jitter      = "9s"
  flush_interval         = "30s"
  flush_jitter           = "10s"
  precision              = ""

danielnelson · 2018-08-01T00:21:42Z

@Anderen2 I have a couple theories about the cause, can you tell me approximately how frequently you reload these Telegraf via sending a SIGHUP or through your init system service telegraf reload?

sfrode · 2018-08-02T11:15:06Z

@danielnelson: Every hour.

danielnelson · 2018-08-02T18:41:43Z

I think it may be leaking memory on reload. If you zoom in on memory used metric for a single Telegraf, and group by ~10m, do you see a stair step pattern every hour?

sfrode · 2018-08-08T07:41:28Z

It's difficult to separate the suspected leak from the background noise in the graphs. However, we have rewritten some of our stuff to avoid reloading Telegraf every hour. That seems to have slow down (or stopped) the growth rate, so your theory sounds reasonable.

Anderen2 changed the title ~~Memory leakage in SNMP plugins~~ Memory leakage in SNMP input Jul 23, 2018

glinton added bug unexpected problem or unintended behavior area/snmp labels Jul 23, 2018

danielnelson self-assigned this Jul 23, 2018

danielnelson added this to the 1.8.0 milestone Aug 9, 2018

danielnelson mentioned this issue Aug 31, 2018

'systemctl restart telegraf' failing (related to logparser ?) #4610

Closed

danielnelson modified the milestones: 1.8.0, 1.9.0 Sep 7, 2018

russorat modified the milestone: 1.9.0 Oct 8, 2018

danielnelson mentioned this issue Oct 10, 2018

strconv.ParseFloat: parsing "???": invalid syntax #4827

Closed

danielnelson mentioned this issue Oct 30, 2018

Remove outputs blocking inputs when output is slow #4938

Merged

7 tasks

danielnelson closed this as completed in #4938 Nov 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leakage in SNMP input #4457

Memory leakage in SNMP input #4457

Anderen2 commented Jul 23, 2018

danielnelson commented Jul 24, 2018

Anderen2 commented Jul 24, 2018

danielnelson commented Jul 24, 2018

Anderen2 commented Jul 25, 2018 •

edited

Loading

danielnelson commented Jul 25, 2018

danielnelson commented Jul 25, 2018

Anderen2 commented Jul 27, 2018

sfrode commented Jul 30, 2018 •

edited

Loading

danielnelson commented Aug 1, 2018

sfrode commented Aug 2, 2018

danielnelson commented Aug 2, 2018

sfrode commented Aug 8, 2018

Memory leakage in SNMP input #4457

Memory leakage in SNMP input #4457

Comments

Anderen2 commented Jul 23, 2018

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Jul 24, 2018

Anderen2 commented Jul 24, 2018

danielnelson commented Jul 24, 2018

Anderen2 commented Jul 25, 2018 • edited Loading

danielnelson commented Jul 25, 2018

danielnelson commented Jul 25, 2018

Anderen2 commented Jul 27, 2018

sfrode commented Jul 30, 2018 • edited Loading

danielnelson commented Aug 1, 2018

sfrode commented Aug 2, 2018

danielnelson commented Aug 2, 2018

sfrode commented Aug 8, 2018

Anderen2 commented Jul 25, 2018 •

edited

Loading

sfrode commented Jul 30, 2018 •

edited

Loading