Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGHUP signal handling broken and statsd plugin causes panic if SIGHUP received before statsd is initialized #5618

Closed
Nyyr1 opened this issue Mar 21, 2019 · 3 comments · Fixed by #5628
Labels
area/statsd bug unexpected problem or unintended behavior
Milestone

Comments

@Nyyr1
Copy link

Nyyr1 commented Mar 21, 2019

[global_tags]
[agent]
interval = "60s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "60s"
flush_jitter = "0s"
precision = ""
debug = true
quiet = false
logfile = ""
hostname = ""
omit_hostname = false
instance = "default"
omit_instance = false
[[outputs.ffc_file]]
files = ["/dev/null"]
data_format = "json"
[[outputs.zabbix]]
host = "monitoring-ffc-pr0.ffc.internal"
port = 10051
[[aggregators.zabbix_lld]]
server = "monitoring-ffc-pr0.ffc.internal"
port = 10051
period = "60s"
[[inputs.statsd]]
protocol = "udp"
max_tcp_connections = 250
tcp_keep_alive = false
service_address = ":8125"
delete_gauges = true
delete_counters = true
delete_sets = true
delete_timings = true
percentiles = [90]
metric_separator = "."
parse_data_dog_tags = false
allowed_pending_messages = 10000
percentile_limit = 1000

System info:

Linux 0242a5bfc72f 4.1.17-22.30.amzn1.x86_64 #1 SMP Fri Feb 5 23:44:22 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[Include Telegraf version, operating system name, and other relevant details]
Branched version, rebased to commit fa65a82.

Steps to reproduce:

  1. I added pull request for zabbix and zabbix_lld plugins, setting unreachable zabbix server in the configuration file.
  2. Modified it not to terminate when the Zabbix server is not reachable (just log it).
  3. I changed loglevel of some statsd messages to debug to avoid syslog spam due to invalid metrics received.
  4. Added custom tag (besides default "host") to internal/config:
  • if c.Agent.Instance == "" {
  •   c.Agent.Instance = "default"
    
  • }
  • if !c.Agent.OmitInstance {
  •   c.Tags["telegrafinstance"] = c.Agent.Instance
    
  • }
  1. Enabled statsd input plugin.
  2. Before input plugins are initialized, telegraf receives SIGHUP signal (service reload)

This is the resulting behavior log:

2019/03/21 13:39:45 D! Executing runAgent()
2019-03-21T13:39:45Z I! Starting Telegraf 1.10.0-rc1-52-gac258d6e
2019-03-21T13:39:45Z I! Loaded inputs: statsd
2019-03-21T13:39:45Z I! Loaded aggregators: zabbix_lld
2019-03-21T13:39:45Z I! Loaded processors:
2019-03-21T13:39:45Z I! Loaded outputs: ffc_file zabbix
2019-03-21T13:39:45Z I! Tags enabled: host=0242a5bfc72f telegrafinstance=default
2019-03-21T13:39:45Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"0242a5bfc72f", Flush Interval:1m0s, Instance:"default"
2019-03-21T13:39:45Z D! [agent] Connecting outputs
2019-03-21T13:39:45Z D! [agent] Attempting connection to output: ffc_file
2019-03-21T13:39:45Z D! [agent] Successfully connected to output: ffc_file
2019-03-21T13:39:45Z D! [agent] Attempting connection to output: zabbix
2019-03-21T13:39:51Z I! Reloading Telegraf config
2019-03-21T13:39:51Z D! Dequeued reload
2019-03-21T13:39:51Z D! Cancel completed
2019-03-21T13:41:52Z E! Connection to Zabbix server monitoring-ffc-pr0.ffc.internal:10051 failed!
2019-03-21T13:41:52Z D! [agent] Successfully connected to output: zabbix
2019-03-21T13:41:52Z D! [agent] Starting service inputs
2019-03-21T13:41:52Z I! Started the statsd service on :8125
2019-03-21T13:41:52Z D! [agent] Stopping service inputs
2019-03-21T13:41:52Z I! Stopping the statsd service
2019-03-21T13:41:52Z D! s.done:%#v 0xc4200a6120
2019-03-21T13:41:52Z D! s.UDPlistener:%#v
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1558dcc]

goroutine 167 [running]:
github.com/influxdata/telegraf/plugins/inputs/statsd.(*Statsd).Stop(0xc4203f4380)
/home/kitchen/go/src/github.com/influxdata/telegraf/plugins/inputs/statsd/statsd.go:859 +0x5fc
github.com/influxdata/telegraf/agent.(*Agent).stopServiceInputs(0xc42000e080)
/home/kitchen/go/src/github.com/influxdata/telegraf/agent/agent.go:638 +0x9e
github.com/influxdata/telegraf/agent.(*Agent).Run.func1(0xc42016dfd0, 0xc42000e080, 0x25d91c0, 0xc420170600, 0xbf1d030834e4f95c, 0x1da7938b97, 0x38d6e00, 0xc42009c300)
/home/kitchen/go/src/github.com/influxdata/telegraf/agent/agent.go:76 +0x138
created by github.com/influxdata/telegraf/agent.(*Agent).Run
/home/kitchen/go/src/github.com/influxdata/telegraf/agent/agent.go:67 +0x401

  1. I fixed the SIGSEV by modifying statsd plugin Stop() function:

    if s.isUDP() {

  •   if s.UDPlistener != nil {
      	s.UDPlistener.Close()
    
  •   }
    
    } else {
  •   if s.TCPlistener != nil {
      	s.TCPlistener.Close()
    
  •   }
    
  1. This resulted (in repeated experiment) with:

2019/03/21 13:47:53 D! Executing runAgent()
2019-03-21T13:47:53Z I! Starting Telegraf 1.10.0-rc1-52-gac258d6e
2019-03-21T13:47:53Z I! Loaded inputs: statsd
2019-03-21T13:47:53Z I! Loaded aggregators: zabbix_lld
2019-03-21T13:47:53Z I! Loaded processors:
2019-03-21T13:47:53Z I! Loaded outputs: zabbix ffc_file
2019-03-21T13:47:53Z I! Tags enabled: host=0242a5bfc72f telegrafinstance=default
2019-03-21T13:47:53Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"0242a5bfc72f", Flush Interval:1m0s, Instance:"default"
2019-03-21T13:47:53Z D! [agent] Connecting outputs
2019-03-21T13:47:53Z D! [agent] Attempting connection to output: zabbix
2019-03-21T13:47:57Z I! Reloading Telegraf config
2019-03-21T13:47:57Z D! Dequeued reload
2019-03-21T13:47:57Z D! Cancel completed
2019-03-21T13:50:00Z E! Connection to Zabbix server monitoring-ffc-pr0.ffc.internal:10051 failed!
2019-03-21T13:50:00Z D! [agent] Successfully connected to output: zabbix
2019-03-21T13:50:00Z D! [agent] Attempting connection to output: ffc_file
2019-03-21T13:50:00Z D! [agent] Successfully connected to output: ffc_file
2019-03-21T13:50:00Z D! [agent] Starting service inputs
2019-03-21T13:50:00Z I! Started the statsd service on :8125
2019-03-21T13:50:00Z I! Statsd UDP listener listening on: [::]:8125
2019-03-21T13:50:00Z D! [agent] Stopping service inputs
2019-03-21T13:50:00Z I! Stopping the statsd service
2019-03-21T13:50:00Z D! s.done:%#v 0xc4200a64e0
2019-03-21T13:50:00Z D! s.UDPlistener:%#v &{{0xc4201ac000}}
2019-03-21T13:50:00Z D! s.Unlock()
2019-03-21T13:50:00Z D! s.wg:%#v {{} [0 0 0 0 1 0 0 0 0 0 0 0] 0}
2019-03-21T13:50:00Z D! s.Lock()
2019-03-21T13:50:00Z D! s.in:%#v 0xc42030c2a0
2019-03-21T13:50:00Z I! Stopped Statsd listener service on :8125
2019-03-21T13:50:00Z D! [agent] Input channel closed
2019-03-21T13:52:00Z D! [outputs.ffc_file] buffer fullness: 0 / 10000 metrics.
2019-03-21T13:52:00Z D! [outputs.zabbix] buffer fullness: 0 / 10000 metrics.
2019-03-21T13:53:00Z D! [outputs.ffc_file] buffer fullness: 0 / 10000 metrics.
2019-03-21T13:53:00Z D! [outputs.zabbix] buffer fullness: 0 / 10000 metrics.
2019-03-21T13:54:00Z D! [outputs.ffc_file] buffer fullness: 0 / 10000 metrics.
2019-03-21T13:54:00Z D! [outputs.zabbix] buffer fullness: 0 / 10000 metrics.
2019-03-21T13:55:00Z D! [outputs.ffc_file] buffer fullness: 0 / 10000 metrics.
2019-03-21T13:55:00Z D! [outputs.zabbix] buffer fullness: 0 / 10000 metrics.

NOTE: Don't mind my added debug printf() points.

Telegraf keeps running without any input plugins running :-(

I have NO Go-lang experience so I doubt I would be able to fix it in affordable period of time.....

@danielnelson
Copy link
Contributor

Do you have your code pushed up on Github so I could take a look at it?

@danielnelson danielnelson added bug unexpected problem or unintended behavior need more info labels Mar 23, 2019
@Nyyr1
Copy link
Author

Nyyr1 commented Mar 25, 2019

http://leteckaposta.cz/316824773. I assume I do not have PUSH right and I am not that skilled with GIT, so I rather GZIPped it.

@danielnelson
Copy link
Contributor

I see why the zabbix plugin would continue running with your fix, the outputs continue to run until the inputs finish closing, but there is a race condition on shutting down the statsd plugin, so it may never stop.

#5628 Should resolve both issues.

As you imagined you wouldn't be able to push to the influxdata repo, most people create a fork on github that they can modify instead. Check it out sometime if you continue to use some modifications, it could be helpful for tracking differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/statsd bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants