-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(agent): Catch panics in inputs goroutine #14840
Conversation
Thanks so much for the pull request! |
!signed-cla |
@zmyzheng will Telegraf still exit if a plugin's |
The code in master branch will exit if a plugin's Gather() panics. With this PR, the Telegraf will not exit. |
So to clarify, the behavior we want to see if/when telegraf panics is that telegraf should also exit. As-is this appears to not do that. Here is the output of your branch with an added 'panic' in gather: ❯ ./telegraf --config config.toml
2024-02-21T21:10:41Z I! Loading config: config.toml
2024-02-21T21:10:41Z I! Starting Telegraf 1.30.0-b22818c8 brought to you by InfluxData the makers of InfluxDB
2024-02-21T21:10:41Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 61 outputs, 6 secret-stores
2024-02-21T21:10:41Z I! Loaded inputs: cpu
2024-02-21T21:10:41Z I! Loaded aggregators:
2024-02-21T21:10:41Z I! Loaded processors:
2024-02-21T21:10:41Z I! Loaded secretstores:
2024-02-21T21:10:41Z I! Loaded outputs: file
2024-02-21T21:10:41Z I! Tags enabled: host=ryzen
2024-02-21T21:10:41Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"ryzen", Flush Interval:10s
2024-02-21T21:10:50Z E! FATAL: [inputs.cpu] panicked: oops, Stack:
goroutine 97 [running]:
github.com/influxdata/telegraf/agent.panicRecover(0xc001f357a0)
/tmp/telegraf/agent/agent.go:1201 +0x73
panic({0x6c20bc0?, 0x8bef9d0?})
/usr/lib/go/src/runtime/panic.go:770 +0x132
github.com/influxdata/telegraf/plugins/inputs/cpu.(*CPUStats).Gather(0xc001fad400?, {0x8305ee0?, 0x0?})
/tmp/telegraf/plugins/inputs/cpu/cpu.go:41 +0x25
github.com/influxdata/telegraf/models.(*RunningInput).Gather(0xc001f357a0, {0x8cb2000, 0xc000fe0760})
/tmp/telegraf/models/running_input.go:149 +0x54
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
/tmp/telegraf/agent/agent.go:585 +0x5e
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce in goroutine 74
/tmp/telegraf/agent/agent.go:583 +0xf7
goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc001f47180?)
/usr/lib/go/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc001f4d028?)
/usr/lib/go/src/sync/waitgroup.go:116 +0x48
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc001f4d028, {0x8c7bad0, 0xc001f47130})
/tmp/telegraf/agent/agent.go:197 +0xa26
main.(*Telegraf).runAgent(0xc0020ef2c0, {0x8c7bad0, 0xc001f47130}, 0x7f099c6bc5b8?, 0x10?)
/tmp/telegraf/cmd/telegraf/telegraf.go:386 +0x176c
main.(*Telegraf).reloadLoop(0xc0020ef2c0)
/tmp/telegraf/cmd/telegraf/telegraf.go:173 +0x24c
main.(*Telegraf).Run(0xc0020ef2c0)
/tmp/telegraf/cmd/telegraf/telegraf_posix.go:14 +0x52
main.runApp.func1(0xc0020fc7c0)
/tmp/telegraf/cmd/telegraf/main.go:249 +0xc90
github.com/urfave/cli/v2.(*Command).Run(0xc00212c840, 0xc0020fc7c0, {0xc0000b5710, 0x3, 0x3})
/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/command.go:279 +0x97d
github.com/urfave/cli/v2.(*App).RunContext(0xc001fb7600, {0x8c7b8d8, 0xe3dfba0}, {0xc0000b5710, 0x3, 0x3})
/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:337 +0x58b
github.com/urfave/cli/v2.(*App).Run(...)
/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:311
main.runApp({0xc0000b5710, 0x3, 0x3}, {0x8bfe8c0, 0xc000130048}, {0x8c1ce60, 0xc001f4cab0}, {0x8c1ce88, 0xc002
2024-02-21T21:10:50Z E! PLEASE REPORT THIS PANIC ON GITHUB with stack trace, configuration, and OS information: https://github.com/influxdata/telegraf/issues/new/choose
2024-02-21T21:11:00Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:10Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:20Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:30Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:40Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s While I would like to have this error message, we have not been using it for quite some time. Instead, of trying to propigate the error up and exiting, we should instead leave as is? @srebhan thoughts? |
Sorry for the confusion. My preference is to keep Telegraf running if a plugin's Gather() function panics, but I'm fine to keep either Telegraf running or exit if panic happens. However, we should at least keep the panic message into the log file so that we each track and debug the issue. Otherwise, there is no way to track why Telegraf crashes. The current code in main branch does not keep the panic message to any log file. The code change made in this PR enables the panic message to be logged into the log file and then keeps the Telegraf running. If the prefered approach is to quit Telegraf after logging the panic message, I can change the log.Println to log.Fatal in this line. |
@zmyzheng the issue with keeping Telegraf running is that we do have a lot of unattended installations where nobody reads the logs until data is missing. When exiting telegraf, you will notice and you will be able to detect this e.g. using the health output or using a dead-man detection. If we keep Telegraf running, data might be missing for quite some time until someone notices... Our take on panics is: This should never happen! So we want to really fix this instead of working around those! This being said, I would rather keep the current behavior and remove the whole |
Hi @srebhan , thanks for sharing the insights. I agree we should let Telegraf exit if panic happens. I updated the code to exit Telegraf inside panicRecover() function by changing the log.Println to log.Fatal. This will make sure the panic message can be caputered in the log file before exiting without relying on other external tools. The reason I hope Telegraf can capture those panic messages without using "health output" or using "a dead-man detection" as you mentioned is because those tools are not alway available depending on the deployment environment. For example, our team deploy Telegraf into an Azure VM and bootstrap it with Azure VM Agent. We found Telegraf crashes occasionally because of a corner case bug we involved in a Telegraf plugin. However, Azure VM Agent does not have a very good way to capture the Telegraf panic message so we spent a lot of time to figure out where the issue is. With this experience, we think it will be much better if Telegraf itself can store the panic message into the log files before exiting so that we can check what issue has happened much easily. The updated PR can achieve this without changing anything else. |
I don't think this is right... With this we miss the chance to flush the outputs (even though I think we currently don't) because @powersj what do you think? |
My understanding is if a panic occurs today, those defer functions are not called anyway, right? Then this does not change that behavior. Instead, it provides an opportunity for the panic to show up in logs so users can figure out what is going on rather than losing the panic log. Which in turn gives us the opportunity to fix a panic versus not knowing where it might be occurring. Can you help confirm my understanding? |
If a panic occurs all defer functions along the call-path are called (see https://go.dev/blog/defer-panic-and-recover). So this is a change in behavior as |
How does calling defer functions and then immediately crashing actually benefit the user?
Again, I think you skipped past the part where this actually logs the crash for the user. If we are going to crash anyway I would much rather see the crash in the logs. You and I as maintainers cannot do anything without that crash log! If we are crashing I do not think a user will care or know about missing some defers ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if you think we should not exit gracefully I can live with this... Just some one small comment @zmyzheng...
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zmyzheng!
Summary
The panicRecover(input) cannot capture the panic inside the Gather method in each plugin because the Gather() method is called in a separate go routine: this line. Adding the panicRecover(input) function inside the go routine just before done <- input.Gather(acc) can resolve this issue.
Checklist
Related issues
resolves #14826