fix(agent): Catch panics in inputs goroutine #14840

zmyzheng · 2024-02-16T23:05:09Z

Summary

The panicRecover(input) cannot capture the panic inside the Gather method in each plugin because the Gather() method is called in a separate go routine: this line. Adding the panicRecover(input) function inside the go routine just before done <- input.Gather(acc) can resolve this issue.

Checklist

No AI generated code was used in this PR

Related issues

resolves #14826

… executions

telegraf-tiger · 2024-02-16T23:05:16Z

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

zmyzheng · 2024-02-16T23:07:49Z

!signed-cla

srebhan · 2024-02-21T20:35:29Z

@zmyzheng will Telegraf still exit if a plugin's Gather() function panics?

zmyzheng · 2024-02-21T20:56:41Z

@zmyzheng will Telegraf still exit if a plugin's Gather() function panics?

The code in master branch will exit if a plugin's Gather() panics. With this PR, the Telegraf will not exit.

powersj · 2024-02-21T21:15:35Z

@zmyzheng will Telegraf still exit if a plugin's Gather() function panics?

Yes

So to clarify, the behavior we want to see if/when telegraf panics is that telegraf should also exit. As-is this appears to not do that.

Here is the output of your branch with an added 'panic' in gather:

❯ ./telegraf --config config.toml
2024-02-21T21:10:41Z I! Loading config: config.toml
2024-02-21T21:10:41Z I! Starting Telegraf 1.30.0-b22818c8 brought to you by InfluxData the makers of InfluxDB
2024-02-21T21:10:41Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 61 outputs, 6 secret-stores
2024-02-21T21:10:41Z I! Loaded inputs: cpu
2024-02-21T21:10:41Z I! Loaded aggregators: 
2024-02-21T21:10:41Z I! Loaded processors: 
2024-02-21T21:10:41Z I! Loaded secretstores: 
2024-02-21T21:10:41Z I! Loaded outputs: file
2024-02-21T21:10:41Z I! Tags enabled: host=ryzen
2024-02-21T21:10:41Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"ryzen", Flush Interval:10s
2024-02-21T21:10:50Z E! FATAL: [inputs.cpu] panicked: oops, Stack:
goroutine 97 [running]:
github.com/influxdata/telegraf/agent.panicRecover(0xc001f357a0)
	/tmp/telegraf/agent/agent.go:1201 +0x73
panic({0x6c20bc0?, 0x8bef9d0?})
	/usr/lib/go/src/runtime/panic.go:770 +0x132
github.com/influxdata/telegraf/plugins/inputs/cpu.(*CPUStats).Gather(0xc001fad400?, {0x8305ee0?, 0x0?})
	/tmp/telegraf/plugins/inputs/cpu/cpu.go:41 +0x25
github.com/influxdata/telegraf/models.(*RunningInput).Gather(0xc001f357a0, {0x8cb2000, 0xc000fe0760})
	/tmp/telegraf/models/running_input.go:149 +0x54
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
	/tmp/telegraf/agent/agent.go:585 +0x5e
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce in goroutine 74
	/tmp/telegraf/agent/agent.go:583 +0xf7

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc001f47180?)
	/usr/lib/go/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc001f4d028?)
	/usr/lib/go/src/sync/waitgroup.go:116 +0x48
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc001f4d028, {0x8c7bad0, 0xc001f47130})
	/tmp/telegraf/agent/agent.go:197 +0xa26
main.(*Telegraf).runAgent(0xc0020ef2c0, {0x8c7bad0, 0xc001f47130}, 0x7f099c6bc5b8?, 0x10?)
	/tmp/telegraf/cmd/telegraf/telegraf.go:386 +0x176c
main.(*Telegraf).reloadLoop(0xc0020ef2c0)
	/tmp/telegraf/cmd/telegraf/telegraf.go:173 +0x24c
main.(*Telegraf).Run(0xc0020ef2c0)
	/tmp/telegraf/cmd/telegraf/telegraf_posix.go:14 +0x52
main.runApp.func1(0xc0020fc7c0)
	/tmp/telegraf/cmd/telegraf/main.go:249 +0xc90
github.com/urfave/cli/v2.(*Command).Run(0xc00212c840, 0xc0020fc7c0, {0xc0000b5710, 0x3, 0x3})
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/command.go:279 +0x97d
github.com/urfave/cli/v2.(*App).RunContext(0xc001fb7600, {0x8c7b8d8, 0xe3dfba0}, {0xc0000b5710, 0x3, 0x3})
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:337 +0x58b
github.com/urfave/cli/v2.(*App).Run(...)
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:311
main.runApp({0xc0000b5710, 0x3, 0x3}, {0x8bfe8c0, 0xc000130048}, {0x8c1ce60, 0xc001f4cab0}, {0x8c1ce88, 0xc002
2024-02-21T21:10:50Z E! PLEASE REPORT THIS PANIC ON GITHUB with stack trace, configuration, and OS information: https://github.com/influxdata/telegraf/issues/new/choose
2024-02-21T21:11:00Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:10Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:20Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:30Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:40Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s

While I would like to have this error message, we have not been using it for quite some time. Instead, of trying to propigate the error up and exiting, we should instead leave as is? @srebhan thoughts?

zmyzheng · 2024-02-21T21:26:03Z

@zmyzheng will Telegraf still exit if a plugin's Gather() function panics?

Yes

So to clarify, the behavior we want to see if/when telegraf panics is that telegraf should also exit. As-is this appears to not do that.

Here is the output of your branch with an added 'panic' in gather:

❯ ./telegraf --config config.toml
2024-02-21T21:10:41Z I! Loading config: config.toml
2024-02-21T21:10:41Z I! Starting Telegraf 1.30.0-b22818c8 brought to you by InfluxData the makers of InfluxDB
2024-02-21T21:10:41Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 61 outputs, 6 secret-stores
2024-02-21T21:10:41Z I! Loaded inputs: cpu
2024-02-21T21:10:41Z I! Loaded aggregators: 
2024-02-21T21:10:41Z I! Loaded processors: 
2024-02-21T21:10:41Z I! Loaded secretstores: 
2024-02-21T21:10:41Z I! Loaded outputs: file
2024-02-21T21:10:41Z I! Tags enabled: host=ryzen
2024-02-21T21:10:41Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"ryzen", Flush Interval:10s
2024-02-21T21:10:50Z E! FATAL: [inputs.cpu] panicked: oops, Stack:
goroutine 97 [running]:
github.com/influxdata/telegraf/agent.panicRecover(0xc001f357a0)
	/tmp/telegraf/agent/agent.go:1201 +0x73
panic({0x6c20bc0?, 0x8bef9d0?})
	/usr/lib/go/src/runtime/panic.go:770 +0x132
github.com/influxdata/telegraf/plugins/inputs/cpu.(*CPUStats).Gather(0xc001fad400?, {0x8305ee0?, 0x0?})
	/tmp/telegraf/plugins/inputs/cpu/cpu.go:41 +0x25
github.com/influxdata/telegraf/models.(*RunningInput).Gather(0xc001f357a0, {0x8cb2000, 0xc000fe0760})
	/tmp/telegraf/models/running_input.go:149 +0x54
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
	/tmp/telegraf/agent/agent.go:585 +0x5e
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce in goroutine 74
	/tmp/telegraf/agent/agent.go:583 +0xf7

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc001f47180?)
	/usr/lib/go/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc001f4d028?)
	/usr/lib/go/src/sync/waitgroup.go:116 +0x48
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc001f4d028, {0x8c7bad0, 0xc001f47130})
	/tmp/telegraf/agent/agent.go:197 +0xa26
main.(*Telegraf).runAgent(0xc0020ef2c0, {0x8c7bad0, 0xc001f47130}, 0x7f099c6bc5b8?, 0x10?)
	/tmp/telegraf/cmd/telegraf/telegraf.go:386 +0x176c
main.(*Telegraf).reloadLoop(0xc0020ef2c0)
	/tmp/telegraf/cmd/telegraf/telegraf.go:173 +0x24c
main.(*Telegraf).Run(0xc0020ef2c0)
	/tmp/telegraf/cmd/telegraf/telegraf_posix.go:14 +0x52
main.runApp.func1(0xc0020fc7c0)
	/tmp/telegraf/cmd/telegraf/main.go:249 +0xc90
github.com/urfave/cli/v2.(*Command).Run(0xc00212c840, 0xc0020fc7c0, {0xc0000b5710, 0x3, 0x3})
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/command.go:279 +0x97d
github.com/urfave/cli/v2.(*App).RunContext(0xc001fb7600, {0x8c7b8d8, 0xe3dfba0}, {0xc0000b5710, 0x3, 0x3})
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:337 +0x58b
github.com/urfave/cli/v2.(*App).Run(...)
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:311
main.runApp({0xc0000b5710, 0x3, 0x3}, {0x8bfe8c0, 0xc000130048}, {0x8c1ce60, 0xc001f4cab0}, {0x8c1ce88, 0xc002
2024-02-21T21:10:50Z E! PLEASE REPORT THIS PANIC ON GITHUB with stack trace, configuration, and OS information: https://github.com/influxdata/telegraf/issues/new/choose
2024-02-21T21:11:00Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:10Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:20Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:30Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:40Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s

While I would like to have this error message, we have not been using it for quite some time. Instead, of trying to propigate the error up and exiting, we should instead leave as is? @srebhan thoughts?

Sorry for the confusion. My preference is to keep Telegraf running if a plugin's Gather() function panics, but I'm fine to keep either Telegraf running or exit if panic happens. However, we should at least keep the panic message into the log file so that we each track and debug the issue. Otherwise, there is no way to track why Telegraf crashes.

The current code in main branch does not keep the panic message to any log file. The code change made in this PR enables the panic message to be logged into the log file and then keeps the Telegraf running. If the prefered approach is to quit Telegraf after logging the panic message, I can change the log.Println to log.Fatal in this line.

srebhan · 2024-02-22T07:50:46Z

@zmyzheng the issue with keeping Telegraf running is that we do have a lot of unattended installations where nobody reads the logs until data is missing. When exiting telegraf, you will notice and you will be able to detect this e.g. using the health output or using a dead-man detection. If we keep Telegraf running, data might be missing for quite some time until someone notices...

Our take on panics is: This should never happen! So we want to really fix this instead of working around those!

This being said, I would rather keep the current behavior and remove the whole panicRecover function as it is not used in the current implementation... What do you think @zmyzheng?

zmyzheng · 2024-02-22T18:28:09Z

@zmyzheng the issue with keeping Telegraf running is that we do have a lot of unattended installations where nobody reads the logs until data is missing. When exiting telegraf, you will notice and you will be able to detect this e.g. using the health output or using a dead-man detection. If we keep Telegraf running, data might be missing for quite some time until someone notices...

Our take on panics is: This should never happen! So we want to really fix this instead of working around those!

This being said, I would rather keep the current behavior and remove the whole panicRecover function as it is not used in the current implementation... What do you think @zmyzheng?

Hi @srebhan , thanks for sharing the insights. I agree we should let Telegraf exit if panic happens. I updated the code to exit Telegraf inside panicRecover() function by changing the log.Println to log.Fatal. This will make sure the panic message can be caputered in the log file before exiting without relying on other external tools.

The reason I hope Telegraf can capture those panic messages without using "health output" or using "a dead-man detection" as you mentioned is because those tools are not alway available depending on the deployment environment. For example, our team deploy Telegraf into an Azure VM and bootstrap it with Azure VM Agent. We found Telegraf crashes occasionally because of a corner case bug we involved in a Telegraf plugin. However, Azure VM Agent does not have a very good way to capture the Telegraf panic message so we spent a lot of time to figure out where the issue is. With this experience, we think it will be much better if Telegraf itself can store the panic message into the log files before exiting so that we can check what issue has happened much easily. The updated PR can achieve this without changing anything else.

srebhan · 2024-02-23T10:05:25Z

I don't think this is right... With this we miss the chance to flush the outputs (even though I think we currently don't) because log.Fatal calls os.Exit and thus none of the defer functions is executed. I would really rather let panic do it's magic and do without the nice message than being clever and loose the deferred functions...

@powersj what do you think?

powersj · 2024-02-23T14:35:18Z

I would really rather let panic do it's magic

My understanding is if a panic occurs today, those defer functions are not called anyway, right? Then this does not change that behavior. Instead, it provides an opportunity for the panic to show up in logs so users can figure out what is going on rather than losing the panic log. Which in turn gives us the opportunity to fix a panic versus not knowing where it might be occurring.

Can you help confirm my understanding?

srebhan · 2024-02-26T10:29:07Z

If a panic occurs all defer functions along the call-path are called (see https://go.dev/blog/defer-panic-and-recover). So this is a change in behavior as os.Exit will not call any deferred function! I'm not sure if this makes a difference today. Instead of printing a nice message, we should rather check if we can change the agent code in a way that it flushes the outputs if an input panics using defer calls...

powersj · 2024-02-26T14:06:29Z

If a panic occurs all defer functions along the call-path are called

How does calling defer functions and then immediately crashing actually benefit the user?

I'm not sure if this makes a difference today.

Again, I think you skipped past the part where this actually logs the crash for the user. If we are going to crash anyway I would much rather see the crash in the logs. You and I as maintainers cannot do anything without that crash log! If we are crashing I do not think a user will care or know about missing some defers ;)

srebhan

Well if you think we should not exit gracefully I can live with this... Just some one small comment @zmyzheng...

agent/agent.go

…be called

zmyzheng · 2024-02-26T18:52:43Z

Well if you think we should not exit gracefully I can live with this... Just some one small comment @zmyzheng...

Thanks @srebhan , I removed the defer panicRecover(input) call in line 559 as you suggested.

telegraf-tiger · 2024-02-26T19:23:10Z

Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

☺️ This pull request doesn't significantly change the Telegraf binary size (less than 1%)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB	RPM	TAR GZ	ZIP
amd64.deb	aarch64.rpm	darwin_amd64.tar.gz	windows_amd64.zip
arm64.deb	armel.rpm	darwin_arm64.tar.gz	windows_arm64.zip
armel.deb	armv6hl.rpm	freebsd_amd64.tar.gz	windows_i386.zip
armhf.deb	i386.rpm	freebsd_armv7.tar.gz
i386.deb	ppc64le.rpm	freebsd_i386.tar.gz
mips.deb	riscv64.rpm	linux_amd64.tar.gz
mipsel.deb	s390x.rpm	linux_arm64.tar.gz
ppc64el.deb	x86_64.rpm	linux_armel.tar.gz
riscv64.deb		linux_armhf.tar.gz
s390x.deb		linux_i386.tar.gz
		linux_mips.tar.gz
		linux_mipsel.tar.gz
		linux_ppc64le.tar.gz
		linux_riscv64.tar.gz
		linux_s390x.tar.gz

srebhan

Thanks @zmyzheng!

zmyzheng added 2 commits February 16, 2024 15:00

fix(agent): add panicRecover func to capture the panics inside plugin…

294541d

… executions

revert go.mod

b22818c

telegraf-tiger bot added the fix pr to fix corresponding bug label Feb 16, 2024

zmyzheng mentioned this pull request Feb 21, 2024

The panicRecover func cannot capture the panics during plugin executions. #14826

Closed

powersj assigned srebhan Feb 22, 2024

powersj added the waiting for response waiting for response from contributor label Feb 22, 2024

exit Telegraf inside panicRecover() function

3d4246c

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Feb 22, 2024

powersj approved these changes Feb 22, 2024

View reviewed changes

powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Feb 22, 2024

srebhan reviewed Feb 26, 2024

View reviewed changes

agent/agent.go Outdated Show resolved Hide resolved

srebhan changed the title ~~fix(agent): add panicRecover func to capture the panics inside plugin executions~~ fix(agent): Catch panics in inputs goroutine Feb 26, 2024

srebhan added the area/agent label Feb 26, 2024

srebhan and others added 2 commits February 26, 2024 18:30

Update agent/agent.go

22790f7

remove defer panicRecover(input) call in line 559 as this will never …

cc50afa

…be called

srebhan approved these changes Feb 26, 2024

View reviewed changes

srebhan merged commit 6d523c9 into influxdata:master Feb 26, 2024
26 checks passed

github-actions bot added this to the v1.30.0 milestone Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): Catch panics in inputs goroutine #14840

fix(agent): Catch panics in inputs goroutine #14840

zmyzheng commented Feb 16, 2024

telegraf-tiger bot commented Feb 16, 2024

zmyzheng commented Feb 16, 2024

srebhan commented Feb 21, 2024

zmyzheng commented Feb 21, 2024 •

edited

Loading

powersj commented Feb 21, 2024

zmyzheng commented Feb 21, 2024 •

edited

Loading

srebhan commented Feb 22, 2024

zmyzheng commented Feb 22, 2024

srebhan commented Feb 23, 2024 •

edited

Loading

powersj commented Feb 23, 2024

srebhan commented Feb 26, 2024

powersj commented Feb 26, 2024

srebhan left a comment

zmyzheng commented Feb 26, 2024 •

edited

Loading

telegraf-tiger bot commented Feb 26, 2024

Artifact URLs

srebhan left a comment

fix(agent): Catch panics in inputs goroutine #14840

fix(agent): Catch panics in inputs goroutine #14840

Conversation

zmyzheng commented Feb 16, 2024

Summary

Checklist

Related issues

telegraf-tiger bot commented Feb 16, 2024

zmyzheng commented Feb 16, 2024

srebhan commented Feb 21, 2024

zmyzheng commented Feb 21, 2024 • edited Loading

powersj commented Feb 21, 2024

zmyzheng commented Feb 21, 2024 • edited Loading

srebhan commented Feb 22, 2024

zmyzheng commented Feb 22, 2024

srebhan commented Feb 23, 2024 • edited Loading

powersj commented Feb 23, 2024

srebhan commented Feb 26, 2024

powersj commented Feb 26, 2024

srebhan left a comment

Choose a reason for hiding this comment

zmyzheng commented Feb 26, 2024 • edited Loading

telegraf-tiger bot commented Feb 26, 2024

Artifact URLs

srebhan left a comment

Choose a reason for hiding this comment

zmyzheng commented Feb 21, 2024 •

edited

Loading

zmyzheng commented Feb 21, 2024 •

edited

Loading

srebhan commented Feb 23, 2024 •

edited

Loading

zmyzheng commented Feb 26, 2024 •

edited

Loading