Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(agent): Catch panics in inputs goroutine #14840

Merged
merged 5 commits into from
Feb 26, 2024

Conversation

zmyzheng
Copy link
Contributor

Summary

The panicRecover(input) cannot capture the panic inside the Gather method in each plugin because the Gather() method is called in a separate go routine: this line. Adding the panicRecover(input) function inside the go routine just before done <- input.Gather(acc) can resolve this issue.

Checklist

  • No AI generated code was used in this PR

Related issues

resolves #14826

@telegraf-tiger
Copy link
Contributor

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Feb 16, 2024
@zmyzheng
Copy link
Contributor Author

!signed-cla

@srebhan
Copy link
Member

srebhan commented Feb 21, 2024

@zmyzheng will Telegraf still exit if a plugin's Gather() function panics?

@zmyzheng
Copy link
Contributor Author

zmyzheng commented Feb 21, 2024

@zmyzheng will Telegraf still exit if a plugin's Gather() function panics?

The code in master branch will exit if a plugin's Gather() panics. With this PR, the Telegraf will not exit.

@powersj
Copy link
Contributor

powersj commented Feb 21, 2024

@zmyzheng will Telegraf still exit if a plugin's Gather() function panics?

Yes

So to clarify, the behavior we want to see if/when telegraf panics is that telegraf should also exit. As-is this appears to not do that.

Here is the output of your branch with an added 'panic' in gather:

❯ ./telegraf --config config.toml
2024-02-21T21:10:41Z I! Loading config: config.toml
2024-02-21T21:10:41Z I! Starting Telegraf 1.30.0-b22818c8 brought to you by InfluxData the makers of InfluxDB
2024-02-21T21:10:41Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 61 outputs, 6 secret-stores
2024-02-21T21:10:41Z I! Loaded inputs: cpu
2024-02-21T21:10:41Z I! Loaded aggregators: 
2024-02-21T21:10:41Z I! Loaded processors: 
2024-02-21T21:10:41Z I! Loaded secretstores: 
2024-02-21T21:10:41Z I! Loaded outputs: file
2024-02-21T21:10:41Z I! Tags enabled: host=ryzen
2024-02-21T21:10:41Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"ryzen", Flush Interval:10s
2024-02-21T21:10:50Z E! FATAL: [inputs.cpu] panicked: oops, Stack:
goroutine 97 [running]:
github.com/influxdata/telegraf/agent.panicRecover(0xc001f357a0)
	/tmp/telegraf/agent/agent.go:1201 +0x73
panic({0x6c20bc0?, 0x8bef9d0?})
	/usr/lib/go/src/runtime/panic.go:770 +0x132
github.com/influxdata/telegraf/plugins/inputs/cpu.(*CPUStats).Gather(0xc001fad400?, {0x8305ee0?, 0x0?})
	/tmp/telegraf/plugins/inputs/cpu/cpu.go:41 +0x25
github.com/influxdata/telegraf/models.(*RunningInput).Gather(0xc001f357a0, {0x8cb2000, 0xc000fe0760})
	/tmp/telegraf/models/running_input.go:149 +0x54
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
	/tmp/telegraf/agent/agent.go:585 +0x5e
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce in goroutine 74
	/tmp/telegraf/agent/agent.go:583 +0xf7

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc001f47180?)
	/usr/lib/go/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc001f4d028?)
	/usr/lib/go/src/sync/waitgroup.go:116 +0x48
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc001f4d028, {0x8c7bad0, 0xc001f47130})
	/tmp/telegraf/agent/agent.go:197 +0xa26
main.(*Telegraf).runAgent(0xc0020ef2c0, {0x8c7bad0, 0xc001f47130}, 0x7f099c6bc5b8?, 0x10?)
	/tmp/telegraf/cmd/telegraf/telegraf.go:386 +0x176c
main.(*Telegraf).reloadLoop(0xc0020ef2c0)
	/tmp/telegraf/cmd/telegraf/telegraf.go:173 +0x24c
main.(*Telegraf).Run(0xc0020ef2c0)
	/tmp/telegraf/cmd/telegraf/telegraf_posix.go:14 +0x52
main.runApp.func1(0xc0020fc7c0)
	/tmp/telegraf/cmd/telegraf/main.go:249 +0xc90
github.com/urfave/cli/v2.(*Command).Run(0xc00212c840, 0xc0020fc7c0, {0xc0000b5710, 0x3, 0x3})
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/command.go:279 +0x97d
github.com/urfave/cli/v2.(*App).RunContext(0xc001fb7600, {0x8c7b8d8, 0xe3dfba0}, {0xc0000b5710, 0x3, 0x3})
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:337 +0x58b
github.com/urfave/cli/v2.(*App).Run(...)
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:311
main.runApp({0xc0000b5710, 0x3, 0x3}, {0x8bfe8c0, 0xc000130048}, {0x8c1ce60, 0xc001f4cab0}, {0x8c1ce88, 0xc002
2024-02-21T21:10:50Z E! PLEASE REPORT THIS PANIC ON GITHUB with stack trace, configuration, and OS information: https://github.com/influxdata/telegraf/issues/new/choose
2024-02-21T21:11:00Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:10Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:20Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:30Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:40Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s

While I would like to have this error message, we have not been using it for quite some time. Instead, of trying to propigate the error up and exiting, we should instead leave as is? @srebhan thoughts?

@zmyzheng
Copy link
Contributor Author

zmyzheng commented Feb 21, 2024

@zmyzheng will Telegraf still exit if a plugin's Gather() function panics?

Yes

So to clarify, the behavior we want to see if/when telegraf panics is that telegraf should also exit. As-is this appears to not do that.

Here is the output of your branch with an added 'panic' in gather:

❯ ./telegraf --config config.toml
2024-02-21T21:10:41Z I! Loading config: config.toml
2024-02-21T21:10:41Z I! Starting Telegraf 1.30.0-b22818c8 brought to you by InfluxData the makers of InfluxDB
2024-02-21T21:10:41Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 61 outputs, 6 secret-stores
2024-02-21T21:10:41Z I! Loaded inputs: cpu
2024-02-21T21:10:41Z I! Loaded aggregators: 
2024-02-21T21:10:41Z I! Loaded processors: 
2024-02-21T21:10:41Z I! Loaded secretstores: 
2024-02-21T21:10:41Z I! Loaded outputs: file
2024-02-21T21:10:41Z I! Tags enabled: host=ryzen
2024-02-21T21:10:41Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"ryzen", Flush Interval:10s
2024-02-21T21:10:50Z E! FATAL: [inputs.cpu] panicked: oops, Stack:
goroutine 97 [running]:
github.com/influxdata/telegraf/agent.panicRecover(0xc001f357a0)
	/tmp/telegraf/agent/agent.go:1201 +0x73
panic({0x6c20bc0?, 0x8bef9d0?})
	/usr/lib/go/src/runtime/panic.go:770 +0x132
github.com/influxdata/telegraf/plugins/inputs/cpu.(*CPUStats).Gather(0xc001fad400?, {0x8305ee0?, 0x0?})
	/tmp/telegraf/plugins/inputs/cpu/cpu.go:41 +0x25
github.com/influxdata/telegraf/models.(*RunningInput).Gather(0xc001f357a0, {0x8cb2000, 0xc000fe0760})
	/tmp/telegraf/models/running_input.go:149 +0x54
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
	/tmp/telegraf/agent/agent.go:585 +0x5e
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce in goroutine 74
	/tmp/telegraf/agent/agent.go:583 +0xf7

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc001f47180?)
	/usr/lib/go/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc001f4d028?)
	/usr/lib/go/src/sync/waitgroup.go:116 +0x48
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc001f4d028, {0x8c7bad0, 0xc001f47130})
	/tmp/telegraf/agent/agent.go:197 +0xa26
main.(*Telegraf).runAgent(0xc0020ef2c0, {0x8c7bad0, 0xc001f47130}, 0x7f099c6bc5b8?, 0x10?)
	/tmp/telegraf/cmd/telegraf/telegraf.go:386 +0x176c
main.(*Telegraf).reloadLoop(0xc0020ef2c0)
	/tmp/telegraf/cmd/telegraf/telegraf.go:173 +0x24c
main.(*Telegraf).Run(0xc0020ef2c0)
	/tmp/telegraf/cmd/telegraf/telegraf_posix.go:14 +0x52
main.runApp.func1(0xc0020fc7c0)
	/tmp/telegraf/cmd/telegraf/main.go:249 +0xc90
github.com/urfave/cli/v2.(*Command).Run(0xc00212c840, 0xc0020fc7c0, {0xc0000b5710, 0x3, 0x3})
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/command.go:279 +0x97d
github.com/urfave/cli/v2.(*App).RunContext(0xc001fb7600, {0x8c7b8d8, 0xe3dfba0}, {0xc0000b5710, 0x3, 0x3})
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:337 +0x58b
github.com/urfave/cli/v2.(*App).Run(...)
	/home/powersj/go/pkg/mod/github.com/urfave/cli/v2@v2.27.1/app.go:311
main.runApp({0xc0000b5710, 0x3, 0x3}, {0x8bfe8c0, 0xc000130048}, {0x8c1ce60, 0xc001f4cab0}, {0x8c1ce88, 0xc002
2024-02-21T21:10:50Z E! PLEASE REPORT THIS PANIC ON GITHUB with stack trace, configuration, and OS information: https://github.com/influxdata/telegraf/issues/new/choose
2024-02-21T21:11:00Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:10Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:20Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:30Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2024-02-21T21:11:40Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s

While I would like to have this error message, we have not been using it for quite some time. Instead, of trying to propigate the error up and exiting, we should instead leave as is? @srebhan thoughts?

Sorry for the confusion. My preference is to keep Telegraf running if a plugin's Gather() function panics, but I'm fine to keep either Telegraf running or exit if panic happens. However, we should at least keep the panic message into the log file so that we each track and debug the issue. Otherwise, there is no way to track why Telegraf crashes.

The current code in main branch does not keep the panic message to any log file. The code change made in this PR enables the panic message to be logged into the log file and then keeps the Telegraf running. If the prefered approach is to quit Telegraf after logging the panic message, I can change the log.Println to log.Fatal in this line.

@srebhan
Copy link
Member

srebhan commented Feb 22, 2024

@zmyzheng the issue with keeping Telegraf running is that we do have a lot of unattended installations where nobody reads the logs until data is missing. When exiting telegraf, you will notice and you will be able to detect this e.g. using the health output or using a dead-man detection. If we keep Telegraf running, data might be missing for quite some time until someone notices...

Our take on panics is: This should never happen! So we want to really fix this instead of working around those!

This being said, I would rather keep the current behavior and remove the whole panicRecover function as it is not used in the current implementation... What do you think @zmyzheng?

@powersj powersj added the waiting for response waiting for response from contributor label Feb 22, 2024
@zmyzheng
Copy link
Contributor Author

@zmyzheng the issue with keeping Telegraf running is that we do have a lot of unattended installations where nobody reads the logs until data is missing. When exiting telegraf, you will notice and you will be able to detect this e.g. using the health output or using a dead-man detection. If we keep Telegraf running, data might be missing for quite some time until someone notices...

Our take on panics is: This should never happen! So we want to really fix this instead of working around those!

This being said, I would rather keep the current behavior and remove the whole panicRecover function as it is not used in the current implementation... What do you think @zmyzheng?

Hi @srebhan , thanks for sharing the insights. I agree we should let Telegraf exit if panic happens. I updated the code to exit Telegraf inside panicRecover() function by changing the log.Println to log.Fatal. This will make sure the panic message can be caputered in the log file before exiting without relying on other external tools.

The reason I hope Telegraf can capture those panic messages without using "health output" or using "a dead-man detection" as you mentioned is because those tools are not alway available depending on the deployment environment. For example, our team deploy Telegraf into an Azure VM and bootstrap it with Azure VM Agent. We found Telegraf crashes occasionally because of a corner case bug we involved in a Telegraf plugin. However, Azure VM Agent does not have a very good way to capture the Telegraf panic message so we spent a lot of time to figure out where the issue is. With this experience, we think it will be much better if Telegraf itself can store the panic message into the log files before exiting so that we can check what issue has happened much easily. The updated PR can achieve this without changing anything else.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Feb 22, 2024
@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Feb 22, 2024
@srebhan
Copy link
Member

srebhan commented Feb 23, 2024

I don't think this is right... With this we miss the chance to flush the outputs (even though I think we currently don't) because log.Fatal calls os.Exit and thus none of the defer functions is executed. I would really rather let panic do it's magic and do without the nice message than being clever and loose the deferred functions...

@powersj what do you think?

@powersj
Copy link
Contributor

powersj commented Feb 23, 2024

I would really rather let panic do it's magic

My understanding is if a panic occurs today, those defer functions are not called anyway, right? Then this does not change that behavior. Instead, it provides an opportunity for the panic to show up in logs so users can figure out what is going on rather than losing the panic log. Which in turn gives us the opportunity to fix a panic versus not knowing where it might be occurring.

Can you help confirm my understanding?

@srebhan
Copy link
Member

srebhan commented Feb 26, 2024

If a panic occurs all defer functions along the call-path are called (see https://go.dev/blog/defer-panic-and-recover). So this is a change in behavior as os.Exit will not call any deferred function! I'm not sure if this makes a difference today. Instead of printing a nice message, we should rather check if we can change the agent code in a way that it flushes the outputs if an input panics using defer calls...

@powersj
Copy link
Contributor

powersj commented Feb 26, 2024

If a panic occurs all defer functions along the call-path are called

How does calling defer functions and then immediately crashing actually benefit the user?

I'm not sure if this makes a difference today.

Again, I think you skipped past the part where this actually logs the crash for the user. If we are going to crash anyway I would much rather see the crash in the logs. You and I as maintainers cannot do anything without that crash log! If we are crashing I do not think a user will care or know about missing some defers ;)

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if you think we should not exit gracefully I can live with this... Just some one small comment @zmyzheng...

agent/agent.go Outdated Show resolved Hide resolved
@srebhan srebhan changed the title fix(agent): add panicRecover func to capture the panics inside plugin executions fix(agent): Catch panics in inputs goroutine Feb 26, 2024
@zmyzheng
Copy link
Contributor Author

zmyzheng commented Feb 26, 2024

Well if you think we should not exit gracefully I can live with this... Just some one small comment @zmyzheng...

Thanks @srebhan , I removed the defer panicRecover(input) call in line 559 as you suggested.

@telegraf-tiger
Copy link
Contributor

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zmyzheng!

@srebhan srebhan merged commit 6d523c9 into influxdata:master Feb 26, 2024
26 checks passed
@github-actions github-actions bot added this to the v1.30.0 milestone Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent fix pr to fix corresponding bug ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The panicRecover func cannot capture the panics during plugin executions.
3 participants