Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf Generating Orphaned DBus Processes on RHEL Servers #13481

Closed
crflanigan opened this issue Jun 22, 2023 · 9 comments · Fixed by #13489
Closed

Telegraf Generating Orphaned DBus Processes on RHEL Servers #13481

crflanigan opened this issue Jun 22, 2023 · 9 comments · Fixed by #13489
Labels
bug unexpected problem or unintended behavior

Comments

@crflanigan
Copy link
Contributor

crflanigan commented Jun 22, 2023

Relevant telegraf.conf

The Telegraf configuration appears to be irrelevant as the problem is related to the Telegraf Secret Store which initializes when the agent starts regardless if you are using it or not.

Logs from Telegraf

The logs were unremarkable.

System info

Telegraf 1.25.2 - RHEL 6, 7, 8

Docker

No response

Steps to reproduce

Reproducing has been tricky as it doesn't always appear to occur, but on systems that were impacted (hundreds+) reverting Telegraf to an earlier version, stopping the Telegraf service and removing the orphaned process, or performing the below actions resolved the issue.

What we have seen:
Upgrading the Telegraf version 1.14 to 1.25.2 on RHEL servers seems to create an issue where DBus generates many orphaned processes. This eventually causes the system to hit the ceiling of available PIDs. Rolling back to 1.14 seems to clear the problem.

Example from one of our systems:

ps -ef|grep dbus|grep -v grep|wc -l
1459  

What we found:

The Telegraf Secret Store appears to have a dependency called github.com/99designs/keyring, which is loaded by plugins/secretstores/all/os.go, which then points to telegraf/plugins/secretstores/os/os.go, which imports the keyring/kwallet.go which runs the following init() function:

func init() {
	if os.Getenv("DISABLE_KWALLET") == "1" {
		return
	}

	// silently fail if dbus isn't available
	_, err := dbus.SessionBus()
	if err != nil {
		return
	}

From here we found that we can bypass this DBus activity by creating an environment variable DISABLE_KWALLET=1 in the Telegraf startup script, though setting it through the terminal should also work.

Investigating deeper it appears this behavior is a known issue for this package and has yet to be solved.

As an aside, it looks like the keyring application isn't actively being maintained, with the last release being in December of 2022.

Expected behavior

Telegraf works as expected.

Actual behavior

Telegraf inadvertantly creates thousands of orphaned DBus processes which eventually causes the available PID's to hit the maximum ceiling, which causes system degradation.

Additional info

No response

@crflanigan crflanigan added the bug unexpected problem or unintended behavior label Jun 22, 2023
@crflanigan crflanigan changed the title Telegraf Generating Orphaned DBus PID's on RHEL Servers Telegraf Generating Orphaned DBus processes on RHEL Servers Jun 22, 2023
@crflanigan crflanigan changed the title Telegraf Generating Orphaned DBus processes on RHEL Servers Telegraf Generating Orphaned DBus Processes on RHEL Servers Jun 22, 2023
@jdstrand
Copy link
Contributor

IMO, telegraf itself should unconditionally disable the kwallet integration. The integration, AIUI, was an unintentional side-effect of using this library.

@powersj
Copy link
Contributor

powersj commented Jun 23, 2023

@crflanigan I have put up #13489 can you download an artifact and verify this no longer crashes?

Thanks

@crflanigan
Copy link
Contributor Author

@powersj Sure thing!
Thanks for the quick turn around!

@crflanigan
Copy link
Contributor Author

Our initial testing shows that this fix doesn't cause the issue. We will keep testing and let you know what (if anything) we find.
@powersj

@crflanigan
Copy link
Contributor Author

@powersj
This fix looks good to us!

@powersj
Copy link
Contributor

powersj commented Jun 27, 2023

Brilliant, thanks for the quick turn around on testing

@crflanigan
Copy link
Contributor Author

You bet buddy!
Thanks for the awesome responsiveness!

@crflanigan
Copy link
Contributor Author

@powersj @srebhan,

It looks like the issue is still occurring :(

Can this be re-opened, or should we create a new issue?

@powersj
Copy link
Contributor

powersj commented Jul 17, 2023

Let's create a new issue and if you could please get logs from 1.27.2 I would appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants