Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded - after updating to 1.30.0 (git: HEAD@3c03ddcf) #14980

Closed
electrofloat opened this issue Mar 12, 2024 · 19 comments · Fixed by #14987
Assignees
Labels
bug unexpected problem or unintended behavior regression something that used to work, but is now broken

Comments

@electrofloat
Copy link

electrofloat commented Mar 12, 2024

Relevant telegraf.conf

[[inputs.systemd_units]]

Logs from Telegraf

Mar 12 18:44:42 arb telegraf[1945801]: 2024-03-12T17:44:42Z E! [inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded
Mar 12 18:44:52 arb telegraf[1945801]: 2024-03-12T17:44:52Z E! [inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded
Mar 12 18:45:01 arb telegraf[1945801]: 2024-03-12T17:45:01Z E! [inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded

on a different machine:

Mar 12 18:46:54 arc telegraf[2322492]: 2024-03-12T17:46:54Z E! [inputs.systemd_units] Error in plugin: listing unit states failed: Unit name serial-getty@.service is missing the instance name.

System info

Ubuntu 22.04

Docker

No response

Steps to reproduce

I've just updated to: Telegraf 1.30.0 (git: HEAD@3c03ddcf)

and after restarting telegraf, I'm getting the above error in the logs.

Expected behavior

No error.

Actual behavior

Error logs

Additional info

I had to revert back to: Telegraf 1.29.5 (git: HEAD@138d0d54)

@electrofloat electrofloat added the bug unexpected problem or unintended behavior label Mar 12, 2024
@electrofloat electrofloat changed the title telegraf[1945801]: 2024-03-12T17:43:52Z E! [inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded [inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded - after updating to 1.30.0 (git: HEAD@3c03ddcf) Mar 12, 2024
@srebhan srebhan self-assigned this Mar 12, 2024
@srebhan srebhan added the regression something that used to work, but is now broken label Mar 13, 2024
@srebhan
Copy link
Member

srebhan commented Mar 13, 2024

@electrofloat can you please check the binary in #14987 available once CI finished the tests!? Let me know if this fixes the issue!

@electrofloat
Copy link
Author

Yes. I've upgraded to this version (Telegraf 1.31.0-30d5d365 (git: pull/14987@30d5d365)) and it does not report the errors now. Tested it on both machines.

So it seems to be fixed.

@srebhan
Copy link
Member

srebhan commented Mar 13, 2024

@electrofloat thanks for the quick testing!

@jjh74
Copy link
Contributor

jjh74 commented Mar 14, 2024

binary from #14987 also fixes unit name is missing errors (tested on RHEL8/9, AlmaLinux8/9) for me.

@electrofloat
Copy link
Author

@DStrand1 You closed this as completed, but do we know when this will be released?

@powersj
Copy link
Contributor

powersj commented Mar 14, 2024

@electrofloat
Copy link
Author

April 1? But this is a regression in 1.30. How is this not fixed and released immediately?

@srebhan
Copy link
Member

srebhan commented Mar 14, 2024

@electrofloat you can use a nightly build starting from tomorrow.

How is this not fixed and released immediately?

This was fixed within two days! What do you expect? You do have three possibilities, use 1.29.5 until release, use the binary in the PR or use a nightly build starting from tomorrow. We do not have the resources to bake a release for every single commit to master!

@electrofloat
Copy link
Author

I expect a new release after a regression fix!

What you guys need to understand is that on a debian based system like ubuntu, you install software by using an apt line in a sources.list file (exactly how it is described in your docs to do on ubuntu) and then using apt.

Now.. to upgrade packages on debian based systems, you type in a command like apt-get update && apt-get upgrade. Now since there's a new release of telegraf which is known to be BAD, every time I want to upgrade my packages, I either have to remove the telegraf line from sources list, or I have to put the package on hold. Both of these solutions guarantees to forgot to put it back and the user is stuck with an old/full of secholes package.

This is a second time in a short window where a new release just breaks a previously working functionality.

So the problem is not with the slow patch, the problem is that we have to wait 3 weeks for release to be able to upgrade our package.

@powersj
Copy link
Contributor

powersj commented Mar 14, 2024

What you guys need to understand is that on a debian based system like ubuntu, you install software by using an apt line in a sources.list file (exactly how it is described in your docs to do on ubuntu) and then using apt.

We are well aware of how package managers work. As you also mention they do provide mechanisms for you to avoid package versions with issues.

As Sven already said, we do not release a new version for every single fix, security issue, or regression. Telegraf has for its history used time-based releases with great success. When issues do arise, there are mechanisms available to users to use a nightly, a custom build, or revert to a previous version whether they are using our own provided package repo, downloading tarballs, or using the official docker images.

This is a second time in a short window where a new release just breaks a previously working functionality.

Yes, it is and I can tell you we hate when this happens, and it literally keeps us up at night after a release. It is why we jump on these types of issues and ensure that we make every attempt to resolve them ASAP. Additionally, when we are landing PRs, there is a consideration around the potential for regression. For a tool with literally millions of deployments across a wide range of architectures and operating systems, each that can have huge numbers of varying environments and configurations, we cannot replicate every deployment or scenario.

We released the Docker image of 1.30 today, which means a lot more users may run into this or other issues. I would personally feel better about waiting till early next week to see if anything else has come up before we jump on another release.

@powersj
Copy link
Contributor

powersj commented Mar 14, 2024

@electrofloat,

One more thing I wanted to mention: do you have the ability to run the nightly build as a test? it would help both us and you so incredibly much if you could or had the ability. That way you could catch issues before we did a release and could relay issues that you might have.

I take it you have a large deployment so catching issues earlier would help both of us.

@electrofloat
Copy link
Author

@powersj Unfortunately no. We have strict rules on what software we can install on prod machines, which only includes stable releases.

I also forgot to mention, but probably you know this too already, debian/ubuntu has this so called feature "phased-updates". Which also supported by apt now since 2.1.16 (Fri, 08 Jan 2021 22:01:50 +0100). That means a new package update does not get to all the repo users at once, but in phases. And in the event of a regression they can immediately set the phasing back to 0%, which causes it to not to install the update.

So maybe in the future you could utilize this feature too.

(as far as I remember, on the server side this only needs a new Phased-Update-Percentage field in the packages file, like here: http://archive.ubuntu.com/ubuntu/dists/jammy-updates/main/binary-amd64/Packages.gz you can check that some of the packages are phased right now with varying amount of percentages. All the other 'magic' are happening on the client side.)

@SebastianThorn
Copy link
Contributor

SebastianThorn commented Apr 2, 2024

@powersj @srebhan
Hi! sorry for hijacking the thread.

We run tons of telegraf instances for different use-cases, and can probably set up something that runs nightly if that would help you out.
How would you like the reporting back to you be?

I'll add this to our backlog.

@powersj
Copy link
Contributor

powersj commented Apr 2, 2024

can probably set up something that runs nightly if that would help you out.

It absolutely would!

How would you like the reporting back to you be?

Any issues that you come across should be filed as issues in this repo.

@JamieSimon2
Copy link

@powersj @srebhan First, thanks for your work on this bugfix. 🙏

I've evaluated https://repos.influxdata.com/rhel/7/x86_64/stable/telegraf-1.30.1-1.x86_64.rpm (appeared in repo yesterday) and still see this error. Are we too early? I notice that https://github.com/influxdata/telegraf still shows "Latest" as v1.30.0.

Example error:

2024-04-02T15:43:00Z E! [inputs.systemd_units] Error in plugin: listing unit files failed: Rejected send message, 2 matched rules; type="method_call", sender=":1.584082" (uid=1003 pid=387836 comm="/usr/bin/telegraf -config /etc/telegraf/telegraf.c") interface="org.freedesktop.systemd1.Manager" member="ListUnitFilesByPatterns" error name="(unset)" requested_reply="0" destination="org.freedesktop.systemd1" (uid=0 pid=1 comm="/usr/lib/systemd/systemd --switched-root --system ")

@srebhan
Copy link
Member

srebhan commented Apr 2, 2024

@JamieSimon2 which systemd version is installed? Is the dbus interface running?

@JamieSimon2
Copy link

JamieSimon2 commented Apr 2, 2024

@srebhan thanks for your quick response! (This is Centos7 😢 )

$ systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN
$ ps -ef  |grep dbus
dbus        1537       1  0 Mar26 ?        00:09:36 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation

@srebhan
Copy link
Member

srebhan commented Apr 2, 2024

systemd 219

This is the issue, a 9 year old systemd... ;-)

@JamieSimon2 could you please open a new issue with the information above? We will discuss internally how we handle the situation...

@JamieSimon2
Copy link

JamieSimon2 commented Apr 2, 2024

Acknowledged, thank you @srebhan !
Edit: #15093

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants