[inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded - after updating to 1.30.0 (git: HEAD@3c03ddcf) #14980

electrofloat · 2024-03-12T17:46:21Z

Relevant telegraf.conf

[[inputs.systemd_units]]

Logs from Telegraf

Mar 12 18:44:42 arb telegraf[1945801]: 2024-03-12T17:44:42Z E! [inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded
Mar 12 18:44:52 arb telegraf[1945801]: 2024-03-12T17:44:52Z E! [inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded
Mar 12 18:45:01 arb telegraf[1945801]: 2024-03-12T17:45:01Z E! [inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded

on a different machine:

Mar 12 18:46:54 arc telegraf[2322492]: 2024-03-12T17:46:54Z E! [inputs.systemd_units] Error in plugin: listing unit states failed: Unit name serial-getty@.service is missing the instance name.

System info

Ubuntu 22.04

Docker

No response

Steps to reproduce

I've just updated to: Telegraf 1.30.0 (git: HEAD@3c03ddcf)

and after restarting telegraf, I'm getting the above error in the logs.

Expected behavior

No error.

Actual behavior

Error logs

Additional info

I had to revert back to: Telegraf 1.29.5 (git: HEAD@138d0d54)

The text was updated successfully, but these errors were encountered:

srebhan · 2024-03-13T19:40:26Z

@electrofloat can you please check the binary in #14987 available once CI finished the tests!? Let me know if this fixes the issue!

electrofloat · 2024-03-13T20:20:09Z

Yes. I've upgraded to this version (Telegraf 1.31.0-30d5d365 (git: pull/14987@30d5d365)) and it does not report the errors now. Tested it on both machines.

So it seems to be fixed.

srebhan · 2024-03-13T20:22:38Z

@electrofloat thanks for the quick testing!

jjh74 · 2024-03-14T08:41:00Z

binary from #14987 also fixes unit name is missing errors (tested on RHEL8/9, AlmaLinux8/9) for me.

electrofloat · 2024-03-14T17:20:11Z

@DStrand1 You closed this as completed, but do we know when this will be released?

powersj · 2024-03-14T17:28:48Z

Hi,

It will be part of v1.30.1 on or around April 1:

https://github.com/influxdata/telegraf/blob/master/docs/FAQ.md#when-is-the-next-release-when-will-my-pr-or-fix-get-released

electrofloat · 2024-03-14T17:30:10Z

April 1? But this is a regression in 1.30. How is this not fixed and released immediately?

srebhan · 2024-03-14T18:20:44Z

@electrofloat you can use a nightly build starting from tomorrow.

How is this not fixed and released immediately?

This was fixed within two days! What do you expect? You do have three possibilities, use 1.29.5 until release, use the binary in the PR or use a nightly build starting from tomorrow. We do not have the resources to bake a release for every single commit to master!

electrofloat · 2024-03-14T18:54:25Z

I expect a new release after a regression fix!

What you guys need to understand is that on a debian based system like ubuntu, you install software by using an apt line in a sources.list file (exactly how it is described in your docs to do on ubuntu) and then using apt.

Now.. to upgrade packages on debian based systems, you type in a command like apt-get update && apt-get upgrade. Now since there's a new release of telegraf which is known to be BAD, every time I want to upgrade my packages, I either have to remove the telegraf line from sources list, or I have to put the package on hold. Both of these solutions guarantees to forgot to put it back and the user is stuck with an old/full of secholes package.

This is a second time in a short window where a new release just breaks a previously working functionality.

So the problem is not with the slow patch, the problem is that we have to wait 3 weeks for release to be able to upgrade our package.

powersj · 2024-03-14T19:24:06Z

What you guys need to understand is that on a debian based system like ubuntu, you install software by using an apt line in a sources.list file (exactly how it is described in your docs to do on ubuntu) and then using apt.

We are well aware of how package managers work. As you also mention they do provide mechanisms for you to avoid package versions with issues.

As Sven already said, we do not release a new version for every single fix, security issue, or regression. Telegraf has for its history used time-based releases with great success. When issues do arise, there are mechanisms available to users to use a nightly, a custom build, or revert to a previous version whether they are using our own provided package repo, downloading tarballs, or using the official docker images.

This is a second time in a short window where a new release just breaks a previously working functionality.

Yes, it is and I can tell you we hate when this happens, and it literally keeps us up at night after a release. It is why we jump on these types of issues and ensure that we make every attempt to resolve them ASAP. Additionally, when we are landing PRs, there is a consideration around the potential for regression. For a tool with literally millions of deployments across a wide range of architectures and operating systems, each that can have huge numbers of varying environments and configurations, we cannot replicate every deployment or scenario.

We released the Docker image of 1.30 today, which means a lot more users may run into this or other issues. I would personally feel better about waiting till early next week to see if anything else has come up before we jump on another release.

powersj · 2024-03-14T23:15:02Z

@electrofloat,

One more thing I wanted to mention: do you have the ability to run the nightly build as a test? it would help both us and you so incredibly much if you could or had the ability. That way you could catch issues before we did a release and could relay issues that you might have.

I take it you have a large deployment so catching issues earlier would help both of us.

electrofloat · 2024-03-15T08:33:09Z

@powersj Unfortunately no. We have strict rules on what software we can install on prod machines, which only includes stable releases.

I also forgot to mention, but probably you know this too already, debian/ubuntu has this so called feature "phased-updates". Which also supported by apt now since 2.1.16 (Fri, 08 Jan 2021 22:01:50 +0100). That means a new package update does not get to all the repo users at once, but in phases. And in the event of a regression they can immediately set the phasing back to 0%, which causes it to not to install the update.

So maybe in the future you could utilize this feature too.

(as far as I remember, on the server side this only needs a new Phased-Update-Percentage field in the packages file, like here: http://archive.ubuntu.com/ubuntu/dists/jammy-updates/main/binary-amd64/Packages.gz you can check that some of the packages are phased right now with varying amount of percentages. All the other 'magic' are happening on the client side.)

SebastianThorn · 2024-04-02T12:26:21Z

@powersj @srebhan
Hi! sorry for hijacking the thread.

We run tons of telegraf instances for different use-cases, and can probably set up something that runs nightly if that would help you out.
How would you like the reporting back to you be?

I'll add this to our backlog.

powersj · 2024-04-02T13:21:56Z

can probably set up something that runs nightly if that would help you out.

It absolutely would!

How would you like the reporting back to you be?

Any issues that you come across should be filed as issues in this repo.

JamieSimon2 · 2024-04-02T15:57:49Z

@powersj @srebhan First, thanks for your work on this bugfix. 🙏

I've evaluated https://repos.influxdata.com/rhel/7/x86_64/stable/telegraf-1.30.1-1.x86_64.rpm (appeared in repo yesterday) and still see this error. Are we too early? I notice that https://github.com/influxdata/telegraf still shows "Latest" as v1.30.0.

Example error:

2024-04-02T15:43:00Z E! [inputs.systemd_units] Error in plugin: listing unit files failed: Rejected send message, 2 matched rules; type="method_call", sender=":1.584082" (uid=1003 pid=387836 comm="/usr/bin/telegraf -config /etc/telegraf/telegraf.c") interface="org.freedesktop.systemd1.Manager" member="ListUnitFilesByPatterns" error name="(unset)" requested_reply="0" destination="org.freedesktop.systemd1" (uid=0 pid=1 comm="/usr/lib/systemd/systemd --switched-root --system ")

srebhan · 2024-04-02T16:08:33Z

@JamieSimon2 which systemd version is installed? Is the dbus interface running?

JamieSimon2 · 2024-04-02T16:13:08Z

@srebhan thanks for your quick response! (This is Centos7 😢 )

$ systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

$ ps -ef  |grep dbus
dbus        1537       1  0 Mar26 ?        00:09:36 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation

srebhan · 2024-04-02T17:04:44Z

systemd 219

This is the issue, a 9 year old systemd... ;-)

@JamieSimon2 could you please open a new issue with the information above? We will discuss internally how we handle the situation...

JamieSimon2 · 2024-04-02T17:12:21Z

Acknowledged, thank you @srebhan !
Edit: #15093

electrofloat added the bug unexpected problem or unintended behavior label Mar 12, 2024

srebhan self-assigned this Mar 12, 2024

powersj mentioned this issue Mar 13, 2024

[inputs.systemd_units] Error in plugin: listing unit states failed: Unit name ... is missing the instance name #14984

Closed

srebhan mentioned this issue Mar 13, 2024

fix(inputs.systemd_units): Handle disabled multi-instance units correctly #14987

Merged

1 task

srebhan added the regression something that used to work, but is now broken label Mar 13, 2024

DStrand1 closed this as completed in #14987 Mar 14, 2024

JamieSimon2 mentioned this issue Apr 2, 2024

inputs.systemd_units "Error in plugin: listing unit files failed" in v1.30.0 and v1.30.1 #15093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded - after updating to 1.30.0 (git: HEAD@3c03ddcf) #14980

[inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded - after updating to 1.30.0 (git: HEAD@3c03ddcf) #14980

electrofloat commented Mar 12, 2024 •

edited

Loading

srebhan commented Mar 13, 2024

electrofloat commented Mar 13, 2024

srebhan commented Mar 13, 2024

jjh74 commented Mar 14, 2024

electrofloat commented Mar 14, 2024

powersj commented Mar 14, 2024

electrofloat commented Mar 14, 2024

srebhan commented Mar 14, 2024

electrofloat commented Mar 14, 2024

powersj commented Mar 14, 2024

powersj commented Mar 14, 2024

electrofloat commented Mar 15, 2024

SebastianThorn commented Apr 2, 2024 •

edited

Loading

powersj commented Apr 2, 2024

JamieSimon2 commented Apr 2, 2024

srebhan commented Apr 2, 2024

JamieSimon2 commented Apr 2, 2024 •

edited

Loading

srebhan commented Apr 2, 2024

JamieSimon2 commented Apr 2, 2024 •

edited

Loading

[inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded - after updating to 1.30.0 (git: HEAD@3c03ddcf) #14980

[inputs.systemd_units] Error in plugin: listing unit files failed: context deadline exceeded - after updating to 1.30.0 (git: HEAD@3c03ddcf) #14980

Comments

electrofloat commented Mar 12, 2024 • edited Loading

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

srebhan commented Mar 13, 2024

electrofloat commented Mar 13, 2024

srebhan commented Mar 13, 2024

jjh74 commented Mar 14, 2024

electrofloat commented Mar 14, 2024

powersj commented Mar 14, 2024

electrofloat commented Mar 14, 2024

srebhan commented Mar 14, 2024

electrofloat commented Mar 14, 2024

powersj commented Mar 14, 2024

powersj commented Mar 14, 2024

electrofloat commented Mar 15, 2024

SebastianThorn commented Apr 2, 2024 • edited Loading

powersj commented Apr 2, 2024

JamieSimon2 commented Apr 2, 2024

srebhan commented Apr 2, 2024

JamieSimon2 commented Apr 2, 2024 • edited Loading

srebhan commented Apr 2, 2024

JamieSimon2 commented Apr 2, 2024 • edited Loading

electrofloat commented Mar 12, 2024 •

edited

Loading

SebastianThorn commented Apr 2, 2024 •

edited

Loading

JamieSimon2 commented Apr 2, 2024 •

edited

Loading

JamieSimon2 commented Apr 2, 2024 •

edited

Loading