Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systemd does not see all shutdowns as failures #2716

Merged
merged 1 commit into from
May 8, 2017

Conversation

djsly
Copy link
Contributor

@djsly djsly commented Apr 25, 2017

Fixes: #2717

[root@<hostname> ~]# systemctl status telegraf -l
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2017-04-25 10:33:40 EDT; 14min ago
     Docs: https://github.com/influxdata/telegraf
 Main PID: 7857 (code=killed, signal=PIPE)

This occurs when we restart systemd-journald, which cause a crash/stop of telegraph seen as a STDPIPE

See: https://bugs.freedesktop.org/show_bug.cgi?id=84923#c9
With affected services like Consul: hashicorp/consul#1688
and Prometheus: https://github.com/voxpupuli/puppet-prometheus/pull/3/files

Required for all PRs:

  • CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
  • Sign CLA (if not already signed)

@danielnelson
Copy link
Contributor

Would it make sense to use RestartForceExitStatus=SIGPIPE to reduce side effects?

@djsly
Copy link
Contributor Author

djsly commented Apr 25, 2017

I could test to ensure it is doing the same work.
Note that I found those issues under influxData

influxdata/docs.influxdata.com-ARCHIVE#548
where influxDB / telegraf / kapacitor were reported to have the issue.

I can't seem to find the DOC changes though... And the provided fix is to put always
@rossmcdonald do you know where I can find the DOC referencing this issue ?

@djsly
Copy link
Contributor Author

djsly commented Apr 25, 2017

@danielnelson I have tested with RestartForceExitStatus=SIGPIPE it restart telegraf as well.
Update the PR

@danielnelson
Copy link
Contributor

danielnelson commented Apr 25, 2017

So, it looks like no code change was made in influxdb or kapacitor, and in influxdata/influxdb#7040 there was some concern that this option is no available on older systemd versions. Perhaps enough time has passed that this is no longer a concern?

@rkuchan Do you know if the documentation change happened?

@djsly
Copy link
Contributor Author

djsly commented Apr 25, 2017

I tested on CentOS 7.3 and the version of systemd is still problematic (centOS 7.1 is running the same version)

systemctl --version
systemd 219

I understand that the latest verison of systemd might be fixing the issue but until the latest releases of Os distribution start using it, it could be a pain to have to manage the unit file for each of the services...

@rkuchan
Copy link
Contributor

rkuchan commented Apr 25, 2017

@danielnelson - I can't find the doc change anywhere. Want me to reopen the issue?

@danielnelson
Copy link
Contributor

@rkuchan Yeah

@danielnelson
Copy link
Contributor

According to influxdata/influxdb#7040 (comment), the RestartForceExitStatus option was added in systemd 215. Do we know what happens on versions before 215 if we use this option?

@djsly
Copy link
Contributor Author

djsly commented Apr 25, 2017

@danielnelson: Unfortunately, I don't have a system with systemd --version 215 to test this on.

@danielnelson
Copy link
Contributor

Looks like Debian jessie has 215, but SLES 12 has only 210. We should fire up a VM and check if adding the option causes any issues, if it is okay I think we can merge this.

@rkuchan Actually lets wait and see how things go on SLES before adding to the docs.

@danielnelson
Copy link
Contributor

SLES 12 SP2 actually has systemd-228-117.12.x86_64, so I think we are good.

@djsly Can you add this to the 1.4 changelog? I will probably port this over to the 1.3 branch after 1.3.0 is released.

No change will be needed in the docs since this should solve the problem for all of our targets.

@djsly
Copy link
Contributor Author

djsly commented May 8, 2017

@danielnelson let me know if this is good. Thanks

@danielnelson
Copy link
Contributor

Can you put the changelog entry in the 1.4 section, I'll move it to 1.3.x later if it goes in. Looks like you need to rebase to get rid of all the unrelated commits.

@djsly djsly force-pushed the sigpipe-support branch 2 times, most recently from 628f3d4 to 7564f16 Compare May 8, 2017 17:59
@djsly
Copy link
Contributor Author

djsly commented May 8, 2017

@danielnelson Done. Sorry I guess when I pull I didnt get the latest CHANGELOG.
Everything should be good now.

@@ -8,7 +8,8 @@ EnvironmentFile=-/etc/default/telegraf
User=telegraf
ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d ${TELEGRAF_OPTS}
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
Restart=always
RestartForceExitStatus=SIGPIPE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want one or the other here, not both?
With Restart=always, RestartForceExitStatus doesn't have any effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, sorry , got confused doing merge conflict while multi tasking . thanks for catching this up!

````
[root@<hostname> ~]# systemctl status telegraf -l
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2017-04-25 10:33:40 EDT; 14min ago
     Docs: https://github.com/influxdata/telegraf
 Main PID: 7857 (code=killed, signal=PIPE)
````

This occurs when we restart `systemd-journald`, which cause a crash/stop of telegraph seen as a STDPIPE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants