-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reloads seem to reset the check atempt count. Also notifications go missing shortly after a reload. #6592
Comments
Ensure that the In terms of the notification 0 user problem, you should investigate further but not as part of this issue. |
Thanks for the input. The file is intact and readable. I'm using the command "systemctl reload icinga2". How can I ensure that the icinga2.state file is read? I searched the docs and the internet but sadly found no clue. |
bump |
No need for bumping issues, we'll reply once there's time to. Once the state file is read, the log will write something like this: |
Sorry for the bumping. I can't find the relating log entries anywhere in the syslog or in the debuglog. I temporarely enabled the mainlog feature, did a reload (systemctl reload icinga2) and looked there too. Nothing. To have a reference I looked for those log entries in other github issues. Should be something like this:
I have never seen them before in our environment. This is the syslog output during a reload:
The content of /lib/systemd/system/icinga2.service looks like this:
Is there a chance that the icinga2.state file is being ignored? |
We have a similar issue with reloads. We have a check that monitors if check results of all services are fresh by using the internal state of the objects (e.g. service objects So I highly suspect that it's not a problem with reading the state file, more it seems that the state file is not updated on a clean shutdown (reload). If this is true and I got this right this would mean Icinga2 currently looses up to 5min of all state during every restart / reload. |
I can verify your observation. The icinga2.state file is not updated before/during a reload. |
I think we have a similar issue. We have three different kinds of notifications:
This is what happened:
I am still looking for a way to reproduce this in a reliable way. It would be also interesting to be able to check how often something like this happens. So if anyone has an idea I am happy to hear it :) Some additional context that could be useful: This happened with Icinga2 2.9.1. (In the meantime we updated to 2.9.2.) According to the Icinga2 log it dumps the current state every 5 minutes, but not immediately before the reload, with the following line:
We do not have the following line anywhere in our logs:
|
@ekeih it's important to understand that the data (state of a check, last check result, ack, etc.) in the idodb (icingaweb2) is not necessarily the same as internally in icinga2. icinga2 does not read from the idodb, it only exports the state to it. after a restart, icinga2 depends on the state from the state file. which, if my observation is true, is not updated during a restart but periodically only. if it's true that icinga2 looses the state during reloads (is reset back to the state where the state file was updated, which happens every 5min afaik) then you'll see the correct number of check attempts / hard state, etc. in icingaweb2 but icinga2 internally will not know that the service was already in a hard state for example. What I don't know is whether IDO will remove e.g. an ACK when it is gone in icinga2's internal state (because of the issue we suspect here). This could partly explain the situation you described. |
I'd like to get an idea what's going on, so it is essential to know about the specialities on your platform. This includes the following:
does the timestamp change during reload?
Are there any temporary files written with a newer timestamp, e.g.
Can you strace the processes, and look specifically for file write operations done during the reload?
|
icinga2 --version
We use a HA-Setup, which consists of 2 physical Machines. sestatus
icinga2 feature list
Everything works fine so far. I found nothing unusual. icinga2 daemon -C
A reload takes about 20 seconds. watch -n 1 'ls -la /var/lib/icinga2/icinga2.state'
watch -n 1 'ls -la /var/lib/icinga2/icinga2.state*'
strace -o strace.icinga2.log -e trace=open -p
|
Interesting, thanks. I'm not able to reproduce this locally, I can see that the temporary state file is written and renamed at the exact reload time, including an open call in dtrace. Likely it is a race condition when the child process with the validation takes too much time, and kills the parent process for some reason. I'd still like to hear the setups from @marcofl and @ekeih to get a better picture. |
I think the problem could be that puppet (https://github.com/Icinga/puppet-icinga2/blob/master/manifests/service.pp#L36, https://github.com/Icinga/puppet-icinga2/blob/master/manifests/params.pp#L98) uses This would also explain, why we see this issue after changing from our fork of the old icinga2 puppet module to the new one.
and nothing about this in the log (just the periodic writes)
|
Indeed. With |
So as @marcofl and @K0nne already pointed out a restart may work better than a reload. But interestingly Icinga2 log did not log
Update: I just realized that my strace during a reload prints out |
Thanks for the details, I wanted to get an overview if this is distribution agnostic, or maybe related to special signal handling with Systemd. One of our community members opened #6689 which includes a fix, similar to what I already had in mind from my analysis. Likely last week was too much stress, today I've found a reliable way to reproduce the issue. You can find the steps to reproduce and test protocol in #6689, a PR is coming soon. Please test the snapshot packages - I'll trigger them for el7 and Ubuntu 16 and let you know once available. |
Credits to @west0rmann finding the issue and providing the initial fix. fixes #6689 fixes #6592
Credits to @west0rmann finding the issue and providing the initial fix. fixes #6689 fixes #6592
Packages are available, please report from your tests :-) |
I can confirm that with the version 2.10.0+18.gc0398ed the state file is changed during a reload. What I don't see is the new log message implemented in https://github.com/Icinga/icinga2/pull/6691/files
|
Maybe the logs are flushed too late and then exit() kills the stream prior to writing it. Did you test the thing with sending an acknowledgement during the reload time too? |
Our issue with having old check results internally in icinga2 after a reload seems to be gone with this. Also I did things like disable active checks for a service -> reload -> active check still disabled -> enable active check -> reload -> waiting for next check -> new check results was there. also acked a critical service and the ack was not gone after a reload. I tend to say that this fixed the issue. But maybe other reporters of this issue can confirm this too please. would you say this snapshot (2.10.0+18.gc0398ed) is okay to use in production until a release? |
Ok, thanks for testing :) The snapshot packages only contain a regression fix for the API, and Windows build fix. Nothing critical, although I wouldn't recommend snapshot packages in production. When there's more feedback we're planning with 2.10.1 soon. |
Hi, sorry for my late response - I was out of office the last two days. Thanks to everyone for this fix! :) |
We push new configurations on an hourly base into icinga. After this a config validation and a icinga2 reload are performed. The script, which does this runs every 11th minute. Our environment contains a master zone and 7 satellite zones. The masters and slaves in each zone are installed as ha-pairs of two machines. We have 17k hosts and 73k services.
We recently noticed a few, but nasty side effects from those reloads:
Sometimes the check attempt count seems to be set back during a reload:
https://i.imgur.com/RcJfd8S.png
Shortly after a reload there's also a point in time where notifications are not being sent:
https://monitoring-portal.org/uploads/default/original/2X/1/1a2e69717db5d8e2382e64a6eb2ec8d44208b324.png
I allready discuss this matter at monitoring-portal where michi dropped an interesting hint at post #5.
Expected Behavior
Reloads should be statefull.
Current Behavior
reloads seem to reset the check attempt count.
also notifications are not being sent shortly after a reload.
Possible Solution
unknown
Context
both problems are a big problem for us. because they undermine the trust in the monitoring solution :~
Your Environment
icinga2 --version
):icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.1-1)
Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
System information:
Platform: Red Hat Enterprise Linux Server
Platform version: 7.4 (Maipo)
Kernel: Linux
Kernel version: 3.10.0-693.21.1.el7.x86_64
Architecture: x86_64
Build information:
Compiler: GNU 4.8.5
Build host: unknown
RHEL 7.4
icinga2 feature list
):Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb livestatus mainlog opentsdb statusdata
Enabled features: api checker ido-mysql notification perfdata syslog
Icinga Web 2 Version
2.6.0
Git Commit
cfe6c7b06587189b3ef688183cacd32594db071a
Git Commit Datum
2018-07-19
Copyright
© 2013-2018 Das Icinga Projekt
Geladene Module
Name Version
businessprocess 2.1.0
monitoring 2.6.0
pnp 1.1.0
icinga2 daemon -C
):[2018-09-04 18:24:32 +0200] information/cli: Icinga application loader (version: r2.9.1-1)
[2018-09-04 18:24:32 +0200] information/cli: Loading configuration file(s).
[2018-09-04 18:24:33 +0200] information/ConfigItem: Committing config item(s).
[2018-09-04 18:24:33 +0200] information/ApiListener: My API identity: dxzmicinga01
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 73387 Services.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 SyslogLogger.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 42 HostGroups.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 81900 Notifications.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 17060 Hosts.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 PerfdataWriter.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 12 Zones.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 17 Endpoints.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 4 ApiUsers.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 249 CheckCommands.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 3 TimePeriods.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 User.
[2018-09-04 18:24:39 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-09-04 18:24:39 +0200] information/cli: Finished validating the configuration file(s).
The text was updated successfully, but these errors were encountered: