Reloads seem to reset the check atempt count. Also notifications go missing shortly after a reload. #6592

K0nne · 2018-09-04T16:28:21Z

We push new configurations on an hourly base into icinga. After this a config validation and a icinga2 reload are performed. The script, which does this runs every 11th minute. Our environment contains a master zone and 7 satellite zones. The masters and slaves in each zone are installed as ha-pairs of two machines. We have 17k hosts and 73k services.

We recently noticed a few, but nasty side effects from those reloads:

Sometimes the check attempt count seems to be set back during a reload:

https://i.imgur.com/RcJfd8S.png

Shortly after a reload there's also a point in time where notifications are not being sent:

https://monitoring-portal.org/uploads/default/original/2X/1/1a2e69717db5d8e2382e64a6eb2ec8d44208b324.png

I allready discuss this matter at monitoring-portal where michi dropped an interesting hint at post #5.

Expected Behavior

Reloads should be statefull.

Current Behavior

reloads seem to reset the check attempt count.
also notifications are not being sent shortly after a reload.

Possible Solution

unknown

Context

both problems are a big problem for us. because they undermine the trust in the monitoring solution :~

Your Environment

Version used (icinga2 --version):

icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.1-1)

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

System information:
Platform: Red Hat Enterprise Linux Server
Platform version: 7.4 (Maipo)
Kernel: Linux
Kernel version: 3.10.0-693.21.1.el7.x86_64
Architecture: x86_64

Build information:
Compiler: GNU 4.8.5
Build host: unknown

Operating System and version:

RHEL 7.4

Enabled features (icinga2 feature list):

Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb livestatus mainlog opentsdb statusdata
Enabled features: api checker ido-mysql notification perfdata syslog

Icinga Web 2 version and modules (System - About):

Icinga Web 2 Version
2.6.0
Git Commit
cfe6c7b06587189b3ef688183cacd32594db071a
Git Commit Datum
2018-07-19
Copyright
© 2013-2018 Das Icinga Projekt

Geladene Module
Name Version
businessprocess 2.1.0
monitoring 2.6.0
pnp 1.1.0

Config validation (icinga2 daemon -C):

[2018-09-04 18:24:32 +0200] information/cli: Icinga application loader (version: r2.9.1-1)
[2018-09-04 18:24:32 +0200] information/cli: Loading configuration file(s).
[2018-09-04 18:24:33 +0200] information/ConfigItem: Committing config item(s).
[2018-09-04 18:24:33 +0200] information/ApiListener: My API identity: dxzmicinga01
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 73387 Services.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 SyslogLogger.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 42 HostGroups.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 81900 Notifications.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 17060 Hosts.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 PerfdataWriter.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 12 Zones.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 17 Endpoints.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 4 ApiUsers.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 249 CheckCommands.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 3 TimePeriods.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 User.
[2018-09-04 18:24:39 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-09-04 18:24:39 +0200] information/cli: Finished validating the configuration file(s).

The text was updated successfully, but these errors were encountered:

dnsmichi · 2018-09-04T16:35:19Z

Ensure that the icinga2.state file is intact and read during reloads, as otherwise you're loosing runtime data.

In terms of the notification 0 user problem, you should investigate further but not as part of this issue.

K0nne · 2018-09-05T13:05:01Z

Thanks for the input. The file is intact and readable. I'm using the command "systemctl reload icinga2". How can I ensure that the icinga2.state file is read? I searched the docs and the internet but sadly found no clue.

K0nne · 2018-09-12T17:58:48Z

bump

dnsmichi · 2018-09-14T17:41:14Z

No need for bumping issues, we'll reply once there's time to.

Once the state file is read, the log will write something like this: Restored X objects. Loaded Y new objects without state..

K0nne · 2018-09-19T08:40:24Z

Sorry for the bumping.

I can't find the relating log entries anywhere in the syslog or in the debuglog. I temporarely enabled the mainlog feature, did a reload (systemctl reload icinga2) and looked there too. Nothing.

To have a reference I looked for those log entries in other github issues. Should be something like this:

information/ConfigObject: Restoring program state from file '/var/lib/icinga2/icinga2.state'

I have never seen them before in our environment.

This is the syslog output during a reload:

Sep 19 08:24:12 icingamaster01 systemd: Removed slice User Slice of root.
Sep 19 08:24:12 icingamaster01 systemd: Stopping User Slice of root.
Sep 19 08:24:13 icingamaster01 xinetd[1899]: EXIT: csync2 status=0 pid=53087 duration=11(sec)
Sep 19 08:24:18 icingamaster01 safe-reload: Validating config files: Done
Sep 19 08:24:18 icingamaster01 safe-reload: Reloading Icinga 2: Done
Sep 19 08:24:18 icingamaster01 systemd: Reloaded Icinga host/service/network monitoring system.
Sep 19 08:24:26 icingamaster01 systemd: icinga2.service: Supervising process 53355 which is not our child. We'll most likely not notice when it exits.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone1' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone2' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'global-templates' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone3' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone4' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'linux-commands' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'master' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'windows-commands' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone5' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone6' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone7' because we have an authoritative version of the zone's config.

The content of /lib/systemd/system/icinga2.service looks like this:

[Unit]
Description=Icinga host/service/network monitoring system
After=syslog.target network-online.target postgresql.service mariadb.service carbon-cache.service carbon-relay.service

[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/icinga2
ExecStartPre=/usr/lib/icinga2/prepare-dirs /etc/sysconfig/icinga2
ExecStart=/usr/sbin/icinga2 daemon -e ${ICINGA2_ERROR_LOG}
PIDFile=/run/icinga2/icinga2.pid
ExecReload=/usr/lib/icinga2/safe-reload /etc/sysconfig/icinga2
TimeoutStartSec=30m

# Systemd >228 enforces a lower process number for services.
# Depending on the distribution and Systemd version, this must
# be explicitly raised. Packages will set the needed values
# into /etc/systemd/system/icinga2.service.d/limits.conf
#
# Please check the troubleshooting documentation for further details.
# The values below can be used as examples for customized service files.

#TasksMax=infinity
#LimitNPROC=62883

[Install]
WantedBy=multi-user.target

Is there a chance that the icinga2.state file is being ignored?

marcofl · 2018-10-10T14:19:58Z

We have a similar issue with reloads. We have a check that monitors if check results of all services are fresh by using the internal state of the objects (e.g. service objects last_check attribute). after reloads / restarts these values are old (up to 5min, from the time where the last Dumping program state to file '/var/lib/icinga2/icinga2.state' appeared in the log).

So I highly suspect that it's not a problem with reading the state file, more it seems that the state file is not updated on a clean shutdown (reload).

If this is true and I got this right this would mean Icinga2 currently looses up to 5min of all state during every restart / reload.

K0nne · 2018-10-11T07:30:28Z

I can verify your observation. The icinga2.state file is not updated before/during a reload.

ekeih · 2018-10-11T13:21:19Z

I think we have a similar issue.

We have three different kinds of notifications:

firstlevel: gets a notification as soon as the check is in hard state, gets renotified every 15 minutes
secondlevel: gets a notification after 30 minutes when the check is still unhandled, gets renotified every 15 minutes
thirdlevel: gets a notification after 60 minutes when the check is still unhandled, gets renotified every 30 minutes

This is what happened:

At 12:52:38 Icinga2 sent a notification to our firstlevel.
At 12:53:05 we acknowledged the alert.
At 12:57:27 the acknowledgment was removed by Icinga2 according to the Icingaweb2 history. We had a puppet run at 12:52:00 which resulted in a Icinga2 reload around 12:57.
The check was still critical!
No additional reminder notification to our firstlevel or secondlevel happened! The correct behaviour would have been a notification every 15 minutes. (This made the situation even worse.)
At 14:05:15 Icinga2 started to send notifications to our thirdlevel.
Icinga2 repeated the thirdlevel notifications (correctly) every 30 minutes.
After our thirdlevel notified our firstlevel manually we fixed the issue.
After the recovery Icinga2 sent recovery notifications to our firstlevel and our thirdlevel.

I am still looking for a way to reproduce this in a reliable way. It would be also interesting to be able to check how often something like this happens. So if anyone has an idea I am happy to hear it :)

Some additional context that could be useful:

This happened with Icinga2 2.9.1. (In the meantime we updated to 2.9.2.)
We have a HA setup with two masters and 3x2 satellites. Puppet generates the config on master01 which synchronizes the config to master02. When both masters are up Icinga2 selects master02 to write into the IDODB.

According to the Icinga2 log it dumps the current state every 5 minutes, but not immediately before the reload, with the following line:

information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'

We do not have the following line anywhere in our logs:

information/ConfigObject: Restoring program state from file '/var/lib/icinga2/icinga2.state'

marcofl · 2018-10-12T08:14:33Z

@ekeih it's important to understand that the data (state of a check, last check result, ack, etc.) in the idodb (icingaweb2) is not necessarily the same as internally in icinga2. icinga2 does not read from the idodb, it only exports the state to it. after a restart, icinga2 depends on the state from the state file. which, if my observation is true, is not updated during a restart but periodically only.

if it's true that icinga2 looses the state during reloads (is reset back to the state where the state file was updated, which happens every 5min afaik) then you'll see the correct number of check attempts / hard state, etc. in icingaweb2 but icinga2 internally will not know that the service was already in a hard state for example.

What I don't know is whether IDO will remove e.g. an ACK when it is gone in icinga2's internal state (because of the issue we suspect here). This could partly explain the situation you described.

dnsmichi · 2018-10-12T08:33:48Z

I'd like to get an idea what's going on, so it is essential to know about the specialities on your platform. This includes the following:

icinga2 --version output
System health, are there any security limits, limits on I/O
Which features are enabled and running on the same box, do they take more CPU / IO than usual?
How many objects are involved here (icinga2 daemon -C output).
How long does a reload typically take?

watch -n 1 'ls -la /var/lib/icinga2/icinga2.state'

does the timestamp change during reload?

systemctl reload icinga2

Are there any temporary files written with a newer timestamp, e.g. icinga2.state.23Wckf?

watch -n 1 'ls -la /var/lib/icinga2/icinga2.state*'

Can you strace the processes, and look specifically for file write operations done during the reload?

pidof icinga2
<pick one>
strace -o strace.icinga2.log -e trace=open -p <PID>

K0nne · 2018-10-12T09:42:10Z

icinga2 --version

icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.1-1)

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Red Hat Enterprise Linux Server
  Platform version: 7.4 (Maipo)
  Kernel: Linux
  Kernel version: 3.10.0-693.21.1.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: unknown

We use a HA-Setup, which consists of 2 physical Machines.

sestatus

SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   permissive
Mode from config file:          permissive
Policy MLS status:              enabled
Policy deny_unknown status:     allowed

icinga2 feature list

Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb livestatus mainlog opentsdb statusdata
Enabled features: api checker ido-mysql notification perfdata syslog

Everything works fine so far. I found nothing unusual.

icinga2 daemon -C

[2018-10-12 11:22:47 +0200] information/cli: Icinga application loader (version: r2.9.1-1)
[2018-10-12 11:22:47 +0200] information/cli: Loading configuration file(s).
[2018-10-12 11:22:48 +0200] information/ConfigItem: Committing config item(s).
[2018-10-12 11:22:49 +0200] information/ApiListener: My API identity: dxzmicinga01
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 74463 Services.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 SyslogLogger.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 54 HostGroups.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 90291 Notifications.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 17485 Hosts.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 PerfdataWriter.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 76 Comments.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 12 Zones.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 17 Endpoints.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 5 ApiUsers.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 255 CheckCommands.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 3 TimePeriods.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 User.
[2018-10-12 11:22:55 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-10-12 11:22:55 +0200] information/cli: Finished validating the configuration file(s).

A reload takes about 20 seconds.

watch -n 1 'ls -la /var/lib/icinga2/icinga2.state'

no timestamp-changes

watch -n 1 'ls -la /var/lib/icinga2/icinga2.state*'

no temp-files

strace -o strace.icinga2.log -e trace=open -p

+++ exited with 0 +++

dnsmichi · 2018-10-12T12:26:05Z

Interesting, thanks. I'm not able to reproduce this locally, I can see that the temporary state file is written and renamed at the exact reload time, including an open call in dtrace. Likely it is a race condition when the child process with the validation takes too much time, and kills the parent process for some reason.

I'd still like to hear the setups from @marcofl and @ekeih to get a better picture.

marcofl · 2018-10-15T08:02:53Z

# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.2-1)

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Ubuntu
  Platform version: 16.04.5 LTS (Xenial Xerus)
  Kernel: Linux
  Kernel version: 4.4.0-137-generic
  Architecture: x86_64

Build information:
  Compiler: GNU 5.3.1
  Build host: c1410dac9c1d

# icinga2 daemon -C
[2018-10-15 09:49:21 +0200] information/cli: Icinga application loader (version: r2.9.2-1)
[2018-10-15 09:49:21 +0200] information/cli: Loading configuration file(s).
[2018-10-15 09:49:26 +0200] information/ConfigItem: Committing config item(s).
[2018-10-15 09:49:26 +0200] information/ApiListener: My API identity: master1.monitoring.vshn.net
[2018-10-15 09:49:36 +0200] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 30.6333/s (1838/min 1838/5min 1838/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #5 (InfluxdbWriter, influxdb) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 59932 Services.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 4 ServiceGroups.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 7689 ScheduledDowntimes.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 40 HostGroups.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 6 NotificationCommands.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 26429 Notifications.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 971 Hosts.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 15614 Downtimes.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 152 Comments.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 896 Zones.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 895 Endpoints.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 6 ApiUsers.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 7 UserGroups.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 320 CheckCommands.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 2 TimePeriods.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 14 Users.
[2018-10-15 09:49:40 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-10-15 09:49:40 +0200] information/cli: Finished validating the configuration file(s).

I think the problem could be that puppet (https://github.com/Icinga/puppet-icinga2/blob/master/manifests/service.pp#L36, https://github.com/Icinga/puppet-icinga2/blob/master/manifests/params.pp#L98) uses reload and not restart. maybe the state file update on reload is broken? because I see the file being updated during a systemctl restart icinga2 but NOT on a systemctl reload icinga2

This would also explain, why we see this issue after changing from our fork of the old icinga2 puppet module to the new one.

Every 1.0s: ls -la /var/lib/icinga2/icinga2.state*                                 Mon Oct 15 09:53:51 2018

-rw------- 1 nagios nagios 135209033 Oct 15 09:53 /var/lib/icinga2/icinga2.state

and nothing about this in the log (just the periodic writes)

[2018-10-15 09:30:53 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2018-10-15 09:36:01 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2018-10-15 09:43:12 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2018-10-15 09:48:20 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'

K0nne · 2018-10-15T08:47:18Z

Indeed. With systemctl restart icinga2 it works as it should. The state-file is written during the restart.

ekeih · 2018-10-15T10:26:54Z

icinga2 --version output
My comment from last week was about 2.9.1. But I do not think that the update to 2.9.2 changed the issue.

root@master01:/# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.2-1)

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Ubuntu
  Platform version: 16.04.5 LTS (Xenial Xerus)
  Kernel: Linux
  Kernel version: 2.6.32-042stab133.2
  Architecture: x86_64

Build information:
  Compiler: GNU 5.3.1
  Build host: c1410dac9c1d

System health, are there any security limits, limits on I/O
No
Which features are enabled and running on the same box, do they take more CPU / IO than usual?
All components are running in virtuozzo containers. We have two masters, 3x2 satellites and a separate MySQL Master and a Slave. The host systems and the containers do not show any performance issues.
How many objects are involved here (icinga2 daemon -C output).

root@master01:/# icinga2 daemon -C
[2018-10-15 10:45:56 +0200] information/cli: Icinga application loader (version: r2.9.2-1)
[2018-10-15 10:45:56 +0200] information/cli: Loading configuration file(s).
[2018-10-15 10:45:57 +0200] information/ConfigItem: Committing config item(s).
[2018-10-15 10:45:57 +0200] warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-10-15 10:45:57 +0200] warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-10-15 10:45:57 +0200] warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-10-15 10:45:57 +0200] warning/ApiListener: Please read the upgrading documentation for v2.8: https://www.icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
[2018-10-15 10:45:57 +0200] information/ApiListener: My API identity: foobar.live.master01
[2018-10-15 10:46:07 +0200] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 45.4667/s (2728/min 2728/5min 2728/15min);
[2018-10-15 10:46:07 +0200] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 10:46:07 +0200] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 10:46:07 +0200] information/WorkQueue: #7 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 65651 Services.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 ServiceGroup.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 605 ScheduledDowntimes.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 355 HostGroups.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 49649 Notifications.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 2569 Hosts.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 6761 Downtimes.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 54952 Dependencies.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 395 Comments.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 6 Zones.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 8 Endpoints.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 7 ApiUsers.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 210 UserGroups.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 14 CheckCommands.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 4 TimePeriods.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 327 Users.
[2018-10-15 10:46:13 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-10-15 10:46:13 +0200] information/cli: Finished validating the configuration file(s).

How long does a reload typically take?

root@master01:/# time systemctl reload icinga2

real	0m17.984s
user	0m0.000s
sys	0m0.003s

does the timestamp change during reload?
Not during a reload, but during a restart.
Are there any temporary files written with a newer timestamp, e.g. icinga2.state.23Wckf?
Also not during a reload, but during a restart.

So as @marcofl and @K0nne already pointed out a restart may work better than a reload. But interestingly Icinga2 log did not log Dumping program state to file or Restoring program state from file during the restart.

Can you strace the processes, and look specifically for file write operations done during the reload?

# during a reload
root@master01:/# strace -e open -p 26885
strace: Process 26885 attached
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=27670, si_uid=0} ---
--- SIGCHLD {si_signo=SIGCHLD, si_code=SI_USER, si_pid=27817, si_uid=107} ---
+++ killed by SIGABRT +++

# during a restart
root@master01:/# strace -e open -p 27817
strace: Process 27817 attached
open("/var/lib/icinga2/icinga2.state.UiOfHM", O_RDWR|O_CREAT|O_EXCL, 0600) = 13
open("/var/lib/icinga2/icinga2.state.UiOfHM", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 47
open("/var/lib/icinga2/modified-attributes.conf.GisxY1", O_RDWR|O_CREAT|O_EXCL, 0600) = 13
open("/var/lib/icinga2/modified-attributes.conf.GisxY1", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 47
+++ exited with 0 +++

Update: I just realized that my strace during a reload prints out +++ killed by SIGABRT +++. Maybe this is a clue?

dnsmichi · 2018-10-16T10:30:05Z

Thanks for the details, I wanted to get an overview if this is distribution agnostic, or maybe related to special signal handling with Systemd. One of our community members opened #6689 which includes a fix, similar to what I already had in mind from my analysis. Likely last week was too much stress, today I've found a reliable way to reproduce the issue.

You can find the steps to reproduce and test protocol in #6689, a PR is coming soon.

Please test the snapshot packages - I'll trigger them for el7 and Ubuntu 16 and let you know once available.

@west0rmann

Credits to @west0rmann finding the issue and providing the initial fix. fixes #6689 fixes #6592

@west0rmann

Credits to @west0rmann finding the issue and providing the initial fix. fixes #6689 fixes #6592

dnsmichi · 2018-10-16T12:15:07Z

https://build.icinga.com/view/Icinga%202/job/icinga2-snapshot/job/rpm-centos-7-1binary/arch=x86_64/275/
https://build.icinga.com/view/Icinga%202/job/icinga2-snapshot/job/deb-ubuntu-xenial-1binary/arch=x86_64/244/

dnsmichi · 2018-10-16T13:07:57Z

Packages are available, please report from your tests :-)

marcofl · 2018-10-17T07:31:37Z

I can confirm that with the version 2.10.0+18.gc0398ed the state file is changed during a reload. What I don't see is the new log message implemented in https://github.com/Icinga/icinga2/pull/6691/files

(master-test-monitoring_v2)root@master1:~# grep -i reload /var/log/icinga2/icinga2.log
[2018-10-17 09:28:05 +0200] information/Application: Got reload command: Starting new instance.
[2018-10-17 09:28:06 +0200] information/Application: Reload requested, letting new process take over.
[2018-10-17 09:29:28 +0200] information/Application: Got reload command: Starting new instance.
[2018-10-17 09:29:29 +0200] information/Application: Reload requested, letting new process take over.

dnsmichi · 2018-10-17T07:34:49Z

Maybe the logs are flushed too late and then exit() kills the stream prior to writing it. Did you test the thing with sending an acknowledgement during the reload time too?

marcofl · 2018-10-17T07:43:15Z

Our issue with having old check results internally in icinga2 after a reload seems to be gone with this. Also I did things like disable active checks for a service -> reload -> active check still disabled -> enable active check -> reload -> waiting for next check -> new check results was there. also acked a critical service and the ack was not gone after a reload.

I tend to say that this fixed the issue. But maybe other reporters of this issue can confirm this too please.

would you say this snapshot (2.10.0+18.gc0398ed) is okay to use in production until a release?

dnsmichi · 2018-10-17T07:55:15Z

Ok, thanks for testing :)

The snapshot packages only contain a regression fix for the API, and Windows build fix. Nothing critical, although I wouldn't recommend snapshot packages in production. When there's more feedback we're planning with 2.10.1 soon.

dnsmichi · 2018-10-18T08:09:26Z

Got a similar test response with #6624, so I am inclined to say that it is fixed. I know that @K0nne won't be able to test this in production.

ekeih · 2018-10-18T09:28:59Z

Hi, sorry for my late response - I was out of office the last two days.
According to another ticket we can excepct 2.10.1 later today. I will upgrade our stage environment to 2.10.1 and report back.

Thanks to everyone for this fix! :)

dnsmichi added the needs feedback We'll only proceed once we hear from you again label Sep 4, 2018

Crunsher mentioned this issue Sep 18, 2018

Master Reload Causes Passive Check State Change #6624

Closed

dnsmichi added bug Something isn't working and removed needs feedback We'll only proceed once we hear from you again labels Oct 16, 2018

dnsmichi self-assigned this Oct 16, 2018

dnsmichi added this to the 2.10.1 milestone Oct 16, 2018

dnsmichi pushed a commit that referenced this issue Oct 16, 2018

Add missing shutdown/program state dumps for SIGUSR2 reload handler

cd6f455

Credits to @west0rmann finding the issue and providing the initial fix. fixes #6689 fixes #6592

dnsmichi mentioned this issue Oct 16, 2018

Add missing shutdown/program state dumps for SIGUSR2 reload handler #6691

Merged

dnsmichi pushed a commit that referenced this issue Oct 16, 2018

Add missing shutdown/program state dumps for SIGUSR2 reload handler

af6c8a1

Credits to @west0rmann finding the issue and providing the initial fix. fixes #6689 fixes #6592

dnsmichi closed this as completed in #6691 Oct 16, 2018

dnsmichi reopened this Oct 16, 2018

dnsmichi added the needs feedback We'll only proceed once we hear from you again label Oct 16, 2018

dnsmichi closed this as completed Oct 18, 2018

dnsmichi removed the needs feedback We'll only proceed once we hear from you again label Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reloads seem to reset the check atempt count. Also notifications go missing shortly after a reload. #6592

Reloads seem to reset the check atempt count. Also notifications go missing shortly after a reload. #6592

K0nne commented Sep 4, 2018

dnsmichi commented Sep 4, 2018

K0nne commented Sep 5, 2018 •

edited

Loading

K0nne commented Sep 12, 2018

dnsmichi commented Sep 14, 2018 •

edited

Loading

K0nne commented Sep 19, 2018 •

edited

Loading

marcofl commented Oct 10, 2018 •

edited

Loading

K0nne commented Oct 11, 2018

ekeih commented Oct 11, 2018

marcofl commented Oct 12, 2018 •

edited

Loading

dnsmichi commented Oct 12, 2018

K0nne commented Oct 12, 2018 •

edited

Loading

dnsmichi commented Oct 12, 2018

marcofl commented Oct 15, 2018

K0nne commented Oct 15, 2018

ekeih commented Oct 15, 2018 •

edited

Loading

dnsmichi commented Oct 16, 2018

dnsmichi commented Oct 16, 2018

dnsmichi commented Oct 16, 2018

marcofl commented Oct 17, 2018 •

edited

Loading

dnsmichi commented Oct 17, 2018

marcofl commented Oct 17, 2018 •

edited

Loading

dnsmichi commented Oct 17, 2018

dnsmichi commented Oct 18, 2018

ekeih commented Oct 18, 2018

Reloads seem to reset the check atempt count. Also notifications go missing shortly after a reload. #6592

Reloads seem to reset the check atempt count. Also notifications go missing shortly after a reload. #6592

Comments

K0nne commented Sep 4, 2018

Expected Behavior

Current Behavior

Possible Solution

Context

Your Environment

dnsmichi commented Sep 4, 2018

K0nne commented Sep 5, 2018 • edited Loading

K0nne commented Sep 12, 2018

dnsmichi commented Sep 14, 2018 • edited Loading

K0nne commented Sep 19, 2018 • edited Loading

marcofl commented Oct 10, 2018 • edited Loading

K0nne commented Oct 11, 2018

ekeih commented Oct 11, 2018

marcofl commented Oct 12, 2018 • edited Loading

dnsmichi commented Oct 12, 2018

K0nne commented Oct 12, 2018 • edited Loading

dnsmichi commented Oct 12, 2018

marcofl commented Oct 15, 2018

K0nne commented Oct 15, 2018

ekeih commented Oct 15, 2018 • edited Loading

dnsmichi commented Oct 16, 2018

dnsmichi commented Oct 16, 2018

dnsmichi commented Oct 16, 2018

marcofl commented Oct 17, 2018 • edited Loading

dnsmichi commented Oct 17, 2018

marcofl commented Oct 17, 2018 • edited Loading

dnsmichi commented Oct 17, 2018

dnsmichi commented Oct 18, 2018

ekeih commented Oct 18, 2018

K0nne commented Sep 5, 2018 •

edited

Loading

dnsmichi commented Sep 14, 2018 •

edited

Loading

K0nne commented Sep 19, 2018 •

edited

Loading

marcofl commented Oct 10, 2018 •

edited

Loading

marcofl commented Oct 12, 2018 •

edited

Loading

K0nne commented Oct 12, 2018 •

edited

Loading

ekeih commented Oct 15, 2018 •

edited

Loading

marcofl commented Oct 17, 2018 •

edited

Loading

marcofl commented Oct 17, 2018 •

edited

Loading