Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reloads seem to reset the check atempt count. Also notifications go missing shortly after a reload. #6592

Closed
K0nne opened this issue Sep 4, 2018 · 24 comments · Fixed by #6691
Assignees
Labels
bug Something isn't working
Milestone

Comments

@K0nne
Copy link
Contributor

K0nne commented Sep 4, 2018

We push new configurations on an hourly base into icinga. After this a config validation and a icinga2 reload are performed. The script, which does this runs every 11th minute. Our environment contains a master zone and 7 satellite zones. The masters and slaves in each zone are installed as ha-pairs of two machines. We have 17k hosts and 73k services.

We recently noticed a few, but nasty side effects from those reloads:

Sometimes the check attempt count seems to be set back during a reload:

https://i.imgur.com/RcJfd8S.png

Shortly after a reload there's also a point in time where notifications are not being sent:

https://monitoring-portal.org/uploads/default/original/2X/1/1a2e69717db5d8e2382e64a6eb2ec8d44208b324.png

I allready discuss this matter at monitoring-portal where michi dropped an interesting hint at post #5.

Expected Behavior

Reloads should be statefull.

Current Behavior

reloads seem to reset the check attempt count.
also notifications are not being sent shortly after a reload.

Possible Solution

unknown

Context

both problems are a big problem for us. because they undermine the trust in the monitoring solution :~

Your Environment

  • Version used (icinga2 --version):

icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.1-1)

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

System information:
Platform: Red Hat Enterprise Linux Server
Platform version: 7.4 (Maipo)
Kernel: Linux
Kernel version: 3.10.0-693.21.1.el7.x86_64
Architecture: x86_64

Build information:
Compiler: GNU 4.8.5
Build host: unknown

  • Operating System and version:

RHEL 7.4

  • Enabled features (icinga2 feature list):

Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb livestatus mainlog opentsdb statusdata
Enabled features: api checker ido-mysql notification perfdata syslog

  • Icinga Web 2 version and modules (System - About):

Icinga Web 2 Version
2.6.0
Git Commit
cfe6c7b06587189b3ef688183cacd32594db071a
Git Commit Datum
2018-07-19
Copyright
© 2013-2018 Das Icinga Projekt

Geladene Module
Name Version
businessprocess 2.1.0
monitoring 2.6.0
pnp 1.1.0

  • Config validation (icinga2 daemon -C):

[2018-09-04 18:24:32 +0200] information/cli: Icinga application loader (version: r2.9.1-1)
[2018-09-04 18:24:32 +0200] information/cli: Loading configuration file(s).
[2018-09-04 18:24:33 +0200] information/ConfigItem: Committing config item(s).
[2018-09-04 18:24:33 +0200] information/ApiListener: My API identity: dxzmicinga01
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 73387 Services.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 SyslogLogger.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 42 HostGroups.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 81900 Notifications.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 17060 Hosts.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 PerfdataWriter.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 12 Zones.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 17 Endpoints.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 4 ApiUsers.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 249 CheckCommands.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 3 TimePeriods.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 User.
[2018-09-04 18:24:39 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-09-04 18:24:39 +0200] information/cli: Finished validating the configuration file(s).

@dnsmichi
Copy link
Contributor

dnsmichi commented Sep 4, 2018

Ensure that the icinga2.state file is intact and read during reloads, as otherwise you're loosing runtime data.

In terms of the notification 0 user problem, you should investigate further but not as part of this issue.

@dnsmichi dnsmichi added the needs feedback We'll only proceed once we hear from you again label Sep 4, 2018
@K0nne
Copy link
Contributor Author

K0nne commented Sep 5, 2018

Thanks for the input. The file is intact and readable. I'm using the command "systemctl reload icinga2". How can I ensure that the icinga2.state file is read? I searched the docs and the internet but sadly found no clue.

@K0nne
Copy link
Contributor Author

K0nne commented Sep 12, 2018

bump

@dnsmichi
Copy link
Contributor

dnsmichi commented Sep 14, 2018

No need for bumping issues, we'll reply once there's time to.

Once the state file is read, the log will write something like this: Restored X objects. Loaded Y new objects without state..

@K0nne
Copy link
Contributor Author

K0nne commented Sep 19, 2018

Sorry for the bumping.

I can't find the relating log entries anywhere in the syslog or in the debuglog. I temporarely enabled the mainlog feature, did a reload (systemctl reload icinga2) and looked there too. Nothing.

To have a reference I looked for those log entries in other github issues. Should be something like this:

information/ConfigObject: Restoring program state from file '/var/lib/icinga2/icinga2.state'

I have never seen them before in our environment.

This is the syslog output during a reload:

Sep 19 08:24:12 icingamaster01 systemd: Removed slice User Slice of root.
Sep 19 08:24:12 icingamaster01 systemd: Stopping User Slice of root.
Sep 19 08:24:13 icingamaster01 xinetd[1899]: EXIT: csync2 status=0 pid=53087 duration=11(sec)
Sep 19 08:24:18 icingamaster01 safe-reload: Validating config files: Done
Sep 19 08:24:18 icingamaster01 safe-reload: Reloading Icinga 2: Done
Sep 19 08:24:18 icingamaster01 systemd: Reloaded Icinga host/service/network monitoring system.
Sep 19 08:24:26 icingamaster01 systemd: icinga2.service: Supervising process 53355 which is not our child. We'll most likely not notice when it exits.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone1' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone2' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'global-templates' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone3' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone4' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'linux-commands' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'master' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'windows-commands' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone5' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone6' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone7' because we have an authoritative version of the zone's config.

The content of /lib/systemd/system/icinga2.service looks like this:

[Unit]
Description=Icinga host/service/network monitoring system
After=syslog.target network-online.target postgresql.service mariadb.service carbon-cache.service carbon-relay.service

[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/icinga2
ExecStartPre=/usr/lib/icinga2/prepare-dirs /etc/sysconfig/icinga2
ExecStart=/usr/sbin/icinga2 daemon -e ${ICINGA2_ERROR_LOG}
PIDFile=/run/icinga2/icinga2.pid
ExecReload=/usr/lib/icinga2/safe-reload /etc/sysconfig/icinga2
TimeoutStartSec=30m

# Systemd >228 enforces a lower process number for services.
# Depending on the distribution and Systemd version, this must
# be explicitly raised. Packages will set the needed values
# into /etc/systemd/system/icinga2.service.d/limits.conf
#
# Please check the troubleshooting documentation for further details.
# The values below can be used as examples for customized service files.

#TasksMax=infinity
#LimitNPROC=62883

[Install]
WantedBy=multi-user.target

Is there a chance that the icinga2.state file is being ignored?

@marcofl
Copy link

marcofl commented Oct 10, 2018

We have a similar issue with reloads. We have a check that monitors if check results of all services are fresh by using the internal state of the objects (e.g. service objects last_check attribute). after reloads / restarts these values are old (up to 5min, from the time where the last Dumping program state to file '/var/lib/icinga2/icinga2.state' appeared in the log).

So I highly suspect that it's not a problem with reading the state file, more it seems that the state file is not updated on a clean shutdown (reload).

If this is true and I got this right this would mean Icinga2 currently looses up to 5min of all state during every restart / reload.

@K0nne
Copy link
Contributor Author

K0nne commented Oct 11, 2018

I can verify your observation. The icinga2.state file is not updated before/during a reload.

@ekeih
Copy link
Contributor

ekeih commented Oct 11, 2018

I think we have a similar issue.

We have three different kinds of notifications:

  • firstlevel: gets a notification as soon as the check is in hard state, gets renotified every 15 minutes
  • secondlevel: gets a notification after 30 minutes when the check is still unhandled, gets renotified every 15 minutes
  • thirdlevel: gets a notification after 60 minutes when the check is still unhandled, gets renotified every 30 minutes

This is what happened:

  • At 12:52:38 Icinga2 sent a notification to our firstlevel.
  • At 12:53:05 we acknowledged the alert.
  • At 12:57:27 the acknowledgment was removed by Icinga2 according to the Icingaweb2 history. We had a puppet run at 12:52:00 which resulted in a Icinga2 reload around 12:57.
  • The check was still critical!
  • No additional reminder notification to our firstlevel or secondlevel happened! The correct behaviour would have been a notification every 15 minutes. (This made the situation even worse.)
  • At 14:05:15 Icinga2 started to send notifications to our thirdlevel.
  • Icinga2 repeated the thirdlevel notifications (correctly) every 30 minutes.
  • After our thirdlevel notified our firstlevel manually we fixed the issue.
  • After the recovery Icinga2 sent recovery notifications to our firstlevel and our thirdlevel.

I am still looking for a way to reproduce this in a reliable way. It would be also interesting to be able to check how often something like this happens. So if anyone has an idea I am happy to hear it :)

Some additional context that could be useful:

This happened with Icinga2 2.9.1. (In the meantime we updated to 2.9.2.)
We have a HA setup with two masters and 3x2 satellites. Puppet generates the config on master01 which synchronizes the config to master02. When both masters are up Icinga2 selects master02 to write into the IDODB.

According to the Icinga2 log it dumps the current state every 5 minutes, but not immediately before the reload, with the following line:

information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'

We do not have the following line anywhere in our logs:

information/ConfigObject: Restoring program state from file '/var/lib/icinga2/icinga2.state'

@marcofl
Copy link

marcofl commented Oct 12, 2018

@ekeih it's important to understand that the data (state of a check, last check result, ack, etc.) in the idodb (icingaweb2) is not necessarily the same as internally in icinga2. icinga2 does not read from the idodb, it only exports the state to it. after a restart, icinga2 depends on the state from the state file. which, if my observation is true, is not updated during a restart but periodically only.

if it's true that icinga2 looses the state during reloads (is reset back to the state where the state file was updated, which happens every 5min afaik) then you'll see the correct number of check attempts / hard state, etc. in icingaweb2 but icinga2 internally will not know that the service was already in a hard state for example.

What I don't know is whether IDO will remove e.g. an ACK when it is gone in icinga2's internal state (because of the issue we suspect here). This could partly explain the situation you described.

@dnsmichi
Copy link
Contributor

I'd like to get an idea what's going on, so it is essential to know about the specialities on your platform. This includes the following:

  • icinga2 --version output
  • System health, are there any security limits, limits on I/O
  • Which features are enabled and running on the same box, do they take more CPU / IO than usual?
  • How many objects are involved here (icinga2 daemon -C output).
  • How long does a reload typically take?
watch -n 1 'ls -la /var/lib/icinga2/icinga2.state'

does the timestamp change during reload?

systemctl reload icinga2

Are there any temporary files written with a newer timestamp, e.g. icinga2.state.23Wckf?

watch -n 1 'ls -la /var/lib/icinga2/icinga2.state*'

Can you strace the processes, and look specifically for file write operations done during the reload?

pidof icinga2
<pick one>
strace -o strace.icinga2.log -e trace=open -p <PID>

@K0nne
Copy link
Contributor Author

K0nne commented Oct 12, 2018

icinga2 --version

icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.1-1)

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Red Hat Enterprise Linux Server
  Platform version: 7.4 (Maipo)
  Kernel: Linux
  Kernel version: 3.10.0-693.21.1.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: unknown

We use a HA-Setup, which consists of 2 physical Machines.

sestatus

SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   permissive
Mode from config file:          permissive
Policy MLS status:              enabled
Policy deny_unknown status:     allowed

icinga2 feature list

Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb livestatus mainlog opentsdb statusdata
Enabled features: api checker ido-mysql notification perfdata syslog

Everything works fine so far. I found nothing unusual.

icinga2 daemon -C

[2018-10-12 11:22:47 +0200] information/cli: Icinga application loader (version: r2.9.1-1)
[2018-10-12 11:22:47 +0200] information/cli: Loading configuration file(s).
[2018-10-12 11:22:48 +0200] information/ConfigItem: Committing config item(s).
[2018-10-12 11:22:49 +0200] information/ApiListener: My API identity: dxzmicinga01
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 74463 Services.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 SyslogLogger.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 54 HostGroups.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 90291 Notifications.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 17485 Hosts.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 PerfdataWriter.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 76 Comments.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 12 Zones.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 17 Endpoints.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 5 ApiUsers.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 255 CheckCommands.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 3 TimePeriods.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 User.
[2018-10-12 11:22:55 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-10-12 11:22:55 +0200] information/cli: Finished validating the configuration file(s).

A reload takes about 20 seconds.

watch -n 1 'ls -la /var/lib/icinga2/icinga2.state'

no timestamp-changes

watch -n 1 'ls -la /var/lib/icinga2/icinga2.state*'

no temp-files

strace -o strace.icinga2.log -e trace=open -p

+++ exited with 0 +++

@dnsmichi
Copy link
Contributor

Interesting, thanks. I'm not able to reproduce this locally, I can see that the temporary state file is written and renamed at the exact reload time, including an open call in dtrace. Likely it is a race condition when the child process with the validation takes too much time, and kills the parent process for some reason.

I'd still like to hear the setups from @marcofl and @ekeih to get a better picture.

@marcofl
Copy link

marcofl commented Oct 15, 2018

# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.2-1)

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Ubuntu
  Platform version: 16.04.5 LTS (Xenial Xerus)
  Kernel: Linux
  Kernel version: 4.4.0-137-generic
  Architecture: x86_64

Build information:
  Compiler: GNU 5.3.1
  Build host: c1410dac9c1d
# icinga2 daemon -C
[2018-10-15 09:49:21 +0200] information/cli: Icinga application loader (version: r2.9.2-1)
[2018-10-15 09:49:21 +0200] information/cli: Loading configuration file(s).
[2018-10-15 09:49:26 +0200] information/ConfigItem: Committing config item(s).
[2018-10-15 09:49:26 +0200] information/ApiListener: My API identity: master1.monitoring.vshn.net
[2018-10-15 09:49:36 +0200] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 30.6333/s (1838/min 1838/5min 1838/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #5 (InfluxdbWriter, influxdb) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 59932 Services.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 4 ServiceGroups.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 7689 ScheduledDowntimes.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 40 HostGroups.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 6 NotificationCommands.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 26429 Notifications.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 971 Hosts.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 15614 Downtimes.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 152 Comments.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 896 Zones.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 895 Endpoints.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 6 ApiUsers.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 7 UserGroups.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 320 CheckCommands.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 2 TimePeriods.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 14 Users.
[2018-10-15 09:49:40 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-10-15 09:49:40 +0200] information/cli: Finished validating the configuration file(s).

I think the problem could be that puppet (https://github.com/Icinga/puppet-icinga2/blob/master/manifests/service.pp#L36, https://github.com/Icinga/puppet-icinga2/blob/master/manifests/params.pp#L98) uses reload and not restart. maybe the state file update on reload is broken? because I see the file being updated during a systemctl restart icinga2 but NOT on a systemctl reload icinga2

This would also explain, why we see this issue after changing from our fork of the old icinga2 puppet module to the new one.

Every 1.0s: ls -la /var/lib/icinga2/icinga2.state*                                 Mon Oct 15 09:53:51 2018

-rw------- 1 nagios nagios 135209033 Oct 15 09:53 /var/lib/icinga2/icinga2.state

and nothing about this in the log (just the periodic writes)

[2018-10-15 09:30:53 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2018-10-15 09:36:01 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2018-10-15 09:43:12 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2018-10-15 09:48:20 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'

@K0nne
Copy link
Contributor Author

K0nne commented Oct 15, 2018

Indeed. With systemctl restart icinga2 it works as it should. The state-file is written during the restart.

@ekeih
Copy link
Contributor

ekeih commented Oct 15, 2018

  • icinga2 --version output
    My comment from last week was about 2.9.1. But I do not think that the update to 2.9.2 changed the issue.
root@master01:/# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.2-1)

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Ubuntu
  Platform version: 16.04.5 LTS (Xenial Xerus)
  Kernel: Linux
  Kernel version: 2.6.32-042stab133.2
  Architecture: x86_64

Build information:
  Compiler: GNU 5.3.1
  Build host: c1410dac9c1d
  • System health, are there any security limits, limits on I/O
    No

  • Which features are enabled and running on the same box, do they take more CPU / IO than usual?
    All components are running in virtuozzo containers. We have two masters, 3x2 satellites and a separate MySQL Master and a Slave. The host systems and the containers do not show any performance issues.

  • How many objects are involved here (icinga2 daemon -C output).

root@master01:/# icinga2 daemon -C
[2018-10-15 10:45:56 +0200] information/cli: Icinga application loader (version: r2.9.2-1)
[2018-10-15 10:45:56 +0200] information/cli: Loading configuration file(s).
[2018-10-15 10:45:57 +0200] information/ConfigItem: Committing config item(s).
[2018-10-15 10:45:57 +0200] warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-10-15 10:45:57 +0200] warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-10-15 10:45:57 +0200] warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-10-15 10:45:57 +0200] warning/ApiListener: Please read the upgrading documentation for v2.8: https://www.icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
[2018-10-15 10:45:57 +0200] information/ApiListener: My API identity: foobar.live.master01
[2018-10-15 10:46:07 +0200] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 45.4667/s (2728/min 2728/5min 2728/15min);
[2018-10-15 10:46:07 +0200] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 10:46:07 +0200] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 10:46:07 +0200] information/WorkQueue: #7 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 65651 Services.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 ServiceGroup.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 605 ScheduledDowntimes.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 355 HostGroups.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 49649 Notifications.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 2569 Hosts.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 6761 Downtimes.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 54952 Dependencies.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 395 Comments.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 6 Zones.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 8 Endpoints.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 7 ApiUsers.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 210 UserGroups.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 14 CheckCommands.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 4 TimePeriods.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 327 Users.
[2018-10-15 10:46:13 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-10-15 10:46:13 +0200] information/cli: Finished validating the configuration file(s).
  • How long does a reload typically take?
root@master01:/# time systemctl reload icinga2

real	0m17.984s
user	0m0.000s
sys	0m0.003s
  • does the timestamp change during reload?
    Not during a reload, but during a restart.

  • Are there any temporary files written with a newer timestamp, e.g. icinga2.state.23Wckf?
    Also not during a reload, but during a restart.

So as @marcofl and @K0nne already pointed out a restart may work better than a reload. But interestingly Icinga2 log did not log Dumping program state to file or Restoring program state from file during the restart.

  • Can you strace the processes, and look specifically for file write operations done during the reload?
# during a reload
root@master01:/# strace -e open -p 26885
strace: Process 26885 attached
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=27670, si_uid=0} ---
--- SIGCHLD {si_signo=SIGCHLD, si_code=SI_USER, si_pid=27817, si_uid=107} ---
+++ killed by SIGABRT +++

# during a restart
root@master01:/# strace -e open -p 27817
strace: Process 27817 attached
open("/var/lib/icinga2/icinga2.state.UiOfHM", O_RDWR|O_CREAT|O_EXCL, 0600) = 13
open("/var/lib/icinga2/icinga2.state.UiOfHM", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 47
open("/var/lib/icinga2/modified-attributes.conf.GisxY1", O_RDWR|O_CREAT|O_EXCL, 0600) = 13
open("/var/lib/icinga2/modified-attributes.conf.GisxY1", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 47
+++ exited with 0 +++

Update: I just realized that my strace during a reload prints out +++ killed by SIGABRT +++. Maybe this is a clue?

@dnsmichi dnsmichi added bug Something isn't working and removed needs feedback We'll only proceed once we hear from you again labels Oct 16, 2018
@dnsmichi
Copy link
Contributor

Thanks for the details, I wanted to get an overview if this is distribution agnostic, or maybe related to special signal handling with Systemd. One of our community members opened #6689 which includes a fix, similar to what I already had in mind from my analysis. Likely last week was too much stress, today I've found a reliable way to reproduce the issue.

You can find the steps to reproduce and test protocol in #6689, a PR is coming soon.

Please test the snapshot packages - I'll trigger them for el7 and Ubuntu 16 and let you know once available.

@dnsmichi dnsmichi self-assigned this Oct 16, 2018
@dnsmichi dnsmichi added this to the 2.10.1 milestone Oct 16, 2018
dnsmichi pushed a commit that referenced this issue Oct 16, 2018
Credits to @west0rmann finding the issue and providing the initial fix.

fixes #6689
fixes #6592
dnsmichi pushed a commit that referenced this issue Oct 16, 2018
Credits to @west0rmann finding the issue and providing the initial fix.

fixes #6689
fixes #6592
@dnsmichi
Copy link
Contributor

Packages are available, please report from your tests :-)

@dnsmichi dnsmichi reopened this Oct 16, 2018
@dnsmichi dnsmichi added the needs feedback We'll only proceed once we hear from you again label Oct 16, 2018
@marcofl
Copy link

marcofl commented Oct 17, 2018

I can confirm that with the version 2.10.0+18.gc0398ed the state file is changed during a reload. What I don't see is the new log message implemented in https://github.com/Icinga/icinga2/pull/6691/files

(master-test-monitoring_v2)root@master1:~# grep -i reload /var/log/icinga2/icinga2.log
[2018-10-17 09:28:05 +0200] information/Application: Got reload command: Starting new instance.
[2018-10-17 09:28:06 +0200] information/Application: Reload requested, letting new process take over.
[2018-10-17 09:29:28 +0200] information/Application: Got reload command: Starting new instance.
[2018-10-17 09:29:29 +0200] information/Application: Reload requested, letting new process take over.

@dnsmichi
Copy link
Contributor

Maybe the logs are flushed too late and then exit() kills the stream prior to writing it. Did you test the thing with sending an acknowledgement during the reload time too?

@marcofl
Copy link

marcofl commented Oct 17, 2018

Our issue with having old check results internally in icinga2 after a reload seems to be gone with this. Also I did things like disable active checks for a service -> reload -> active check still disabled -> enable active check -> reload -> waiting for next check -> new check results was there. also acked a critical service and the ack was not gone after a reload.

I tend to say that this fixed the issue. But maybe other reporters of this issue can confirm this too please.

would you say this snapshot (2.10.0+18.gc0398ed) is okay to use in production until a release?

@dnsmichi
Copy link
Contributor

Ok, thanks for testing :)

The snapshot packages only contain a regression fix for the API, and Windows build fix. Nothing critical, although I wouldn't recommend snapshot packages in production. When there's more feedback we're planning with 2.10.1 soon.

@dnsmichi
Copy link
Contributor

Got a similar test response with #6624, so I am inclined to say that it is fixed. I know that @K0nne won't be able to test this in production.

@ekeih
Copy link
Contributor

ekeih commented Oct 18, 2018

Hi, sorry for my late response - I was out of office the last two days.
According to another ticket we can excepct 2.10.1 later today. I will upgrade our stage environment to 2.10.1 and report back.

Thanks to everyone for this fix! :)

@dnsmichi dnsmichi removed the needs feedback We'll only proceed once we hear from you again label Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants