Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reload handling for features (metric & queue flush, activation priority) #6970

Merged
merged 20 commits into from
Feb 26, 2019

Conversation

dnsmichi
Copy link
Contributor

@dnsmichi dnsmichi commented Feb 25, 2019

Summary

WQs need to be flushed on Pause/Stop handling,
this also includes metric buffer strings in InfluxDB/Elasticsearch.
The Perfdata feature did not properly close file handles,
leaving performance data behind.

Reconnect-Timers need to be stopped on Pause() too,
that was missing from #6725. IDO query queues had the wrong
priority for object activation and session cleanup, sometimes
resulting in zombie objects deactivated after minutes.

The activation priority needs to be respected in reverse order
on shutdown/reload. Furthermore, features need to be activated
before checker/notification, and on shutdown/reload, they need
to be stopped the soonest. There's also changes with making
the IcingaApplication a primary citizen and starting Downtime
objects early.

The ReloadTimeout constant allows to modify the default
of 300 previously hardcoded in two places: application
reload and config package reload. This is for advanced usage only.

Notes

This reverts #6882 and #6908 and re-implements them
with a refined behaviour.

The implemented changes may slow down the overall reload
time thus ensuring everything is stopped and flushed in order.

fixes #6841

ref/NC/591065

Michael Friedrich added 20 commits February 19, 2019 14:05
This stops the checker component first, then notifications, then
features, then config objects, then the API feature and logger(s).

Patch taken from @Al2Klimov
Refactored the code into a local mutex and added
some more debug logging while at it.
…eue on Pause/Shutdown/Reload

Patch taken from @Al2Klimov but moved into Pause()
@dnsmichi
Copy link
Contributor Author

Tests

T1

icinga2 daemon -x debug 

T2

kill -HUP $(pidof icinga2)

kill -9 $(pidof icinga2)

icinga2_6841_reload_flush.txt.zip

InfluxDB

[2019-02-20 16:24:49 +0100] debug/ConfigObject: Deactivate() called for config object 'influxdb' with type 'InfluxdbWriter'.
[2019-02-20 16:24:49 +0100] debug/InfluxdbWriter: Flushing pending data buffers.
[2019-02-20 16:24:49 +0100] debug/InfluxdbWriter: Flushing data buffer to InfluxDB.

[2019-02-20 16:24:49 +0100] debug/InfluxdbWriter: Joining existing WQ tasks.

[2019-02-20 16:24:49 +0100] debug/InfluxdbWriter: Flushing data buffers from WQ tasks.
[2019-02-20 16:24:49 +0100] debug/InfluxdbWriter: Flushing data buffer to InfluxDB.

[2019-02-20 16:24:49 +0100] debug/HttpRequest: line: HTTP/1.1 204 No Content, tokens: 4
[2019-02-20 16:24:49 +0100] information/InfluxdbWriter: 'influxdb' paused.

Perfdata

The code holds some stream writing now commented out. This was the easiest way to figure out whether the file handles are actually flushed ;-)

$ grep -r Pause /usr/local/icinga/icinga2/var/spool/icinga2/perfdata/
/usr/local/icinga/icinga2/var/spool/icinga2/perfdata//service-perfdata.1550666633:# Pause the feature
/usr/local/icinga/icinga2/var/spool/icinga2/perfdata//service-perfdata.1550667269:# Pause the feature
/usr/local/icinga/icinga2/var/spool/icinga2/perfdata//host-perfdata.1550666633:# Pause the feature
/usr/local/icinga/icinga2/var/spool/icinga2/perfdata//host-perfdata.1550667269:# Pause the feature
[2019-02-20 14:18:31 +0100] information/Application: Reload requested, letting new process take over.
[2019-02-20 14:18:31 +0100] debug/ConfigObject: Deactivate() called for config object 'checker' with type 'CheckerComponent'.
[2019-02-20 14:18:31 +0100] information/CheckerComponent: 'checker' stopped.
[2019-02-20 14:18:31 +0100] debug/ConfigObject: Deactivate() called for config object 'notification' with type 'NotificationComponent'.
[2019-02-20 14:18:31 +0100] information/NotificationComponent: 'notification' stopped.
[2019-02-20 14:18:31 +0100] debug/ConfigObject: Deactivate() called for config object 'perfdata' with type 'PerfdataWriter'.
[2019-02-20 14:18:31 +0100] debug/PerfdataWriter: Rotating perfdata files.
[2019-02-20 14:18:31 +0100] debug/PerfdataWriter: Closed output file and renaming into '/usr/local/icinga/icinga2/var/spool/icinga2/perfdata/service-perfdata.1550668711'.
[2019-02-20 14:18:31 +0100] debug/PerfdataWriter: Rotating perfdata files.
[2019-02-20 14:18:31 +0100] debug/PerfdataWriter: Closed output file and renaming into '/usr/local/icinga/icinga2/var/spool/icinga2/perfdata/host-perfdata.1550668711'.
[2019-02-20 14:18:31 +0100] information/PerfdataWriter: 'perfdata' paused.
[2019-02-20 14:18:31 +0100] debug/ConfigObject: Deactivate() called for config object 'ido-mysql' with type 'IdoMysqlConnection'.
[2019-02-20 14:18:31 +0100] information/IdoMysqlConnection: 'ido-mysql' paused.
[2019-02-20 14:18:31 +0100] information/DbConnection: Pausing IDO connection: ido-mysql
[2019-02-20 14:18:31 +0100] debug/IdoMysqlConnection: Rescheduling disconnect task.

Reload Logging

[2019-02-22 15:45:08 +0100] information/Application: Got reload command: Started new instance with PID '64850' (timeout is 300s).

[2019-02-20 14:18:31 +0100] debug/ConfigObject: Deactivate() called for config object 'root' with type 'ApiUser'.
[2019-02-20 14:18:31 +0100] debug/ConfigObject: Deactivate() called for config object 'never' with type 'TimePeriod'.
[2019-02-20 14:18:31 +0100] debug/ConfigObject: Deactivate() called for config object '24x7' with type 'TimePeriod'.
[2019-02-20 14:18:31 +0100] debug/ConfigObject: Deactivate() called for config object '9to5' with type 'TimePeriod'.
[2019-02-20 14:18:31 +0100] debug/ConfigObject: Deactivate() called for config object 'mbpmif.int.netways.de!load!4c26071e-2666-455d-8c30-c279d3392c94' with type 'Downtime'.
[2019-02-20 14:18:31 +0100] debug/ConfigObject: Deactivate() called for config object 'app' with type 'IcingaApplication'.
[2019-02-20 14:18:31 +0100] debug/ConfigObject: Deactivate() called for config object 'main-log' with type 'FileLogger'.
[2019-02-20 14:18:31 +0100] information/ConfigObject: Dumping program state to file '/usr/local/icinga/icinga2/var/lib/icinga2/icinga2.state'
[2019-02-20 14:18:31 +0100] information/Application: Reload done, parent process shutting down. Child process with PID '64850' is taking over.

@dnsmichi dnsmichi added this to the 2.11.0 milestone Feb 25, 2019
@dnsmichi dnsmichi added bug Something isn't working area/db-ido Database output area/metrics General metrics handling area/checks Check execution and results area/log Logging related labels Feb 25, 2019
@Al2Klimov
Copy link
Member

@dnsmichi, it's not about credits, it's about Git history noise (someone will complain about one day). Feel free to assign this one to me and I'll rebase it (w/o making actual changes).

@dnsmichi
Copy link
Contributor Author

Thanks, but I'll prefer to keep it this way. I'll now merge this for snapshot packages.

@dnsmichi dnsmichi merged commit e2df115 into master Feb 26, 2019
@dnsmichi dnsmichi deleted the bugfix/perfdata-gaps branch March 18, 2019 14:17
dnsmichi pushed a commit that referenced this pull request Mar 29, 2019
- Decrease Object Authority updates to 10s (was 30s)
- Decrease failover timeout to 30s (was 60s)
- Decrease cold startup (after (re)start) with no OA updates to 30s (was 60s)
- Immediately connect on Resume()
- Fix query priority which got broken with #6970
- Add more logging when a failover is in progress

```
[2019-03-29 16:13:53 +0100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 8.33246s ago (< failover timeout of 30s). Retrying.

[2019-03-29 16:14:23 +0100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 38.3288s ago. Taking over 'ido-mysql' in HA zone 'master'.
```

- Add more logging for reconnect and disconnect handling
- Add 'last_failover' attribute to IDO*Connection objects

refs #6970
dnsmichi pushed a commit that referenced this pull request Apr 1, 2019
- Decrease Object Authority updates to 10s (was 30s)
- Decrease failover timeout to 30s (was 60s)
- Decrease cold startup (after (re)start) with no OA updates to 30s (was 60s)
- Immediately connect on Resume()
- Fix query priority which got broken with #6970
- Add more logging when a failover is in progress

```
[2019-03-29 16:13:53 +0100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 8.33246s ago (< failover timeout of 30s). Retrying.

[2019-03-29 16:14:23 +0100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 38.3288s ago. Taking over 'ido-mysql' in HA zone 'master'.
```

- Add more logging for reconnect and disconnect handling
- Add 'last_failover' attribute to IDO*Connection objects

refs #6970
dnsmichi pushed a commit that referenced this pull request May 3, 2019
This follows the same principle as with the shutdown handler,
and was introduced with the changed reload handling with 2.9.
Previously IsShuttingDown() was sufficient which got set at one
location.

SigUsr2 as handler introduced a new location where m_ShuttingDown
is not necessarily set yet. Since this handler gets called when
l_Restarting is enabled, we'll use this flag to avoid config update
events resulting in object deactivation (object->IsActive() always
returns false).

refs #5996
refs #6691
refs #6970

fixes #7125
dnsmichi pushed a commit that referenced this pull request May 16, 2019
This follows the same principle as with the shutdown handler,
and was introduced with the changed reload handling with 2.9.
Previously IsShuttingDown() was sufficient which got set at one
location.

SigUsr2 as handler introduced a new location where m_ShuttingDown
is not necessarily set yet. Since this handler gets called when
l_Restarting is enabled, we'll use this flag to avoid config update
events resulting in object deactivation (object->IsActive() always
returns false).

refs #5996
refs #6691
refs #6970

fixes #7125

(cherry picked from commit 78e24c5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Check execution and results area/db-ido Database output area/log Logging related area/metrics General metrics handling bug Something isn't working ref/NC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Icinga2 does not reliably collect all performance data values during reload
2 participants