-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve reload handling for features (metric & queue flush, activation priority) #6970
Conversation
This reverts commit 1eaad06.
This reverts commit b81aa6a.
This reverts commit f0e12ff.
This reverts commit 8ad1717.
This reverts commit 8470fac.
This stops the checker component first, then notifications, then features, then config objects, then the API feature and logger(s). Patch taken from @Al2Klimov
Refactored the code into a local mutex and added some more debug logging while at it.
…eue on Pause/Shutdown/Reload Patch taken from @Al2Klimov but moved into Pause()
TestsT1
T2
icinga2_6841_reload_flush.txt.zip InfluxDB
PerfdataThe code holds some stream writing now commented out. This was the easiest way to figure out whether the file handles are actually flushed ;-)
Reload Logging
|
@dnsmichi, it's not about credits, it's about Git history noise (someone will complain about one day). Feel free to assign this one to me and I'll rebase it (w/o making actual changes). |
Thanks, but I'll prefer to keep it this way. I'll now merge this for snapshot packages. |
- Decrease Object Authority updates to 10s (was 30s) - Decrease failover timeout to 30s (was 60s) - Decrease cold startup (after (re)start) with no OA updates to 30s (was 60s) - Immediately connect on Resume() - Fix query priority which got broken with #6970 - Add more logging when a failover is in progress ``` [2019-03-29 16:13:53 +0100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 8.33246s ago (< failover timeout of 30s). Retrying. [2019-03-29 16:14:23 +0100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 38.3288s ago. Taking over 'ido-mysql' in HA zone 'master'. ``` - Add more logging for reconnect and disconnect handling - Add 'last_failover' attribute to IDO*Connection objects refs #6970
- Decrease Object Authority updates to 10s (was 30s) - Decrease failover timeout to 30s (was 60s) - Decrease cold startup (after (re)start) with no OA updates to 30s (was 60s) - Immediately connect on Resume() - Fix query priority which got broken with #6970 - Add more logging when a failover is in progress ``` [2019-03-29 16:13:53 +0100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 8.33246s ago (< failover timeout of 30s). Retrying. [2019-03-29 16:14:23 +0100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 38.3288s ago. Taking over 'ido-mysql' in HA zone 'master'. ``` - Add more logging for reconnect and disconnect handling - Add 'last_failover' attribute to IDO*Connection objects refs #6970
This follows the same principle as with the shutdown handler, and was introduced with the changed reload handling with 2.9. Previously IsShuttingDown() was sufficient which got set at one location. SigUsr2 as handler introduced a new location where m_ShuttingDown is not necessarily set yet. Since this handler gets called when l_Restarting is enabled, we'll use this flag to avoid config update events resulting in object deactivation (object->IsActive() always returns false). refs #5996 refs #6691 refs #6970 fixes #7125
This follows the same principle as with the shutdown handler, and was introduced with the changed reload handling with 2.9. Previously IsShuttingDown() was sufficient which got set at one location. SigUsr2 as handler introduced a new location where m_ShuttingDown is not necessarily set yet. Since this handler gets called when l_Restarting is enabled, we'll use this flag to avoid config update events resulting in object deactivation (object->IsActive() always returns false). refs #5996 refs #6691 refs #6970 fixes #7125 (cherry picked from commit 78e24c5)
Summary
WQs need to be flushed on Pause/Stop handling,
this also includes metric buffer strings in InfluxDB/Elasticsearch.
The Perfdata feature did not properly close file handles,
leaving performance data behind.
Reconnect-Timers need to be stopped on Pause() too,
that was missing from #6725. IDO query queues had the wrong
priority for object activation and session cleanup, sometimes
resulting in zombie objects deactivated after minutes.
The activation priority needs to be respected in reverse order
on shutdown/reload. Furthermore, features need to be activated
before checker/notification, and on shutdown/reload, they need
to be stopped the soonest. There's also changes with making
the IcingaApplication a primary citizen and starting Downtime
objects early.
The ReloadTimeout constant allows to modify the default
of 300 previously hardcoded in two places: application
reload and config package reload. This is for advanced usage only.
Notes
This reverts #6882 and #6908 and re-implements them
with a refined behaviour.
The implemented changes may slow down the overall reload
time thus ensuring everything is stopped and flushed in order.
fixes #6841
ref/NC/591065