-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Merge adjacent downtime segments #6579
Conversation
Put running downtimes in effect and merge segments Put an already running downtime in effect immediately: If Icinga2 was restarted with a newly configured downtime that should be in effect at the time of restart, the should-be-running segment of it was not put into effect. Merge adjacent downtime segments: As legacy time periods can't span midnight, a configured downtime spanning midnight is technically two (immediately adjecent) segments. As segments were queued individually, at midnight, the downtime technically ended (sending a DowntimeEnd) only to start immediately again (sending a DowntimeStart notification). With this fix, an immediately following segment is merged into the current one in case the current one is ending soon (where "soon" is defined as "12 hours or less"). The time limit is arbitrary, but necessary to prevent endless merging in case of a 7*24 downtime.
TestCreate a new ScheduledDowntime which is already running. Ensure that the downtime is in effect immediately.
To verify that the downtime is immediately in effect I queried the API. API: The Downtime is created.
API: Notice the
ProblemAfter a while there are multiple Downtimes created. API: There are multiple entries.
(truncated) (full) Quering the host via API I can confirm that there are multiple downtimes in effect. API: Notice the
(truncated) (full) Looking into the debug log I can see multiple log messages like the following. All log messages are for the same ScheduledDowntime object.
Short summary: The Downtime is re-created every minute, thereby we have multiple downtime objects after a while. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my test.
In ScheduledDowntime::FindRunningSegment(), only regard downtimes that last longer than minEnd, not at least as long. Otherwise, a running downtime with a fixed start date will be queued over and over again.
In ScheduledDowntime::FindRunningSegment(), only regard downtimes that last longer than minEnd, not at least as long. Otherwise, a running downtime with a fixed start date will be queued over and over again.
In ScheduledDowntime::FindRunningSegment(), only regard downtimes that last longer than minEnd, not at least as long. Otherwise, a running downtime with a fixed start date will be queued over and over again.
Thanks for testing and reporting the failure. There was a I'm also unsure as to whether GitHub automagically updated my Pull Request. |
@mcktr Can you have look again please? I'd like to include this in 2.11 then. |
Yep, I will have a look in the next days. |
Thanks for updating the PR. TestsPut running downtime in effect immediately
Downtime is in effect immediately after restart. Waiting for 10 minutes, there are no other downtimes created -> Good. This part of the patch works. Merge adjacent downtime segments
Watching the log to verify the merging.
The first merge attempt is against the Downtime itself, I guess we can safely skip this attempt. The second merge attempt is against the adjacent downtime, which starts right after the initial downtime. But due to a date mismatch (Thu Jan 1 01:00:00 1970 != Mon Oct 15 12:00:00 2018) it doesn't fit and no merge will happen. I assume that some default date value is used since the date is equal to unixtime 0. Please have a look into the second part of your patch. |
The second merge attempt is against the adjacent downtime
The patch's intention is not to merge adjacent downtimes. The intent is to merge adjacent segments of the same downtime (especially because the legacy timeperiod format doesn't allow to specify periods spanning midnight).
In your example, the behaviour is as intended, and, I guess, what makes sense: one downtime ends and another one starts. You may wish to be notified etc. in that case, I suppose.
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Evaluating segment: 2018-10-15: 10:00-11:30
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Next Segment doesn't fit: Thu Jan 1 01:00:00 1970 != Mon Oct 15 11:30:00 2018
I'll have to dig into this. Maybe it's an error, maybe just the debug output is misleading.
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Not by us (devel-host-001!devel-downtime-002 != devel-host-001!devel-downtime-001)
This is exactly what I would expect to happen. These are two different doentimes.
Please have a look into the second part of your patch.
I'll look into the confusing debug output.
|
Thanks for the clarification. 👍 So the above example is correctly not merged together since this are two different downtimes. More Tests
If I understand you correctly the second segment should be merged into the first, since this is the same downtime with two adjacent segments, correct? It does.
One thing I noticed in Icinga Web 2 during the tests: Notice the negative expires in value. After all segments of the downtime are expired and the downtime is finally over, it won't get cleared in Icinga Web 2. When the transition from the first segment to the second segment happens there shows up a second Downtime, but this one will be cleared on expiration. The Downtime with negative expires in value stays there until I restart the Icinga 2 daemon. A guess what happens: The Downtime is initially created with the original end time of the first segment, after a while it will be merged with the adjacent segment (you can verify this via the API). But the updated end time does not find it's way into the database which Icinga Web 2 uses. So in Icinga Web 2 we'll see the old end time of the first segment for the Downtime which results into an negative expires in value. I'm looking forward to your feedback. |
ranges = {
"2018-10-15" = "17:55-18:00,18:00-18:05"
}
While this is not the intended use case (which is more like
"friday" = "17:00-24:00"
"saturday" = "00:00-24:00"
"sunday" = "00:00-24:00"
"monday" = "00:00-09:00"
or simply 24*7), yes, these are two adjecent segments of the same downtime.
It does.
Fine.
A guess what happens:
Well, I don't have to guess and you are right.
The reason for the delayed merge (which may look awkward) is that, in case of a 24*7 downtime, I would be merging endlessly. While there may be a way to detect that, it appeared easier to merge on demand.
you can verify this via the API
No need to since that's exactly how I designed it.
But the updated end time does not find it's way into the database
which Icinga Web 2 uses.
Hm. Could someone educate me on the subject? Are you simply not meant to extend a downtime? Is Icinga Web 2 at fault not to re-checking it? Is it simply me needing to call notify_api_of_downtime_change() or something?
What happens if you change a downtime manually entered via the Web Interface (should that be possible, I don't do that)?
|
There is currently no way to extend an existing downtime, except for remove and re-create it with the new ending time. In terms of configuration objects, such as hosts, services, users, downtimes, etc., Icinga Web 2 relies on the IDO database. The downtime object which you can fetch via the API is current, so updating the API wouldn't help here. I took pen and paper and draw some pictures to illustrate the problem a bit more. We have the following ScheduledDowntime object.
The expected behavior is a downtime start notification at 14:30 and a downtime end notification at 14:40. The current behavior looks like the following: The downtime for the first segment (14:30-14:35) is created. The downtime object is written to the IDO database and exposed via API. A downtime start notification will be sent at 14:30. The initially created downtime is merged with the adjacent segment (14:35-14:40). This is done by setting the downtime ending time to the ending time of the adjacent segment (14:40). if (segment.first == current_end) {
Log(LogDebug, "ScheduledDowntime") << "Next Segment fits, extending end time " << Utility::FormatDateTime("%c", current_end) << " to " << Utility::FormatDateTime("%c", segment.second);
downtime->SetEndTime(segment.second, false);
return; This works for the API but the IDO database never sees the updated ending time. Another downtime is created for the second segment (14:35-14:40). The downtime object is written to the IDO database and exposed via API. A downtime start notification will be sent at 14:35. At the end we have two downtimes and four notifications will be sent out. Summary: Extending the downtime ending time by setting it to the end time of the second segment is not sufficient.
Looking in downtime.cpp we have
If a adjacent segment is merged, there shouldn't be an another downtime for the second segment. Since your first part (put an already running downtime in effect immediately) of the patch works flawlessly could you split up the first part in another PR? This way we can focus here to get the second part working. :) |
There is currently no way to extend an existing downtime
I see.
Another downtime is created for the second segment (14:35-14:40).
This would be an error I'll have to correct.
the downtime object in the IDO database must be updated when merging the second segment
I have no ideea how to do that.
the downtime object for the second segment must not be created
Yes.
could you split up the first part in another PR?
I'll try to do that. But I'm currently (Icinga2-wise) also working on trying to clean up ProcessCheckResult() to be able to introduce a Soft-OK state I'd like to have.
|
Surpress a mislading debug message stating the next segment won't fit because its start time (The Epoch) didn't match. Instead, log that no next segment exists.
1. I fixed the confusing debug message (The Epoch not matching the current end time) in efuss@1e58214.
2. You mention that after merging, a second downtime for the second segment is created. I can't figure out how that could happen. Other than your hand-written notes, could you provide me with debug output (or whatever) showing that effect?
3. I was thinking of, instead of merging one segment ahead, merging as much as possible when creating the downtime in the first place, avoiding the trouble that extending its end time may cause. However, I'd need to recognize a 24*7 downtime in order to avoid endless merging and couldn't come up with an idea for that (any recognition method that came to my mind could be subverted by, i.e. writing the "Monday" part as "00:00-12:00,12:00-24:00").
4. I (hopefully) split off the put-in-effect-immediately part in Pull Request #6704. I haven't even compile tested that yet.
|
EF> [...] I'd need to recognize a 24*7 downtime in order to avoid endless merging and couldn't come up with an idea for that [...]
I just had an idea for that. I'll probably come up with an implementation for this entirely different approach to downtime segment merging next week.
|
EF> I just had an idea for that. I'll probably come up with an implementation for this entirely different approach to downtime segment merging next week.
Unfortunately, it's complicated. I need more input/advice on this.
The objective is to work around a defiency of legacy time periods, in which it's impossible to directly specify an interval spanning midnight. In particular, it's impossible to explicitly specify an ``always'' interval.
This implies the necessity of splitting scheduled downtimes at midnight, leading to spurious end/begin notifications, rendering downtime notifications unusuable.
My first approach was to merge future segments into a running downtime on demand. This changes an already active downtime's end, leading to problems in at least Icinga Web 2.
Could some senior developer please comment on this?
Is it just that I should additionally be doing this-and-that when extending the active downtime's end (or Icinga Web 2 should be doing such) or is it simply contrary to Icinga2's design to tamper with an active downtime?
My second approach is to merge as many segments as possible prior to actually installing the downtime. This clearly avoids the problem mentioned above.
The tricky part is how to detect endless downtimes, which is close to impossible due to the convoluted syntax of legacy time periods. I do have an algorithm that can be proven to reliably detect any actually infinite specification; however, it can be fooled into regarding an actually finite specification as infinite: Take the finite specification
"Monday" = "00:00 - 24:00"
[...]
"Friday" = "00:00 - 24:00"
and add
<date of next saturday> = "00:00 - 00:24"
<date of next sunday> = "00:00 - 00:24"
and it will flag an infinite loop where there is none. One can work around this, but you'll only neeed a more convoluted example to fool an improved version.
One possible solution is to regard this specification as silly (it should probably be expresed in two different downtimes), document the exact behaviour and treat specifications like this as ``always''.
A third approach, which would be nearly trivial to implement, is to keep merging until having accumulated an enormous period, say, a year or ten years. Then, either simply stop merging (giving spurious notifications every <enormous period>) or stop merging and treat the period as infinite.
Which approach shoud I persue?
|
@efuss Can you please move this into a new issue for better discussion with involved developers and users? The PR unfortunately went stale in this regard. Thanks. |
Put an already running downtime in effect immediately:
If Icinga2 was restarted with a newly configured downtime that should
be in effect at the time of restart, the should-be-running segment of
it was not put into effect.
Merge adjacent downtime segments:
As legacy time periods can't span midnight, a configured downtime
spanning midnight is technically two (immediately adjecent) segments.
As segments were queued individually, at midnight, the downtime
technically ended (sending a DowntimeEnd) only to start immediately
again (sending a DowntimeStart notification).
With this fix, an immediately following segment is merged into the
current one in case the current one is ending soon (where "soon" is
defined as "12 hours or less"). The time limit is arbitrary, but
necessary to prevent endless merging in case of a 7*24 downtime.
Note that the diff of scheduleddowntime.cpp looks weird because it
inserts a new FindRunningSegment() function in front of FindNextSegment()
and the initial lines of bot functions look sufficiently similar to make
diff believe FindNextSegment() got changed.