-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let redundancy groups not just fail #10190
Comments
To be honest, I don't quite understand what this issue is all about :). You have listed a number of open questions that don't quite fit the title of the issue.
This particular case might be covered by #10227 as that issue is about syncing dependencies to Redis.
The
Isn't that what #10014 is about? That's a confirmed bug that will be fixed by #10228, but I don't see how that applies to this issue.
This is already the case, all parents in a given dependency graph are evaluated recursively, but no redundancy groups have been taken into account so far.
Well, it's simply unreachable. Isn't the unreachable state intended for use cases like this? There's no need to add another meaning to it. |
The main point of this issue is that redundancy groups don't have a reachability flag in the current design proposal. They need one however, as Web needs it to identify root problems. #10227 does not include this yet, as Icinga doesn't even consider it. #10228 does exactly miss the point I made here and introduces contradictory states: A host or service might now be unreachable, although every related dependency states it is available. I don't know how #10227 determines a group's state at the moment, but let's assume it's a loop over all related dependencies which are asked if they're available. Which they all are. And in this case, this contradiction is written to redis and subsequently to the database where Web will have a hard time following a hierarchy up to the actual root problem as when it reaches the group in question it stops, as it isn't failed. And of course, it didn't fail, it just doesn't have any member that is reachable. And now let's assume every member of a redundancy group is itself dependent on another (grand) parent, just like in my example. The group only fails (and thus states the child isn't reachable) if all members are either unreachable themselves or in such a state that the dependency fails. But what if one of the members isn't in such a state that the dependency fails but it's (grand) parent is in such a state that their dependency fails? The answer: Nothing, unless the parent is checked. Which it won't if the parent <-> grandparent dependency disables checks. -- I know this is a culmination of several issues. But I hope I made it clear now what I'm up to. |
Well, redundancy groups simply don't exist from Icinga 2's point of view, or at least they're not one of those things that can have states and I don't think it would make sense to change that. If in doubt, #10227, if I understood @oxzi correctly, will construct the redundancy groups virtually and sync them to Redis. So I definitely think this is something that should be fixed with or based on #10227's codebase.
With #10228, a host or service becomes unreachable when any of its parent dependencies become unreachable, so I don't know why this contradicts the point you made. After all...
... isn't the PR doing exactly what you've asked for here?
This can only be answered by @oxzi. Since Icinga/icingadb#347 (comment) have listed all the necessary database tables, I guess this needs to be done or maybe is already in progress with #10227.
That sounds to me like a problem that might occur in the future, not something that already exists. There isn't even a corresponding PR for #10227 yet, so there doesn't seem to be any point in discussing it here now. Originally I just wanted to understand this issue and see if I could fix it in parallel with #10227, but these two are mutually dependent.
If I understand your twisted question correctly :), this seems to be related to #10143. |
✅ 😅
reachable is a state which a child is in, available is what a parent can be or not. There are two distinct objects, which one's reachability depends on the others availability. Even with #10228, a child becomes not reachable while the parent is still available.
No, it doesn't touch anything related to redundancy groups. And it cannot either, since they don't exist yet, as you pointed out. What I ask here cannot be fixed by a single change anyway, I just wanted to outline my theory that there is something wrong with how reachability and availability are currently determined.
It certainly will, if one relies on what
Yes…
|
Recently, we discussed this in person and I wasn't so sure anymore whether we required Note that, as always, implicit relations (host -> service) are of no importance here. graph TD;
A-->B;
B-->D;
C-->D;
subgraph BC [ redundancy group ]
B~~~C
end
The group is failed if #10158 talks of If Now, the root problem of say graph TD;
A[down]-->B;
B["up (unreachable)"]-->D;
C[down]-->D["down (unreachable)"];
subgraph BC [ unreachable ]
B~~~C
end
In this case |
This is not a bug, not yet, at least. Once #10014 is fixed, this should be considered.
tl/dr
Redundancy groups should not just fail, they should in addition be not reachable or not determinable, just like hosts and services are.
What's this about?
Right now, the child host will be unreachable in two cases:
Once the second case is fixed, it will likely behave like this instead:
But what happens in case only one of the parents is down and the other unreachable?
I suppose so, since already right now, the availability of the child host is influenced by its parent's reachable state. It will just be extended to consider redundancy groups in such a case.
But then there is
Dependency::IsAvailable
Which is what is used to determine a redundancy group's state at the moment. Though, it doesn't consider the parent's reachable state at all. The parent might be UP, but unreachable, and it returns true.
So, in order to really ensure that the child host is unreachable in the third case, this might need to change. (The potential bug)
Of course, this has larger side-effects, as it changes how dependencies work in general.
Which is what this issue is really about! As I believe this is a good thing. Why shouldn't a dependency fail, in case another one higher up in the hierarchy fails? This is already the case, I think, anyway. This discrepancy will only pop up in case checks are disabled on the parent in question. (As if checks run, the parent will eventually go down/unknown)
Additional note
What led me to this, was the wish to understand how a redundancy group's state will be determined as part of Icinga/icingadb#347.
As of today I thought it simply represents whether all related dependencies failed or not. But now a failed dependency might be failing, because a parent might just not be reachable. And so will a redundancy group.
This all sounds fine, unless you want to identify the actual root problem in a failing dependency chain. (A root problem is a dependency node, which is reachable and has a problem)
In a dependency chain, the current plan is, to represent redundancy groups as actual nodes with an actual state. While hosts and services also have state, they can in addition be not reachable. Redundancy groups can only fail.
But a redundancy group, with one host member that is UP but not reachable, is technically also not reachable. So it's not just a failing dependency, but one that is … not determinable?
The text was updated successfully, but these errors were encountered: