-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set limit the unknown status in ansible callback #407
Conversation
backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/AnsibleCallback.java
Outdated
Show resolved
Hide resolved
When calling ansible callback, we check the event and see the command status. Initially the command status is `unknown`. Until now the async ansible callback had no limitations, but if the ansible runner service and the artifacts are gone, it will keep being on `unknown` state or move to this state again, running for unlimited time. A new configuration, `AsyncAnsibleTimeout` added, by default set to 180 seconds as a time limit. When we reach the time limit we will try to cancel the playbook in order to release the usage of the relevant entities. Change-Id: I31c7c4a1bc17721806c70769377fe1a1cf059b57 Bug-Url: https://bugzilla.redhat.com/2030293 Signed-off-by: Liran Rotenberg <lrotenbe@redhat.com>
backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/AnsibleCallback.java
Show resolved
Hide resolved
It doesn't sound to me like a great idea to make this convoluted even further with yet another timeout. I think it should be equally simple and yet fixing this problem by:
if we really want to be paranoid about runner background execution we should probably wait right in AnsibleExecutor::runCommand() |
from my inspection, if they don't exists in the engine you will get here is a snippet from the code:
i think the only way we kinda can do something is by: |
right, I still think that the other approach is better but I see operating on a state transition is not so simple here so the posted code is a reasonable compromise in my opinion - we can go with that for solving the reported issue and then let the infra team do something like what Michal suggested if they think it's needed, after all it's a pretty small change |
Yes, i think this solution still won't fix a case where the job didn't start (no artifact yet) and somehow everything went down. |
I do not see any argument against doing what I suggested yet. It's less code, less extra configurable variables, so I do not see the point of this current change. |
if it's possible then why not? looking at this code, it seems like there was a race which could have caused the engine to monitor the task before that path exists and so the engine waits with that a bit. if there's an alternative that we can do to distinguish between the initial phase and the problematic phase that was reported, then great. maybe a bit of an overkill for this particular issue, considering that this looks like a result of restarting services which shouldn't really happen, but maybe it is an opportunity to improve that code, sure |
that was specifically for waiting for events to happen. I'm talking about the private dir itself. We do not need to wait for the first event to be emitted to know that it's running. |
Did some investigation and created another PR please checkout #468. |
This should solve the bug we are trying to fix in this PR. |
Closing this PR because of #468 . |
When calling ansible callback, we check the event and see the command
status. Initially the command status is
unknown
. Until now the asyncansible callback had no limitations, but if the ansible runner service
and the artifacts are gone, it will keep being on
unknown
state ormove to this state again, running for unlimited time.
Change-Id: I31c7c4a1bc17721806c70769377fe1a1cf059b57
Bug-Url: https://bugzilla.redhat.com/2030293
Signed-off-by: Liran Rotenberg lrotenbe@redhat.com