Set limit the unknown status in ansible callback #407

liranr23 · 2022-06-01T13:38:38Z

When calling ansible callback, we check the event and see the command
status. Initially the command status is unknown. Until now the async
ansible callback had no limitations, but if the ansible runner service
and the artifacts are gone, it will keep being on unknown state or
move to this state again, running for unlimited time.

Change-Id: I31c7c4a1bc17721806c70769377fe1a1cf059b57
Bug-Url: https://bugzilla.redhat.com/2030293
Signed-off-by: Liran Rotenberg lrotenbe@redhat.com

backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/AnsibleCallback.java

When calling ansible callback, we check the event and see the command status. Initially the command status is `unknown`. Until now the async ansible callback had no limitations, but if the ansible runner service and the artifacts are gone, it will keep being on `unknown` state or move to this state again, running for unlimited time. A new configuration, `AsyncAnsibleTimeout` added, by default set to 180 seconds as a time limit. When we reach the time limit we will try to cancel the playbook in order to release the usage of the relevant entities. Change-Id: I31c7c4a1bc17721806c70769377fe1a1cf059b57 Bug-Url: https://bugzilla.redhat.com/2030293 Signed-off-by: Liran Rotenberg <lrotenbe@redhat.com>

backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/AnsibleCallback.java

michalskrivanek · 2022-06-13T12:16:13Z

It doesn't sound to me like a great idea to make this convoluted even further with yet another timeout.

I think it should be equally simple and yet fixing this problem by:

returning "running" state when the job's uuid exists, even when there are no events yet
return unknown (or change to "missing" or "invalid") when it doesn't - because then it never will, someone must have removed it

if we really want to be paranoid about runner background execution we should probably wait right in AnsibleExecutor::runCommand()

liranr23 · 2022-06-13T15:31:26Z

It doesn't sound to me like a great idea to make this convoluted even further with yet another timeout.

I think it should be equally simple and yet fixing this problem by:

returning "running" state when the job's uuid exists, even when there are no events yet

return unknown (or change to "missing" or "invalid") when it doesn't - because then it never will, someone must have removed it

if we really want to be paranoid about runner background execution we should probably wait right in AnsibleExecutor::runCommand()

from my inspection, if they don't exists in the engine you will get unknown. initially, they won't.

here is a snippet from the code:

if (!Files.exists(Paths.get(String.format("%1$s/status", playData)))) {
    // artifacts are not yet present, try to fetch them in the next polling round
    return new PlaybookStatus("unknown", "");
}

i think the only way we kinda can do something is by:
keeping it with continue in the callback, saving if we ever been in running state. i remember i discussed this option with Arik and it moved to the logic in we currently have in the PR.
i feel that the current implementation is OK, and for now used only for OVA.

ahadas · 2022-06-14T07:03:04Z

i think the only way we kinda can do something is by: keeping it with continue in the callback, saving if we ever been in running state. i remember i discussed this option with Arik and it moved to the logic in we currently have in the PR.

right, I still think that the other approach is better but I see operating on a state transition is not so simple here so the posted code is a reasonable compromise in my opinion - we can go with that for solving the reported issue and then let the infra team do something like what Michal suggested if they think it's needed, after all it's a pretty small change

liranr23 · 2022-06-14T13:09:49Z

i think the only way we kinda can do something is by: keeping it with continue in the callback, saving if we ever been in running state. i remember i discussed this option with Arik and it moved to the logic in we currently have in the PR.

right, I still think that the other approach is better but I see operating on a state transition is not so simple here so the posted code is a reasonable compromise in my opinion - we can go with that for solving the reported issue and then let the infra team do something like what Michal suggested if they think it's needed, after all it's a pretty small change

Yes, i think this solution still won't fix a case where the job didn't start (no artifact yet) and somehow everything went down.

michalskrivanek · 2022-06-14T13:24:32Z

I do not see any argument against doing what I suggested yet. It's less code, less extra configurable variables, so I do not see the point of this current change.

ahadas · 2022-06-14T14:34:21Z

I do not see any argument against doing what I suggested yet. It's less code, less extra configurable variables, so I do not see the point of this current change.

if it's possible then why not? looking at this code, it seems like there was a race which could have caused the engine to monitor the task before that path exists and so the engine waits with that a bit. if there's an alternative that we can do to distinguish between the initial phase and the problematic phase that was reported, then great. maybe a bit of an overkill for this particular issue, considering that this looks like a result of restarting services which shouldn't really happen, but maybe it is an opportunity to improve that code, sure

michalskrivanek · 2022-06-14T14:46:14Z

tagging @mnecas again to take it into a consideration in #423

michalskrivanek · 2022-06-14T14:48:13Z

it seems like there was a race which could have caused the engine to monitor the task before that path exists and so the engine waits with that a bit

that was specifically for waiting for events to happen. I'm talking about the private dir itself. We do not need to wait for the first event to be emitted to know that it's running.

mnecas · 2022-06-15T13:03:46Z

Did some investigation and created another PR please checkout #468.

liranr23 · 2022-06-15T14:08:19Z

Did some investigation and created another PR please checkout #468.

This should solve the bug we are trying to fix in this PR.

liranr23 · 2022-06-16T08:12:24Z

Closing this PR because of #468 .

ahadas reviewed Jun 1, 2022

View reviewed changes

backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/AnsibleCallback.java Outdated Show resolved Hide resolved

liranr23 force-pushed the ova_callback branch from 93c1752 to cb292e4 Compare June 2, 2022 08:38

liranr23 force-pushed the ova_callback branch from cb292e4 to 3d392e4 Compare June 2, 2022 14:35

liranr23 commented Jun 2, 2022

View reviewed changes

backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/AnsibleCallback.java Show resolved Hide resolved

mwperina requested review from dangel101 and mnecas June 13, 2022 15:51

liranr23 closed this Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set limit the unknown status in ansible callback #407

Set limit the unknown status in ansible callback #407

liranr23 commented Jun 1, 2022

michalskrivanek commented Jun 13, 2022

liranr23 commented Jun 13, 2022

ahadas commented Jun 14, 2022

liranr23 commented Jun 14, 2022

michalskrivanek commented Jun 14, 2022

ahadas commented Jun 14, 2022 •

edited

Loading

michalskrivanek commented Jun 14, 2022

michalskrivanek commented Jun 14, 2022

mnecas commented Jun 15, 2022

liranr23 commented Jun 15, 2022 •

edited

Loading

liranr23 commented Jun 16, 2022

Set limit the unknown status in ansible callback #407

Set limit the unknown status in ansible callback #407

Conversation

liranr23 commented Jun 1, 2022

michalskrivanek commented Jun 13, 2022

liranr23 commented Jun 13, 2022

ahadas commented Jun 14, 2022

liranr23 commented Jun 14, 2022

michalskrivanek commented Jun 14, 2022

ahadas commented Jun 14, 2022 • edited Loading

michalskrivanek commented Jun 14, 2022

michalskrivanek commented Jun 14, 2022

mnecas commented Jun 15, 2022

liranr23 commented Jun 15, 2022 • edited Loading

liranr23 commented Jun 16, 2022

ahadas commented Jun 14, 2022 •

edited

Loading

liranr23 commented Jun 15, 2022 •

edited

Loading