Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set limit the unknown status in ansible callback #407

Closed
wants to merge 1 commit into from

Conversation

liranr23
Copy link
Member

@liranr23 liranr23 commented Jun 1, 2022

When calling ansible callback, we check the event and see the command
status. Initially the command status is unknown. Until now the async
ansible callback had no limitations, but if the ansible runner service
and the artifacts are gone, it will keep being on unknown state or
move to this state again, running for unlimited time.

Change-Id: I31c7c4a1bc17721806c70769377fe1a1cf059b57
Bug-Url: https://bugzilla.redhat.com/2030293
Signed-off-by: Liran Rotenberg lrotenbe@redhat.com

When calling ansible callback, we check the event and see the command
status. Initially the command status is `unknown`. Until now the async
ansible callback had no limitations, but if the ansible runner service
and the artifacts are gone, it will keep being on `unknown` state or
move to this state again, running for unlimited time.
A new configuration, `AsyncAnsibleTimeout` added, by default set to 180
seconds as a time limit.
When we reach the time limit we will try to cancel the playbook in order
to release the usage of the relevant entities.

Change-Id: I31c7c4a1bc17721806c70769377fe1a1cf059b57
Bug-Url: https://bugzilla.redhat.com/2030293
Signed-off-by: Liran Rotenberg <lrotenbe@redhat.com>
@michalskrivanek
Copy link
Member

It doesn't sound to me like a great idea to make this convoluted even further with yet another timeout.

I think it should be equally simple and yet fixing this problem by:

  • returning "running" state when the job's uuid exists, even when there are no events yet
  • return unknown (or change to "missing" or "invalid") when it doesn't - because then it never will, someone must have removed it

if we really want to be paranoid about runner background execution we should probably wait right in AnsibleExecutor::runCommand()

@liranr23
Copy link
Member Author

It doesn't sound to me like a great idea to make this convoluted even further with yet another timeout.

I think it should be equally simple and yet fixing this problem by:

  • returning "running" state when the job's uuid exists, even when there are no events yet
  • return unknown (or change to "missing" or "invalid") when it doesn't - because then it never will, someone must have removed it

if we really want to be paranoid about runner background execution we should probably wait right in AnsibleExecutor::runCommand()

from my inspection, if they don't exists in the engine you will get unknown. initially, they won't.

here is a snippet from the code:

if (!Files.exists(Paths.get(String.format("%1$s/status", playData)))) {
    // artifacts are not yet present, try to fetch them in the next polling round
    return new PlaybookStatus("unknown", "");
}

i think the only way we kinda can do something is by:
keeping it with continue in the callback, saving if we ever been in running state. i remember i discussed this option with Arik and it moved to the logic in we currently have in the PR.
i feel that the current implementation is OK, and for now used only for OVA.

@ahadas
Copy link
Member

ahadas commented Jun 14, 2022

i think the only way we kinda can do something is by: keeping it with continue in the callback, saving if we ever been in running state. i remember i discussed this option with Arik and it moved to the logic in we currently have in the PR.

right, I still think that the other approach is better but I see operating on a state transition is not so simple here so the posted code is a reasonable compromise in my opinion - we can go with that for solving the reported issue and then let the infra team do something like what Michal suggested if they think it's needed, after all it's a pretty small change

@liranr23
Copy link
Member Author

i think the only way we kinda can do something is by: keeping it with continue in the callback, saving if we ever been in running state. i remember i discussed this option with Arik and it moved to the logic in we currently have in the PR.

right, I still think that the other approach is better but I see operating on a state transition is not so simple here so the posted code is a reasonable compromise in my opinion - we can go with that for solving the reported issue and then let the infra team do something like what Michal suggested if they think it's needed, after all it's a pretty small change

Yes, i think this solution still won't fix a case where the job didn't start (no artifact yet) and somehow everything went down.

@michalskrivanek
Copy link
Member

I do not see any argument against doing what I suggested yet. It's less code, less extra configurable variables, so I do not see the point of this current change.

@ahadas
Copy link
Member

ahadas commented Jun 14, 2022

I do not see any argument against doing what I suggested yet. It's less code, less extra configurable variables, so I do not see the point of this current change.

if it's possible then why not? looking at this code, it seems like there was a race which could have caused the engine to monitor the task before that path exists and so the engine waits with that a bit. if there's an alternative that we can do to distinguish between the initial phase and the problematic phase that was reported, then great. maybe a bit of an overkill for this particular issue, considering that this looks like a result of restarting services which shouldn't really happen, but maybe it is an opportunity to improve that code, sure

@michalskrivanek
Copy link
Member

tagging @mnecas again to take it into a consideration in #423

@michalskrivanek
Copy link
Member

it seems like there was a race which could have caused the engine to monitor the task before that path exists and so the engine waits with that a bit

that was specifically for waiting for events to happen. I'm talking about the private dir itself. We do not need to wait for the first event to be emitted to know that it's running.

@mnecas
Copy link
Member

mnecas commented Jun 15, 2022

Did some investigation and created another PR please checkout #468.

@liranr23
Copy link
Member Author

liranr23 commented Jun 15, 2022

Did some investigation and created another PR please checkout #468.

This should solve the bug we are trying to fix in this PR.

@liranr23
Copy link
Member Author

Closing this PR because of #468 .

@liranr23 liranr23 closed this Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants