Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

artemis plugin does not cancel beaker guest after SIGTERM #2729

Open
thrix opened this issue Mar 4, 2024 · 3 comments
Open

artemis plugin does not cancel beaker guest after SIGTERM #2729

thrix opened this issue Mar 4, 2024 · 3 comments
Assignees
Labels
priority | must high priority, must be included in the next release
Milestone

Comments

@thrix
Copy link
Collaborator

thrix commented Mar 4, 2024

See this report:

https://issues.redhat.com/browse/TFT-2471

After SIGTERM tmt finished with 0, seems some signal handling issues maybe (destroy did not run).

�[08:02:54] [W] [worker_0] [stderr] [RHEL-9.4.0-Nightly:x86_64:/plans/nay] [server-1]         guest: has been requested�[0m
�[08:05:26] [W] [Thread-1] [testing-farm-request] Cancelling pipeline as requested
�[08:05:26] [W] Interrupted by SIGTERM�[0m
�[08:05:26] [W] Sending SIGTERM to child process 'tmt' (PID 102)
�[08:05:26] [W] [worker_0] [stderr] [RHEL-9.4.0-Nightly:x86_64:/plans/nay] [server-1]         guestname: 4f34a1f8-2443-455f-a5a8-70b41d72a1da
�[08:05:26] [+] ---^---^---^---^---^--- End of command output
�[08:05:26] [+] [worker_0] [test-schedule-tmt-multihost] tmt exited with code 0

http://artifacts.osci.redhat.com/testing-farm/90c29f96-66dd-43a6-b85b-f78634f624e2/pipeline.log
http://artifacts.osci.redhat.com/testing-farm/90c29f96-66dd-43a6-b85b-f78634f624e2/work-naysfc4zleb/log.txt

@thrix thrix self-assigned this Mar 4, 2024
@thrix thrix added this to the 1.33 milestone Mar 4, 2024
@thrix thrix modified the milestones: 1.33, 1.34 Apr 30, 2024
@thrix thrix modified the milestones: 1.34, 1.35 Jun 11, 2024
@thrix thrix modified the milestones: 1.35, 1.36 Jul 30, 2024
@thrix thrix modified the milestones: 1.36, 1.37 Aug 27, 2024
@thrix thrix modified the milestones: 1.37, 1.38 Oct 1, 2024
@thrix thrix modified the milestones: 1.38, 1.39 Oct 25, 2024
@thrix thrix modified the milestones: 1.39, 1.40 Nov 5, 2024
@bajertom bajertom modified the milestones: 1.40, 1.41 Dec 3, 2024
@happz happz removed this from the 1.41 milestone Jan 13, 2025
@thrix thrix assigned happz and unassigned thrix Jan 23, 2025
@thrix thrix added this to the 1.43 milestone Jan 23, 2025
@psss psss added the priority | must high priority, must be included in the next release label Jan 23, 2025
@happz
Copy link
Collaborator

happz commented Jan 24, 2025

Started poking it with a stick. So far it seems to me that:

  1. tmt needs explicit handling of SIGINT/SIGTERM (aka Ctrl+C) rather than ad hoc capture of the KeyboardInterrupt exception.
  2. We need an API that drivers could use to register cleanup actions to undo provisioning. This may be already in place with finish being called as the last step as long as it's enabled, but the provision step is not aware of guests until a plugin's go() returns, so finish has nothing to clean up. That we could change, and make provision collect guests from phases "live", ignoring None values, for example. Might have unexpected side effects.
  3. We must make sure there is no space for a race condition between a plugin submitting a request to its infrastructure, e.g. by sending the said request to Beaker API, and storing the received IDs in the guest's attributes for the rest of tmt to see - and, possibly, release when interrupted. This seems to require some signal-handling-related work, e.g. masking signals and postponing them until all info is saved correctly, but since tmt runs each provision plugin in its own thread, preventing signals may not be easy.

@psss
Copy link
Collaborator

psss commented Feb 6, 2025

  1. tmt needs explicit handling of SIGINT/SIGTERM (aka Ctrl+C) rather than ad hoc capture of the KeyboardInterrupt exception.

Ack, sounds good.

  1. We need an API that drivers could use to register cleanup actions to undo provisioning. This may be already in place with finish being called as the last step as long as it's enabled, but the provision step is not aware of guests until a plugin's go() returns, so finish has nothing to clean up. That we could change, and make provision collect guests from phases "live", ignoring None values, for example. Might have unexpected side effects.

So we already have the stop() method which takes care of the cleanup. I believe the main problem here is that when the execution is interrupted in the middle of the go() method there is not enough information stored for proper cleanup. If go() would properly store actions which have been already done, e.g. when a job has been already scheduled in Beaker store its id, then stop() would be able to correctly do the cleanup, e.g. cancel the running Beaker job.

  1. We must make sure there is no space for a race condition between a plugin submitting a request to its infrastructure, e.g. by sending the said request to Beaker API, and storing the received IDs in the guest's attributes for the rest of tmt to see - and, possibly, release when interrupted. This seems to require some signal-handling-related work, e.g. masking signals and postponing them until all info is saved correctly, but since tmt runs each provision plugin in its own thread, preventing signals may not be easy.

Good point, this part will have to be handled carefully.

@happz
Copy link
Collaborator

happz commented Feb 10, 2025

A brief update:

  • In Cooperative SIGINT/SIGTERM handling #3481, I tried to teach tmt about signals and how to handle them in a way that would work for our multithreaded reality and support uninterruptible moments.
  • In Provision step should store even guests not fully provisioned #3514, I tried to reshuffle the code so the provision step becomes aware of Guest instances much sooner, at the expense of them not being fully provisioned. It is the responsibility of Guest implementations to keep their stop() method ready to face incomplete provisioning, and I believe all of them already do.

Will it have some unexpected side effects? most likely. Waiting for tests to tell us more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority | must high priority, must be included in the next release
Projects
None yet
Development

No branches or pull requests

4 participants