Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnect platform when job is interrupted #1110

Merged
merged 2 commits into from
Dec 5, 2024
Merged

Conversation

stavros11
Copy link
Member

@stavros11 stavros11 commented Dec 1, 2024

Implements the suggestion of #1108 for 0.1, which is what was primarily causing the temperature issues. Note that it will only work when the Platform interface is used, not the instruments directly, but this should capture most cases in practice. It may be possible to lower it to the drivers. If you agree, I can also port it to 0.2.

I have tested and it seems that the handler is called successfully on qw11q when sbatch+scancel is used. It also works on dummy for both sbatch+scancel and srun+(Ctrl+C). However, srun+(Ctrl+C) does not seem to work on qw11q and the handler is skipped! I am not sure why.

I have also confirmed that the TWPA pump LO is turned off by the handler, which is not the case now (LO will remain on after job cancellation). I have not confirmed that QM jobs are properly cancelled, but I believe that

self.manager.close_all_quantum_machines()

should be taking care of this (will confirm soon).

@stavros11 stavros11 requested a review from alecandido December 1, 2024 14:12
Copy link

codecov bot commented Dec 1, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 69.88%. Comparing base (cc64ee1) to head (3025a0a).

Files with missing lines Patch % Lines
src/qibolab/platform/platform.py 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              0.1    #1110      +/-   ##
==========================================
- Coverage   69.88%   69.88%   -0.01%     
==========================================
  Files          64       64              
  Lines        6880     6886       +6     
==========================================
+ Hits         4808     4812       +4     
- Misses       2072     2074       +2     
Flag Coverage Δ
unittests 69.88% <66.66%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@alecandido alecandido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to amend for 0.1. Let's treat it as a quick fix, and start experimenting the approach.

Instead, for 0.2 I'd rather try to add a dedicated .stop() method, since I'm not sure whether disconnection operations will always be suitable to interrupt an ongoing job (they may also be intended to disconnect an idle device, or wait until the current execution/queue is terminated).
Moreover, I'm still considering delegating the handler installation to the user, possibly done during the platform creation (but maybe we could push even further) to avoid any library's side effect/global state.

Concerning the srun + CTRL-C, there is also the warning I added in the issue. That could be part of the issue (and good general reason to avoid srun usage which actual devices).
Maybe we should consider making it part of the docs. But this is also pretty irrelevant for 0.1.

@stavros11 stavros11 merged commit ac5157b into 0.1 Dec 5, 2024
32 of 33 checks passed
@stavros11 stavros11 deleted the cancel-disconnect branch December 5, 2024 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants