Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for FractalSlurmSSHExecutor.shutdown #1695

Closed
tcompa opened this issue Jul 26, 2024 · 1 comment · Fixed by #1696
Closed

Add tests for FractalSlurmSSHExecutor.shutdown #1695

tcompa opened this issue Jul 26, 2024 · 1 comment · Fixed by #1696
Labels
bug Something isn't working slurm ssh Issues related to the SSH mode of deployment testing testing

Comments

@tcompa
Copy link
Collaborator

tcompa commented Jul 26, 2024

What we are specially interested in is whether shutdown may fail because it is triggered at the same time when some other SSH command is in progress.

@tcompa tcompa added testing testing ssh Issues related to the SSH mode of deployment bug Something isn't working slurm labels Jul 26, 2024
@tcompa
Copy link
Collaborator Author

tcompa commented Jul 26, 2024

We debugged (with @mfranzon) the corresponding bug, and the motivation is that shutdown was triggered before any job had been submitted. This bug affects both sudo/ssh SLURM executor, but it's much more relevant for the SSH one because a much longer time passes between the job-submission API call and the actual job submission (in the executor and on SLURM). During this time, a shutdown would corrupt the executor, by closing its auxiliary thread.


Here is the timeline of the observed bug

19:54:13.5 -> job-submit API call
19:54:16.5 -> end of SSH handshake
19:54:16.5 -> many small things
19:54:16.5 -> start remote extract archive
19:54:19.5 -> shutdown, closes waiting thread
19:54:24.9 -> start remote extract archive
19:54:24.9 -> sbatch is successful
19:56:12.x -> python task fails (no actions triggered)
19:56:12.x -> SLURM job is complete (no actions triggered)

The reason why SLURM job completion does not trigger any action is that the waiting thread is already closed.


Here is how the timeline should have looked

19:54:13.5 -> job-submit API call
19:54:16.5 -> end of SSH handshake
19:54:16.5 -> many small things
19:54:16.5 -> start remote extract archive
19:54:19.5 -> shutdown, closes waiting thread AND CLOSES EXECUTOR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working slurm ssh Issues related to the SSH mode of deployment testing testing
Projects
Development

Successfully merging a pull request may close this issue.

1 participant