You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We debugged (with @mfranzon) the corresponding bug, and the motivation is that shutdown was triggered before any job had been submitted. This bug affects both sudo/ssh SLURM executor, but it's much more relevant for the SSH one because a much longer time passes between the job-submission API call and the actual job submission (in the executor and on SLURM). During this time, a shutdown would corrupt the executor, by closing its auxiliary thread.
Here is the timeline of the observed bug
19:54:13.5 -> job-submit API call
19:54:16.5 -> end of SSH handshake
19:54:16.5 -> many small things
19:54:16.5 -> start remote extract archive
19:54:19.5 -> shutdown, closes waiting thread
19:54:24.9 -> start remote extract archive
19:54:24.9 -> sbatch is successful
19:56:12.x -> python task fails (no actions triggered)
19:56:12.x -> SLURM job is complete (no actions triggered)
The reason why SLURM job completion does not trigger any action is that the waiting thread is already closed.
Here is how the timeline should have looked
19:54:13.5 -> job-submit API call
19:54:16.5 -> end of SSH handshake
19:54:16.5 -> many small things
19:54:16.5 -> start remote extract archive
19:54:19.5 -> shutdown, closes waiting thread AND CLOSES EXECUTOR
What we are specially interested in is whether shutdown may fail because it is triggered at the same time when some other SSH command is in progress.
The text was updated successfully, but these errors were encountered: