You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This seems to be a longstanding issue in the old code but I'm reporting it here as this is the version I'm using myself. In slurmdrmaa_job_update_status a state of 32772/JOB_CANCELLED triggers an explicit setting of the exit status to -1, but then this is immediately overwritten as the execution continues into the next part of the switch statement. It seems pretty obvious to me that the author intended to put a 'break' at the end of the code block. (A classic C bug!)
Most of the time, this doesn't matter, because SLURM gives the correct exit status, but I've found that for jobs that have aborted due to overrunning a high memory limit on my cluster the exit status gets reported as 0, and the caller then has no way to see that the job actually failed. Adding a small artificial wait then re-querying the status fixes the problem so I'm sure it's a race condition, and that forcing the status to -1 (or 15 or anything but 0) is reasonable behaviour to avoid it. Patch attached.
Hi,
This seems to be a longstanding issue in the old code but I'm reporting it here as this is the version I'm using myself. In slurmdrmaa_job_update_status a state of 32772/JOB_CANCELLED triggers an explicit setting of the exit status to -1, but then this is immediately overwritten as the execution continues into the next part of the switch statement. It seems pretty obvious to me that the author intended to put a 'break' at the end of the code block. (A classic C bug!)
Most of the time, this doesn't matter, because SLURM gives the correct exit status, but I've found that for jobs that have aborted due to overrunning a high memory limit on my cluster the exit status gets reported as 0, and the caller then has no way to see that the job actually failed. Adding a small artificial wait then re-querying the status fixes the problem so I'm sure it's a race condition, and that forcing the status to -1 (or 15 or anything but 0) is reasonable behaviour to avoid it. Patch attached.
TIM
givemeabreak_patch.txt
The text was updated successfully, but these errors were encountered: