Race condition in job.c/slurmdrmaa_job_update_status #3

tbooth · 2016-11-04T15:41:27Z

Hi,

This seems to be a longstanding issue in the old code but I'm reporting it here as this is the version I'm using myself. In slurmdrmaa_job_update_status a state of 32772/JOB_CANCELLED triggers an explicit setting of the exit status to -1, but then this is immediately overwritten as the execution continues into the next part of the switch statement. It seems pretty obvious to me that the author intended to put a 'break' at the end of the code block. (A classic C bug!)

Most of the time, this doesn't matter, because SLURM gives the correct exit status, but I've found that for jobs that have aborted due to overrunning a high memory limit on my cluster the exit status gets reported as 0, and the caller then has no way to see that the job actually failed. Adding a small artificial wait then re-querying the status fixes the problem so I'm sure it's a race condition, and that forcing the status to -1 (or 15 or anything but 0) is reasonable behaviour to avoid it. Patch attached.

TIM

givemeabreak_patch.txt

natefoo · 2017-07-12T13:51:23Z

Thanks! I'll include this over at natefoo/slurm-drmaa and pass it on upstream.

natefoo closed this as completed in natefoo/slurm-drmaa@fee7e6c Jul 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in job.c/slurmdrmaa_job_update_status #3

Race condition in job.c/slurmdrmaa_job_update_status #3

tbooth commented Nov 4, 2016 •

edited by natefoo

Loading

natefoo commented Jul 12, 2017

Race condition in job.c/slurmdrmaa_job_update_status #3

Race condition in job.c/slurmdrmaa_job_update_status #3

Comments

tbooth commented Nov 4, 2016 • edited by natefoo Loading

natefoo commented Jul 12, 2017

tbooth commented Nov 4, 2016 •

edited by natefoo

Loading