-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Null PID reference #290
Comments
Can confirm that I can reproduce this occasionally, on apptainer and bpipe v0.9.9.2. I think that you're on the right track about printf $$ > .bpipe.'$LAUNCHER_PID'.run.pid To communicate between the subprocess and the main while [ ! -e .bpipe.$LAUNCHER_PID.run.pid ];
do
if type usleep > /dev/null 2>&1 ;
then
usleep 100000
else
# this is just to waste time - sleep 1 is too long
# and we have no usleep
echo > /dev/null
fi
done
BPIPE_PID=`cat .bpipe.$LAUNCHER_PID.run.pid` The race condition exists when I've gotten around this by patching my older bpipe version (which has the same code and issue) and replacing the # this checks:
# - whether the file .bpipe.$LAUNCHER_PID.run.pid either does not exist OR does exist and is not empty
# - and also validates that the PID is in fact a multi-digit string
while [ ! -s .bpipe.$LAUNCHER_PID.run.pid ] || \
! [[ $(cat .bpipe.$LAUNCHER_PID.run.pid) =~ ^[0-9]+$ ]];
do
if type usleep > /dev/null 2>&1 ;
then
usleep 100000
else
# this is just to waste time - sleep 1 is too long
# and we have no usleep
echo > /dev/null
fi
done I haven't tested this extensively, but I did test it briefly. When running 180 concurrent apptainer instances (on bpipe 0.9.9.2), the occurrence rate of this bug was 14/180 without the patch; with the patch, it was 0/180. Let me know if it works for you @kevinkovalchik! |
Thanks to both of you for looking into this! - I'll take a look at this and merge in the change if it there isn't a reason not to. I haven't done a lot of running Bpipe inside containers but it makes a whole lot of sense that it could introduce latency that creates the issue here. |
I believe this is related to #201.
I am running Bpipe 0.9.11 in Apptainer. @hh1985 in #201 was using Docker, so possibly this is related to containerization though really I don't know.
I don't know if it is related, but I am running multiple instances of Bpipe concurrently. I have tried to isolate them by temporarily setting $HOME to a unique temporary directory for each instance (since use of $HOME is hardcoded into Bpipe in at least one place, if I recall correctly).
It seems like sometimes
$BPIPE_PID
ends up being null or an empty string. I don't know if this is just due to an IO error reading the temporary PID file or if there is another issue behind it. Every path that points to a file named$BPIPE_PID
actually points to the parent directory, which would result in the error seen in #201 (along with whatever other issues come up with the PID being null).Below is the head of a log file of a job which suffered from this issue. Note that the filename of the log is
.bpipe/logs/.bpipe.log
. It should be.bpipe/logs/$BPIPE_PID.bpipe.log
, so the PID is null. This is also supported by the contents of the log:There is no PID in the log, and the whole job ends up failing.
When the failed job is rerun, it then (usually) gets a PID and proceeds as expected. Head of a log after restarting
.bpipe/logs/3648408.bpipe.log
:This time there is a PID.
My solution thus far has been to retry each job several times.
The text was updated successfully, but these errors were encountered: