Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Null PID reference #290

Open
kevinkovalchik opened this issue May 23, 2024 · 2 comments
Open

Null PID reference #290

kevinkovalchik opened this issue May 23, 2024 · 2 comments

Comments

@kevinkovalchik
Copy link

kevinkovalchik commented May 23, 2024

I believe this is related to #201.

I am running Bpipe 0.9.11 in Apptainer. @hh1985 in #201 was using Docker, so possibly this is related to containerization though really I don't know.

I don't know if it is related, but I am running multiple instances of Bpipe concurrently. I have tried to isolate them by temporarily setting $HOME to a unique temporary directory for each instance (since use of $HOME is hardcoded into Bpipe in at least one place, if I recall correctly).

It seems like sometimes $BPIPE_PID ends up being null or an empty string. I don't know if this is just due to an IO error reading the temporary PID file or if there is another issue behind it. Every path that points to a file named $BPIPE_PID actually points to the parent directory, which would result in the error seen in #201 (along with whatever other issues come up with the PID being null).

Below is the head of a log file of a job which suffered from this issue. Note that the filename of the log is .bpipe/logs/.bpipe.log. It should be .bpipe/logs/$BPIPE_PID.bpipe.log, so the PID is null. This is also supported by the contents of the log:

bpipe.Runner	[1]	INFO	|11:08:23 Starting 
bpipe.Runner	[1]	INFO	|11:08:24 OS: Linux (5.15.0-75-generic) Java: 11.0.23 Vendor: Debian 
bpipe.Runner	[1]	INFO	|11:08:24 Initializing plugins ... 
bpipe.Config	[1]	INFO	|11:08:24 No plugins directory found: /output/.bpipe/plugins 
bpipe.Runner	[1]	INFO	|11:08:26 =================== GUID=b35ee14be2f543845b570c1fa5de6d85742cbe76 PID= () ==============

There is no PID in the log, and the whole job ends up failing.

When the failed job is rerun, it then (usually) gets a PID and proceeds as expected. Head of a log after restarting.bpipe/logs/3648408.bpipe.log:

bpipe.Runner	[1]	INFO	|11:10:11 Starting 
bpipe.Runner	[1]	INFO	|11:10:11 OS: Linux (5.15.0-75-generic) Java: 11.0.23 Vendor: Debian 
bpipe.Runner	[1]	INFO	|11:10:11 Initializing plugins ... 
bpipe.Config	[1]	INFO	|11:10:11 No plugins directory found: /output/.bpipe/plugins 
bpipe.Runner	[1]	INFO	|11:10:11 =================== GUID=fa5fa6ae3cbc25e0c44b5d8850817a28d18900cf PID=3648408 (3648408) ============== 

This time there is a PID.

My solution thus far has been to retry each job several times.

@olliecheng
Copy link
Contributor

olliecheng commented Aug 12, 2024

Can confirm that I can reproduce this occasionally, on apptainer and bpipe v0.9.9.2.

I think that you're on the right track about $BPIPE_PID being assigned to an empty string, in particular, maybe as a result of a race condition. It seems like $BPIPE_LOG is read from a file, .bpipe.'$LAUNCHER_PID'.run.pid, which is created in L715-L764 of the launcher shell script. In particular, a background process pipes its PID to .bpipe.$LAUNCHER_PID.run.pid on line 716:

printf $$ > .bpipe.'$LAUNCHER_PID'.run.pid

To communicate between the subprocess and the main bpipe script, the main process waits in a loop until the .bpipe.'$LAUNCHER_PID'.run.pid file exists, and then reads the (now populated, supposedly) subprocess PID. It assigns this value to $BPIPE_PID:

while [ ! -e .bpipe.$LAUNCHER_PID.run.pid ];
do
  if type usleep > /dev/null 2>&1 ;
  then
      usleep 100000
  else
      # this is just to waste time - sleep 1 is too long
      # and we have no usleep
      echo > /dev/null
  fi
done

BPIPE_PID=`cat .bpipe.$LAUNCHER_PID.run.pid`

The race condition exists when .bpipe.'$LAUNCHER_PID'.run.pid exists, but has not yet been populated. For some reason, this seems to be much more common when using apptainer (and perhaps Docker), perhaps because of different filesystem access internals. In theory, it should still be possible outside of a container, but I haven't been able to reproduce it yet in my limited testing.

I've gotten around this by patching my older bpipe version (which has the same code and issue) and replacing the while loop with:

# this checks:
#  - whether the file .bpipe.$LAUNCHER_PID.run.pid either does not exist OR does exist and is not empty
#  - and also validates that the PID is in fact a multi-digit string
while [ ! -s .bpipe.$LAUNCHER_PID.run.pid ] || \
	! [[ $(cat .bpipe.$LAUNCHER_PID.run.pid) =~ ^[0-9]+$ ]];
do
  if type usleep > /dev/null 2>&1 ;
  then
      usleep 100000
  else
      # this is just to waste time - sleep 1 is too long
      # and we have no usleep
      echo > /dev/null
  fi
done

I haven't tested this extensively, but I did test it briefly. When running 180 concurrent apptainer instances (on bpipe 0.9.9.2), the occurrence rate of this bug was 14/180 without the patch; with the patch, it was 0/180.

Let me know if it works for you @kevinkovalchik!

@ssadedin
Copy link
Owner

Thanks to both of you for looking into this! - I'll take a look at this and merge in the change if it there isn't a reason not to. I haven't done a lot of running Bpipe inside containers but it makes a whole lot of sense that it could introduce latency that creates the issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants