Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock folder not cleaned up after (Slurm) job is killed #3280

Open
Flamefire opened this issue Apr 15, 2020 · 7 comments
Open

Lock folder not cleaned up after (Slurm) job is killed #3280

Flamefire opened this issue Apr 15, 2020 · 7 comments
Milestone

Comments

@Flamefire
Copy link
Contributor

The following happened:

  • Create interactive job with SLURM
  • Change to node
  • Start build with EB
  • Get killed (Connection closed in this case, but can be due to timeout [slurm time exceeded])

Afterwards the lock was still there so the next build failed too and the lock had to be removed manually

@boegel @mboisson

@mboisson
Copy link
Contributor

mboisson commented Apr 15, 2020

Copy from a discussion with Todd Gamblin on Slack :

we use fcntl locks
they are released when the process dies
and we use one lockfile for everything — no mess.
the locks are “bytes” in the file
but the file’s zero-length
it’s one empty lockfile, we take a 63-bit prefix of the SHA-1 of the DAG hash and lock that byte for each build
the advantage is that a) fcntl locks are supported for on most filesystems (NFS3 with lock server, NFS4, GPFS, Lustre) and b) they’re released when the process dies
which prevents the annoying SVN locked-forever-because-nfs-is-busy issue

if you want to steal ours, here you go: https://github.com/spack/spack/blob/develop/lib/spack/llnl/util/lock.py (edited)
tests are here: https://github.com/spack/spack/blob/develop/lib/spack/spack/test/llnl/util/lock.py
I think you will need our little barrier implementation if you want to test on python 2.6: https://github.com/spack/spack/blob/develop/lib/spack/llnl/util/multiproc.py — or you can just run MPI tests. (edited)
the only thing beyond that lock you need is the little trick we use to lock prefixes with one lock file: https://github.com/spack/spack/blob/develop/lib/spack/spack/database.py#L489 (edited)
spack lets you install “anything” so every installation has a hash. so we map a hash to a byte in a lockfile and lock that byte (fcntl and that lock class I linked support byte-range locking)

@boegel boegel added this to the next release (4.2.1?) milestone Apr 18, 2020
@boegel
Copy link
Member

boegel commented Apr 18, 2020

The problem with fcntl locks is that not all filesystems support them though, so it's not a perfect solution either. If we can figure out a way to detect whether fcntl locks can be used, then it seems that's a better solution, but I'm not sure that can be done reliably...

I've also been bitten by the problem that @Flamefire reported, it's certainly annoying.

How about installing a signal handler that cleans up the locks that were created in that EasyBuild session in case a SIGTERM signal (and possible other signals) is received?

@mboisson
Copy link
Contributor

mboisson commented Apr 18, 2020 via email

@boegel
Copy link
Member

boegel commented Apr 18, 2020

I'm working on the auto-cleanup after receiving a signal part, which is relatively easy to implement.

We can definitely also implement support for using fcntl-style locks, and let the user configure it. But we should stick with a locking mechanism that works anywhere by default imho (the one we have now), even though it has known downsides...

@boegel
Copy link
Member

boegel commented Apr 19, 2020

With the changes in #3291, locks are cleaned up if the EasyBuild session gets a SIGTERM signal (+ a couple of other signals).

That doesn't seem to help in the context of Slurm jobs that get cancelled or run into a timeout though...
Although the Slurm documentation claims that this involves sending a SIGTERM to the job steps (incl. the job script), that doesn't seem to be the same. :-/

@Flamefire: Are you up for testing the changes in #3291 in the context of Slurm jobs?

@boegel
Copy link
Member

boegel commented Apr 19, 2020

The only way I could get the signal handler in eb triggered for EasyBuild running in a Slurm job was to cancel the job with scancel -f --signal=TERM <jobid> (a regular scancel <jobid> doesn't trigger a SIGTERM at all apparently...).

@boegel
Copy link
Member

boegel commented May 1, 2020

Shouldn't be closed yet, since #3291 doesn't actually fix this yet...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants