Lock folder not cleaned up after (Slurm) job is killed #3280

Flamefire · 2020-04-15T14:14:51Z

The following happened:

Create interactive job with SLURM
Change to node
Start build with EB
Get killed (Connection closed in this case, but can be due to timeout [slurm time exceeded])

Afterwards the lock was still there so the next build failed too and the lock had to be removed manually

mboisson · 2020-04-15T14:19:41Z

Copy from a discussion with Todd Gamblin on Slack :

we use fcntl locks
they are released when the process dies
and we use one lockfile for everything — no mess.
the locks are “bytes” in the file
but the file’s zero-length
it’s one empty lockfile, we take a 63-bit prefix of the SHA-1 of the DAG hash and lock that byte for each build
the advantage is that a) fcntl locks are supported for on most filesystems (NFS3 with lock server, NFS4, GPFS, Lustre) and b) they’re released when the process dies
which prevents the annoying SVN locked-forever-because-nfs-is-busy issue

if you want to steal ours, here you go: https://github.com/spack/spack/blob/develop/lib/spack/llnl/util/lock.py (edited)
tests are here: https://github.com/spack/spack/blob/develop/lib/spack/spack/test/llnl/util/lock.py
I think you will need our little barrier implementation if you want to test on python 2.6: https://github.com/spack/spack/blob/develop/lib/spack/llnl/util/multiproc.py — or you can just run MPI tests. (edited)
the only thing beyond that lock you need is the little trick we use to lock prefixes with one lock file: https://github.com/spack/spack/blob/develop/lib/spack/spack/database.py#L489 (edited)
spack lets you install “anything” so every installation has a hash. so we map a hash to a byte in a lockfile and lock that byte (fcntl and that lock class I linked support byte-range locking)

boegel · 2020-04-18T12:27:41Z

The problem with fcntl locks is that not all filesystems support them though, so it's not a perfect solution either. If we can figure out a way to detect whether fcntl locks can be used, then it seems that's a better solution, but I'm not sure that can be done reliably...

I've also been bitten by the problem that @Flamefire reported, it's certainly annoying.

How about installing a signal handler that cleans up the locks that were created in that EasyBuild session in case a SIGTERM signal (and possible other signals) is received?

mboisson · 2020-04-18T13:55:33Z

Does a crash necessarily get a signal ? Not all signals can be catches either. What about implementing both, and letting the configuration decide which implementation is used ?

…

On Apr 18, 2020 at 8:27 AM, <Kenneth Hoste ***@***.***)> wrote: The problem with fcntl locks is that not all filesystems support them though, so it's not a perfect solution either. If we can figure out a way to detect whether fcntl locks can be used, then it seems that's a better solution, but I'm not sure that can be done reliably... I've also been bitten by the problem that @Flamefire (https://github.com/Flamefire) reported, it's certainly annoying. How about installing a signal handler that cleans up the locks that were created in that EasyBuild session in case a SIGTERM signal (and possible other signals) is received? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub (#3280 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ABZKY2WUJOHSUN3KHGGJ6ATRNGMETANCNFSM4MITROEA).

boegel · 2020-04-18T19:29:12Z

I'm working on the auto-cleanup after receiving a signal part, which is relatively easy to implement.

We can definitely also implement support for using fcntl-style locks, and let the user configure it. But we should stick with a locking mechanism that works anywhere by default imho (the one we have now), even though it has known downsides...

boegel · 2020-04-19T11:47:39Z

With the changes in #3291, locks are cleaned up if the EasyBuild session gets a SIGTERM signal (+ a couple of other signals).

That doesn't seem to help in the context of Slurm jobs that get cancelled or run into a timeout though...
Although the Slurm documentation claims that this involves sending a SIGTERM to the job steps (incl. the job script), that doesn't seem to be the same. :-/

@Flamefire: Are you up for testing the changes in #3291 in the context of Slurm jobs?

boegel · 2020-04-19T11:49:23Z

The only way I could get the signal handler in eb triggered for EasyBuild running in a Slurm job was to cancel the job with scancel -f --signal=TERM <jobid> (a regular scancel <jobid> doesn't trigger a SIGTERM at all apparently...).

boegel · 2020-05-01T20:19:31Z

Shouldn't be closed yet, since #3291 doesn't actually fix this yet...

boegel added the problem report label Apr 18, 2020

boegel added this to the next release (4.2.1?) milestone Apr 18, 2020

boegel mentioned this issue Apr 19, 2020

clean up locks when EasyBuild session is cancelled with signal like SIGTERM #3291

Merged

akesandgren closed this as completed in 1752c4a May 1, 2020

boegel reopened this May 1, 2020

boegel modified the milestones: 4.2.1, release after 4.2.1 (4.2.2?) May 19, 2020

boegel modified the milestones: 4.2.2, release after 4.2.2 (4.2.3?) Jul 3, 2020

boegel modified the milestones: next release (4.3.0), release after 4.3.0 (4.3.1?) Sep 2, 2020

boegel modified the milestones: next release (4.3.1), 4.4.0 Oct 14, 2020

boegel modified the milestones: 4.3.3, release after 4.3.3 Feb 3, 2021

boegel modified the milestones: next release (4.3.4), 4.x Apr 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock folder not cleaned up after (Slurm) job is killed #3280

Lock folder not cleaned up after (Slurm) job is killed #3280

Flamefire commented Apr 15, 2020

mboisson commented Apr 15, 2020 •

edited

Loading

boegel commented Apr 18, 2020

mboisson commented Apr 18, 2020 via email

boegel commented Apr 18, 2020

boegel commented Apr 19, 2020

boegel commented Apr 19, 2020

boegel commented May 1, 2020

Lock folder not cleaned up after (Slurm) job is killed #3280

Lock folder not cleaned up after (Slurm) job is killed #3280

Comments

Flamefire commented Apr 15, 2020

mboisson commented Apr 15, 2020 • edited Loading

boegel commented Apr 18, 2020

mboisson commented Apr 18, 2020 via email

boegel commented Apr 18, 2020

boegel commented Apr 19, 2020

boegel commented Apr 19, 2020

boegel commented May 1, 2020

mboisson commented Apr 15, 2020 •

edited

Loading