Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Locking System Across Multiple Pavilion Instances #735

Merged
merged 68 commits into from
Jul 10, 2024
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
ffe4f00
add notes
Jan 10, 2024
1277263
start towards prototype solution
Jan 12, 2024
5f35168
track build hashes on MultiBuildTracker
Jan 17, 2024
353b2cc
restore lost changes
Jan 17, 2024
d9c8571
fix mistake
Jan 17, 2024
f44a8dd
implement hash->lock map
Jan 19, 2024
a4e3fa7
fix lingering misnamed function call
Jan 19, 2024
992d526
remove stray code
hwikle-lanl Jan 26, 2024
113b552
cache build hash on creation
hwikle-lanl Jan 26, 2024
f72d24b
attempt to resolve style test failure
hwikle-lanl Jan 26, 2024
e3e036c
second attempt at style test
hwikle-lanl Jan 26, 2024
ba686f2
add "fuzzy" lockfile implementation
hwikle-lanl Jan 26, 2024
f6d69f6
fix minor style issue
hwikle-lanl Jan 26, 2024
0af8ad1
reimplement timeout on build locks
hwikle-lanl Jan 26, 2024
7d96487
update build locking test
hwikle-lanl Jan 26, 2024
6acbf4e
ignore vscode workspace
hwikle-lanl Jan 29, 2024
cb33723
attempt to fix timeout
hwikle-lanl Jan 29, 2024
99d60c2
fix timed lock context management
hwikle-lanl Jan 29, 2024
fad32cf
fix style issues; tweak lock context
hwikle-lanl Jan 29, 2024
51d3003
fix mistake; more style fixes
hwikle-lanl Jan 29, 2024
3b8c3e4
fix changed method name
hwikle-lanl Jan 29, 2024
1487d4b
another style fix
hwikle-lanl Jan 29, 2024
8641ae8
even more style fixes
hwikle-lanl Jan 29, 2024
6a4716b
(attempt to) fix broken unit test
hwikle-lanl Jan 31, 2024
04abb19
fix mistake with protected class
hwikle-lanl Jan 31, 2024
f14423f
fix accidentally deleted import
hwikle-lanl Jan 31, 2024
2550e1b
restore removed import
hwikle-lanl Jan 31, 2024
d29b0d1
attempt to fix broken docstring
hwikle-lanl Jan 31, 2024
0623307
restore (janky) docstring to pass unit test
hwikle-lanl Jan 31, 2024
e9d0d42
fix unit test
hwikle-lanl Jan 31, 2024
9e85e90
fix unit test
hwikle-lanl Jan 31, 2024
d7f824f
fix unit test
hwikle-lanl Jan 31, 2024
709541f
fix unit test
hwikle-lanl Jan 31, 2024
6411448
Merge branch 'fuzzy-lock' into build-race-condition
hwikle-lanl Jan 31, 2024
a987d76
tweak status file check
hwikle-lanl Jan 31, 2024
afc4c18
style fixes
hwikle-lanl Jan 31, 2024
f2a9b9d
fix build lock unit test
hwikle-lanl Feb 2, 2024
9524805
remove fuzzy lock -- for inclusion later
hwikle-lanl Feb 5, 2024
08a0afa
Merge remote-tracking branch 'origin/build-race-condition' into build…
hwikle-lanl Feb 5, 2024
611c752
add sphinx doctring; minor clean up
hwikle-lanl Feb 5, 2024
36c15f1
edit docstrings
hwikle-lanl Feb 9, 2024
f3f7d94
integrate NFSLock into builder and mb_tracker
hwikle-lanl Feb 9, 2024
872a0b1
progress towards integrating NFSLock object
hwikle-lanl Feb 9, 2024
adeefe5
fix some failing tests
hwikle-lanl Feb 12, 2024
42bcb99
pass test_rebuild test
hwikle-lanl Feb 14, 2024
45d1178
Merge branch 'master' into fuzzy-lock
hwikle-lanl Feb 23, 2024
e685fc0
fix style issues
hwikle-lanl Feb 23, 2024
9752932
fix NameError
hwikle-lanl Feb 23, 2024
3483891
implement PR feedback
hwikle-lanl Mar 1, 2024
cf1a416
fix style issues
hwikle-lanl Mar 1, 2024
15290ce
remove lockfile directory once empty
hwikle-lanl Mar 1, 2024
f26f5a0
miscellaneous changes
hwikle-lanl Mar 6, 2024
b6ef151
remove use of lock timeout in builder
hwikle-lanl Mar 11, 2024
0257f21
implement code review feedback
hwikle-lanl Mar 11, 2024
a955a44
add tests for fuzzylock
hwikle-lanl Mar 11, 2024
e1519ff
return lockfile from __enter__ method
hwikle-lanl Mar 11, 2024
f2b822f
create tests for FuzzyLock
hwikle-lanl Mar 11, 2024
a156cea
pass fuzzylock mutual exclusion test
hwikle-lanl Mar 11, 2024
5dbb32e
fix FuzzyLock cleanup test
hwikle-lanl Mar 11, 2024
f944e3e
fix style issues
hwikle-lanl Mar 11, 2024
56621ab
remove debug print statements
hwikle-lanl Mar 13, 2024
3528ef6
more style fixes
hwikle-lanl Mar 13, 2024
d2b81b1
implement code review feedback
hwikle-lanl Mar 22, 2024
9313406
fix failing tests
hwikle-lanl Apr 1, 2024
84f634b
Implement PR feedback
hwikle-lanl Apr 12, 2024
56e236a
Merge branch 'develop' into fuzzy-lock
Paul-Ferrell Apr 18, 2024
a87302f
Merge branch 'develop' into fuzzy-lock
Paul-Ferrell Apr 24, 2024
4cc6f08
Merge branch 'develop' into fuzzy-lock
Paul-Ferrell Jul 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 2 additions & 22 deletions lib/pavilion/build_tracker.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from contextlib import contextmanager

from pavilion.status_file import STATES
from .lockfile import NFSLock


class MultiBuildTracker:
Expand All @@ -25,14 +26,13 @@ def __init__(self):
self.lock = threading.Lock()
self._build_locks = {} # type: Dict[str, threading.Lock]

def register(self, test) -> "BuildTracker":
def register(self, test: 'TestRun') -> 'BuildTracker':
"""Register a builder, and get your own build tracker.

:param test: The TestRun object to track.
:return: A build tracker instance that can be used by builds directly."""

tracker = BuildTracker(test, self)
hash = test.builder.build_hash

with self.lock:
# Test may actually be a TestRun object rather than a TestBuilder object,
Expand All @@ -42,28 +42,8 @@ def register(self, test) -> "BuildTracker":
self.messages[test.builder] = []
self.trackers[test.builder] = tracker

if hash not in self._build_locks:
self._build_locks[hash] = threading.Lock()

return tracker

@contextmanager
def make_lock_context(self, hash: str, timeout: float = -1) -> ContextManager[bool]:
"""Return a context manager to manage the build-specific lock.

:param str hash: The hash identifying the specific build.
:return: A context manager to manage the (optionally) timed lock
associated with the build."""

lock = self._build_locks[hash]

try:
result = lock.acquire(timeout=timeout)
yield result
finally:
if result:
lock.release()

def update(self, builder, note, state=None):
"""Add a message for the given builder without changes the status.

Expand Down
5 changes: 4 additions & 1 deletion lib/pavilion/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
import pavilion.errors
from pavilion import extract, lockfile, utils, wget, create_files
from pavilion.build_tracker import BuildTracker
from pavilion.lockfile import NFSLock
from pavilion.errors import TestBuilderError, TestConfigError
from pavilion.status_file import TestStatusFile, STATES
from pavilion.test_config import parse_timeout
Expand Down Expand Up @@ -408,7 +409,9 @@ def build(self, test_id: str, tracker: BuildTracker,
state=STATES.BUILD_WAIT,
note="Waiting on lock for build {}.".format(self.name))

timed_lock = mb_tracker.make_lock_context(self.build_hash)
# Pass a timeout here to prevent starvation
timed_lock = NFSLock(self.path.parent, self.name, timeout=5)

with timed_lock as acquired:
# Make sure the build wasn't created while we waited for
# the lock.
Expand Down
75 changes: 73 additions & 2 deletions lib/pavilion/lockfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
from pathlib import Path
from typing import Union, TextIO
import threading
from uuid import uuid4
from time import sleep

from pavilion import output
from pavilion import utils
Expand All @@ -19,7 +21,7 @@


class LockFile:
"""An NFS friendly way to create a lock file. Locks contain information
"""A custom lock object that is not NFS-friendly. Locks contain information
on what host and user created the lock, and have a built in expiration
date. To be used in a 'with' context.

Expand Down Expand Up @@ -295,7 +297,6 @@ def _warn(self, msg):
output.fprint(self._errfile, msg, color=output.YELLOW)



class LockFilePoker:
"""This context creates a thread that regularly 'pokes' a lockfile to make sure it
doesn't expire."""
Expand Down Expand Up @@ -328,3 +329,73 @@ def __exit__(self, exc_type, exc_val, exc_tb):

self._done_event.set()
self._thread.join()


class NFSLock:
hwikle-lanl marked this conversation as resolved.
Show resolved Hide resolved
"""A custom lock object designed for use on NFS systems. The lock behaves like a queue:
each process declares its intent to take the lock, and the lock goes to whichever process
declared first. The object lessens, but does not fully resolve, some of the concurrency issues
related to NFS, particularly in the case where multiple Pavilion instances are working with
the same build. Intended to be invoked as a context manager, using the 'with' keyword.
"""
def __init__(self, lock_dir: Path, build_name: str,
hwikle-lanl marked this conversation as resolved.
Show resolved Hide resolved
wait_time: float = 0.5, timeout: float = -1):
"""
:param lock_dir: directory in which lockfiles will be created
:param build_name: full name of the build to which the lock controls access
:param wait_time: time to wait between checking status
:param timeout: time (in seconds) after which the lock times out.
A value of -1 indicates no timeout
"""
self._lock_dir = lock_dir
hwikle-lanl marked this conversation as resolved.
Show resolved Hide resolved
self.build_name = build_name
self._wait_time = wait_time
self._timeout = timeout

# create a globally unique lockfile
self._lockfile = lock_dir / f"{build_name}-{uuid4()}.lock"

def set_timeout(self, timeout: float) -> None:
"""Set the timeout for the lock object.

:param timeout: the length of the timeout, in seconds
"""
self._timeout = timeout
hwikle-lanl marked this conversation as resolved.
Show resolved Hide resolved

def __enter__(self) -> bool:
"""
Attempt to acquire the lock.

:return: True if lock has been successfully acquired; False if timed out
"""
# Declare intent to take lock
self._lockfile.touch()

first = False

start = time.time()

while not first:
# Give other instances opportunity to declare intent
sleep(self._wait_time)

if self._timeout >= 0 and (time.time() - start > self._timeout):
return False
hwikle-lanl marked this conversation as resolved.
Show resolved Hide resolved

first = self._get_earliest() == self._lockfile

return True

def __exit__(self, exc_type, exc_val, exc_tb) -> None:
# Remove the lockfile, since no longer competing for lock
self._lockfile.unlink()

def _get_earliest(self) -> Path:
"""Get the path to the lockfile that was created first.

:return: Path object to whichever lockfile was created first"""

lockfiles = self._lock_dir.glob(f'{self.build_name}-*.lock')
hwikle-lanl marked this conversation as resolved.
Show resolved Hide resolved

# Sort files by creation time, and return oldest
return sorted(lockfiles, key=lambda x: x.stat().st_ctime)[0]
hwikle-lanl marked this conversation as resolved.
Show resolved Hide resolved
Loading