-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Locking in the store for concurrent builds. #3741
Comments
@dcoutts, do you think you can commit to implementing this for 2.0 (this release cycle)? |
@ezyang Possibly, though we should discuss priorities & schedule. |
From #3946, we now test if the directory in the store exists, but we don't atomically rename the directory in, so there is currently a race condition. |
The concurrency strategy proposed in #4400 is more or less in line with the description above, specifically strategy 1 where we may occasionally waste work. |
#4400 is merged, can this be closed? |
Yep! |
In the new-build code we use a shared nix-style store. Without proper locking this could be corrupted by concurrent builds. Even though the store is in principle pure and immutable, because ghc compilation is not actually deterministic we can end up with problems.
Here is an example situation we really must avoid:
Here cabal process 2 has overwritten A-39af3 with a version that may not have been ABI compatible with the previous instance (because ghc is not deterministic) and thus broken B-d241c (in the sense that using B will now get linker errors or worse).
So the first point is that it's never ok to overwrite an instance (as we cannot guarantee the replacement is ABI compatible). Slightly more subtly: it's not ok to build something against one instance and later link against another.
Consider the final step above for process 2, when it comes to install and register A-39af3 and it notices that someone else got there first, what should it do? Certainly it cannot overwrite A, but moreover: it must not install and register C either. It cannot install C because it depends on the instance of A that we also have to throw away.
As an aside: note that this strategy of building everything and then installing it into the store is exactly the kind of strategy that one would like to take when using a package store/cache that is shared across a network (e.g. for a cluster of build bots). In that case it's probably more convenient to build a bunch of packages and do one shared cache update at the end, and without coordinating between build bots. But also in that circumstance it would be perfectly ok to simply not add A and C to the shared cache. Updating a shared cache is always optional, and so if we occasionally have clashes and have to throw away some potential cache additions then it's no big deal. So this kind of shared cache update would fit nicely with HTTP PUT semantics: in package dep order add each one to the cache with a conditional PUT based on the etag indicating the resource previously did not exist. Then on any conditional PUT constraint failure, skip adding all dependent packages.
So back to the local case. In the local case we don't actually ever build several packages and then install & register into the store, we install and register each one as we go along (using the store package db as the db to pass to ghc to find the dependencies).
So there are now two reasonable strategies for avoiding problems:
In 2 of course we avoid repeating the work of building A-39af3. On the other hand it may be that 1 is simpler to implement.
Atomic primitives like create directory enable two process to race and for the looser to discover they lost. However this is not enough because installing and registering cannot be made fully atomic. In particular there is the need both to install files and to register. While installing files could be made atomic by using directory rename, the ghc-pkg registration happens with a separate set of files. So we need locking of some sort, and ideally locking that enables a process to block until the lock is released (to avoid polling). Both unix and windows support blocking file locks (though some FFI would be required as the Win32 package does not bind LockFileEx).
A solution that is simple for readers would be one where testing for a package is just a matter of checking if a directory in the store exists (rather than that plus checking some "still in progress" lock is not held). So ideally a locking strategy here would involve atomically renaming the entry into the store as the final step, so that readers can simply rely on checking that entry. Note that currently cabal new-build uses the ghc package db entry as the thing to check if packages are available, but it seems like it would be nicer to use the store entry, and then to load package db info later for those packages where we need it.
So, what remains to be done here is to work out in detail the locking & installing protocol.
The text was updated successfully, but these errors were encountered: