Locking in the store for concurrent builds. #3741

dcoutts · 2016-08-30T17:15:29Z

In the new-build code we use a shared nix-style store. Without proper locking this could be corrupted by concurrent builds. Even though the store is in principle pure and immutable, because ghc compilation is not actually deterministic we can end up with problems.

Here is an example situation we really must avoid:

cabal process 1 decides to install A-39af3 and B-d241c into the store, where B depends on A
cabal process 2 decides to install A-39af3 and C-29ef2 into the store, where C depends on A
cabal process 1 builds both A-39af3 and B-d241c
cabal process 2 builds both A-39af3 and C-29ef2
cabal process 1 installs and registers into the store A-39af3 and B-d241c
cabal process 2 installs and registers into the store A-39af3 and C-29ef2

Here cabal process 2 has overwritten A-39af3 with a version that may not have been ABI compatible with the previous instance (because ghc is not deterministic) and thus broken B-d241c (in the sense that using B will now get linker errors or worse).

So the first point is that it's never ok to overwrite an instance (as we cannot guarantee the replacement is ABI compatible). Slightly more subtly: it's not ok to build something against one instance and later link against another.

Consider the final step above for process 2, when it comes to install and register A-39af3 and it notices that someone else got there first, what should it do? Certainly it cannot overwrite A, but moreover: it must not install and register C either. It cannot install C because it depends on the instance of A that we also have to throw away.

As an aside: note that this strategy of building everything and then installing it into the store is exactly the kind of strategy that one would like to take when using a package store/cache that is shared across a network (e.g. for a cluster of build bots). In that case it's probably more convenient to build a bunch of packages and do one shared cache update at the end, and without coordinating between build bots. But also in that circumstance it would be perfectly ok to simply not add A and C to the shared cache. Updating a shared cache is always optional, and so if we occasionally have clashes and have to throw away some potential cache additions then it's no big deal. So this kind of shared cache update would fit nicely with HTTP PUT semantics: in package dep order add each one to the cache with a conditional PUT based on the etag indicating the resource previously did not exist. Then on any conditional PUT constraint failure, skip adding all dependent packages.

So back to the local case. In the local case we don't actually ever build several packages and then install & register into the store, we install and register each one as we go along (using the store package db as the db to pass to ghc to find the dependencies).

So there are now two reasonable strategies for avoiding problems:

Upon trying to install and register A-39af3, if we find we've been beaten to it by another process then throw our instance away and continue building the other packages but now using the A-39af3 that has now appeared in the store. The only coordination required is that we need to know when the other process has finished installing and registering its instance of A-39af3.
At the point when we begin to build A-39af3, take out a lock on it. If the lock is currently held then it means another process beat us to it and we should wait until they have finished, at which point they will have installed and registered the package and we can continue building the other packages but now using the A-39af3 that has now appeared in the store.

In 2 of course we avoid repeating the work of building A-39af3. On the other hand it may be that 1 is simpler to implement.

Atomic primitives like create directory enable two process to race and for the looser to discover they lost. However this is not enough because installing and registering cannot be made fully atomic. In particular there is the need both to install files and to register. While installing files could be made atomic by using directory rename, the ghc-pkg registration happens with a separate set of files. So we need locking of some sort, and ideally locking that enables a process to block until the lock is released (to avoid polling). Both unix and windows support blocking file locks (though some FFI would be required as the Win32 package does not bind LockFileEx).

A solution that is simple for readers would be one where testing for a package is just a matter of checking if a directory in the store exists (rather than that plus checking some "still in progress" lock is not held). So ideally a locking strategy here would involve atomically renaming the entry into the store as the final step, so that readers can simply rely on checking that entry. Note that currently cabal new-build uses the ghc package db entry as the thing to check if packages are available, but it seems like it would be nicer to use the store entry, and then to load package db info later for those packages where we need it.

So, what remains to be done here is to work out in detail the locking & installing protocol.

ezyang · 2016-09-06T22:42:56Z

@dcoutts, do you think you can commit to implementing this for 2.0 (this release cycle)?

dcoutts · 2016-09-08T10:53:28Z

@ezyang Possibly, though we should discuss priorities & schedule.

ezyang · 2016-10-12T21:50:10Z

From #3946, we now test if the directory in the store exists, but we don't atomically rename the directory in, so there is currently a race condition.

dcoutts · 2017-03-15T11:27:27Z

The concurrency strategy proposed in #4400 is more or less in line with the description above, specifically strategy 1 where we may occasionally waste work.

23Skidoo · 2017-06-08T19:53:08Z

#4400 is merged, can this be closed?

ezyang · 2017-06-09T01:21:32Z

Yep!

dcoutts mentioned this issue Aug 30, 2016

Tracking ticket for Galois new-build requirements #3577

Open

20 tasks

23Skidoo added type: enhancement cabal-install: nix-local-build labels Aug 30, 2016

ezyang mentioned this issue Sep 2, 2016

ghc-pkg fails sometimes after removing dist-newstyle folder #3764

Closed

ezyang added this to the 2.0 milestone Sep 8, 2016

ezyang assigned dcoutts Sep 8, 2016

dcoutts mentioned this issue Oct 12, 2016

new-build copy/register race condition #3946

Closed

23Skidoo modified the milestones: 2.2, 2.0 Feb 17, 2017

dcoutts mentioned this issue Mar 15, 2017

WIP: concurrent store updates #4400

Merged

ezyang closed this as completed Jun 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locking in the store for concurrent builds. #3741

Locking in the store for concurrent builds. #3741

dcoutts commented Aug 30, 2016

ezyang commented Sep 6, 2016

dcoutts commented Sep 8, 2016

ezyang commented Oct 12, 2016

dcoutts commented Mar 15, 2017

23Skidoo commented Jun 8, 2017

ezyang commented Jun 9, 2017

Locking in the store for concurrent builds. #3741

Locking in the store for concurrent builds. #3741

Comments

dcoutts commented Aug 30, 2016

ezyang commented Sep 6, 2016

dcoutts commented Sep 8, 2016

ezyang commented Oct 12, 2016

dcoutts commented Mar 15, 2017

23Skidoo commented Jun 8, 2017

ezyang commented Jun 9, 2017