Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locking in the store for concurrent builds. #3741

Closed
dcoutts opened this issue Aug 30, 2016 · 6 comments
Closed

Locking in the store for concurrent builds. #3741

dcoutts opened this issue Aug 30, 2016 · 6 comments

Comments

@dcoutts
Copy link
Contributor

dcoutts commented Aug 30, 2016

In the new-build code we use a shared nix-style store. Without proper locking this could be corrupted by concurrent builds. Even though the store is in principle pure and immutable, because ghc compilation is not actually deterministic we can end up with problems.

Here is an example situation we really must avoid:

  • cabal process 1 decides to install A-39af3 and B-d241c into the store, where B depends on A
  • cabal process 2 decides to install A-39af3 and C-29ef2 into the store, where C depends on A
  • cabal process 1 builds both A-39af3 and B-d241c
  • cabal process 2 builds both A-39af3 and C-29ef2
  • cabal process 1 installs and registers into the store A-39af3 and B-d241c
  • cabal process 2 installs and registers into the store A-39af3 and C-29ef2

Here cabal process 2 has overwritten A-39af3 with a version that may not have been ABI compatible with the previous instance (because ghc is not deterministic) and thus broken B-d241c (in the sense that using B will now get linker errors or worse).

So the first point is that it's never ok to overwrite an instance (as we cannot guarantee the replacement is ABI compatible). Slightly more subtly: it's not ok to build something against one instance and later link against another.

Consider the final step above for process 2, when it comes to install and register A-39af3 and it notices that someone else got there first, what should it do? Certainly it cannot overwrite A, but moreover: it must not install and register C either. It cannot install C because it depends on the instance of A that we also have to throw away.

As an aside: note that this strategy of building everything and then installing it into the store is exactly the kind of strategy that one would like to take when using a package store/cache that is shared across a network (e.g. for a cluster of build bots). In that case it's probably more convenient to build a bunch of packages and do one shared cache update at the end, and without coordinating between build bots. But also in that circumstance it would be perfectly ok to simply not add A and C to the shared cache. Updating a shared cache is always optional, and so if we occasionally have clashes and have to throw away some potential cache additions then it's no big deal. So this kind of shared cache update would fit nicely with HTTP PUT semantics: in package dep order add each one to the cache with a conditional PUT based on the etag indicating the resource previously did not exist. Then on any conditional PUT constraint failure, skip adding all dependent packages.

So back to the local case. In the local case we don't actually ever build several packages and then install & register into the store, we install and register each one as we go along (using the store package db as the db to pass to ghc to find the dependencies).

So there are now two reasonable strategies for avoiding problems:

  1. Upon trying to install and register A-39af3, if we find we've been beaten to it by another process then throw our instance away and continue building the other packages but now using the A-39af3 that has now appeared in the store. The only coordination required is that we need to know when the other process has finished installing and registering its instance of A-39af3.
  2. At the point when we begin to build A-39af3, take out a lock on it. If the lock is currently held then it means another process beat us to it and we should wait until they have finished, at which point they will have installed and registered the package and we can continue building the other packages but now using the A-39af3 that has now appeared in the store.

In 2 of course we avoid repeating the work of building A-39af3. On the other hand it may be that 1 is simpler to implement.

Atomic primitives like create directory enable two process to race and for the looser to discover they lost. However this is not enough because installing and registering cannot be made fully atomic. In particular there is the need both to install files and to register. While installing files could be made atomic by using directory rename, the ghc-pkg registration happens with a separate set of files. So we need locking of some sort, and ideally locking that enables a process to block until the lock is released (to avoid polling). Both unix and windows support blocking file locks (though some FFI would be required as the Win32 package does not bind LockFileEx).

A solution that is simple for readers would be one where testing for a package is just a matter of checking if a directory in the store exists (rather than that plus checking some "still in progress" lock is not held). So ideally a locking strategy here would involve atomically renaming the entry into the store as the final step, so that readers can simply rely on checking that entry. Note that currently cabal new-build uses the ghc package db entry as the thing to check if packages are available, but it seems like it would be nicer to use the store entry, and then to load package db info later for those packages where we need it.

So, what remains to be done here is to work out in detail the locking & installing protocol.

@ezyang
Copy link
Contributor

ezyang commented Sep 6, 2016

@dcoutts, do you think you can commit to implementing this for 2.0 (this release cycle)?

@dcoutts
Copy link
Contributor Author

dcoutts commented Sep 8, 2016

@ezyang Possibly, though we should discuss priorities & schedule.

@ezyang
Copy link
Contributor

ezyang commented Oct 12, 2016

From #3946, we now test if the directory in the store exists, but we don't atomically rename the directory in, so there is currently a race condition.

@23Skidoo 23Skidoo modified the milestones: 2.2, 2.0 Feb 17, 2017
@dcoutts
Copy link
Contributor Author

dcoutts commented Mar 15, 2017

The concurrency strategy proposed in #4400 is more or less in line with the description above, specifically strategy 1 where we may occasionally waste work.

@23Skidoo
Copy link
Member

23Skidoo commented Jun 8, 2017

#4400 is merged, can this be closed?

@ezyang
Copy link
Contributor

ezyang commented Jun 9, 2017

Yep!

@ezyang ezyang closed this as completed Jun 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants