Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intensional store model #296

Closed
Mathnerd314 opened this issue Jul 16, 2014 · 34 comments
Closed

Intensional store model #296

Mathnerd314 opened this issue Jul 16, 2014 · 34 comments
Assignees
Labels
feature Feature request or proposal stale

Comments

@Mathnerd314
Copy link
Contributor

Currently, if any change is made to the build package script, e.g. adding an extra newline in installPhase, then the package and all of its dependencies will be rebuilt because the derivation hash changed. With an intensional store model, only the package will be rebuilt, and the dependencies will remain unchanged, reducing build times.

http://nixos.org/~eelco/pubs/phd-thesis.pdf refers to a "prototype implementation" of the intensional store, which appears to be in https://github.com/NixOS/nix/tree/secure; maybe that could be resurrected and merged?

@edolstra edolstra added this to the nix-2.0 milestone Jul 16, 2014
@Mathnerd314
Copy link
Contributor Author

Another interesting possibility is to use OSTree as the underlying store (which already hashes and deduplicates) and then turn /nix/store/* into hardlinks. So then we'd have two levels of store, the intensional and the extensional, which would mostly coexist.

@lucabrunox
Copy link
Contributor

Does OSTree hardcode paths in libraries with rpath? As far as I understood when I looked at OSTree, it was less granular than nix, as it switches a whole file system. I can't imagine OSTree being used with nix.
For example, after switching tree, it triggers an ldconfig. Now you lost the ability to run service X with library L1.0, and service Y with library L1.0-custom. Back to the classic global state problems. You only changed the global state.

@Mathnerd314
Copy link
Contributor Author

OSTree is just a store of read-only files with extended attributes identified by the hash of their contents, together with code for hardlinking those files into a directory tree. RPM-OSTree is the software that manages and switches the filesystem; I'm not proposing we use that, since our activation scripts are about as featureful and are easier to work with.

@lucabrunox
Copy link
Contributor

You didn't answer how should it be applied to nix. Shall the whole /nix/store tree change whenever a new derivation is stored or what?

@Mathnerd314
Copy link
Contributor Author

See the API. Maybe you can see the similarity to nix-store's internal API. My plan was to store each derivation as a commit and check it out to /nix/store/whatever. The runtime dependencies can be parent commits or we can just ignore that part and do the GC ourselves (the SQL database is not going away).

@tomberek
Copy link
Contributor

Any thoughts about the "intensional store"?

@Mathnerd314
Copy link
Contributor Author

@tomberek I'm not certain who you were speaking to, but my thoughts are that it should be implemented ASAP.

@vcunat
Copy link
Member

vcunat commented Jul 25, 2014

I thought OSTree doesn't allow accessing multiple versions at once, just as git doesn't. Anyway, we already do have the bare store-part that we need, with always-on top-level-path deduplication and optional file-level deduplication. That's IMHO the easy part.

I thought much about the intensional store many months ago, and we certainly do want it at some point. After delving deeper I was very surprised that the consequences are not at all as straightforward as they first appeared. IIRC derivation handling is the main stumbling block and can't be as straightforward as it is now. I have no idea if/how all is dealt with in that prototype code.

Also, using the intensional store will put much larger pressure on real binary determinism of the outputs. Our nixpkgs is most likely far from ready ATM. Currently if there's some slight semantics-preserving impurity (like programs wanting to print build date), we don't even notice it, just as in any usual distro. With intensional store these packages would change their output path on every build, including paths of anything that depends on it (transitive runtime dependents).

@Mathnerd314
Copy link
Contributor Author

So, here is the intensional model as I understand it.

Data types (all can be hashed and/or stored on disk and/or streamed if necessary)

  • Checkouts contain arbitrary data with no pointers
  • Storeballs (NAR's) contain arbitrary data and pointers to tokens
  • Builders contain input lists of tokens and of other builder outputs, a list of outputs, and arbitrary metadata
  • Tokens refer to ephemeral resources such as "the world as it was at a specific time" (arbitrary symbols) or to an output token of a builder together with a mapping from its requirements to specific storeballs.

Containers

  • Systems contain checkouts, generated manually by NixOps
  • Stores contain storeballs, generated strictly by NixOS (note that storeballs are rarely used)
  • Databases contain output tokens, generated automatically by Hydra.
  • Programs contain builders, generated lazily by Nixpkgs.

Operations

  • Parsing transforms checkouts into builders and is done by nix-instantiate
  • Hashing transforms builders into tokens and is done by nix-hash (this part needs work, because it is currently not reversible)
  • Realizing transforms tokens into storeballs and is done by nix-store and a checkout with the required tokens (this part is signed since it can be hijacked)
  • Configuring transforms storeballs into checkouts and is done by nix-env

From this, I can see 4 things:

  1. Determinism is a nice-to-have, as it allows multiple signatures for the same storeball and thus encourages security, but is not at all necessary for the model to function.
  2. Derivations, substitutions, and sources are simply a subset of the main building functionality.
  3. Hash-rewriting is a subset of configuring. Unlike the extensional model, where the file system does most of the configuring, the intensional model allows (and requires) complete control over this process.
  4. Nixpkgs can be a lot cleaner than it is now.

I've started by rebasing secure onto master, but unfortunately most of the changes were just commenting things out and the rest referred to things that don't exist anymore, so it was mostly useful for learning my way around the code. OSTree can only store checkouts, so it is a feature rather than part of the design.

@copumpkin
Copy link
Member

Nixpkgs can be a lot cleaner than it is now.

Can you elaborate on that?

@Mathnerd314
Copy link
Contributor Author

The main one is keeping the checksums out of Nixpkgs; they're already stored in the token-storeball
mapping. I was also thinking about omitting the version numbers, but I have concluded that's better dealt with in Nixpkgs.

@Ericson2314
Copy link
Member

I believe this has some interesting interactions with recursive nix (#13).

First of all, once nix exprs can be developed upstream, it will be even more useful to have an easy way to keep HEAD packages up to date. This implies three phases:

  1. Query repo for master tip, potentially a pre-fetch for other hashs -- non-deterministic.
  2. Download srcs, build dependencies (due to prefetch, packge maybe be downloaded) -- deterministic.
  3. eval and build package's nix expr -- deterministic, provided proper hygiene.

The building of intentionally non-deterministic pkgs seems a lot safer with an intensional store. Whereas most builds would automatically extend the user's trusted build mapping (the one inducing an equivalence set over output paths), intentionally non-deterministic builds such as the repo pre-fetch could create a new mapping which the user could optionally subscribe too.

This makes me wonder if even the actions relating to nix-channels could be conceived of as installing non-deterministic packages.

@Ericson2314
Copy link
Member

On another note, I don't know about OSTree, but http://ipfs.io/ once it is ready would make a fantastic intentional store for Nix--we could really be its killer app. I mentioned it on IRC, but thought i should here too. [Disclaimer: I am not associated with IPFS in any way, but neither have I tried it. I just read its paper once and immediately thought it perfect for Nix.]

@CMCDragonkai
Copy link
Member

Just to clarify the intensional model is explained on page 143 of the thesis: http://nixos.org/~eelco/pubs/phd-thesis.pdf

I was wondering why it was called "intensional"?

@shlevy
Copy link
Member

shlevy commented Feb 6, 2015

@CMCDragonkai
Copy link
Member

I have read that before. Could you elaborate as to how this applies to Nix?
On 06/02/2015 11:54 PM, "Shea Levy" notifications@github.com wrote:

http://en.wikipedia.org/wiki/Intensional_definition


Reply to this email directly or view it on GitHub
#296 (comment).

@shlevy
Copy link
Member

shlevy commented Feb 6, 2015

The idea is that the store path name reflects the entirety of the properties of the path by containing a hash of its contents.

@zimbatm
Copy link
Member

zimbatm commented Mar 15, 2016

When I started using nix I was confused on why we need to calculate the checksum of git repos since the git sha is relatively unique. Now I know the distinction but if we could store git checkouts by their sha it would be really nice and remove a lot of boilerplate.

@ehmry
Copy link
Contributor

ehmry commented Mar 16, 2016

A cheap and easy thing to do could be to store each store path at a content addressable hard link, and then make a symlink from the input hash to the hard location. Multiple hard links can be made for different hashing schemes, and multiple input symlinks can point to a single output.

I don't know how complicated it would be to perform the switch after build jobs complete and how costly it would be to dereference a symlink for each package reference by inputs.

@jbenet
Copy link

jbenet commented Mar 24, 2016

the IPFS community would love to help with this! let us know how we can.

cc @whyrusleeping @lgierth @diasdavid @noffle @davidar

@Ericson2314
Copy link
Member

@jbenet Glad to here it! The PHD thesis is still probably the best resource on the idea itself. #378 while superficially not about this at all, I think is actually serves as a good resource on the quirks of the current system, and the usecase where it is most wanting.

I'm not any sort of official Nix developer, but happy to answer any questions you may have.

@vcunat vcunat mentioned this issue Mar 24, 2016
@vcunat
Copy link
Member

vcunat commented Mar 24, 2016

@jbenet: actually, I strongly believe that IPFS has much larger potential of use to nix than the intensional store itself (i.e. forcing the use of hashes from content instead of derivations for path references). Let's split that thread to #859.

@wmertens
Copy link
Contributor

wmertens commented Aug 7, 2017

@ehmry I just had the same idea as you :) https://groups.google.com/forum/#!topic/nix-devel/m8Rrv3VpdBo

The difference is that I propose that the build step gets the CAS entries as inputs, not the input hashes. The input hashes would only be used in case the build product needs to refer to itself.

Obviously, this means that build outputs that need to access themselves will have a different $cas for different input hashes, even if the build output is otherwise the same.

Perhaps builds should be done in /nix/store/build-$randomstring, then build-$randomstring should be replaced with zeroes before calculating the output hash $cas, and then replaced again by $cas. The $cas will be slightly more work to calculate, but still unique and predictable.

@ehmry
Copy link
Contributor

ehmry commented Aug 9, 2017

@wmertens Yes, I had considered supporting multiple hashing schemes, but I no longer think that is worth the effort so replacing the input hashes seems practical. I had a system like this running, I don't remember any specific problems with CAS entries but the whole eventually collapsed from making too many changes to Nix.

@edolstra
Copy link
Member

Some progress on this: edolstra@236e87c

@wmertens
Copy link
Contributor

@edolstra wonderful! I just read the Intensional Store section of your thesis, I now wish I did that long ago ;)

I see that there is still quite a bit of work to do to get to the Intensional Store you laid out there. One thing that stands out is storing the equivalent hashes (refClasses) in the database.

I'm particularly curious about how this will play out with Hydra, how the refClasses will be provided over the network.

It would also be interesting to have a crowd-sourced refClasses database, where many builds by somewhat-trusted users show that a certain input hash leads to some CAS hash.

@wmertens
Copy link
Contributor

wmertens commented Mar 30, 2018

I just realized that this initial progress is already enough to rewrite the entire store into CAS equivalents with a script: move+link all the outputs to CAS paths and then rewrite all the hash references to their CAS hash.

The refClasses "database table" is then simply the set of symlinks that point from original to CAS.

Enough to already play with it :) I'll see if I can cook something up this weekend, but I will be happy if someone beats me to it ;)

EDIT: the CAS linking + rewriting should happen depth-first and rewrite first, otherwise the CAS hash changes. So if a depends on b depends on c, first calculate c', then rewrite b to use c' into b', then calculate b'', then rewrite a to a' with b'', then calculate a''.

@rrnewton
Copy link

rrnewton commented Jun 1, 2018

I read the intensional store thesis chapter and I think it will be a big improvement, but at the same time there seems to be a small conflict with strict determinism. The hash rewriting policy (sec 6.3.2), allows the derivation to build with its own temporary output location in its environment, and a random hash is suggested for such a purpose. But such a random hash puts entropy into that build that it could use to create a different output.

If I'm understanding correctly, the minimum example of recompilation avoidance is something like this:

bar-1.2 depends on foo-3.4
foo-3.4 builds to bits XYZ
bar-1.2 builds to bits ABC

Then foo-3.4 receives a trivial tweak (e.g. README file), that changes its derivation hash, but not its output bits (still XYZ). As a result, bar-1.2 now has its derivation hash change as well, but we want a rock-solid guarantee that bar-1.2 need not be rebuilt, because it only really depends on bits XYZ, and it would just compute bits ABC once again.

That is, the "from scratch guarantee" (like in any incremental computing) is that the full rebuild would have created the same expected bits if it were run. But that's why visibility of bar-1.2's own output path during the build (either random, or based on its derivation hash) would inject entropy that could break this from-scratch guarantee.

Simple solution: Why not just set the output path to a constant? As long as the bar-1.2 rebuild only sees a constant output path, plus the (intensional) content-based paths of its dependencies, it should have no way to produce a separate output, assuming proper determinism enforcement (CC @RyanGlScott).

As a variation on that, it could be the determinism-enforcement sandbox itself that simulates a directory-rename of the $out path. That is, there could be some random destination on the real file system, /nix/store/2c8d367ae0c4...-bar-1.2, but the build process thinks it is mapped simply to /nix/store/bar-1.2 or something. (Directory "renaming" via syscall rewriting.)

P.S. The current nix make-content-addressable patch linked above seems to use a post-facto mechanism for rewriting pointers from derivation-paths to contents-paths. But post-facto rewriting would put more paths in the environment that the build should treat as opaque to guarantee from-stratch consistency. (Perhaps this is fine, as its generally an assumption made by Nix: store paths should be treated as opaque symbols, even if that is unenforceable. But again, some rewriting tricks could guarantee that these bits never make their way into any build process's memory.)

@domenkozar domenkozar removed this from the nix-2.0 milestone Apr 30, 2020
@stale
Copy link

stale bot commented Feb 13, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Feb 13, 2021
@siraben
Copy link
Member

siraben commented Mar 23, 2021

Still important to me.

@stale stale bot removed the stale label Mar 23, 2021
@stale
Copy link

stale bot commented Sep 19, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Sep 19, 2021
@tomberek
Copy link
Contributor

Still important, but perhaps this specific issue can be closed. Seems to be well underway with the CA effort. Well done @regnat !

@stale stale bot removed the stale label Sep 19, 2021
@stale
Copy link

stale bot commented Apr 16, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Apr 16, 2022
@Ericson2314
Copy link
Member

Yes, we do have this now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Feature request or proposal stale
Projects
None yet
Development

No branches or pull requests