-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Git-based content-addressed tarball cache #9485
Conversation
GitArchiveInputScheme now streams tarballs into a Git repository. This deduplicates data a lot, e.g. when you're fetching different revisions of the Nixpkgs repo. It also warns if the tree hash returned by GitHub doesn't match the tree hash of the imported tarball.
(I have now down the git sink changes I wanted.) |
@Ericson2314 It would really be easier for review to have that in a separate PR. Unless there was anything blocking in the original PR, could you remove your commits and make another PR? I will review it. |
@edolstra I can make a separate PR sure, but for having this separation is a blocker. |
#9653 This PR is just the preparatory commit. The other 2 non-merge commits I made are just changing code that was added in this PR. |
Why is that a blocker? Having a more generic |
Same reason I was so adamant about combining I think #9325 ended up using that in a small way right away, and that not something I planned in advanced, so I am decently confident we'll find a concrete use-case for this too in addition. |
b57ee7c
to
e7d95a1
Compare
@Ericson2314 The best way to ensure that PRs never get merged is to tack on concerns about future use cases, or to bloat them by adding stuff like the |
I did the work myself precisely to alleviate bottlenecks. #9653 can be a preparatory PR --- small PRs are good! --- and the rest added back to this makes it no bigger since it is just shuffling new code around. The situation where we add code and then try to make it nicer later makes me nervous; to me it is pushing up against the release readiness principle. If code review gets things in the consensus state, then there is no "cleaner -> messier -> cleaner again" race condition. |
Using libgit for deduplication yields a very simple implementation but drastically worsens eval performance. I've mentioned this before in the "Source tree abstraction" PR. Benchmark source# https://github.com/NixOS/nix/pull/6530 as of 2023-12-31
nix build github:edolstra/nix/2055e28aedb3918d91dcd80a5cf9bdd32242c822 -o /tmp/nix/git-accessor
# The version of master which the above PR is based on
nix build github:nixos/nix/9dbfd186b129ddee4a7a66958be9abdf6dd6a668 -o /tmp/nix/baseline
nix_eval() {
nix=$1
shift
# A test with a single VM system evaluation that accesses lots of files
# nixpkgs-unstable as of 2023-02-01
/tmp/nix/$nix/bin/nix eval --json github:nixos/nixpkgs/ea692c2ad1afd6384e171eabef4f0887d2b882d3#nixosTests.paperless "$@"
}
export -f nix_eval
hyperfine --warmup 3 \
'nix_eval baseline --read-only' \
'nix_eval git-accessor --read-only'
rm -rf /tmp/nix
# Benchmark 1: nix_eval baseline --read-only
# Time (mean ± σ): 3.595 s ± 0.042 s [User: 3.221 s, System: 0.368 s]
# Range (min … max): 3.554 s … 3.684 s 10 runs
#
# Benchmark 2: nix_eval git-accessor --read-only
# Time (mean ± σ): 5.786 s ± 0.059 s [User: 5.273 s, System: 0.460 s]
# Range (min … max): 5.712 s … 5.909 s 10 runs
#
# Summary
# 'nix_eval baseline --read-only' ran
# 1.61 ± 0.02 times faster than 'nix_eval git-accessor --read-only' This issue can also be confirmed with this PR: Also, extracting tarballs to the libgit cache is very slow. Evaluating a NixOS system flake that has multiple github inputs with cold caches can be 5x slower compared to Nix master, moving the initial eval time into the minutes territory. Eval perf is already a major pain point of Nix. Adding the libgit penalty makes Nix impractical to use for larger NixOS systems. |
The performance issue should be addressed by #10013. With that, the runtime of
went from 32.0s to 12.5s, which is very close to master. |
There is now a separation of: 1. A `FileSystemObjectSink` for writing to git repos 2. Adapting libarchive to use that parse sink. The prepares a proper separation of concerns.
There is no longer an `importTarball` method. Instead, there is a `unpackTarfileToSink` function (back in libutil). The caller can use thisw with the `getParseSink` method we added in the last commit easily enough. In addition, tarball cache functionality is separated from `git-utils` and moved into `tarball-cache`. This ensures we are separating mechanism and policy.
TarballInfo is only used in github.cc, and getTarballCache() is a bit too trivial to have its own file.
This seems to be the cause of https://hydra.nixos.org/build/252347612 ( |
Found the culprit: Nix now requests the tree hash of Github inputs, and the mock server from the tests doesn't provide that |
Motivation
In combination with lazy-trees, this reduces the disk space required for flakes (e.g. different revisions of nixpkgs) by a lot.
Context
Priorities
Add 👍 to pull requests you find important.