-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fetchGit/fetchTree reproducibility problems #5313
Comments
As for the solution, perhaps it'd be best to make |
It’s not just smudge filters either. There is at least For true reproducibility it might be necessary to forgo the use of any “porcelain” and use the “plumbing” directly (either as commands or using libgit) |
If ditching the "porcelain" tools, please make sure that whatever the replacement is still has some solution for authentication for fetching from private repositories. Authentication does not affect reproducibility (just reliability) so there's no issue with reading authentication information from the environment or filesystem. |
The GitHub tarball API suffers from a similar problem: It runs I looked at this with @tomberek, and it seems that the most reliable solution would be to
An alternative for GitHub would be to just make it sugar for Git, but add some clever blob filtering to get some reasonable performances. I've no idea how possible that would be. |
This only applies during locking and impure use. Most of the fetching will happen on locked Note though that by making fetching tree-based, we solve an opposite problem, which is that archive based locking can not fall back to git operations because in the status quo, we'd have to invoke or emulate Another possible way around the rate limits, which doesn't involve cloning, is perhaps to use Finally I consider the false equivalence between the commit tarball and normal git fetching to be a serious bug. |
If we store the tree hash as part of the lock file, then yes. But it's problematic because it means that the commit hash isn't the ultimate source of truth any more (the tree hash is). So in a Flake context I could have
@tomberek has been trying that (directly |
Can be done with the git CLI. Crucially we only incur the cost when using a truly new commit. E.g. if you have a flake with 12 versions of Nixpkgs, you only fetch the commits once. $ git init
$ git remote add origin https://github.com/NixOS/nixpkgs.git
$ time git fetch origin --filter=blob:none --depth=1 master
remote: Enumerating objects: 25463, done.
remote: Counting objects: 100% (25463/25463), done.
remote: Compressing objects: 100% (4729/4729), done.
remote: Total 25463 (delta 14), reused 24666 (delta 13), pack-reused 0
Receiving objects: 100% (25463/25463), 2.31 MiB | 3.46 MiB/s, done.
Resolving deltas: 100% (14/14), done.
From https://github.com/NixOS/nixpkgs
* branch master -> FETCH_HEAD
* [new branch] master -> origin/master
real 0m1.808s
user 0m0.394s
sys 0m0.127s
$ time git fetch origin --filter=blob:none --depth=1 master
remote: Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
From https://github.com/NixOS/nixpkgs
* branch master -> FETCH_HEAD
real 0m0.976s
user 0m0.112s
sys 0m0.044s
$ git rev-parse refs/remotes/origin/master^{tree} # is instant
76a68875f4fbbe51864431a639f7413c15b2469b
$ du -shc .git
3.1M .git
3.1M total
$ time git fetch origin --filter=blob:none --depth=1 nixos-unstable
remote: Enumerating objects: 1029, done.
remote: Counting objects: 100% (1029/1029), done.
remote: Compressing objects: 100% (257/257), done.
remote: Total 519 (delta 174), reused 333 (delta 0), pack-reused 0
Receiving objects: 100% (519/519), 48.52 KiB | 1.13 MiB/s, done.
Resolving deltas: 100% (174/174), completed with 174 local objects.
From https://github.com/NixOS/nixpkgs
* branch nixos-unstable -> FETCH_HEAD
* [new branch] nixos-unstable -> origin/nixos-unstable
real 0m1.299s
user 0m0.143s
sys 0m0.054s
That's already a problem with |
Oh, that is great! I didn't expect there would be such an easy (and cheap-ish) solution. |
Right that's why I very much want to get in my git hashing changes the same time we do this redesign: being able to
is very useful, especially has we approach a world where there is quite a lot of different ways to fetch git things. Ultimately, in a world with signed commits being the norm, we should be writing down commit hashes and public keys in the input spec, and then everything is verified via merkle inclusion proofs form there. |
Probably should be in It doesn't support sparse checkouts, so it wouldn't surprise me if clone filtering (fetch filtering?) isn't implemented yet either. I assume that it's just not implemented yet. We might want to use the CLI for this procedure until it is implemented in libgit2. |
New commands with --filter=tree:0 instead of --filter=blob:none We can improve on #5313 (comment)
Re-running it today gives
Getting or checking an arbitrary revision is slower, but this is only needed when we don't have a lock file to cache the tree hash.
For comparison, the GitHub API responds in 0.12 to 0.4 seconds for an arbitrary commit.
|
@roberth that's sweet :)
|
@thufschmitt That's the plan. We could elaborate that a bit:
With @DavHau we discussed two implementation strategies
The latter appears more elegant, as we can re-frame the new fetching strategy as an alternate "git transport", somewhat similar to how git itself can deal with multiple protocols. Even submodule support seems within reach that way, although for that we do need the slightly slow |
Describe the bug
Builtin fetchers have to be reproducible. Any changes to their output either break the build because of a hash mismatch or cause the sources to change, if no hash is provided. Implementation errors in the builtin fetchers invalidate Nix as a tool for reproducibility.
While a git commit hash appears solid at first, its translation to a store path is not unique and prone to impurities.
This shows in #5260 for example, where someone relied on Nix's previous behavior of applying git's smudge filters. This was arguably an implementation error that was recently corrected, but it is a very breaking change nonetheless.
For a more detailed description of the problem with git smudge filters I'll kindly refer to the motivation description of #4635, which was a PR to attempt to fix the problem, at least in most cases.
The text was updated successfully, but these errors were encountered: