Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetchGit/fetchTree reproducibility problems #5313

Open
roberth opened this issue Oct 1, 2021 · 13 comments
Open

fetchGit/fetchTree reproducibility problems #5313

roberth opened this issue Oct 1, 2021 · 13 comments
Labels

Comments

@roberth
Copy link
Member

roberth commented Oct 1, 2021

Describe the bug

Builtin fetchers have to be reproducible. Any changes to their output either break the build because of a hash mismatch or cause the sources to change, if no hash is provided. Implementation errors in the builtin fetchers invalidate Nix as a tool for reproducibility.

While a git commit hash appears solid at first, its translation to a store path is not unique and prone to impurities.

This shows in #5260 for example, where someone relied on Nix's previous behavior of applying git's smudge filters. This was arguably an implementation error that was recently corrected, but it is a very breaking change nonetheless.

For a more detailed description of the problem with git smudge filters I'll kindly refer to the motivation description of #4635, which was a PR to attempt to fix the problem, at least in most cases.

@roberth
Copy link
Member Author

roberth commented Oct 1, 2021

As for the solution, perhaps it'd be best to make builtins.fetchGit behave exactly like Nix 2.3 and implement the reproducibility fixes only in builtins.fetchTree.

@greedy
Copy link
Contributor

greedy commented Oct 1, 2021

It’s not just smudge filters either. There is at least core.eol and core.symlinks confit options that affect working copy contents. There’s also the possibility for a global gitattributes file that can cause files to be iconv’ed at checkout, and substitute RCS-style ID tokens.

For true reproducibility it might be necessary to forgo the use of any “porcelain” and use the “plumbing” directly (either as commands or using libgit)

@lilyball
Copy link
Member

If ditching the "porcelain" tools, please make sure that whatever the replacement is still has some solution for authentication for fetching from private repositories. Authentication does not affect reproducibility (just reliability) so there's no issue with reading authentication information from the environment or filesystem.

@stale stale bot added the stale label Aug 12, 2022
@zebreus zebreus mentioned this issue Sep 24, 2023
13 tasks
@thufschmitt
Copy link
Member

The GitHub tarball API suffers from a similar problem: It runs git archive behind the scenes, which honours the export-subst attribute (which can perform some nondeterministic substitutions, and gives a different tree than git clone). This isn't the case however if the API is used with a tree hash rather than a commit hash. GitLab has the same behaviour (and probably the other Git forges too since that's inherited from git archive).

I looked at this with @tomberek, and it seems that the most reliable solution would be to

  • use git archive {treeHash} for fetchTree (git)
  • Use the tarball API with a tree hash (https://github.com/{owner}/{repo}/archive/{treeHash}.tar.gz) for GitHub. However this is problematic because getting the tree hash from a commit hash requires hitting the GitHub API which is severely rate-limited for anonymous users.

An alternative for GitHub would be to just make it sugar for Git, but add some clever blob filtering to get some reasonable performances. I've no idea how possible that would be.

@roberth
Copy link
Member Author

roberth commented Oct 26, 2023

  • However this is problematic because getting the tree hash from a commit hash requires hitting the GitHub API which is severely rate-limited for anonymous users.

This only applies during locking and impure use. Most of the fetching will happen on locked fetchTree calls that already have a copy of the tree hash. Only manually pinned fetchTree calls will be affected, and fallback is possible, as mentioned.

Note though that by making fetching tree-based, we solve an opposite problem, which is that archive based locking can not fall back to git operations because in the status quo, we'd have to invoke or emulate git archive, which is not desirable.

Another possible way around the rate limits, which doesn't involve cloning, is perhaps to use libgit2 to fetch only the relevant packfile and get the tree hash from there. I believe that's feasible in 2 requests to the git (non-"api") endpoint. Not sure if such functions are exposed though.

Finally I consider the false equivalence between the commit tarball and normal git fetching to be a serious bug.

@thufschmitt
Copy link
Member

  • However this is problematic because getting the tree hash from a commit hash requires hitting the GitHub API which is severely rate-limited for anonymous users.

This only applies during locking and impure use. Most of the fetching will happen on locked fetchTree calls that already have a copy of the tree hash. Only manually pinned fetchTree calls will be affected, and fallback is possible, as mentioned.

If we store the tree hash as part of the lock file, then yes. But it's problematic because it means that the commit hash isn't the ultimate source of truth any more (the tree hash is). So in a Flake context I could have inputs.foo.url = "github:foo/bar?rev=abcde123", but a forged lockfile that maps that to a totally unrelated tree hash.
But indeed, we could fall back to plain Git if we get rate-limited.

Another possible way around the rate limits, which doesn't involve cloning, is perhaps to use libgit2 to fetch only the relevant packfile and get the tree hash from there. I believe that's feasible in 2 requests to the git (non-"api") endpoint. Not sure if such functions are exposed though.

@tomberek has been trying that (directly curl-ing the Git server, not through libgit2) with not much success. But it's probably theoretically possible indeed.

@roberth
Copy link
Member Author

roberth commented Oct 26, 2023

not much success. But it's probably theoretically possible

Can be done with the git CLI.
Tested with https://github.com/NixOS/nixpkgs.git.
master takes 3 seconds to fetch the hashes for the first time.
Also takes 3 MB storage, for all historic commits of master.
1-3 seconds for syncing up when needed (ie commit hash not found locally).
Network use also about 3 MB.

Crucially we only incur the cost when using a truly new commit. E.g. if you have a flake with 12 versions of Nixpkgs, you only fetch the commits once.

$ git init

$ git remote add origin https://github.com/NixOS/nixpkgs.git

$ time git fetch origin --filter=blob:none --depth=1 master
remote: Enumerating objects: 25463, done.
remote: Counting objects: 100% (25463/25463), done.
remote: Compressing objects: 100% (4729/4729), done.
remote: Total 25463 (delta 14), reused 24666 (delta 13), pack-reused 0
Receiving objects: 100% (25463/25463), 2.31 MiB | 3.46 MiB/s, done.
Resolving deltas: 100% (14/14), done.
From https://github.com/NixOS/nixpkgs
 * branch              master     -> FETCH_HEAD
 * [new branch]        master     -> origin/master

real	0m1.808s
user	0m0.394s
sys	0m0.127s

$ time git fetch origin --filter=blob:none --depth=1 master
remote: Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
From https://github.com/NixOS/nixpkgs
 * branch              master     -> FETCH_HEAD

real	0m0.976s
user	0m0.112s
sys	0m0.044s

$ git rev-parse refs/remotes/origin/master^{tree}   # is instant
76a68875f4fbbe51864431a639f7413c15b2469b

$ du -shc .git
3.1M	.git
3.1M	total

$ time git fetch origin --filter=blob:none --depth=1 nixos-unstable
remote: Enumerating objects: 1029, done.
remote: Counting objects: 100% (1029/1029), done.
remote: Compressing objects: 100% (257/257), done.
remote: Total 519 (delta 174), reused 333 (delta 0), pack-reused 0
Receiving objects: 100% (519/519), 48.52 KiB | 1.13 MiB/s, done.
Resolving deltas: 100% (174/174), completed with 174 local objects.
From https://github.com/NixOS/nixpkgs
 * branch              nixos-unstable -> FETCH_HEAD
 * [new branch]        nixos-unstable -> origin/nixos-unstable

real	0m1.299s
user	0m0.143s
sys	0m0.054s

problematic because it means that the commit hash isn't the ultimate source of truth any more (the tree hash is). So in a Flake context I could have inputs.foo.url = "github:foo/bar?rev=abcde123", but a forged lockfile that maps that to a totally unrelated tree hash.

That's already a problem with rev and narHash. Don't trust lockfile updates from untrusted sources.
Nonetheless, we could always check that tree and rev match. It's cheap.

@thufschmitt
Copy link
Member

Can be done with the git CLI.

Oh, that is great! I didn't expect there would be such an easy (and cheap-ish) solution.

@Ericson2314
Copy link
Member

Right that's why I very much want to get in my git hashing changes the same time we do this redesign: being able to

  • verify everything everything Nix-side with one hash
  • have that hash also be useful to upstream, not be "yet another" hatch.

is very useful, especially has we approach a world where there is quite a lot of different ways to fetch git things.

Ultimately, in a world with signed commits being the norm, we should be writing down commit hashes and public keys in the input spec, and then everything is verified via merkle inclusion proofs form there.

@stale stale bot removed the stale label Oct 28, 2023
@roberth
Copy link
Member Author

roberth commented Nov 3, 2023

git fetch origin --filter=blob:none --depth=1 master

Probably should be in git_fetch_options in libgit2, but I don't see how to do it.

It doesn't support sparse checkouts, so it wouldn't surprise me if clone filtering (fetch filtering?) isn't implemented yet either.
Partial cloning is also still a feature issue. (And the code suggests that the filter options are handled by the partial cloning feature)
Neither remote.h, nor fetch.h, nor clone.h mention "blob" either.

I assume that it's just not implemented yet. We might want to use the CLI for this procedure until it is implemented in libgit2.

@roberth
Copy link
Member Author

roberth commented Mar 25, 2024

New commands with --filter=tree:0 instead of --filter=blob:none

We can improve on #5313 (comment)

$ time git fetch origin --filter=blob:none --depth=1 master
[...]
real	0m1.808s

Re-running it today gives 0m1.299s, but tree:0 is even faster:

$ rm -rf .git/; git init; git remote add origin https://github.com/NixOS/nixpkgs.git;
$ time git fetch origin --filter=tree:0 --depth=1 master
[...]
real	0m0.556s

Getting or checking an arbitrary revision is slower, but this is only needed when we don't have a lock file to cache the tree hash.

$ time git fetch origin --filter=tree:0 --depth=1 ca09999d7a6d9e789e00078e660e90bd63acc7a4
[...]
real	0m2.140s

For comparison, the GitHub API responds in 0.12 to 0.4 seconds for an arbitrary commit.

curl https://api.github.com/repos/NixOS/nixpkgs/commits/ca09999d7a6d9e789e00078e660e90bd63acc7a4

@thufschmitt
Copy link
Member

@roberth that's sweet :)
Does that mean that for fetching github:nixos/nixpkgs/{commitHash} we can

  1. Use git to fetch the commit blob
  2. Use git to get the tree hash from the blob
  3. Use the /archive GitHub endpoint to fetch the archive corresponding to it
    ?

@roberth
Copy link
Member Author

roberth commented Mar 25, 2024

@thufschmitt That's the plan. We could elaborate that a bit:

  1. Use the git command to fetch the commit blob (including ref resolution if needed)
  2. Use libgit2 to get the tree hash from the commit
  3. Use the /archive GitHub endpoint to fetch the archive corresponding to the tree hash
  4. Verify the tree hash while unpacking the tarball (into a git repo?)
  5. (contingency) fall back to normal git-command fetch.

With @DavHau we discussed two implementation strategies

  • Extend GitArchiveInputScheme
  • "Rebase" onto GitInputScheme

The latter appears more elegant, as we can re-frame the new fetching strategy as an alternate "git transport", somewhat similar to how git itself can deal with multiple protocols.

Even submodule support seems within reach that way, although for that we do need the slightly slow blob:none, which fetches a "wintery" tree that does contain references to submodules. But well, that's extra anyway, because github: doesn't support submodules yet. (Also probably similar story for subtree fetching, like the nixpkgs/lib flake)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants