Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specifying shallow clones in .config #1171

Closed
Manishearth opened this issue Jan 14, 2015 · 37 comments · Fixed by #13252
Closed

Allow specifying shallow clones in .config #1171

Manishearth opened this issue Jan 14, 2015 · 37 comments · Fixed by #13252
Labels
A-git Area: anything dealing with git C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` S-blocked-external Status: ❌ blocked on something out of the direct control of the Cargo project, e.g., upstream fix

Comments

@Manishearth
Copy link
Member

Servo has a dep on Skia, which takes a very long time to download. Aside from appearing to hang (since Cargo doesn't show a progress message), this is a rather unnecessary delay for getting a new build running.

It would be nice if we could specify a --depth argument in the .config. (most of the size is due to it's extensive commit history)

Alternatively, a default --depth 50 would work too, though I'm not sure how well that would interact with cargo update.

Status

Blocked on libgit2/libgit2#3058

@alexcrichton
Copy link
Member

Note that this is blocked on libgit2/libgit2#2782.

@alexcrichton
Copy link
Member

And to clarify some more, I would love to implement this! The backing libgit2 support is a prerequisite, however (see libgit2/libgit2#1430 as well)

@larsbergstrom
Copy link

Both of those blocked issues in libgit are now fixed :-)

@alexcrichton
Copy link
Member

Ah unfortunately they were closed in favor of a new metabug in libgit2: libgit2/libgit2#3058 I'll update the description accordingly

@larsbergstrom
Copy link

Darn, got my hopes up :-)

@larsbergstrom
Copy link

This issue is less important to Servo now that crates.io supports larger crate sizes and we are able to shift more of our dependency graph from raw git to crates.io.

@cuviper
Copy link
Member

cuviper commented May 9, 2017

This would be nice for crates.io-index too, not just git dependencies. A current full clone of the index takes 39MB network and 48MB .git/ on disk, whereas --depth 1 is only 4.5MB network and 6MB on disk.

@alexcrichton
Copy link
Member

@cuviper oh it's worth pointing out that for the index at least we planned on this happening over time (clone being large) and we wanted to protect against that. It should be implemented that at any time we can roll the entire index into one commit (e.g. one huge squash) and force push that up. That, in effect, should reset everyone to a "shallow clone".

We've never done that yet so we don't quite have a procedure and/or threshold to do that, but I should probably verify that we can actually do that soon :). Other than that all we'd need to do is briefly turn off crates.io, do the squash, force push, and then turn crates.io back. All instances of Cargo should then pick up the new commit and hard reset to it when they next need to update the registry.

@cuviper
Copy link
Member

cuviper commented May 9, 2017

Hmm, will the history be saved to a branch before the reset? I'm not sure it's important, but it seems like we shouldn't just throw it all away. And FWIW git has a --single-branch option (implied by --depth) which we would want to use to get only the squashed master. It seems libgit2 can also do this by adjusting the remote refspec.

@alexcrichton
Copy link
Member

Ah excellent point, I think we'd definitely want a backup somewhere. I don't personally know a huge amount about git refspecs, but my gut is that Cargo is cloning all branches of the index today (and all historical versions of Cargo). I think it'd be easy to update that though to only fetch master!

@cuviper
Copy link
Member

cuviper commented May 9, 2017

Yeah, the refspec just becomes "refs/heads/master:refs/remotes/origin/master".

@alexcrichton
Copy link
Member

I've filed an issue to update that

@cztomsik
Copy link

Hey, any changes to this? I'm depending on my own forks of glfw & yoga which I'd like to rather not publish to crates.io yet - fetching it takes considerable time.

@Kerollmops
Copy link

I think the progress of this issue is completely related to libgit2/libgit2#5254, because IIRC cargo use the git2 rust library and it has recently made progress on shallow clones.

@DCjanus
Copy link

DCjanus commented Jan 23, 2022

I think the progress of this issue is completely related to libgit2/libgit2#5254, because IIRC cargo use the git2 rust library and it has recently made progress on shallow clones.

@Kerollmops

Actual, there is more than that, since GitHub limited shallow clone on large repository, I don't think this would be work even git2 support shallow clone.

There is an related RFC for pure HTTP registry index, maybe that would be a solution.

@milahu
Copy link

milahu commented Mar 12, 2022

example in the wild:

downstream use case:

in nixpkgs there is rustPlatform.fetchCargoTarball which only needs shallow clones

deep clones cannot be shared between fetches, so deep clones actually increase the server load

Shallow clones are discouraged by github as they increase load on their servers (source)

in nixpkgs, many packages use fetchFromGitHub
which uses github's archive API for better performance than git clone

the downside of using the archive API is:

there is no simple way to verify the files by the commit hash,
because the commit object is *not* included in the download (could be added as http header),
and the github repos API is lossy (timezones are missing)

tree hash of files + commit object → commit hash

as workaround, fetchFromGitHub requires an additional sha256 hash to verify files

example

c=d41f5ccd7ea91afee4f1a9d20b85dbcede135d3b

git clone --depth 10 https://github.com/rust-lang/cargo
git checkout $c
git log | head -n3 | grep ^Date
# Date:   Wed Mar 2 08:41:44 2022 -0600
# -> timezone is -0600

curl https://api.github.com/repos/rust-lang/cargo/commits/$c | jq .commit.author.date
# 2022-03-02T14:41:44Z
# -> no timezone

curl -I -L https://github.com/rust-lang/cargo/archive/$c.tar.gz
# -> no commit object in http headers

(maybe someone with a better reputation than me could convince github/gitlab/gitea to implement this ...)

@Ralith
Copy link

Ralith commented Mar 12, 2022

there is no simple way to verify the files by the commit hash

It is however possible to fetch by the commit hash, so maybe this is fine? If you ask for commit $FOO over TLS, then then the result can be trusted insofar as that github is.

@milahu
Copy link

milahu commented Mar 12, 2022

It is however possible to fetch by the commit hash

yepp, but its 2x slower than github's archive API

mkdir d
cd d
git init
git remote add o https://github.com/postgres/postgres
time git fetch o 5b68f75e12831cd5b7d8058320c0ca29bbe76067 --depth 1
# real	0m13.084s
time git checkout 5b68f75e12831cd5b7d8058320c0ca29bbe76067
# real	0m3.814s

time wget https://github.com/postgres/postgres/archive/5b68f75e12831cd5b7d8058320c0ca29bbe76067.tar.gz 
# real	0m6.431s

ask for commit $FOO over TLS

TLS/SSL does not help here, because the data is already content-addressed

@epage
Copy link
Contributor

epage commented Jul 1, 2022

@Byron would this only be replacing libgit2 for clones but continue to use it for everywhere else or is this the last item needed to completely switch to gitoxide?

If we need to still use libgit2, I wonder what the trade offs look like vs waiting until we can completely switch (build time, binary size, error messages, behavior differences, etc).

@Eh2406
Copy link
Contributor

Eh2406 commented Jul 1, 2022

@Byron I am very excited for your grant! I look forward to us working together!

We should be able to perform the initial clone of the crates index in a shallow fashion. Doing so is only technically related and not part of this issue at all, thus it's something we would need to sort out beforehand.

From our conversations with GitHub, doing a shallow clone is fine, but then attempting to do a fetch from that clone creates a lot of work on their servers. Shallow cloning the index is not a viable solution. #9069 is actively making progress to removing git from the index entirely. Happy to talk more about the details, but as you pointed out it's off-topic for this issue.

I'd consider shallow clones of the crates index as a stepping stone for bare clones as well, which seem to be able to save an additional 700MB on disk space right off the bat and will definitely speed up initial clones further.

I believe we already do a bear clone of the index. Is that what you're referring to?

If we need to still use libgit2, I wonder what the trade offs look like vs waiting until we can completely switch (build time, binary size, error messages, behavior differences, etc).

I appreciate the pressure to look at the data on this. My null hypothesis is that gitoxide will bring improvements even if we have to have places where we fall back to libgit2.
For example, it sounds like creating a working tree is harder than doing a clone, I wonder if we could use gitoxide to do a shallow and bare clone of dep dependencies, and then use libgit2 to actually create the working tree.

The Cargo Team should make a higher bandwidth communication channel with you to fully support the work in your grant!

@Byron
Copy link
Member

Byron commented Jul 2, 2022

@Byron would this only be replacing libgit2 for clones but continue to use it for everywhere else or is this the last item needed to completely switch to gitoxide?

It would only replace it for clones and fetches, basically all network operations and worktree updates. Indeed git2 would remain everywhere else in the codebase. I am tracking a probably incomplete list of capabilities that gitoxide would need to support to fully replace git2. The latter is also my goal, and I would expect this to happen in a follow-up grant that I will be applying for.

That said, if there is time in this grant period I will be happy to make use of gitoxide in as many places where possible based on your guidance, and would expect the API to be more comfortable at least while also being faster in the common case.

And to be honest, grant or not, I know I won't rest until gitoxide fully replaces git2 and is meaningfully 'better' in more ways than just being 'pure Rust'.

If we need to still use libgit2, I wonder what the trade offs look like vs waiting until we can completely switch (build time, binary size, error messages, behavior differences, etc).

I think figuring out how to perform the initial integration where git2 and gitoxide have to be present in parallel is part of what we would come up with before any integration work can start. The way I see it and from my experience with cargo, both code-paths would be compiled in at all times and can be switched between. What I would want to avoid is to have to maintain a patch queue for a year until the transition is complete 😅. In any case, it's probably worth collecting some data first to see if the overhead incurred to binary size and build/CI times is acceptable for the duration of the transition.

Last but not least, something I forgot to mention: I plan to maintain the gitoxide integration until it reaches 1.0, so things like breaking changes you won't have to deal with as I will perform the upgrades.

@epage
Copy link
Contributor

epage commented Jul 2, 2022

@Eh2406

I appreciate the pressure to look at the data on this.

To clarify, I don't think a deep analysis is needed. However, in the past, we have cared about dependencies being added (e.g. clap's derive feature) and about consistent behavior when having two dependencies fulfilling a similar role (toml and toml_edit). Overall, I see this as something that both the Cargo team and Byron would work together to figure out what is the acceptable set of trade offs if we can't exclusively switch.

@Byron
Copy link
Member

Byron commented Jul 2, 2022

@Byron I am very excited for your grant! I look forward to us working together!

Thank you! And likewise :).

We should be able to perform the initial clone of the crates index in a shallow fashion. Doing so is only technically related and not part of this issue at all, thus it's something we would need to sort out beforehand.

From our conversations with GitHub, doing a shallow clone is fine, but then attempting to do a fetch from that clone creates a lot of work on their servers. Shallow cloning the index is not a viable solution. #9069 is actively making progress to removing git from the index entirely. Happy to talk more about the details, but as you pointed out it's off-topic for this issue.

Before posting here I dug into the tree of links spreading out from this issue and also noticed that shallow clones and fetches were discouraged. From what I could fathom, it all boils down to this comment by a GitHub engineer 6 years ago. From my shallow (🐌👏) understanding of how shallow clones work I believe it's critical to assure the future fetches are correctly implemented so that the fetch will only include the changes, very similar what happens to fetches into a non-shallow repository. Besides the fetch protocol having special handling of shallow commits I don't see why the server would be slower if the client is correctly implemented. Furthermore the GitHub engineer talked about a series of patches to land that would reduce waste on the server side when handling shallow fetches, and I assume these landed by now. They also highlighted that the client was doing a non-shallow fetch after the first shallow clone which defeats the purpose.

That said, I am happy to assume shallow fetches aren't viable, but hope that the previous paragraph can be a reason to re-validate that assumption.

Using gitoxide for clones and fetches of the crates.io index, shallow or not, still seems to be a viable mid-term goal though, and I think it's valuable as gitoxide can resolve packs faster thus speeding up the clone.

I'd consider shallow clones of the crates index as a stepping stone for bare clones as well, which seem to be able to save an additional 700MB on disk space right off the bat and will definitely speed up initial clones further.

I believe we already do a bear clone of the index. Is that what you're referring to?

Thanks for pointing that out! After removing my possibly years old index and running cargo again, it indeed created a checkout that looks more like a bare clone. The repository isn't 'officially' bare as bare = false can be seen in the .git/config file, but I assume cargo uses it as if it was bare anyway - in any case, there is no worktree checkout anymore.

This idea can certainly be discarded then.

For example, it sounds like creating a working tree is harder than doing a clone, I wonder if we could use gitoxide to do a shallow and bare clone of dep dependencies, and then use libgit2 to actually create the working tree.

Doing so could be a viable alternative mid-term goal if any work on using gitoxide for cloning and fetching the crates index should be avoided, as it would allow skipping worktree checkouts at first. I'd even go as far as to say that gitoxide can realistically checkout the initial clone but delegate submodule updates to git2 at first.

@Byron
Copy link
Member

Byron commented Jul 3, 2022

After diving into cargo's git related code a little more where conveniently most of the relevant one seems to be in sources/git/utils.rs I came up with a more fine-grained and hopefully more realistic approach to tackling this issue.

assumptions

The general assumption is that gitoxide will be put behind a feature toggle (or similar), along with some options to control more precisely where shallow clones are used, i.e. crates only, or crates + index. Whenever gitoxide can't be used due to lack of feature parity with git2, git2 is used instead which will work just like it does before (special care will be taken to assure we don't have to repos loaded at any time, which would easily double the amount of system resources required).

integration steps

There are four distinct integration steps, with the first two being sufficient to fulfill the mid term and grant goal. The last two are more like stretch goals that I did have in mind when applying for the grant and that I personally want to see realized this year.

  1. Use gitoxide for git::fetch(…)
    • value: fetches get significantly faster in the resolving objects stage
  2. Add support for shallow fetches (and configure if this also applies to the crates index)
    • value: greatly reduced bandwidth and storage requirements for initial clones; resolve this issue.
  3. gitoxide does correct and bit-for-bit-perfect worktree checkouts if compared to git
  4. gitoxide can checkout submodules as well, and shallow settings apply when fetching these
    • value: submodules benefit from all of the above - shallow fetches, faster pack resolution and checkouts
getting to shallow fetches faster

Hidden in the listing above is some flexibility in implementation detail as it's possible to use standard transports (git+ssh, https) along with standard authentication that are already implemented in gitoxide to probably cover 90% of the cases to get to implementing shallow fetch negotiations more quickly. Then, between 2) and 3) one could take the time to truly cover all the details of transports and credentials (for instance, git2 seems to support NTLM and kerberos, and uses libssh2 instead of the ssh binary like gitoxide does currently, similar to git). This is really up to our preferences here, otherwise I'd expect transport and credentials parity to happen with 1).

measures of success

Runtimes of cargo that involve a lot of git interactions should be reduced visibly. Running the cargo benchmarks seems like a good baseline as it will clone a lot of git repositories. All of this should get much faster - 38m23s is the time to beat for a plain cargo bench.

call to action and the bigger picture

This write-up is the summary of my research issue and I hope it helps to come up with the next steps to take for integrating gitoxide with cargo. It might also a starting point for tackling other questions like:

  • when can the gitoxide code-path be considered ready for being used by default?
  • what's the rough plan and timeline for completing the transition to gitoxide to avoid having to build with gitixode and git2?
  • Should gitoxide be used to replace the two git command invocations currently present? (it already could be used for both of them today)

I am definitely looking forward to your feedback :), thanks a lot!

@tshepang
Copy link
Member

tshepang commented Jul 3, 2022

Should gitoxide be used to replace the two git command invocations currently present? (it already could be used for both of them today)

This would be nice... having executables execute other executables smells.

@Byron
Copy link
Member

Byron commented Jul 4, 2022

This would be nice... having executables execute other executables smells.

I think both invocations are well-motivated. The one in build.rs is probably to avoid increasing build times, which is always built for the host and thus may lead to git2/gitoxide being built multiple times in case of --target specifications. As git is typically installed on a machine tasked to build cargo, it's the fastest and way to get some basic information out of a git repository. That's probably why git2 isn't used for this even though it could have been.

The other use is git gc with the sole purpose of aggregating various packs into one pack file, which isn't easily supported by git2. Since this is actual cargo code (as opposed to build.rs), using gitoxide there should be a great advantage as it would assure this maintenance happens wherever cargo is executed independently of whether git is available in the PATH or not. I imagine this positively affects windows users in particular.

What I like about this second opportunity is that it is high-value yet easy to accomplish, and all that's needed to enable a PR is a pre-existing feature toggle to allow switching gitoxide on in the code base - it could be some sort of warm up to the bigger operation proposed here. Doing so could motivate the feature toggle that future PRs will also use (see the 4-step plan above for what these could look like).

@Byron Byron mentioned this issue Dec 1, 2022
16 tasks
bors added a commit that referenced this issue Mar 2, 2023
gitoxide integration: fetch

This PR is the first step towards resolving #1171.

In order to get there, we integrate `gitoxide` into `cargo` in such a way that one can control its usage in nightly via `-Zgitoxide` or `Zgitoxide=<feature>[,featureN]`.

Planned features are:

* **fetch** - all fetches are done with `gitxide` (this PR)
* **shallow_index** - the crates index will be a shallow clone (_planned_)
* **shallow_deps** - git dependencies will be a shallow clone (_planned_)
* **checkout** - plain checkouts with `gitoxide` (_planned_)

The above list is a prediction and might change as we understand the requirements better.

### Testing and Transitioning

By default, everything stays as is. However, relevant tests can be re-runwith `gitoxide` using

```
RUSTFLAGS='--cfg always_test_gitoxide' cargo test git
```

There are about 200 tests with 'git' in their name and I plan to enable them one by one. That way the costs for CI stay managable (my first measurement with one test was 2min 30s), while allowing to take one step at a time.

Custom tests shall be added once we realize that more coverage is needed.

That way we should be able to maintain running `git2` and `gitoxide` side by side until we are willing to switch over to `gitoxide` entirely on stable cargo. Then turning on `git2` might be a feature toggle for a while until we finally remove it from the codebase.

_Please see the above paragraph as invitation for discussion, it's merely a basis to explore from and improve upon._

### Tasks

* [x] add feature toggle
* [x] setup test system with one currently successful test
* [x] implement fetch with `gitoxide` (MVP)
* [x] fetch progress
* [x] detect spurious errors
* [x] enable as many git tests as possible (and ignore what's not possible)
* [x] fix all git-related test failures (except for 1: built-in upload-pack, skipped for now)
* [x] validate that all HTTP handle options that come from `cargo` specific values are passed to `gitoxide`
* [x] a test to validate `git2` code can handle crates-index clones created with `gitoxide` and vice-versa
* [x] remove patches that enabled `gitoxide` enabled testing - it's not used anymore
* [x] ~~remove all TODOs and use crates-index version of `git-repository`~~ The remaining 2 TODO's are more like questions for the reviewer.
* [x] run all tests with gitoxide on the fastest platform as another parallel task
* [x] switch to released version
* [x] [Tasks from first review round](#11448 (comment))
* [x] create a new `gitoxide` release and refer to the latest version from crates.io (instead of git-dependency)
* [x] [address 2nd review round comments](#11448 (comment))

### Postponed Tasks

I suggest to go breadth-first and implement the most valuable features first, and then aim for a broad replacement of `git2`. What's left is details and improved compatibility with the `git2` implementation that will be required once `gitoxide` should become the default implementation on stable to complete the transition.

* **built-in support for serving the `file` protocol** (i.e. without using `git`). Simple cases like `clone` can probably be supported quickly, `fetch` needs more work though due to negotiation.
* SSH name fallbacks via a native (probably ~~libssh~~ (avoid LGPL) `libssh2` based) transport. Look at [this issue](#2399) for some history.
* additional tasks from [this tracking issue](GitoxideLabs/gitoxide#450 (comment))

### Proposed Workflow

I am now using [stacked git](https://stacked-git.github.io) to keep commits meaningful during development. This will also mean that before reviews I will force-push a lot as changes will be bucketed into their respective commits.

Once review officially begins I will stop force-pushing and create small commits to address review comments. That way it should be easier to understand how things change over time.

Those review-comments can certainly be squashed into one commit before merging.

_Please let me know if this is feasible or if there are other ways of working you prefer._

### Development notes

* unrelated: [this line](https://github.com/rust-lang/cargo/blob/9827412fee4f5a88ac85e013edd954b2b63f399b/src/cargo/ops/registry.rs#L620) refers to an issue that has since been resolved in `curl`.
* Additional tasks related to a correct fetch implementation are collected in this [tracking issue](GitoxideLabs/gitoxide#450). **These affect how well the HTTP transport can be configured, needs work**
* _authentication_ [is quite complex](https://github.com/rust-lang/cargo/blob/37cad5bd7f7dcd2f6d3e45312a99a9d3eec1e2a0/src/cargo/sources/git/utils.rs#L490) and centred around making SSH connections work. This feature is currently the weakest in `gitoxide` as it simply uses `ssh` (the program) and calls it a day.  No authentication flows are supported there yet and the goal would be to match `git` there at least (which it might already do by just calling `ssh`). Needs investigation. Once en-par with `git` I think `cargo` can restart the whole fetch operation to try different user names like before.
   - the built-in `ssh`-program based transport can now understand permission-denied errors, but the capability isn't used after all since a builtin ssh transport is required.
* It would be possible to implement `git::Progress` and just ignore most of the calls, but that's known to be too slow as the implementation assumes a `Progress::inc()` call is as fast as an atomic increment and makes no attempt to reduce its calls to it.
* learning about [a way to get custom traits in `thiserror`](dtolnay/thiserror#212) could make spurious error checks nicer and less error prone during maintenance. It's not a problem though.
* I am using `RUSTFLAGS=--cfg` to influence the entire build and unit-tests as environment variables didn't get through to the binary built and run for tests.

### Questions

* The way `gitoxide` is configured the user has the opportunity to override these values using more specific git options, for example using url specific http settings. This looks like a feature to me, but if it's not `gitoxide` needs to provide a way to disable applying these overrides. Please let me know what's desired here - my preference is to allow overrides.
* `gitoxide` currently opens repositories similar to how `git` does which respects git specific environment variables. This might be a deviation from how it was before and can be turned off. My preference is to see it as a feature.

### Prerequisite PRs

* #11602
@weihanglo weihanglo added S-blocked-external Status: ❌ blocked on something out of the direct control of the Cargo project, e.g., upstream fix C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` S-blocked labels Mar 9, 2023
@Qix-
Copy link

Qix- commented Apr 5, 2023

Note that upstream shallow cloning functionality is in final review stages over at libgit2/libgit2#6396. Perhaps an interested party can implement this using a patched version + PR here so that when it lands it's ready to go?

@pwnorbitals
Copy link

It looks like shallow-cloning on libgit2 side has landed 🎉🎉
So this issue is actually unblocked now

@connorworley
Copy link
Contributor

I'm interested in taking on this issue as a first contribution.

@connorworley
Copy link
Contributor

My initial thoughts are to move the -Zgitoxide=shallow-index,shallow-deps flag to a new -Zgit=shallow-index,shallow-deps flag that applies to both gitoxide and libgit2. One pain point I've discovered in testing is that libgit2 doesn't support local shallow fetches. Otherwise, it's a pretty straightforward change.

@Byron
Copy link
Member

Byron commented Jan 5, 2024

Thanks for sharing! This reminded me to keep track of the shallow requirement in the respective gitoxide ticket.

Besides that, I really like the generalization of the shallow-* feature.

@Qix-
Copy link

Qix- commented Jan 12, 2024

Amazing, thanks everyone for getting this landed! Exciting :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-git Area: anything dealing with git C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` S-blocked-external Status: ❌ blocked on something out of the direct control of the Cargo project, e.g., upstream fix
Projects
None yet
Development

Successfully merging a pull request may close this issue.