-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for shallow clones and fetches with gitoxide
#11840
Conversation
r? @epage (rustbot has picked a reviewer for you, use r? to override) |
gitoxide
gitoxide
4459329
to
367074c
Compare
tests/testsuite/git.rs
Outdated
.all()? | ||
.count(), | ||
2, | ||
"it fetched only the new portion of the history, even though it was unaware of the shallow boundary" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is truly amazing. Do we know that this will always work correctly?
I'm worried about the case where a repository returns to an earlier version of the file. Say we have commit 1 with "a", 2 with "b", and 3 with "a". We do a shallow clone of 2. Then we do a libgit2 full update from 2->3, the server tells us "the contents of the file is the same as it was in commit 1 which you already have" but we don't have it.
I am concerned that we always correctly handle any errors from libgit2 and that we never deepen a shallow clone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, a valid point! But is it that exhaustive?
If that was true, it wouldn't only have to align the commits to send along with the changes within them, but also back-track through the entire history to do some sort of tree optimization. Not having implemented that part yet definitely leaves my knowledge spotty there, but having implemented pack generation I know that when building the smallest possible pack of a set of commits, I do diff each of them to only send changes. Each of these object candidates to send would now have to pass through a filter that answers the question of "does this object exist already in the portion of the commit graph that the client has? If yes, don't send it." However, what speaks for it being able to do that is its ability to base a delta entry in the back off an object that it knows the other side has (so it doesn't have to send it).
I am aware that the server side has caches which facilitate all kinds of operations (which are mostly non-functional when doing a shallow clone), but that level of sophistication still baffles me :D.
Excuse my ramblings, I think the answer here is that it's still early and I am writing tests to see what happens, and more testing has to be done to see if it's truly safe. If there is any doubt, and I will see if I can get clarification, it would be better to nuke the repository and re-clone it with git2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of researching this further, I thought it's safest to reinitialize repo instead so using git2
on top of a shallow repo will unshallow it effectively and in a controlled fashion. I am also watching the 'shallow' PR over at the libgit2
repo so this can be adjusted once it becomes available there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will always be older cargoes that do not know that shallow clones are a thing. We need a mechanism to ensure that if we shallow clone a repository older cargoes will ether "work entirely correctly" or "not see the checkout".
There are two big things that would not qualify as "work entirely correctly". The obvious one, is crashes with an error. The less obvious one is unshallow the checkout in one of the ways that is extremely expensive for GitHub.
A undesirable scenario that I think still qualifies as "work entirely correctly" is, do a Delta fetch (which as you demonstrate almost always works) and if there's an error blow away the repository and do a full fetch. If we can determine what errors are likely from libgit2 and we have a re-fetch fallback for those errors, I will feel significantly better.
If for any of these reasons we need to make sure that old cargoes do not see shallow clones we will have to make sure that they are put in a different location. Either by adding it to the hash in the repo name or by /git/shallow-db
.
All of this applies just as well whether were adding shallowness from git2
or from gix
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this to the forefront of my mind.
Having seen what happens, I think the default would be that git2
can operate on shallow repos and receive only an incremental pack. This might not contain all objects they need due to the scenario you mentioned if git
really works that well (which it probably does), resulting in some probability of hard errors. With that being a possibility (and even if it is not right now it might be in the future), I feel like it would be best to not let git2
touch shallow repos at all by placing them in their own directory.
Something I also thought about is if it's an option to 'cut' the feature down to an MVP level that yields value while not being used by everyone. Specifically, what if those shallow clones of the index would only be 'for CI' which makes them one-time-use only. Then there is no worry of older cargo
's seeing them.
Thinking about it more, git-dependencies might be major benefactors after all, and having them on CI only is certainly a start, but ultimately people probably also want them locally which makes me arrive with the original problem that seemed like changing their hash (to make them invisible) is the safest bet and easiest to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After having made sure that git-dependencies with revisions also work as they should by disabling shallow clones or unshallowing existing shallow clones it became clear that these situations can't be handled correctly by older cargos - instead they will find the revision they try to checkout missing and fail with an error.
The only way around this that I see would be to adjust the place where shallow clones are stored or make them unusable in other ways, I don't know yet.
A problem I see with the current implementation is that it itself makes a decision what to do about the shallowness of a clone, at a stage where the destination repository is already set and with it its location. I will see if it's feasible to move the whole clone into another location then once it's observable whether or not a clone is still shallow.
Running all tests against the 'make all clone and fetches shallow' has proven itself to be very useful to see more special cases and make them work, so when changing the location of clones I will use that as a measure of feasibility as well even though as of now there is no automated test that runs like this.
45f874a
to
e81797d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will add one more test today, but besides that I believe shallow clones and fetches are working :). I will also run the entire test suite with shallow clones to see if there is unusual failures.
Please feel free to try it yourself.
&msg, | ||
)?; | ||
} else if let Some((action, remote)) = | ||
find_in(&tasks, |t| progress_by_id(remote_progress, t)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specifically for avoiding pauses when shallow cloning the index, which is when the remote does far more work.
0ee9c5d
to
44924a7
Compare
A couple of additional tests have been added to particularly validate that locked revisions of git dependencies work as they should, and furthermore all shallow clones are now placed under a name with the All in all, this seems to address all issues that were brought up so far and I am looking forward to receiving more feedback. |
☔ The latest upstream changes (presumably #11881) made this pull request unmergeable. Please resolve the merge conflicts. |
☔ The latest upstream changes (presumably #11928) made this pull request unmergeable. Please resolve the merge conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! It is so close to complete that only tiny suggestions can I give 😆.
As of dealing with the blocker, we're getting closer of making cargo a workspace I believe.
src/cargo/sources/git/mod.rs
Outdated
if matches!(self, RemoteKind::GitDependencyForbidShallow) { | ||
return History::Unshallow; | ||
} else { | ||
gix::remote::fetch::Shallow::NoChange |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it put pressure on server side? According to this post from GitHub, they suggest avoiding subsequent fetches on a shallow cloned repo. Is this a thing we need to consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the gist of this advice is that unshallowing shallow clones intentionally or accidentally incurs high cost on the server as it bypasses caches.
Unintentionally unshallowing is something that depends on branch configuration and refspecs (assuming the client code supports shallow), and the fact that cargo
limits refspecs for fetches means that unintentional unshallowing would be the exception, not the rule. That could happen nonetheless when the history is rewritten past the shallow boundary for example.
For the crates index, I think this is really something to think about as users would run into this very situation every couple of months. But when they do, the server would traverse only a small history, so the cost incurred is not on the worst end of the spectrum.
Now that we have an implementation to show for, we might even pass it maybe along with an analysis to GitHub to be sure, and I think it's worth seeing this on a timescale as well. We are talking about maybe a year from now when this code might become available by default once git2
is phased out. And by then, the HTTP index is probably already cemented as default so the amount of clones we't actually see like this would be greatly reduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record. Here are some concern of shallow clones from GitHub Infra Team in the past.
- Issues Cloning Spec repo - GitHub taking a very long time to download changes to the Specs Repo CocoaPods/CocoaPods#4989 (comment)
- update.sh: refuse to update shallow homebrew-core/cask clones. Homebrew/brew#9383
We will continue the conversation on Zulip and post a summary here when we make progess.
Thanks a lot for the review, it's much appreciated. In todays session I didn't address everything yet, but replied wherever I did. Tomorrow (or in the next days), I'd expect to address everything. |
439c4f2
to
c700901
Compare
This recreates the previously lost change from rust-lang#12005.
tests/testsuite/git_shallow.rs
Outdated
Ok(()) | ||
} | ||
|
||
fn find_lexicographically_first_bar_checkout() -> std::path::PathBuf { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one not important style suggestion: move these helper functions either to the top or the bottom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling it out - fixed in c207407551ca7888bff0ab06ea86ea879997f81a.
tests/testsuite/git_shallow.rs
Outdated
} | ||
|
||
#[cargo_test] | ||
fn gitoxide_unshallows_git_dependencies_that_may_not_be_shallow_anymore() -> anyhow::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more unshallow, so title needs a chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good catch, done in in c207407551ca7888bff0ab06ea86ea879997f81a.
src/cargo/sources/git/utils.rs
Outdated
// There is a specific commit to fetch and we will just do so in shallow-mode only | ||
// to not disturb the previous logic. Note that with typical settings for shallowing, | ||
// we will just fetch a specific slice of the history. | ||
refspecs.push(format!("+{0}:refs/remotes/origin/HEAD", rev)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! We still have room to optimize with fallbacks and retries. I am not too concerned with that at this moment, but perhaps we mark it as unresolved in the tracking issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some highlights
- Two unstable options for
-Zgitoxide
-Zgitoxide=fetch,shallow_deps
for fetching git dependencies-Zgitoxide=fetch,shallow_index
for fetching registry index
- Shallow-cloned and shallow-checked-out git repositories reside at their own
-shallow
suffixed directories, i.e,~/.cargo/registry/index/*-shallow
~/.cargo/git/db/*-shallow
~/.cargo/git/checkouts/*-shallow
- When the unstable feature is on, fetching/cloning a git repository is always a shallow fetch. This roughly equals to
git fetch --depth 1
everywhere. - Even with the presence of
Cargo.lock
or specifying a commit{ rev = "…" }
, gitoxide is still smart enough to shallow fetch without unshallowing the existing repository.
I am going to merge this sooner or later.
Complete, | ||
} | ||
|
||
#[cargo_test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really your fault. I found git-related tests are all a bit hard to follow, because you need to understand git2
and each helper functions beforehand. I wonder if there is a way to make it simple to learn. Maybe add some equivalent git command in comments to explain what it tries to do?
Not a blocker. I don't expect we to do more sizable updates in this PR 😆.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker. I don't expect we to do more sizable updates in this PR 😆
Maybe no change is done in this PR, but I'd love to do better. My only means of describing what's going on right now is a wordy title for the #[test]
function, and messages as last argument in assertions. Maybe a leading paragraph could describe the journey from start to end?
I wish this summary would be retained in a more prominent spot as it is so very useful :). The tracking issue might be a spot, but maybe there are other options as well. In any case, I am excited to see this one merged :). |
Could we add the summary to the existing notes in |
Co-authored-by: Weihang Lo <me@weihanglo.tw>
@arlosi I think that's just the right place :) - please take a look at the updated |
Seems great enough for an unstable feature. We can always tweak it if needed afterward. Thank you! @bors r+ |
☀️ Test successful - checks-actions |
Update lock to normalize `home` dep I'm not sure what happened, but in #11840 `Cargo.lock` was updated in such a way that the `home` dependency had a qualified source, even though it was equivalent to crates.io. This should only happen if there is some ambiguity (like if something had a path source to the in-tree home). This causes the `Cargo.lock` file to be modified whenever running cargo commands locally. This doesn't fail `--locked` because the resolve is equivalent, pointing to the the same source. Closes #12082
Update cargo 10 commits in ac84010322a31f4a581dafe26258aa4ac8dea9cd..569b648b5831ae8a515e90c80843a5287c3304ef 2023-05-02 13:41:16 +0000 to 2023-05-05 15:49:44 +0000 - xtask-unpublished: output a markdown table (rust-lang/cargo#12085) - fix: hack around `libsysroot` instead of `libtest` (rust-lang/cargo#12088) - Optimize usage under rustup. (rust-lang/cargo#11917) - Update lock to normalize `home` dep (rust-lang/cargo#12084) - fix: doc-test failures (rust-lang/cargo#12055) - feat(cargo-metadata): add `workspace_default_members` (rust-lang/cargo#11978) - doc: clarify implications of `cargo-yank` (rust-lang/cargo#11862) - chore: Use `[workspace.dependencies]` (rust-lang/cargo#12057) - support for shallow clones and fetches with `gitoxide` (rust-lang/cargo#11840) - Build by PackageIdSpec, not name, to avoid ambiguity (rust-lang/cargo#12015) r? `@ghost`
This PR makes it possible to enable shallow clones and fetches for git dependencies and crate indices independently with the
-Zgitoxide=fetch,shallow_deps
and-Zgitoxide=fetch,shallow_index
respectively.Tasks
git2
(it can open it and fetch like normal, no issues)git2
can safely operate on a shallow clone - we unshallow it beforehand, both for registries and git dependenciesgix
with full shallow support and use it hereshallow
files remain after unshallowing. Should they not rather be deleted if empty? - Yes,git
does so as well, implemented with this commitsee if it can be avoided to ever unshallow an existingCannot happen anymore as it can predict the final location perfectly.-shallow
clone by using the right location from the start. If not, test that we can goshallow->unshallow->shallow
without a hitch.Cargo.lock
files don't prevent shallow clonesReview Notes
Notes
gitoxide
version of shallow clones as thegit2
version might soon be available as well. Having that would be good as it would ensure interoperability remains intact.git2
has been phased out, i.e. everything else is working, I think (unscientifically) there might be benefits in using worktrees for checkouts. Admittedly I don't know the history of why they weren't used in the first place. Also:gitoxide
doesn't yet support local clones and might not have to if worktrees were used instead.