Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port all git functionality to use git CLI #3833

Merged
merged 6 commits into from
May 30, 2024

Conversation

ibraheemdev
Copy link
Member

@ibraheemdev ibraheemdev commented May 24, 2024

Summary

We currently rely on libgit2 for most git-related functionality. However, libgit2 has long-standing performance issues, as well as lags significantly behind git in terms of new features. For these reasons we now use the git CLI by default for fetching repositories (#1781). This PR completely drops libgit2 in favor of the git CLI for all git-related functionality, which should allow us to use features such as partial clones and sparse checkouts in the future for performance.

There is also a lot of technical debt in the current git code as it's mostly taken from Cargo. Switching to the git CLI vastly simplifies the uv-git codebase.

Eventually we might want to look into switching to gitoxide, but it's currently too immature for our use case.

Test Plan

@ibraheemdev ibraheemdev force-pushed the git-cli branch 3 times, most recently from db17525 to a8686e2 Compare May 24, 2024 22:41
crates/uv-git/src/git.rs Outdated Show resolved Hide resolved
@konstin
Copy link
Member

konstin commented May 25, 2024

Do we introduce a specific git cli minimum version requirement this way? I remember graphite breaking because my linux distro git was to old for their wrapper.

Copy link

codspeed-hq bot commented May 28, 2024

CodSpeed Performance Report

Merging #3833 will degrade performances by 9.61%

Comparing ibraheemdev:git-cli (915d8d8) with main (502e042)

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 11 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main ibraheemdev:git-cli Change
resolve_warm_airflow 1.5 s 1.4 s +6.39%
resolve_warm_jupyter 80.2 ms 88.8 ms -9.61%

@ibraheemdev ibraheemdev force-pushed the git-cli branch 2 times, most recently from 3a8b6cb to b631841 Compare May 28, 2024 16:46
@ibraheemdev
Copy link
Member Author

@konstin I don't think this relies on anything even relatively new, but I can audit the commands and see the minimum version we require.

@ibraheemdev
Copy link
Member Author

ibraheemdev commented May 28, 2024

Benchmarks are pretty inconsistent but it seems like overall this change introduces a ~5% regression, which is expected because we have to shell out to Git quite a bit. We may be able to improve this somewhat taking into account the new cost (e.g. some of the rev-parse calls). However, the features that this enables us to use have a lot more potential for performance improvements, so, along with the reduced maintenance burden, it seems worth it.

#[derive(Debug, Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct GitOid {
len: usize,
bytes: [u8; 40],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libgit2 hex encodes object IDs to reduce their size (by half). However we can't use the hex crate directly because we would need to support odd-length strings, so I left it out for now. We could also just store the IDs as Strings and do a bit of extra cloning.

// Skip anything related to templates, they just call all sorts of issues as
// we really don't want to use them yet they insist on being used. See #6240
// for an example issue that comes up.
// opts.external_template(false);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to find a way to do this with the CLI...

git_config.set_bool("core.autocrlf", false)?;
}
// if let Ok(mut git_config) = self.repo.config() {
// git_config.set_bool("core.autocrlf", false)?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also unsure if we need to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just remove this. We don't support vendoring right? Unless I'm misunderstanding.

Copy link
Member Author

@ibraheemdev ibraheemdev May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sort of vendoring are you thinking of? core.autocrlf configures whether git automatically converts line endings (CRLF to LF) for checked out files. I think this can still affect us, but I'm not sure whether it matters (or why it matters for Cargo)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you mean cargo vendoring. Yeah that looks to be why they needed this, it shouldn't affect us. I'll remove it.

}

let mut out = [0; 40];
out[..s.len()].copy_from_slice(s.as_bytes());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to validate that this a valid hex?

Copy link
Member Author

@ibraheemdev ibraheemdev May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, git will end up failing if we try to identify a commit using a malformed/broken hash anyways. I guess we could for better error messages, but it seems unlikely anyways?

@@ -970,10 +458,6 @@ pub(crate) fn fetch(
}
};

maybe_gc_repo(repo)?;

clean_repo_temp_files(repo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove these two?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of those were linked to libgit2 issues. The git cli runs garbage collection automatically on git fetch, git commit, and other commands, so it shouldn't be a problem now.

}
let result = match self {
// Resolve the commit pointed to by the tag.
Self::Tag(s) => repo.rev_parse(&format!("refs/remotes/origin/tags/{s}^0")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the ^0 signify here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^0 is short for ^{commit}, which recursively "peels" to the underlying commit object. e.g. if the tag points to another tag, which then points to a commit. I'm not 100% sure if we need to do this, but the previous code had calls to peel(ObjectType::Commit), so this retains the original behavior. I'll add a more detailed comment explaining this.

@charliermarsh
Copy link
Member

So just confirming -- the overall scheme here is the same, right? We have one checkout for Git repo, and then we fetch individual commits and clone them into separate directories to build?

ProcessBuilder::new("git")
.arg("init")
.cwd(path)
.exec_with_streaming(&mut silent, &mut silent, true)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any risk of dropping important output here? (What did we do with Git CLI output before, when fetching?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same thing we've always done in fetch_with_cli. The true means that stderr output will still be captured in the error message if the command fails, the silent callbacks just make sure that the stdout/stderr output is not piped to the user's terminal (Command::{stdout,stderr} are overridden).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out we can use exec_with_output here instead, which is a bit clearer.

@ibraheemdev
Copy link
Member Author

So just confirming -- the overall scheme here is the same, right? We have one checkout for Git repo, and then we fetch individual commits and clone them into separate directories to build?

Yes, there should be no change in behavior.

@ibraheemdev ibraheemdev merged commit 261aa2c into astral-sh:main May 30, 2024
45 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants