-
-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add basic connectivity check #1081
feat: add basic connectivity check #1081
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for contributing, great work!
I took a first look and the most impactful remark I have is that either the new crate needs tests, or the code should move to gitoxide-core
for experimentation.
Thanks and cheers
I seem to have slightly botched this PR... I spent some time iterating on the formatting CI checks and fixed them, but then it looks like I failed to push the fixes before submitting the draft PR 😮💨. This revision of the PR also includes the incorrect committer email address (my work one). I'll get this fixed. Your comments are all still perfectly valid as they don't relate to the formatting checks - replies incoming... |
4dce6bb
to
b12b07d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for keeping at it!
Besides leaving a few comments which I hope are descriptive, I will provide you with a new trait that helps you to avoid using gix-odb
as dependency.
And even though this PR is still some steps away from being mergable, it's getting there and I think the learning experience will be valuable to you. If that's ever not the case or you feel exhausted, please do let me know as I can also push it to completion myself then, no problem.
gix-fsck/Cargo.toml
Outdated
gix-hash = { version = "^0.13.1", path = "../gix-hash" } | ||
gix-hashtable = { version = "^0.4.0", path = "../gix-hashtable" } | ||
gix-object = { version = "^0.38.0", path = "../gix-object" } | ||
gix-odb = { version = "^0.54.0", path = "../gix-odb" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gix-fsck
cannot use git-odb
- I always avoid such a heavy dependency. However, I see why you do it and believe it's finally time to improve object access.
I will be adding a trait to gix-object
that supports the features you need, so the gix-odb
dependency can go away.
gix-fsck/tests/Cargo.toml
Outdated
@@ -0,0 +1,22 @@ | |||
[package] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to make this a separate project. Standard integration tests should be fine here.
These 'test-projects' exist to avoid cyclic dependencies which break rust-analyzer, but that doesn't seem to be an issue here.
Extra-dependencies can be added as dev-dependencies
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and thank you! I was a bit confused because by attempt at partial cloning using the existing test tooling was causing my tests to fail, but that ended up being due to uploadpack.allowFilter
not being set in persistent git config. See below for details on that...
hex_ids.into_iter().map(hex_to_id).collect() | ||
} | ||
|
||
fn hex_to_objects<'a>(hex_ids: impl IntoIterator<Item = &'a str>, kind: Kind) -> HashMap<ObjectId, Kind> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be called hex_to_ids
to stay in line with hex_to_id
which is used everywhere.
In theory, this utility could be provided by gix-testtools
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't just spit out a collection of ObjectId
's, but rather, it provides a map of ObjectID
s to Kind
's. The function above this does exactly what you're implying, and it is named as you'd expect (hex_to_ids
).
I understand that "objects" probably isn't the best naming here, as it is not outputting actual objects, but rather a representation of their (expected) metadata. Totally open to better naming!
Once #1088 is merged, you could rebase remove the |
b12b07d
to
a4f6121
Compare
Done! |
Thanks! I should be able to finish the review today - it will be an active one so I will look at everything and make changes as needed. These I will push while rewriting the commit-queue, possibly multiple times - please refrain to make any further changes to avoid complications. Even though the gist of changes will be described in their respective commits, I will provide a summary here before merging with everything I think could help further the feature. |
a4f6121
to
098b9a3
Compare
Implement a simple connectivity check in a new `gix-fsck` crate, and add this to `gix` via a new `fsck` subcommand. Currently this is functionally equivalent to: `git rev-list --objects --quiet --missing=print` feat: add basic connectivity check - address review feedback
- update `gix-fsck` crate status with all obvious features of `git-fsck` - clarify that fsck is only partial just to be sure people won't use it in place of `git fsck` - if BufWriter's are truly needed, let's expect it as type to avoid double-buffering unnecessarily. - consistent naming of the changelog markdown fail (all caps) - set version to 0.1.0 for fsck crate as 0.0 is used only for name-keeping when there is no functionality. - general cleanup - avoid allocating for each tree (but it's still recursive)
098b9a3
to
df9b264
Compare
df9b264
to
7ab5c76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! If you are interested, you can take a look at the added commits to participate in my refactoring-journey.
A few notes:
- when traversing trees, the
gix-traverse
crate can be used for that. However, in this case it seems easier to just implement traversal ourselves. - Please avoid recursion, as doing so removes an attack vector and typically leads to a better time with the borrowchecker - win-win if you will :).
Performance
On the git
repository, git fsck --connectivity-only
, which still performs more checking than is implemented here, will run in ~2.5s, while gix fsck
takes ~6s.
❯ hyperfine -N -w 1 'git fsck --connectivity-only' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix fsck'
Benchmark 1: git fsck --connectivity-only
Time (mean ± σ): 2.461 s ± 0.022 s [User: 2.333 s, System: 0.090 s]
Range (min … max): 2.415 s … 2.491 s 10 runs
Benchmark 2: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix fsck
Time (mean ± σ): 6.278 s ± 0.026 s [User: 6.216 s, System: 0.018 s]
Range (min … max): 6.253 s … 6.341 s 10 runs
Summary
git fsck --connectivity-only ran
2.55 ± 0.03 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix fsck
It was somewhat expected that this happens as git
- somehow - manages to access all objects much more efficiently. gix
spends the majority of its time decoding deltified objets, with the pack-cache for deltas getting just a few hits. This leads to a huge waste of time.
What I feel uncomfortable with is also that fsck
is actually a lot more than what's currently implemented, as technically it's only a check for missing objects. Even though I tried to be upfront about this in the command's documentation, this in conjunction with the performance issue makes me want to spend more time on it.
Something I'd love to see is if there is something in git fsck
that allows git to be so much faster or more efficient when accessing the ODB. Running git fsck --connectivity-only -v
shows that commits seemingly have been ordered 'a little'. Maybe it's worth it to study their algorithm and try to mimic it, maybe to get access patterns that improve the object access performance which now is the major bottle-neck.
Please let me know what you think about this.
A simple solution to one of these problems would be to rename Connectivity
into something that shows what it's currently doing, and not what it aspires to be. The performance issue would remain though.
CorrectionPerformance isn't as simple apparently, as on the linux repository, I see 46s for
|
Another note: maybe it really does need a special algorithm. Maybe it's just unrelated though: https://github.com/praetorian-inc/noseyparker/blob/3b160d86eca484b36ff754fab7337368547af704/crates/input-enumerator/src/git_metadata_graph.rs#L313 , but I thought I'd share anyway in case it can inspire to find novel solutions to old problems. The original source of the above link is here: #1096 (comment) |
Implement a simple connectivity check in a new
gix-fsck
crate, and add this togix
via a newfsck
subcommand. Currently this is functionally equivalent to:git rev-list --objects --quiet --missing=print
Tasks
Please note that these tasks are notes of Byron, feel free to make it yours though.
crate-status.md
with information about what's done and what is not, especially in comparison to whatgit fsck
really does.README.md
with new crateTasks for Byron
fsck
command to be more similar tofsck
(in lack of a vision what the other subcommands would be)try to improve gix-testtools by configuring the preamble in the fixture script used here- does not work, it's very particular about where it wants to find that settings, needs it to be in globals. Fair enough.