Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add basic connectivity check #1081

Merged
merged 4 commits into from
Nov 10, 2023

Conversation

cesfahani
Copy link
Contributor

@cesfahani cesfahani commented Oct 28, 2023

Implement a simple connectivity check in a new gix-fsck crate, and add this to gix via a new fsck subcommand. Currently this is functionally equivalent to:
git rev-list --objects --quiet --missing=print

Tasks

Please note that these tasks are notes of Byron, feel free to make it yours though.

  • update crate-status.md with information about what's done and what is not, especially in comparison to what git fsck really does.
  • update README.md with new crate

Tasks for Byron

  • refactor
  • flatten fsck command to be more similar to fsck (in lack of a vision what the other subcommands would be)
  • try to improve gix-testtools by configuring the preamble in the fixture script used here - does not work, it's very particular about where it wants to find that settings, needs it to be in globals. Fair enough.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for contributing, great work!

I took a first look and the most impactful remark I have is that either the new crate needs tests, or the code should move to gitoxide-core for experimentation.

Thanks and cheers

gix-fsck/LICENSE-APACHE Outdated Show resolved Hide resolved
gix-fsck/LICENSE-MIT Outdated Show resolved Hide resolved
gix-fsck/src/lib.rs Outdated Show resolved Hide resolved
gix-fsck/src/lib.rs Outdated Show resolved Hide resolved
gix/Cargo.toml Outdated Show resolved Hide resolved
src/plumbing/main.rs Outdated Show resolved Hide resolved
gix-fsck/src/lib.rs Outdated Show resolved Hide resolved
@cesfahani
Copy link
Contributor Author

I seem to have slightly botched this PR... I spent some time iterating on the formatting CI checks and fixed them, but then it looks like I failed to push the fixes before submitting the draft PR 😮‍💨. This revision of the PR also includes the incorrect committer email address (my work one). I'll get this fixed.

Your comments are all still perfectly valid as they don't relate to the formatting checks - replies incoming...

@cesfahani cesfahani force-pushed the feat_basic_connectivity_check branch from 4dce6bb to b12b07d Compare November 4, 2023 06:11
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for keeping at it!

Besides leaving a few comments which I hope are descriptive, I will provide you with a new trait that helps you to avoid using gix-odb as dependency.

And even though this PR is still some steps away from being mergable, it's getting there and I think the learning experience will be valuable to you. If that's ever not the case or you feel exhausted, please do let me know as I can also push it to completion myself then, no problem.

gix-hash = { version = "^0.13.1", path = "../gix-hash" }
gix-hashtable = { version = "^0.4.0", path = "../gix-hashtable" }
gix-object = { version = "^0.38.0", path = "../gix-object" }
gix-odb = { version = "^0.54.0", path = "../gix-odb" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gix-fsck cannot use git-odb - I always avoid such a heavy dependency. However, I see why you do it and believe it's finally time to improve object access.

I will be adding a trait to gix-object that supports the features you need, so the gix-odb dependency can go away.

@@ -0,0 +1,22 @@
[package]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to make this a separate project. Standard integration tests should be fine here.
These 'test-projects' exist to avoid cyclic dependencies which break rust-analyzer, but that doesn't seem to be an issue here.

Extra-dependencies can be added as dev-dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and thank you! I was a bit confused because by attempt at partial cloning using the existing test tooling was causing my tests to fail, but that ended up being due to uploadpack.allowFilternot being set in persistent git config. See below for details on that...

gix-fsck/tests/fixtures/make_test_repos.sh Outdated Show resolved Hide resolved
gix-fsck/tests/fixtures/make_test_repos.sh Outdated Show resolved Hide resolved
gix-fsck/tests/fixtures/make_test_repos.sh Outdated Show resolved Hide resolved
gix-fsck/tests/fixtures/make_test_repos.sh Outdated Show resolved Hide resolved
gix-fsck/tests/fixtures/make_test_repos.sh Outdated Show resolved Hide resolved
hex_ids.into_iter().map(hex_to_id).collect()
}

fn hex_to_objects<'a>(hex_ids: impl IntoIterator<Item = &'a str>, kind: Kind) -> HashMap<ObjectId, Kind> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be called hex_to_ids to stay in line with hex_to_id which is used everywhere.
In theory, this utility could be provided by gix-testtools as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't just spit out a collection of ObjectId's, but rather, it provides a map of ObjectIDs to Kind's. The function above this does exactly what you're implying, and it is named as you'd expect (hex_to_ids).

I understand that "objects" probably isn't the best naming here, as it is not outputting actual objects, but rather a representation of their (expected) metadata. Totally open to better naming!

src/plumbing/main.rs Outdated Show resolved Hide resolved
@Byron
Copy link
Member

Byron commented Nov 4, 2023

Once #1088 is merged, you could rebase remove the gix-odb dependency, please feel free to subscribe for notifications.

@cesfahani cesfahani force-pushed the feat_basic_connectivity_check branch from b12b07d to a4f6121 Compare November 7, 2023 04:40
@cesfahani
Copy link
Contributor Author

Once #1088 is merged, you could rebase remove the gix-odb dependency, please feel free to subscribe for notifications.

Done!

@Byron
Copy link
Member

Byron commented Nov 9, 2023

Thanks!

I should be able to finish the review today - it will be an active one so I will look at everything and make changes as needed. These I will push while rewriting the commit-queue, possibly multiple times - please refrain to make any further changes to avoid complications.

Even though the gist of changes will be described in their respective commits, I will provide a summary here before merging with everything I think could help further the feature.

@Byron Byron force-pushed the feat_basic_connectivity_check branch from a4f6121 to 098b9a3 Compare November 9, 2023 10:18
cesfahani and others added 2 commits November 10, 2023 09:11
Implement a simple connectivity check in a new `gix-fsck` crate, and add
this to `gix` via a new `fsck` subcommand. Currently this is
functionally equivalent to:
`git rev-list --objects --quiet --missing=print`

feat: add basic connectivity check - address review feedback
- update `gix-fsck` crate status with all obvious features of `git-fsck`
- clarify that fsck is only partial just to be sure people won't use it in place of `git fsck`
- if BufWriter's are truly needed, let's expect it as type to avoid double-buffering unnecessarily.
- consistent naming of the changelog markdown fail (all caps)
- set version to 0.1.0 for fsck crate as 0.0 is used only for name-keeping when there is no functionality.
- general cleanup
- avoid allocating for each tree (but it's still recursive)
@Byron Byron force-pushed the feat_basic_connectivity_check branch from 098b9a3 to df9b264 Compare November 10, 2023 08:21
@Byron Byron force-pushed the feat_basic_connectivity_check branch from df9b264 to 7ab5c76 Compare November 10, 2023 08:50
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! If you are interested, you can take a look at the added commits to participate in my refactoring-journey.

A few notes:

  • when traversing trees, the gix-traverse crate can be used for that. However, in this case it seems easier to just implement traversal ourselves.
  • Please avoid recursion, as doing so removes an attack vector and typically leads to a better time with the borrowchecker - win-win if you will :).

Performance

On the git repository, git fsck --connectivity-only, which still performs more checking than is implemented here, will run in ~2.5s, while gix fsck takes ~6s.

❯ hyperfine -N -w 1 'git fsck --connectivity-only' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix fsck'
Benchmark 1: git fsck --connectivity-only
  Time (mean ± σ):      2.461 s ±  0.022 s    [User: 2.333 s, System: 0.090 s]
  Range (min … max):    2.415 s …  2.491 s    10 runs

Benchmark 2: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix fsck
  Time (mean ± σ):      6.278 s ±  0.026 s    [User: 6.216 s, System: 0.018 s]
  Range (min … max):    6.253 s …  6.341 s    10 runs

Summary
  git fsck --connectivity-only ran
    2.55 ± 0.03 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix fsck

It was somewhat expected that this happens as git - somehow - manages to access all objects much more efficiently. gix spends the majority of its time decoding deltified objets, with the pack-cache for deltas getting just a few hits. This leads to a huge waste of time.

Screenshot 2023-11-10 at 09 37 09 Screenshot 2023-11-10 at 09 38 05

What I feel uncomfortable with is also that fsck is actually a lot more than what's currently implemented, as technically it's only a check for missing objects. Even though I tried to be upfront about this in the command's documentation, this in conjunction with the performance issue makes me want to spend more time on it.

Something I'd love to see is if there is something in git fsck that allows git to be so much faster or more efficient when accessing the ODB. Running git fsck --connectivity-only -v shows that commits seemingly have been ordered 'a little'. Maybe it's worth it to study their algorithm and try to mimic it, maybe to get access patterns that improve the object access performance which now is the major bottle-neck.

Please let me know what you think about this.

A simple solution to one of these problems would be to rename Connectivity into something that shows what it's currently doing, and not what it aspires to be. The performance issue would remain though.

@Byron
Copy link
Member

Byron commented Nov 10, 2023

Correction

Performance isn't as simple apparently, as on the linux repository, I see 46s for git and only 22s for gix - this would give room for additional checks to make me think that gix isn't always slower.

❯ hyperfine -N -w 1 'git fsck --connectivity-only' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix fsck'
Benchmark 1: git fsck --connectivity-only
  Time (mean ± σ):     46.783 s ±  1.158 s    [User: 45.172 s, System: 0.808 s]
  Range (min … max):   45.004 s … 48.705 s    10 runs

Benchmark 2: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix fsck
  Time (mean ± σ):     21.987 s ±  0.556 s    [User: 21.639 s, System: 0.190 s]
  Range (min … max):   21.242 s … 22.689 s    10 runs

Summary
  /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix fsck ran
    2.13 ± 0.08 times faster than git fsck --connectivity-only

@Byron
Copy link
Member

Byron commented Nov 10, 2023

Another note: maybe it really does need a special algorithm. Maybe it's just unrelated though: https://github.com/praetorian-inc/noseyparker/blob/3b160d86eca484b36ff754fab7337368547af704/crates/input-enumerator/src/git_metadata_graph.rs#L313 , but I thought I'd share anyway in case it can inspire to find novel solutions to old problems.

The original source of the above link is here: #1096 (comment)

@Byron Byron merged commit 1f9aca5 into GitoxideLabs:main Nov 10, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants