Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tree -> index diff for status #1410

Merged
merged 17 commits into from
Jan 3, 2025
Merged

tree -> index diff for status #1410

merged 17 commits into from
Jan 3, 2025

Conversation

Byron
Copy link
Member

@Byron Byron commented Jun 20, 2024

Based on #1368


diff-correctness → gix-status → gix reset


Improve gix status to the point where it's suitable for use in reset functionality.
Leads to a proper worktree reset implementation, eventually leading to a high-level reset similar to how git supports it.

Architecture

The reason this PR deals quite a bit with gix status is that for a safe implementation of reset() we need to be sure that the files we would want to touch don't don't carry modifications or are untracked files. In order to know what would need to be done, we have to diff the current-index with target-index. The set of files to touch can then be used to lookup information provided by git-status, like worktree modifications, index modifications, and untracked files, to know if we can proceed or not. Here is also where the reset-modes would affect the outcome, i.e. what to change and how.

This is a very modular approach which facilitates testing and understanding of what otherwise would be a very complex algorithm. Having a set of changes as output also allows to one day parallelize applying these changes.

This leaves us in a situation where the current checkout() implementation wants to become a fastpath for situations where the reset involves an empty tree as source (i.e. create everything and overwrite local changes).

Extra Tasks

Out-of-band tasks that just should finally be done, with potential for great impact.

  • support for hasconfig as part of resolve_includes() without actual lookahead.

Tasks

  • depth-first tree traversal to produce the same order as the index
    • make it available in gix
    • use depth-first traversal in gix tree entries
  • depth-first tree iterator as basis for efficient-just-in-time diffing
    • very hard/impossible to do without borrowing
  • consider using depth-first in gix-index::from_tree() if that's faster - try with benchmark, or big-repo tests.
    • index creation now 26% faster
  • diff index with index
    • Must include rename tracking.
  • Is there any flags that are to be considered?
    • IntentToAdd needs to be ignored when diffing with with another tree
    • skip-worktree - what to do there?
  • pathspec support for index/index diffs
  • gix::status should respect status.renames and status.renameLimit - also update docs of StatusPlatform::index_worktree_rewrites()
  • update is_dirty() to use full status
  • make diff results available from status with all transformations applied. Compare to tree-tree-diff which has that functionality already.
  • add status-iterator tests for tree-index diff parts.
  • Submodule::status() to do full status.
  • wire it up in gix
  • should Conflict::try_from_entries() be used to condense merge-conflicts in tree-index diffs? - no as it's taken care off with the status on index/worktree
  • double-check docs and structure
  • remove TODOs

Preliminary Performance

On WebKit

❯ git rev-parse @
886077e077a496a6e398df52a4b7915d8cd68f76
❯ hyperfine -N -w1 '/Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/gix status' 'git status --porcelain'
Benchmark 1: /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/gix status
  Time (mean ± σ):      1.305 s ±  0.059 s    [User: 0.548 s, System: 3.522 s]
  Range (min … max):    1.245 s …  1.419 s    10 runs

Benchmark 2: git status --porcelain
  Time (mean ± σ):      3.701 s ±  0.045 s    [User: 0.453 s, System: 29.213 s]
  Range (min … max):    3.630 s …  3.776 s    10 runs

Summary
  /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/gix status ran
    2.84 ± 0.13 times faster than git status --porcelain

On the linux kernel it's only 1.34x though, and without parallelism on WebKit it's 10% slower.

Status Enables

Inbetween

Next PR: Reset

  • reset() that checks if it's allowed to perform a worktree modification is allowed, or if an entry should be skipped. That way we can postpone safety checks like --hard

Postponed

What follows is important for resets, but won't be needed for cargo worktree resets.

  • a way to expand sparse dirs (but figure out if this is truly always necessary) - probably not, unless sparse dirs can be empty, but even then no expansion is needed
    • wire it up in gix index entries to optionally expand sparse entries
  • gix status with implemented 'porcelain-v2` display mode

Research

  • git ls-files -s and git ls-tree -r use the same ordering, so comparisons are trivial and similar to tree/tree diffs.
  • run_diff_index is the entrypoint for tree/index diffs. It's in diff-core though, so hard to follow. Oh, and that goes into unpack_trees which is even worse.
  • wt_status_collect_updated_cb is the primary callback to collect the actual tree/index change.
  • What to do with submodules?
    • It doesn't appear they get special treatment in the diff-callback at least.
  • What about sparse indices?
    • They are definitely handled, and unsparsed where needed so both indices match.
  • how to deal with unmerged entries?
    • only allow them on the right side
Abandoned DF-Iter
commit aaefdb2a083fbe24b700fbbd7ac376b284867e9d
Author: Sebastian Thiel <sebastian.thiel@icloud.com>
Date:   Sat Dec 28 18:58:43 2024 +0100

    feat: add `tree::depthfirst::Iter` for iterating tree-entries depth-first.

diff --git a/gix-traverse/src/tree/depthfirst.rs b/gix-traverse/src/tree/depthfirst.rs
index 5b0c9eab7..df074af39 100644
--- a/gix-traverse/src/tree/depthfirst.rs
+++ b/gix-traverse/src/tree/depthfirst.rs
@@ -47,14 +47,6 @@ pub(super) mod function {
         StateMut: BorrowMut<State>,
         V: Visit,
     {
-        enum Machine {
-            GetTree(ObjectId),
-            Iterate {
-                tree_buf: Vec<u8>,
-                byte_offset_to_next_entry: usize,
-            },
-        }
-
         let state = state.borrow_mut();
         let mut stack = vec![Machine::GetTree(root)];
         'outer: while let Some(item) = stack.pop() {
@@ -109,4 +101,100 @@ pub(super) mod function {
         }
         Ok(())
     }
+
+    pub(super) enum Machine {
+        GetTree(ObjectId),
+        Iterate {
+            tree_buf: Vec<u8>,
+            byte_offset_to_next_entry: usize,
+        },
+    }
+}
+
+mod iter {
+    use crate::tree::depthfirst::function::Machine;
+    use crate::tree::visit::Action;
+    use gix_object::{FindExt, TreeRefIter};
+    use std::borrow::BorrowMut;
+
+    pub struct State {
+        inner: super::State,
+        stack: Vec<Machine>,
+    }
+
+    /// An iterator yielding entries that reference external state.
+    ///
+    /// It's not using a delegate for simplicity.
+    pub struct Iter<'a, Visit, Find> {
+        /// The state
+        state: &'a mut State,
+        delegate: &'a mut Visit,
+        objects: &'a Find,
+    }
+
+    impl<'a, Visit, Find> Iter<'a, Visit, Find>
+    where
+        Find: gix_object::Find,
+        Visit: crate::tree::Visit,
+    {
+        fn next_inner(&mut self) -> Result<Option<gix_object::tree::EntryRef<'a>>, crate::tree::depthfirst::Error> {
+            let state = &mut self.state.inner;
+            let stack = &mut self.state.stack;
+            let objects = self.objects;
+            let delegate = &mut *self.delegate;
+
+            'outer: while let Some(item) = stack.pop() {
+                match item {
+                    Machine::GetTree(id) => {
+                        let mut buf = state.pop_buf();
+                        objects.find_tree_iter(&id, &mut buf)?;
+                        stack.push(Machine::Iterate {
+                            tree_buf: buf,
+                            byte_offset_to_next_entry: 0,
+                        });
+                    }
+                    Machine::Iterate {
+                        tree_buf: buf,
+                        byte_offset_to_next_entry,
+                    } => {
+                        let mut iter = TreeRefIter::from_bytes(&buf[byte_offset_to_next_entry..]);
+                        delegate.pop_back_tracked_path_and_set_current();
+                        while let Some(entry) = iter.next() {
+                            let entry = entry?;
+                            if entry.mode.is_tree() {
+                                delegate.push_path_component(entry.filename);
+                                let res = delegate.visit_tree(&entry);
+                                delegate.pop_path_component();
+                                match res {
+                                    Action::Continue => {}
+                                    Action::Cancel => break 'outer,
+                                    Action::Skip => continue,
+                                }
+
+                                delegate.push_back_tracked_path_component("".into());
+                                delegate.push_back_tracked_path_component(entry.filename);
+                                let recurse_tree = Machine::GetTree(entry.oid.to_owned());
+                                let continue_at_next_entry = Machine::Iterate {
+                                    byte_offset_to_next_entry: iter.offset_to_next_entry(&buf),
+                                    tree_buf: buf,
+                                };
+                                stack.push(continue_at_next_entry);
+                                stack.push(recurse_tree);
+                                continue 'outer;
+                            } else {
+                                delegate.push_path_component(entry.filename);
+                                if let Action::Cancel = delegate.visit_nontree(&entry) {
+                                    break 'outer;
+                                }
+                                return Ok(Some(entry));
+                                delegate.pop_path_component();
+                            }
+                        }
+                        state.push_buf(buf);
+                    }
+                }
+            }
+            Ok(None)
+        }
+    }
 }

Research

  • Ignored files are considered expandable and can be overwritten on reset
  • How to integrate submodules - probably easy to answer once gix status can deal a little better with submodules. Even though in this case a lot of submodule-related information is needed for a complete reset, probably only doable by a higher-level caller which orchestrates it.
  • How to deal with various modes like merge and keep? How to control refresh? Maybe partial (only the files we touch), and full, to also update the files we don't touch as part of status? Maybe it's part of status if that is run before.
  • Worthwhile to make explicit the difference between git reset and git checkout in terms of HEAD modifications. With the former changing HEADs referent, and the latter changing HEAD itself.
  • figure out how this relates to the current checkout() method as technically that's a reset --hard with optional overwrite check. Could it be rolled into one, with pathspec support added?
    • just keep them separate until it's clear that reset() performs just as well, which is unlikely as there is more overhead. But maybe it's not worth to maintain two versions over it. But if so, one should probably rename it.
  • for git status: what about rename tracking? It's available for tree-diffs and quite complex on its own. Probably only needs HEAD-vs-index rename tracking. No, also can have worktree rename tracking, even though it's hard to imagine how this can be fast unless it's tightly integrated with untracked-files handling. This screams for a generalization of the tracking code though as the testing and implementation is complex, but should be generalisable.

Re-learn

  • pathspecs normalize themselves to turn from any kind of specification into repo-root relative patterns.
  • attribute/ignore file sources are naturally relative to the root of the repo, which remains relative (i.e. can be .. and that root will be always be used to open files like ../.gitignore, which is useful for display to the user)

By default, each thread consumes 8MB of memory for the stack which can quickly
stack as machines have more cores and, especially during status, more threads
are started than there are cores. This overcommitting is by design, but
at least we should make sure that memory doesn't grow unnecessarily.

Especially iterators know the code they execute, hence these versions
should have a way to tune the stack size to reduce the peak memory footprint.

@Byron Byron force-pushed the status branch 2 times, most recently from bff7dde to 5c886aa Compare June 24, 2024 05:20
@Byron Byron force-pushed the status branch 2 times, most recently from af64f8f to eef0fe0 Compare July 12, 2024 13:06
@Byron Byron force-pushed the status branch 2 times, most recently from 964280d to acd054a Compare December 28, 2024 08:26
Add a function to help resume the iterator without holding onto the data directly.
@Byron Byron force-pushed the status branch 2 times, most recently from 531179c to 3115ed5 Compare December 28, 2024 16:24
A depth-first traversal yields the `.git/index` order.

It's a breaking change as the `Visitor` trait gets another way
to pop a tracked path, suitable for the stack used for depth first.
This allows a depth-first traversal with a delegate.
…)` now returns `&mut str`.

This helps callers to avoid converting to UTF8 by hand.
This makes the result similar to `git ls-tree` in terms of ordering.
It uses depth-first traversal out of the box which allows it to save
the sorting in the end. It's also a little bit faster.
@Byron Byron force-pushed the status branch 5 times, most recently from 28a8e93 to a867322 Compare December 30, 2024 14:49
It comes with pathspec support to allow for easier integration into
the `status()` machinery.
@Byron Byron force-pushed the status branch 2 times, most recently from 46bbae2 to 38e3c50 Compare December 30, 2024 20:21
Byron added 2 commits January 2, 2025 20:37
…tree and an index.

It also respects `status.rename` and `status.renameLimit` to configure rename tracking.
This copmpletes the `is_dirty()` implementation.
That way one can officially use "section.name" strings or `&Section::NAME`.
@Byron Byron force-pushed the status branch 2 times, most recently from 5b8140f to 21baf6f Compare January 3, 2025 19:05
Byron added 3 commits January 3, 2025 20:20
…tatus.

Note that it is still possible to disable the head-index status.

Types moved around, effectivey removing the `iter::` module for most
more general types, i.e. those that are quite genericlally useful in
a status.
@Byron Byron marked this pull request as ready for review January 3, 2025 19:31
@Byron Byron enabled auto-merge January 3, 2025 19:34
@Byron Byron merged commit 0ab4f64 into main Jan 3, 2025
20 checks passed
@Byron Byron deleted the status branch January 3, 2025 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

1 participant