Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sparse Index] Integrate with git status #374

Conversation

derrickstolee
Copy link
Collaborator

Replaces #367, which used the upstream commits directly. This instead includes a merge commit onto vfs-2.32.0 so we can move forward even if the upstream version is modified.

This integrates the sparse-index with git status. There is a LOT of code here, and much of it makes future changes (#361 and #364) much simpler than they would be otherwise.

This merges into features/sparse-index which is intended to be released as an experimental release. The goal of reviewing this PR is to find "good enough for a limited release to dogfooders" and not the same scrutiny of upstream.

In particular, I am currently looking for an issue with directory/file conflicts at the sparse-checkout boundary. They don't reveal themselves until git checkout, and only in very rare cases that won't affect our dogfooders.

The sparse-index format is designed to be compatible with merge
conflicts, even those outside the sparse-checkout definition. The reason
is that when converting a full index to a sparse one, a cache entry with
nonzero stage will not be collapsed into a sparse directory entry.

However, this behavior was not tested, and a different behavior within
convert_to_sparse() fails in this scenario. Specifically,
cache_tree_update() will fail when unmerged entries exist.
convert_to_sparse_rec() uses the cache-tree data to recursively walk the
tree structure, but also to compute the OIDs used in the
sparse-directory entries.

Add an index scan to convert_to_sparse() that will detect if these merge
conflict entries exist and skip the conversion before trying to update
the cache-tree. This is marked as NEEDSWORK because this can be removed
with a suitable update to cache_tree_update() or a similar method that
can construct a cache-tree with invalid nodes, but still allow creating
the nodes necessary for creating sparse directory entries.

It is possible that in the future we will not need to make such an
update, since if we do not expand a sparse-index into a full one, this
conversion does not need to happen. Thus, this can be deferred until the
merge machinery is made to integrate with the sparse-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When creating a full index from a sparse one, we create cache entries
for every blob within a given sparse directory entry. These are
correctly marked with the CE_SKIP_WORKTREE flag, but the CE_EXTENDED
flag is not included. The CE_EXTENDED flag would exist if we loaded a
full index from disk with these entries marked with CE_SKIP_WORKTREE, so
we can add the flag here to be consistent. This allows us to directly
compare the flags present in cache entries when testing the sparse-index
feature, but has no significance to its correctness in the user-facing
functionality.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
This fixes the test data shape to be as expected, allowing rename
detection to work properly now that the 'larger-content' file actually
has meaningful lines.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
As more features integrate with the sparse-index feature, more and more
special cases arise that require different data shapes within the tree
structure of the repository in order to demonstrate those cases.

Add several interesting special cases all at once instead of sprinkling
them across several commits. The interesting cases being added here are:

* Add sparse-directory entries on both sides of directories within the
  sparse-checkout definition.

* Add directories outside the sparse-checkout definition who have only
  one entry and are the first entry of a directory with multiple
  entries.

Later tests will take advantage of these shapes, but they also deepen
the tests that already exist.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Before moving to update 'git status' and 'git add' to work with sparse
indexes, add an explicit test that ensures the sparse-index works the
same as a normal sparse-checkout when the worktree contains directories
and files outside of the sparse cone.

Specifically, 'folder1/a' is a file in our test repo, but 'folder1' is
not in the sparse cone. When 'folder1/a' is modified, the file is not
shown as modified and adding it will fail. This is new behavior as of
a20f704 (add: warn when asked to update SKIP_WORKTREE entries,
2021-04-08). Before that change, these adds would be silently ignored.

Untracked files are fine: adding new files both with 'git add .' and
'git add folder1/' works just as in a full checkout. This may not be
entirely desirable, but we are not intending to change behavior at the
moment, only document it. A future change could alter the behavior to
be more sensible, and this test could be modified to satisfy the new
expected behavior.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The cache_bottom member of 'struct unpack_trees_options' is used to
track the range of index entries corresponding to a node of the cache
tree. While recursing with traverse_by_cache_tree(), this value is
preserved on the call stack using a local and then restored as that
method returns.

The mark_ce_used() method normally modifies the cache_bottom member when
it refers to the marked cache entry. However, sparse directory entries
are stored as nodes in the cache-tree data structure as of 2de37c5
(cache-tree: integrate with sparse directory entries, 2021-03-30). Thus,
the cache_bottom will be modified as the cache-tree walk advances. Do
not update it as well within mark_ce_used().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
As we further integrate the sparse-index into unpack-trees, we need to
ensure that we compare sparse directory entries correctly with other
entries. This affects searching for an exact path as well as sorting
index entries.

Sparse directory entries contain the trailing directory separator. This
is important for the sorting, in particular. Thus, within
do_compare_entry() we stop using S_IFREG in all cases, since sparse
directories should use S_IFDIR to indicate that the comparison should
treat the entry name as a dirctory.

Within compare_entry(), it first calls do_compare_entry() to check the
leading portion of the name. When the input path is a directory name, we
could match exactly already. Thus, we should return 0 if we have an
exact string match on a sparse directory entry.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
During unpack_callback(), index entries are compared against tree
entries. These are matched according to names and types. One goal is to
decide if we should recurse into subtrees or simply operate on one index
entry.

In the case of a sparse-directory entry, we do not want to recurse into
that subtree and instead simply compare the trees. In some cases, we
might want to perform a merge operation on the entry, such as during
'git checkout <commit>' which wants to replace a sparse tree entry with
the tree for that path at the target commit. We extend the logic within
unpack_nondirectories() to create a sparse-directory entry in this case,
and then that is sent to call_unpack_fn(). Since the name becomes
confusing by handling directories, rename it to unpack_single_entry()
since it handles a blob entry or a sparse directory entry without using
traverse_trees_recursive().

There are some subtleties in this process. For instance, we need to
update find_cache_entry() to allow finding a sparse-directory entry that
exactly matches a given path. Use the new helper method
sparse_dir_matches_path() for this.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When we have sparse directory entries in the index, we want to compare
that directory against sparse-checkout patterns. Those pattern matching
algorithms are built expecting a file path, not a directory path. This
is especially important in the "cone mode" patterns which will match
files that exist within the "parent directories" as well as the
recursive directory matches.

If path_matches_pattern_list() is given a directory, we can add a fake
filename ("-") to the directory and get the same results as before,
assuming we are in cone mode. Since sparse index requires cone mode
patterns, this is an acceptable assumption.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
While comparing an index to a tree, we may see a sparse directory entry.
In this case, we should compare that portion of the tree to the tree
represented by that entry. This could include a new tree which needs to
be expanded to a full list of added files. It could also include an
existing tree, in which case all of the changes inside are important to
describe, including the modifications, additions, and deletions. Note
that the case where the tree has a path and the index does not remains
identical to before: the lack of a cache entry is the same with a sparse
index.

Use diff_tree_oid() appropriately to appropriately compute the diff.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
'git status' began reporting a percentage of populated paths when
sparse-checkout is enabled in 051df3c (wt-status: show sparse
checkout status as well, 2020-07-18). This percentage is incorrect when
the index has sparse directories. It would also be expensive to
calculate as we would need to parse trees to count the total number of
possible paths.

Avoid the expensive computation by simplifying the output to only report
that a sparse checkout exists, without the percentage.

This change is the reason we use 'git status --porcelain=v2' in
t1092-sparse-checkout-compatibility.sh. We don't want to ensure that
this message is equal across both modes, but instead just the important
information about staged, modified, and untracked files are compared.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
By testing 'git -c core.fsmonitor= status -uno', we can check for the
simplest index operations that can be made sparse-aware. The necessary
implementation details are already integrated with sparse-checkout, so
modify command_requires_full_index to be zero for cmd_status().

In refresh_index(), we loop through the index entries to refresh their
stat() information. However, sparse directories have no stat()
information to populate. Ignore these entries.

This allows 'git status' to no longer expand a sparse index to a full
one. This is further tested by dropping the "-uno" option and adding an
untracked file into the worktree.

The performance test p2000-sparse-checkout-operations.sh demonstrates
these improvements:

Test                                  HEAD~1           HEAD
-----------------------------------------------------------------------------
2000.2: git status (full-index-v3)    0.31(0.30+0.05)  0.31(0.29+0.06) +0.0%
2000.3: git status (full-index-v4)    0.31(0.29+0.07)  0.34(0.30+0.08) +9.7%
2000.4: git status (sparse-index-v3)  2.35(2.28+0.10)  0.04(0.04+0.05) -98.3%
2000.5: git status (sparse-index-v4)  2.35(2.24+0.15)  0.05(0.04+0.06) -97.9%

Note that since HEAD~1 was expanding the sparse index by parsing trees,
it was artificially slower than the full index case. Thus, the 98%
improvement is misleading, and instead we should celebrate the 0.34s to
0.05s improvement of 85%. This is more indicative of the peformance
gains we are expecting by using a sparse index.

Note: we are dropping the assignment of core.fsmonitor here. This is not
necessary for the test script as we are not altering the config any
other way. Correct integration with FS Monitor will be validated in
later changes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
It is difficult, but possible, to get into a state where we intend to
add a directory that is outside of the sparse-checkout definition. Add a
test to t1092-sparse-checkout-compatibility.sh that demonstrates this
using a combination of 'git reset --mixed' and 'git checkout --orphan'.

This test failed before because the output of 'git status
--porcelain=v2' would not match on the lines for folder1/:

* The sparse-checkout repo (with a full index) would output each path
  name that is intended to be added.

* The sparse-index repo would only output that "folder1/" is staged for
  addition.

The status should report the full list of files to be added, and so this
sparse-directory entry should be expanded to a full list when reaching
it inside the wt_status_collect_changes_initial() method. Use
read_tree_at() to assist.

Somehow, this loop over the cache entries was not guarded by
ensure_full_index() as intended.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
If we need to expand a sparse-index into a full one, then the FS Monitor
bitmap is going to be incorrect. Ensure that we start fresh at such an
event.

While this is currently a performance drawback, the eventual hope of the
sparse-index feature is that these expansions will be rare and hence we
will be able to keep the FS Monitor data accurate across multiple Git
commands.

These tests are added to demonstrate that the behavior is the same
across a full index and a sparse index, but also that file modifications
to a tracked directory outside of the sparse cone will trigger
ensure_full_index().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Copy link

@jeffhostetler jeffhostetler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked up a few things, but didn't find anything serious. (I'm still a little fuzzy on some of the details, but that is to be expected.)

t/t1092-sparse-checkout-compatibility.sh Outdated Show resolved Hide resolved
unpack-trees.c Show resolved Hide resolved
unpack-trees.c Show resolved Hide resolved
unpack-trees.c Show resolved Hide resolved
unpack-trees.c Show resolved Hide resolved
diff-lib.c Show resolved Hide resolved
t/t1092-sparse-checkout-compatibility.sh Show resolved Hide resolved
wt-status.c Outdated Show resolved Hide resolved
sparse-index.c Show resolved Hide resolved
t/t7519-status-fsmonitor.sh Show resolved Hide resolved
Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I do have a fundamental question, though. If we can already say that a cache entry refers to a sparse dir by seeing whether its ce_mode says that it is a directory and its flags say CE_SKIP_WORKTREE, do we really need to append a / at the end? I ask because appending that slash seems to cause more trouble than it helps (sorting, binary searching, etc).

*/
if (parent_pathname.len > 0 &&
parent_pathname.buf[parent_pathname.len - 1] == '/')
strbuf_add(&parent_pathname, "-", 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would . make sense here instead of -?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason having a filename as . seems stranger to me than -. Perhaps _ would seem less odd?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked with @dscho offline here and here is the conclusion:

  • The recommendation here is clever: path/to/directory/. is semantically identical to path/to/directory, so . is a natural option here.
  • This relies on the cone-mode path-checking logic to be ignorant of this meaning of the trailing /.. While it is dumb enough to ignore that, we would be adding an expectation that it remains dumb and otherwise doesn't check that the input path is a valid name for a file.

The conclusion is that this is a clever idea, but perhaps too clever.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there is a correct answer here. But it makes me want to say that we want to treat the ce's as a {type,path} tuple with an accessor and all that and get away from the pattern of just iterating over a dumb list of strings (that all callers know is a dumb list of strings).

Is there existing code to detect file -vs- symlink collisions? or file -vs- submodule collisions? How are they handled?
I mean we're adding the concept of (sparse) directory nodes into the index; it would be nice if they collided the same way. (I'm not trying to cause problems, just asking dumb questions.)

wt-status.c Outdated Show resolved Hide resolved
istate->fsmonitor_has_run_once = 0;
FREE_AND_NULL(istate->fsmonitor_dirty);
FREE_AND_NULL(istate->fsmonitor_last_update);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative would be to map the old extension data (which is essentially a compressed version of the fsmonitor_dirty bitmap, which we should be able to map relatively easily by iterating over the full and the sparse index simultaneously).

I do not suggest to implement this here, but maybe mention it as a potential future improvement?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you recommending a translation of the old bitmap into a new bitmap based on the new path positions? Yes, that would work, but I don't think it is worth pursuing much at all. We are better off investing in the sparse-index work such that we only change the sparse directory entries during git sparse-checkout set, which is rare and likely is on a clean working directory.

wt-status.c Show resolved Hide resolved
sparse-index.c Show resolved Hide resolved
@derrickstolee
Copy link
Collaborator Author

I do have a fundamental question, though. If we can already say that a cache entry refers to a sparse dir by seeing whether its ce_mode says that it is a directory and its flags say CE_SKIP_WORKTREE, do we really need to append a / at the end? I ask because appending that slash seems to cause more trouble than it helps (sorting, binary searching, etc).

While it makes things a bit more complicated in this scenario, you missed the part in the fundamentals where it is important for things like index_name_pos() when someone searches for a path within the sparse-directory entry. It is important that the sparse-directory entry fits in the correct position where its contained blobs would fit if the index was expanded.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee
Copy link
Collaborator Author

I amended and force-pushed because some of the vfs-test jobs were stalled on unpacking the SDKs, but canceling the jobs did not give me a way to restart the jobs (the "Cancel Workflow" button was still up instead of "re-run all jobs")

@derrickstolee derrickstolee merged commit 47cf534 into microsoft:features/sparse-index Jun 21, 2021
derrickstolee added a commit that referenced this pull request Jun 21, 2021
See gitgitgadget#973 for the upstream submission.

This PR follows #374 which does the heavy lifting.

There is one known issue with this integration: directory/file conflicts at the sparse-checkout boundary. This will not affect our dogfooders so we should ignore that bug for the purpose of the `features/sparse-index` branch. I will fix that upstream.
derrickstolee added a commit that referenced this pull request Jun 21, 2021
This follows #374 and #361 to complete the necessary functionality for the sparse-index experimental release.

This does not include the behavior changes requested upstream. The request is about how `git add` currently ignores changes that match a tracked path outside of the sparse-checkout definition. However, this is an established behavior of Git, so we don't need to change it for our dogfooders.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants