Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sparse Index] Integrate with git commit and git checkout #361

Conversation

derrickstolee
Copy link
Collaborator

@derrickstolee derrickstolee commented Jun 1, 2021

See gitgitgadget#973 for the upstream submission.

This PR follows #374 which does the heavy lifting.

There is one known issue with this integration: directory/file conflicts at the sparse-checkout boundary. This will not affect our dogfooders so we should ignore that bug for the purpose of the features/sparse-index branch. I will fix that upstream.

@derrickstolee derrickstolee changed the base branch from vfs-2.31.1 to vfs-2.31.1-exp June 1, 2021 18:42
@derrickstolee derrickstolee force-pushed the sparse-index/vfs-checkout branch 3 times, most recently from 1e1af99 to b160b38 Compare June 3, 2021 14:34
@derrickstolee derrickstolee force-pushed the sparse-index/vfs-checkout branch 2 times, most recently from a80ef2a to 8e163eb Compare June 3, 2021 21:19
@derrickstolee derrickstolee changed the base branch from vfs-2.31.1-exp to features/sparse-index June 7, 2021 14:00
@derrickstolee derrickstolee changed the title [Experimental] Integrate git checkout with sparse index [Sparse Index] Integrate with git commit and git checkout Jun 7, 2021
@derrickstolee derrickstolee force-pushed the sparse-index/vfs-checkout branch 2 times, most recently from 4e35f10 to 46172a8 Compare June 9, 2021 14:06
As we increase our list of commands to test in
p2000-sparse-operations.sh, we will want to have a slightly smaller test
repository. Reduce the size by a factor of four by reducing the depth of
the step that creates a big index around a moderately-sized repository.

Also add a step to run 'git checkout -' on repeat. This requires having
a previous location in the reflog, so add that to the initialization
steps.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
By using shorter names for the test repos, we will get a slightly more
compressed performance summary without comprimising clarity.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Update 'git commit' to allow using the sparse-index in memory without
expanding to a full one. The only place that had an ensure_full_index()
call was in cache_tree_update(). The recursive algorithm for
update_one() was already updated in 2de37c5 (cache-tree: integrate
with sparse directory entries, 2021-03-03) to handle sparse directory
entries in the index.

Most of this change involves testing different command-line options that
allow specifying which on-disk changes should be included in the commit.
This includes no options (only take currently-staged changes), -a (take
all tracked changes), and --include (take a list of specific changes).
To simplify testing that these options do not expand the index, update
the test that previously verified that 'git status' does not expand the
index with a helper method, ensure_not_expanded().

This allows 'git commit' to operate much faster when the sparse-checkout
cone is much smaller than the full list of files at HEAD.

Here are the relevant lines from p2000-sparse-operations.sh:

Test                                      HEAD~1           HEAD
----------------------------------------------------------------------------------
2000.14: git commit -a -m A (full-v3)     0.35(0.26+0.06)  0.36(0.28+0.07) +2.9%
2000.15: git commit -a -m A (full-v4)     0.32(0.26+0.05)  0.34(0.28+0.06) +6.3%
2000.16: git commit -a -m A (sparse-v3)   0.63(0.59+0.06)  0.04(0.05+0.05) -93.7%
2000.17: git commit -a -m A (sparse-v4)   0.64(0.59+0.08)  0.04(0.04+0.04) -93.8%

It is important to compare the full-index case to the sparse-index case,
so the improvement for index version v4 is actually an 88% improvement in
this synthetic example.

In a real repository with over two million files at HEAD and 60,000
files in the sparse-checkout definition, the time for 'git commit -a'
went from 2.61 seconds to 134ms. I compared this to the result if the
index only contained the paths in the sparse-checkout definition and
found the theoretical optimum to be 120ms, so the out-of-cone paths only
add a 12% overhead.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Previous changes did the necessary improvements to unpack-trees.c and
diff-lib.c in order to modify a sparse index based on its comparision
with a tree. The only remaining work is to remove some
ensure_full_index() calls and add tests that verify that the index is
not expanded in our interesting cases. Include 'switch' and 'restore' in
these tests, as they share a base implementation with 'checkout'.

Here are the relevant performance results from
p2000-sparse-operations.sh:

Test                                     HEAD~1           HEAD
--------------------------------------------------------------------------------
2000.18: git checkout -f - (full-v3)     0.49(0.43+0.03)  0.47(0.39+0.05) -4.1%
2000.19: git checkout -f - (full-v4)     0.45(0.37+0.06)  0.42(0.37+0.05) -6.7%
2000.20: git checkout -f - (sparse-v3)   0.76(0.71+0.07)  0.04(0.03+0.04) -94.7%
2000.21: git checkout -f - (sparse-v4)   0.75(0.72+0.04)  0.05(0.06+0.04) -93.3%

It is important to compare the full index case to the sparse index case,
as the previous results for the sparse index were inflated by the index
expansion. For index v4, this is an 88% improvement.

On an internal repository with over two million paths at HEAD and a
sparse-checkout definition containing ~60,000 of those paths, 'git
checkout' went from 3.5s to 297ms with this change. The theoretical
optimum where only those ~60,000 paths exist was 275ms, so the extra
sparse directory entries contribute a 22ms overhead.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee derrickstolee self-assigned this Jun 15, 2021
derrickstolee added a commit that referenced this pull request Jun 21, 2021
Replaces #367, which used the upstream commits directly. This instead includes a merge commit onto `vfs-2.32.0` so we can move forward even if the upstream version is modified.

This integrates the sparse-index with `git status`. There is a LOT of code here, and much of it makes future changes (#361 and #364) much simpler than they would be otherwise.

This merges into `features/sparse-index` which is intended to be released as an experimental release. The goal of reviewing this PR is to find "good enough for a limited release to dogfooders" and not the same scrutiny of upstream.

In particular, I am currently looking for an issue with directory/file conflicts at the sparse-checkout boundary. They don't reveal themselves until `git checkout`, and only in very rare cases that won't affect our dogfooders.
@derrickstolee derrickstolee marked this pull request as ready for review June 21, 2021 14:32
@derrickstolee
Copy link
Collaborator Author

Now ready for review. Force-pushed to kick of a new merge and builds now that #374 is merged.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Copy link

@jeffhostetler jeffhostetler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

@jeffhostetler
Copy link

I have Trace2 data for the size of the index already being logged.
Would it be useful to extract confirm cache_nr when sparse-index is enabled in
some of your tests when you're grepping for the presence/absence of various
regions? Or at least confirm that the sparse-vx cache_nr value is much, much
smaller than the full-vx version?

	trace2_data_intmax("index", the_repository, "read/cache_nr",
			   istate->cache_nr);

@derrickstolee
Copy link
Collaborator Author

I have Trace2 data for the size of the index already being logged.
Would it be useful to extract confirm cache_nr when sparse-index is enabled in
some of your tests when you're grepping for the presence/absence of various
regions? Or at least confirm that the sparse-vx cache_nr value is much, much
smaller than the full-vx version?

I have some tests that extract the list of cache entries and verify that
certain directories are stored as sparse directory entries when configured
to do so. The later checks to verify we don't expand them (after starting
with a known-sparse state) should be sufficient for testing.

@derrickstolee derrickstolee merged commit da4283c into microsoft:features/sparse-index Jun 21, 2021
derrickstolee added a commit that referenced this pull request Jun 21, 2021
This follows #374 and #361 to complete the necessary functionality for the sparse-index experimental release.

This does not include the behavior changes requested upstream. The request is about how `git add` currently ignores changes that match a tracked path outside of the sparse-checkout definition. However, this is an established behavior of Git, so we don't need to change it for our dogfooders.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants