Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse index: checkout-index #424

Merged
merged 5 commits into from
Sep 14, 2021
Merged

Sparse index: checkout-index #424

merged 5 commits into from
Sep 14, 2021

Conversation

vdye
Copy link

@vdye vdye commented Sep 10, 2021

Changes

  • Expand testing in t1092 to cover checkout-index
  • Modify the behavior of checkout-index --all to make more sense with sparse checkouts (sparse index or not)
    • Introduce --sparse flag maintaining "old" behavior
  • Integrate sparse index with checkout-index, including corresponding ensure_not_expanded tests

t/t1092-sparse-checkout-compatibility.sh Show resolved Hide resolved
@@ -59,6 +61,10 @@ OPTIONS
write the content to temporary files. The temporary name
associations will be written to stdout.

--sparse::

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little odd. From what I understand, we generally use the --sparse flag to enable sparse behavior, but here we seem to be circumventing it. We may want to consider something more like --ignore-sparse. Let me know if I'm missing something, though!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out gitgitgadget#1018 which restricts how Git operates outside of the sparse-checkout. The --sparse option allows overriding these protections. It's really new.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldennington I definitely see where you're coming from (it does seem somewhat backwards that --sparse would expand the index/do things outside the sparse checkout). I did base it on @derrickstolee's linked PR (+ls-trees), and I think this line conveys it best:

a new '--sparse' option is added that ignores the sparse-checkout patterns and the SKIP_WORKTREE bit

The option here is pretty much treating --sparse as "does things in parts of the repository that should be sparse", rather than ignoring those parts of the repository. @derrickstolee please correct me if I'm using it wrong, though - I'm mostly looking for this to be consistent with the other already-implemented options (and offer a way to use checkout-index --all without always expanding the full index).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdye: I think you have the right approach. We should make Git ignore files outside of the sparse-checkout cone as much as possible, and allow users to override that behavior by specifying --sparse. We do not expect users to have a legitimate reason to care about files outside of the sparse-checkout, but Git needs to care unless we make these changes.

Documentation/git-checkout-index.txt Show resolved Hide resolved
Documentation/git-checkout-index.txt Show resolved Hide resolved
Copy link
Collaborator

@derrickstolee derrickstolee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress. I have a few questions.

Comment on lines 747 to 945
# NEEDSWORK: even in sparse checkouts, checkout-index --all will create all
# files (even those outside the sparse definition) on disk. However, these files
# don't appear in the percentage of tracked files in git status.
test_expect_failure 'checkout-index --all' '
init_repos &&

test_all_match git checkout-index --all &&
test_sparse_match test_path_is_missing folder1
'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Scary that this could completely abolish all performance benefits for a sparse-checkout!

Comment on lines +744 to +930
# Note: although all tests fail (as expected), the messaging differs. For
# non-sparse index checkouts, the error is that the "file" does not appear
# in the index; for sparse checkouts, the error is explicitly that the
# entry is a sparse directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make both implementations say "file is outside the sparse-checkout definition"?

  • The "file does not appear in the index" is not entirely correct for full indexes.
  • The "file is a sparse directory" has a bit too much internal information.

I wonder if using if (!path_in_sparse_checkout(name, &the_index)) somewhere would help here.

It's perhaps a bit tricky how this loop is structured, since it is doing a prefix match on a range of index entries.

In this test script, what happens with this example?

git sparse-checkout set deep/deeper1
git checkout-index -f -- deep/

Note that deep/deeper2/ is a sparse directory within the deep/ directory, but we would still want the other entries to succeed, silently ignoring the deep/deeper2/ entry. This would only be an error if the user ran git checkout-index -f -- deep/deeper2/, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just comparing the names, not performing a prefix match (the first condition in this check makes sure the names are the same length):

if (ce_namelen(ce) != namelen ||
		    memcmp(ce->name, name, namelen))

The result is that, from your example, git checkout-index -f -- deep/ (or any path that doesn't refer to a file entry in the index) returns git checkout-index: <name> is not in the cache. It seems like this command expects exact names of cache entries (except for the subcommands using a pathspec, but they handle things differently anyway), in which case <name> is not in the cache makes some amount of sense? Agreed that it's misleading, though - would that be something that should be clarified in the command doc, or in this message itself?

As for checking whether something is in the sparse checkout, this doesn't actually block checking out files outside the sparse definition (you can successfully run git checkout-index -- folder1/a without any other repo setup). The issue is that sparse directories do appear as standalone cache entries, so without the S_ISSPARSEDIR check the entry would be passed along to checkout_entry (which then fails because it doesn't handle directories). Since there's no support checkout-index-ing directories otherwise, this probably needs to be blocked rather than worked around. Maybe the message should be the same as whenever anyone tries to checkout-index a directory in general? I.e., "lie" and say <name> is not in the cache, or some other error message that's the same between them?

Comment on lines +256 to +267
if (include_sparse)
die("git checkout-index: don't mix '--sparse' and explicit filenames");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why shouldn't these be combined? What if I want to poke at a file outside of my sparse-checkout? It certainly seems like git checkout-index --sparse -- deep/deeper2/a would be a good way to get that file on-disk.

Copy link
Author

@vdye vdye Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't include --sparse for individual files is because, as it is right now, individually-specified files will always expand the index if that specified file is not already in the index - I wasn't really sure what the "right" way to handle it was, though. On one hand, not requiring a --sparse flag to checkout-index individual files (even those outside the checkout definition) is the most consistent with existing behavior. On the other hand, it is kind of weird that we need a special --sparse option for files in sparse directories for --all, but not when you list out individual files.

What do you think would be better (especially based on how you've handled similar situations in other commands)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a plumbing command, let's be cautious and avoid the behavior break by requiring --sparse for individual files. (This is a change of opinion from my earlier comment.)

Add tests to cover basic `checkout-index` cases related to sparse checkouts:
files and folder, both inside and outside of sparse checkout definition. New
tests will serve as a baseline for expected behavior when integrating
`checkout-index` with the sparse index.

Of note is the test demonstrating the behavior of `checkout-index --all`;
creating files even outside the sparse checkout definition is somewhat
unintuitive, and will cause significant performance issues when run with a
sparse index. The goal of later changes will be to change this default
behavior and introduce a `--sparse` flag to enable populating files outside
the sparse definition.

Signed-off-by: Victoria Dye <vdye@github.com>
Change the default behavior of `checkout-index --all` for sparse checkouts
to no longer refresh files outside the sparse checkout definition. The
newly-added `--sparse` option, when used with `--all`, maintains the "old"
behavior and checks out files outside the sparse checkout definition.

Signed-off-by: Victoria Dye <vdye@github.com>
Add repository settings to allow usage of the sparse index. In order to prevent
unexpected errors when attempting to check out a sparse directory entry,
`checkout_file` directly checks whether a found entry is a sparse directory
and, if so, exits with an error. The test corresponding to this case now
verifies the error message, intentionally differing from the non-sparse index
scenarios.

Signed-off-by: Victoria Dye <vdye@github.com>
Update `checkout_all` to only run `ensure_full_index` when entries in
sparse directories are needed (i.e., `--sparse` is specified).

Signed-off-by: Victoria Dye <vdye@github.com>
Copy link
Collaborator

@derrickstolee derrickstolee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my prior concerns were misconceptions about how git checkout-index works. Thanks for clearing them up!

`git checkout-index --all` is a subcommand of `git stash`, so it is helpful
to verify the performance of that particular usage.

Signed-off-by: Victoria Dye <vdye@github.com>
@vdye
Copy link
Author

vdye commented Sep 13, 2021

@derrickstolee + @ldennington I added one last change (performance test for git checkout-index -f --all), which had these results (tip of vfs-2.33.0 vs the tip of this branch):

Test                                                   ms/vfs-2.33.0       HEAD                  
-------------------------------------------------------------------------------------------------
2000.30: git checkout-index -f --all (full-v3)         76.79(9.89+64.59)   0.52(0.22+0.26) -99.3%
2000.31: git checkout-index -f --all (full-v4)         80.96(9.65+63.89)   0.52(0.21+0.26) -99.4%
2000.32: git checkout-index -f --all (sparse-v3)       89.43(9.83+64.99)   0.34(0.05+0.24) -99.6%
2000.33: git checkout-index -f --all (sparse-v4)       93.67(9.65+63.46)   0.34(0.05+0.25) -99.6%

The huge speedup seems to mostly come from having checkout-index --all respect the sparse checkout definition, but there is also improvement from full index to sparse index (which is nice to see)!

@derrickstolee
Copy link
Collaborator

@derrickstolee + @ldennington I added one last change (performance test for git checkout-index -f --all), which had these results (tip of vfs-2.33.0 vs the tip of this branch):

Test                                                   ms/vfs-2.33.0       HEAD                  
-------------------------------------------------------------------------------------------------
2000.30: git checkout-index -f --all (full-v3)         76.79(9.89+64.59)   0.52(0.22+0.26) -99.3%
2000.31: git checkout-index -f --all (full-v4)         80.96(9.65+63.89)   0.52(0.21+0.26) -99.4%
2000.32: git checkout-index -f --all (sparse-v3)       89.43(9.83+64.99)   0.34(0.05+0.24) -99.6%
2000.33: git checkout-index -f --all (sparse-v4)       93.67(9.65+63.46)   0.34(0.05+0.25) -99.6%

The huge speedup seems to mostly come from having checkout-index --all respect the sparse checkout definition, but there is also improvement from full index to sparse index (which is nice to see)!

Yeah! Those 75-90s times are definitely from writing too much data to the working tree. It's nice to see exactly why this is such an important change to include and why users would not like that old behavior!

@vdye vdye merged commit 18844c9 into microsoft:vfs-2.33.0 Sep 14, 2021
@vdye vdye deleted the sparse-index/checkout-index branch September 14, 2021 12:46
dscho pushed a commit that referenced this pull request Oct 30, 2021
derrickstolee pushed a commit that referenced this pull request Oct 30, 2021
derrickstolee pushed a commit that referenced this pull request Oct 31, 2021
derrickstolee pushed a commit that referenced this pull request Nov 4, 2021
derrickstolee pushed a commit that referenced this pull request Nov 4, 2021
derrickstolee pushed a commit that referenced this pull request Nov 10, 2021
derrickstolee pushed a commit that referenced this pull request Nov 15, 2021
ldennington pushed a commit to ldennington/git that referenced this pull request Jan 12, 2022
ldennington pushed a commit to ldennington/git that referenced this pull request Jan 19, 2022
ldennington pushed a commit to ldennington/git that referenced this pull request Jan 20, 2022
ldennington pushed a commit to ldennington/git that referenced this pull request Jan 20, 2022
ldennington pushed a commit to ldennington/git that referenced this pull request Jan 20, 2022
ldennington pushed a commit to ldennington/git that referenced this pull request Jan 20, 2022
ldennington pushed a commit to ldennington/git that referenced this pull request Jan 20, 2022
ldennington pushed a commit to ldennington/git that referenced this pull request Jan 20, 2022
ldennington pushed a commit to ldennington/git that referenced this pull request Jan 25, 2022
ldennington pushed a commit to ldennington/git that referenced this pull request Jan 25, 2022
dscho pushed a commit that referenced this pull request Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants