Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse Index: integrate with reset #1048

Closed
wants to merge 8 commits into from
Closed

Sparse Index: integrate with reset #1048

wants to merge 8 commits into from

Conversation

vdye
Copy link

@vdye vdye commented Sep 29, 2021

This series integrates the sparse index with git reset and provides miscellaneous fixes and improvements to the command in sparse checkouts. This includes:

  1. tests added to t1092 and p2000 to establish the baseline functionality of the command
  2. repository settings to enable the sparse index with ensure_full_index guarding any code paths that break tests without other compatibility updates.
  3. modifications to remove or reduce the scope in which ensure_full_index must be called.

The sparse index updates are predicated on a fix originating from the microsoft/git fork [1], correcting how git reset --mixed handles resetting entries outside the sparse checkout definition. Additionally, a performance "bug" in next_cache_entry with sparse index is corrected, preventing repeatedly looping over already-searched entries.

The p2000 tests demonstrate a ~70% execution time reduction in git reset using a sparse index, and no change (within expected variability [2]) using a full index. Results summarized below [3, 4]:

Test                           base              [5/8]                 
-----------------------------------------------------------------------
git reset --hard (full-v3)     1.00(0.50+0.39)   0.97(0.50+0.37) -3.0% 
git reset --hard (full-v4)     1.00(0.51+0.38)   0.96(0.50+0.36) -4.0% 
git reset --hard (sparse-v3)   1.68(1.17+0.39)   1.37(0.91+0.35) -18.5%
git reset --hard (sparse-v4)   1.70(1.18+0.40)   1.41(0.94+0.35) -17.1%

Test                           base              [6/8]   
-----------------------------------------------------------------------
git reset --hard (full-v3)     1.00(0.50+0.39)   0.94(0.48+0.34) -6.0% 
git reset --hard (full-v4)     1.00(0.51+0.38)   0.95(0.51+0.34) -5.0% 
git reset --hard (sparse-v3)   1.68(1.17+0.39)   0.46(0.05+0.29) -72.6%
git reset --hard (sparse-v4)   1.70(1.18+0.40)   0.46(0.06+0.29) -72.9%

Test                               base              [7/8]
---------------------------------------------------------------------------
git reset (full-v3)                0.77(0.27+0.37)   0.72(0.26+0.32) -6.5%
git reset (full-v4)                0.75(0.27+0.34)   0.73(0.26+0.32) -2.7%
git reset (sparse-v3)              1.44(0.96+0.36)   0.43(0.04+0.96) -70.1%
git reset (sparse-v4)              1.46(0.97+0.36)   0.43(0.05+0.79) -70.5%
git reset -- missing (full-v3)     0.72(0.26+0.32)   0.69(0.26+0.30) -4.2%
git reset -- missing (full-v4)     0.74(0.28+0.33)   0.71(0.27+0.32) -4.1% 
git reset -- missing (sparse-v3)   1.45(0.97+0.35)   0.81(0.42+0.90) -44.1%
git reset -- missing (sparse-v4)   1.41(0.94+0.34)   0.79(0.42+0.76) -44.0%

Test                               base              [8/8]            
---------------------------------------------------------------------------
git reset -- missing (full-v3)     0.72(0.26+0.32)   0.73(0.26+0.33) +1.4% 
git reset -- missing (full-v4)     0.74(0.28+0.33)   0.74(0.27+0.32) +0.0% 
git reset -- missing (sparse-v3)   1.45(0.97+0.35)   0.43(0.05+0.80) -70.3%
git reset -- missing (sparse-v4)   1.41(0.94+0.34)   0.44(0.05+0.76) -68.8%

Changes since V1

  • Add --force-full-index option to update-index. The option is used circumvent changing command_requires_full_index from its default value - right now this is effectively a no-op, but will change once update-index is integrated with sparse index. By using this option in the t1092 expand/collapse test, the command used to test will not need to be updated with subsequent sparse index integrations.
  • Update implementation of mixed reset for entries outside sparse checkout definition. The condition in which a file should be checked out before index reset is simplified to "if it has skip-worktree enabled and a reset would change the file, check it out".
    • After checking the behavior of update_index_from_diff with renames, found that the diff used by reset does not produce diff queue entries with different pathnames for one and two. Because of this, and that nothing in the implementation seems to rely on identical path names, no BUG check is added.
  • Correct a bug in the sparse index is not expanded tests in t1092 where failure of a git reset --mixed test was not being reported. Test now verifies an appropriate scenario with corrected failure-checking.

Changes since V2

  • Replace patch adding checkouts for git reset --mixed with sparse checkout with preserving the skip-worktree flag (including a new test for git reset --mixed and update to t1092 - checkout and reset (mixed))
  • Move rename of is_missing into its own patch
  • Further extend t1092 tests and remove unnecessary commands/tests where possible
  • Refine logic determining which pathspecs require ensure_full_index in git reset --mixed, add related ensure_not_expanded tests
  • Add index_search_mode enum to index_name_stage_pos
  • Clean up variable usage & remove unnecessary subtree_path in prime_cache_tree_rec
  • Update cover letter performance data
  • More thoroughly explain changes in each commit message

Changes since V3

  • Replace git update-index --force-full-index with git reset update-folder1 -- folder1/a, remove introduction of new --force-full-index option entirely, and add comment clarifying the intent of sparse-index is expanded and converted back test
  • Fix authorship on reset: preserve skip-worktree bit in mixed reset (current patch fully replaces original patch, but metadata of the original wasn't properly replaced)

Changes since V4

  • Update t1092 test 'checkout and reset (mixed)' to explicitly verify differences between sparse and full checkouts

Changes since V5

  • Update t1092 test 'reset with wildcard pathspec' with more cases and better checks
  • Add "special case" wildcard pathspec check when determining whether to expand the index (avoids double-loop over pathspecs & index entries)

Changes since V6

  • Update t1092 test 'reset with pathspecs outside sparse definition' to verify index contents
  • Simplify "special case" verification for wildcard pathspecs (added in V5)

Thanks!
-Victoria

[1] microsoft/git@6b8a074
[2] https://lore.kernel.org/git/8b9fe3f8-f0e3-4567-b20b-17c92bd1a5c5@github.com/
[3] If a test and/or commit is not mentioned, there is no significant change to performance
[4] Pathspec "does-not-exist" is changed to "missing" to save space in performance report

cc: stolee@gmail.com
cc: gitster@pobox.com
cc: newren@gmail.com
cc: Taylor Blau me@ttaylorr.com
cc: Bagas Sanjaya bagasdotme@gmail.com
cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: Phillip Wood phillip.wood123@gmail.com
cc: Emily Shaffer emilyshaffer@google.com

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 29, 2021

Welcome to GitGitGadget

Hi @vdye, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that your Pull Request has a good description, as it will be used as cover letter.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "revisions:" to state which subsystem the change is about, and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the Libera Chat IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Libera Chat. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@derrickstolee
Copy link

/allow

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 29, 2021

User vdye is now allowed to use GitGitGadget.

WARNING: vdye has no public email address set on GitHub

Copy link

@derrickstolee derrickstolee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pausing review for meetings.

t/perf/p2000-sparse-operations.sh Show resolved Hide resolved
cache-tree.c Outdated Show resolved Hide resolved
cache-tree.c Outdated Show resolved Hide resolved
builtin/reset.c Outdated Show resolved Hide resolved
t/t1092-sparse-checkout-compatibility.sh Outdated Show resolved Hide resolved
@derrickstolee
Copy link

@vdye: have you considered taking only the git reset changes for your first series, just as a way to try it out? That will also improve the time-to-merge. From there, you could make a case for checkout-index and update-index as being necessary for git stash, so maybe they get grouped with those changes instead.

Just a thought. If you're comfortable going with a big series right away, then ignore me.

@derrickstolee derrickstolee changed the title Sparse index: reset, checkout-index, and update-index Sparse index: reset Sep 30, 2021
@derrickstolee
Copy link

I changed the title so you didn't forget (it becomes the subject of your cover letter).

@vdye vdye changed the title Sparse index: reset Sparse index integrations: reset Sep 30, 2021
@vdye vdye changed the title Sparse index integrations: reset Sparse Index: integrate with reset Sep 30, 2021
@vdye
Copy link
Author

vdye commented Sep 30, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 30, 2021

Submitted as pull.1048.git.1633013461.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-1048/vdye/vdye/sparse-index-part1-v1

To fetch this version to local tag pr-1048/vdye/vdye/sparse-index-part1-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-1048/vdye/vdye/sparse-index-part1-v1

@@ -459,9 +459,7 @@ test_expect_failure 'blame with pathspec outside sparse definition' '
test_all_match git blame deep/deeper2/deepest/a
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Sep 30, 2021 at 02:50:56PM +0000, Victoria Dye via GitGitGadget wrote:
> From: Victoria Dye <vdye@github.com>
>
> In anticipation of multiple commands being fully integrated with sparse
> index, update the test for index expand and collapse for non-sparse index
> integrated commands to use `mv`.
>
> Signed-off-by: Victoria Dye <vdye@github.com>
> ---
>  t/t1092-sparse-checkout-compatibility.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index c5977152661..aed8683e629 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -642,7 +642,7 @@ test_expect_success 'sparse-index is expanded and converted back' '
>  	init_repos &&
>
>  	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> -		git -C sparse-index -c core.fsmonitor="" reset --hard &&
> +		git -C sparse-index -c core.fsmonitor="" mv a b &&

Double-checking my understanding as somebody who is not so familiar with
t1092: this test is to ensure that commands which don't yet understand
the sparse index can temporarily expand it in order to do their work?

If so, makes sense to me. And renaming 'a' to 'b' is arbitrary and fine
to do since we end up recreating the sparse-index repository each time
via init_repos.

Looks good to me.

Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Victoria Dye wrote (reply to this):

Taylor Blau wrote:
> On Thu, Sep 30, 2021 at 02:50:56PM +0000, Victoria Dye via GitGitGadget wrote:
>> From: Victoria Dye <vdye@github.com>
>>
>> In anticipation of multiple commands being fully integrated with sparse
>> index, update the test for index expand and collapse for non-sparse index
>> integrated commands to use `mv`.
>>
>> Signed-off-by: Victoria Dye <vdye@github.com>
>> ---
>>  t/t1092-sparse-checkout-compatibility.sh | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
>> index c5977152661..aed8683e629 100755
>> --- a/t/t1092-sparse-checkout-compatibility.sh
>> +++ b/t/t1092-sparse-checkout-compatibility.sh
>> @@ -642,7 +642,7 @@ test_expect_success 'sparse-index is expanded and converted back' '
>>  	init_repos &&
>>
>>  	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
>> -		git -C sparse-index -c core.fsmonitor="" reset --hard &&
>> +		git -C sparse-index -c core.fsmonitor="" mv a b &&
> 
> Double-checking my understanding as somebody who is not so familiar with
> t1092: this test is to ensure that commands which don't yet understand
> the sparse index can temporarily expand it in order to do their work?

Exactly - if a command doesn't explicitly enable use of the sparse index by
setting `command_requires_full_index` to 0, the index is expanded if/when it
is first read during the command's execution and collapsed if/when it is
written to disk. This test makes sure that mechanism works as intended.

-Victoria



Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Victoria Dye <vdye@github.com> writes:

> Taylor Blau wrote:
>> On Thu, Sep 30, 2021 at 02:50:56PM +0000, Victoria Dye via GitGitGadget wrote:
>>> From: Victoria Dye <vdye@github.com>
>>>
>>> In anticipation of multiple commands being fully integrated with sparse
>>> index, update the test for index expand and collapse for non-sparse index
>>> integrated commands to use `mv`.
>>> ...
>>>  	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
>>> -		git -C sparse-index -c core.fsmonitor="" reset --hard &&
>>> +		git -C sparse-index -c core.fsmonitor="" mv a b &&
>> 
>> Double-checking my understanding as somebody who is not so familiar with
>> t1092: this test is to ensure that commands which don't yet understand
>> the sparse index can temporarily expand it in order to do their work?
>
> Exactly - if a command doesn't explicitly enable use of the sparse index by
> setting `command_requires_full_index` to 0, the index is expanded if/when it
> is first read during the command's execution and collapsed if/when it is
> written to disk. This test makes sure that mechanism works as intended.

Sorry, I do not quite follow.  

So is this "before this series of patches, 'reset --hard' can be
used to as a sample of a command that expands and then collapses,
but because it no longer is a good sample of a command so we replace
it with 'mv a b'"?  Do we need to update this further when "mv a b"
learns to expand and then collapse?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Victoria Dye wrote (reply to this):

Junio C Hamano wrote:
> Victoria Dye <vdye@github.com> writes:
> 
>> Taylor Blau wrote:
>>> On Thu, Sep 30, 2021 at 02:50:56PM +0000, Victoria Dye via GitGitGadget wrote:
>>>> From: Victoria Dye <vdye@github.com>
>>>>
>>>> In anticipation of multiple commands being fully integrated with sparse
>>>> index, update the test for index expand and collapse for non-sparse index
>>>> integrated commands to use `mv`.
>>>> ...
>>>>  	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
>>>> -		git -C sparse-index -c core.fsmonitor="" reset --hard &&
>>>> +		git -C sparse-index -c core.fsmonitor="" mv a b &&
>>>
>>> Double-checking my understanding as somebody who is not so familiar with
>>> t1092: this test is to ensure that commands which don't yet understand
>>> the sparse index can temporarily expand it in order to do their work?
>>
>> Exactly - if a command doesn't explicitly enable use of the sparse index by
>> setting `command_requires_full_index` to 0, the index is expanded if/when it
>> is first read during the command's execution and collapsed if/when it is
>> written to disk. This test makes sure that mechanism works as intended.
> 
> Sorry, I do not quite follow.  
> 
> So is this "before this series of patches, 'reset --hard' can be
> used to as a sample of a command that expands and then collapses,
> but because it no longer is a good sample of a command so we replace
> it with 'mv a b'"?

Yes, because this series enables sparse index integration in `git reset`,
the test no longer applies to that command (but it does apply to `git mv`).

> Do we need to update this further when "mv a b"
> learns to expand and then collapse?

Unfortunately, yes. `git mv` was picked more-or-less at random from the set
of commands that read the index and don't already have sparse index
integrations (excluding those I know are planned for sparse index
integration in the near future). If `git mv` were to be updated to disable
`command_requires_full_index`, the command in the test would need to change
again.

For what it's worth, I do think the test itself is valuable, since it makes
sure a command's capability to use the sparse index is always the result of
an intentional update to (and review of) the code.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Victoria Dye <vdye@github.com> writes:

>> Do we need to update this further when "mv a b"
>> learns to expand and then collapse?
>
> Unfortunately, yes. `git mv` was picked more-or-less at random from the set
> of commands that read the index and don't already have sparse index
> integrations (excluding those I know are planned for sparse index
> integration in the near future). If `git mv` were to be updated to disable
> `command_requires_full_index`, the command in the test would need to change
> again.
>
> For what it's worth, I do think the test itself is valuable, since it makes
> sure a command's capability to use the sparse index is always the result of
> an intentional update to (and review of) the code.

Oh, of course.  

I was actually wondering if it woudl be a good idea to leave a
command that will never be "converted" so that we can keep using it
for testing.

Perhaps a new option that is invented exactly for the purpose added
to a plumbing e.g. "git update-index --expand-collapse"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Victoria Dye wrote (reply to this):

Junio C Hamano wrote:
> Victoria Dye <vdye@github.com> writes:
> 
>>> Do we need to update this further when "mv a b"
>>> learns to expand and then collapse?
>>
>> Unfortunately, yes. `git mv` was picked more-or-less at random from the set
>> of commands that read the index and don't already have sparse index
>> integrations (excluding those I know are planned for sparse index
>> integration in the near future). If `git mv` were to be updated to disable
>> `command_requires_full_index`, the command in the test would need to change
>> again.
>>
>> For what it's worth, I do think the test itself is valuable, since it makes
>> sure a command's capability to use the sparse index is always the result of
>> an intentional update to (and review of) the code.
> 
> Oh, of course.  
> 
> I was actually wondering if it woudl be a good idea to leave a
> command that will never be "converted" so that we can keep using it
> for testing.
> 
> Perhaps a new option that is invented exactly for the purpose added
> to a plumbing e.g. "git update-index --expand-collapse"?
> 

That sounds good to me! I'll add an `update-index --expand-collapse`
implementation and update the test in v2 of this series.

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 30, 2021

User Taylor Blau <me@ttaylorr.com> has been added to the cc: list.

@@ -459,9 +459,7 @@ test_expect_failure 'blame with pathspec outside sparse definition' '
test_all_match git blame deep/deeper2/deepest/a
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Bagas Sanjaya wrote (reply to this):

On 30/09/21 21.50, Victoria Dye via GitGitGadget wrote:
> From: Victoria Dye <vdye@github.com>
> 
> In anticipation of multiple commands being fully integrated with sparse
> index, update the test for index expand and collapse for non-sparse index
> integrated commands to use `mv`.
> 

We can say "use git sparse-index mv instead of git sparse-index reset".

Why is mv used for this case?

-- 
An old man doll... just what I always wanted! - Clara

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 1, 2021

User Bagas Sanjaya <bagasdotme@gmail.com> has been added to the cc: list.

@@ -25,6 +25,8 @@
#include "cache-tree.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Victoria Dye wrote (reply to this):

Victoria Dye via GitGitGadget wrote:
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 0b6ff0de17d..c9b9ef4992c 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -801,14 +801,25 @@ test_expect_success 'sparse-index is not expanded' '
>  	for ref in update-deep update-folder1 update-folder2 update-deep
>  	do
>  		echo >>sparse-index/README.md &&
> +		ensure_not_expanded reset --mixed $ref
>  		ensure_not_expanded reset --hard $ref || return 1
>  	done &&
This is a bug - it's missing `&&` at the end of the line (and adding it will
cause the test to fail). The index is expanded if a mixed reset modifies an
entry outside the sparse cone, so I'll update the test in V2 to verify reset
between two refs with only in-cone files changed between them. 

@vdye
Copy link
Author

vdye commented Oct 5, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 5, 2021

Submitted as pull.1048.v2.git.1633440057.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-1048/vdye/vdye/sparse-index-part1-v2

To fetch this version to local tag pr-1048/vdye/vdye/sparse-index-part1-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-1048/vdye/vdye/sparse-index-part1-v2

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 5, 2021

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Tue, Oct 05 2021, Victoria Dye via GitGitGadget wrote:

> The p2000 tests demonstrate an overall ~70% execution time reduction across
> all tested usages of git reset using a sparse index:

[...]

> Test                                               before   after       
> ------------------------------------------------------------------------
> 2000.22: git reset (full-v3)                       0.48     0.51 +6.3% 
> 2000.23: git reset (full-v4)                       0.47     0.50 +6.4% 
> 2000.24: git reset (sparse-v3)                     0.93     0.30 -67.7%
> 2000.25: git reset (sparse-v4)                     0.94     0.29 -69.1%
> 2000.26: git reset --hard (full-v3)                0.69     0.68 -1.4% 
> 2000.27: git reset --hard (full-v4)                0.75     0.68 -9.3% 
> 2000.28: git reset --hard (sparse-v3)              1.29     0.34 -73.6%
> 2000.29: git reset --hard (sparse-v4)              1.31     0.34 -74.0%
> 2000.30: git reset -- does-not-exist (full-v3)     0.54     0.51 -5.6% 
> 2000.31: git reset -- does-not-exist (full-v4)     0.54     0.52 -3.7% 
> 2000.32: git reset -- does-not-exist (sparse-v3)   1.02     0.31 -69.6%
> 2000.33: git reset -- does-not-exist (sparse-v4)   1.07     0.30 -72.0%

This series looks like it really improves some cases, but at the cost of
that -70% improvement we've got a ~5% regression in 7/7 for the full-v3
--does-not-exist cases. As noted in your 7/7 (which improves all other
cases):

    (full-v3)     0.79(0.38+0.30)   0.91(0.43+0.34) +15.2%
    (full-v4)     0.80(0.38+0.29)   0.85(0.40+0.35) +6.2%

Which b.t.w. I had to read a couple of times before realizig that its
quoted:
    
    Test          before            after
    ------------------------------------------------------
    (full-v3)     0.79(0.38+0.30)   0.91(0.43+0.34) +15.2%
    (full-v4)     0.80(0.38+0.29)   0.85(0.40+0.35) +6.2%
    (sparse-v3)   0.76(0.43+0.69)   0.44(0.08+0.67) -42.1%
    (sparse-v4)   0.71(0.40+0.65)   0.41(0.09+0.65) -42.3%

Is just the does-not-exist part of this bigger table, are the other
cases all ~0% changed, or ...?
    
Anyway, until 7/7 the v3 had been sped up, but a ~10% increase landed us
at ~+6%, and full-v4 had been ~0% but got ~6% worse?

Is there a way we can get those improvements in performance without
regressing on the full-* cases?

Also, these tests only check sparse performance, but isn't some of the
code being modified here general enough to not be used exclusively by
the sparse mode, full checkout cone or not?

It looks fairly easy to extend p2000-sparse-operations.sh to run the
same tests but just pretend that it's running in a "full" mode without
actually setting up anyting sparse-specific (the meat of those tests
just runs "git status" etc. How does that look with this series?

Since only the CL and 7/7 quote numbers from p2000, and 7/7 is at least
a partial regression, it would be nice to have perf numbers on each
commit (if only as a one-off for ML consumption). Are there any more
improvements followed by regressions followed by improvements as we go
along? Would be useful to know...

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 5, 2021

User Ævar Arnfjörð Bjarmason <avarab@gmail.com> has been added to the cc: list.

builtin/reset.c Show resolved Hide resolved
@gitgitgadget
Copy link

gitgitgadget bot commented Oct 5, 2021

On the Git mailing list, Victoria Dye wrote (reply to this):

Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Oct 05 2021, Victoria Dye via GitGitGadget wrote:
> 
>> The p2000 tests demonstrate an overall ~70% execution time reduction across
>> all tested usages of git reset using a sparse index:
> 
> [...]
> 
>> Test                                               before   after       
>> ------------------------------------------------------------------------
>> 2000.22: git reset (full-v3)                       0.48     0.51 +6.3% 
>> 2000.23: git reset (full-v4)                       0.47     0.50 +6.4% 
>> 2000.24: git reset (sparse-v3)                     0.93     0.30 -67.7%
>> 2000.25: git reset (sparse-v4)                     0.94     0.29 -69.1%
>> 2000.26: git reset --hard (full-v3)                0.69     0.68 -1.4% 
>> 2000.27: git reset --hard (full-v4)                0.75     0.68 -9.3% 
>> 2000.28: git reset --hard (sparse-v3)              1.29     0.34 -73.6%
>> 2000.29: git reset --hard (sparse-v4)              1.31     0.34 -74.0%
>> 2000.30: git reset -- does-not-exist (full-v3)     0.54     0.51 -5.6% 
>> 2000.31: git reset -- does-not-exist (full-v4)     0.54     0.52 -3.7% 
>> 2000.32: git reset -- does-not-exist (sparse-v3)   1.02     0.31 -69.6%
>> 2000.33: git reset -- does-not-exist (sparse-v4)   1.07     0.30 -72.0%
> 
> This series looks like it really improves some cases, but at the cost of
> that -70% improvement we've got a ~5% regression in 7/7 for the full-v3
> --does-not-exist cases. As noted in your 7/7 (which improves all other
> cases):
> 
>     (full-v3)     0.79(0.38+0.30)   0.91(0.43+0.34) +15.2%
>     (full-v4)     0.80(0.38+0.29)   0.85(0.40+0.35) +6.2%
> 

New performance numbers at the end - I think I have an explanation for this.

> Which b.t.w. I had to read a couple of times before realizig that its
> quoted:
>     
>     Test          before            after
>     ------------------------------------------------------
>     (full-v3)     0.79(0.38+0.30)   0.91(0.43+0.34) +15.2%
>     (full-v4)     0.80(0.38+0.29)   0.85(0.40+0.35) +6.2%
>     (sparse-v3)   0.76(0.43+0.69)   0.44(0.08+0.67) -42.1%
>     (sparse-v4)   0.71(0.40+0.65)   0.41(0.09+0.65) -42.3%
> 
> Is just the does-not-exist part of this bigger table, are the other
> cases all ~0% changed, or ...?
>     

These numbers were for the `git reset -- does-not-exist` case only. If I end
up needing to send a V3, though, I'll probably remove the performance
numbers from 7/7 altogether - looking at them now, they make the commit
message somewhat cluttered. That said, performance numbers *are* helpful for
reviews on the mailing list, so I'd keep the information in the cover
letter at the very least.

> Anyway, until 7/7 the v3 had been sped up, but a ~10% increase landed us
> at ~+6%, and full-v4 had been ~0% but got ~6% worse?
> 
> Is there a way we can get those improvements in performance without
> regressing on the full-* cases?
> 
> Also, these tests only check sparse performance, but isn't some of the
> code being modified here general enough to not be used exclusively by
> the sparse mode, full checkout cone or not?
> 
> It looks fairly easy to extend p2000-sparse-operations.sh to run the
> same tests but just pretend that it's running in a "full" mode without
> actually setting up anyting sparse-specific (the meat of those tests
> just runs "git status" etc. How does that look with this series?
> 

I updated `p2000` locally to do this but the setup was substantially slower
for the full checkout, to the point that it was infeasible to run the
complete test for all relevant commits. Looking at the changes in this
series, nothing appears to affect the full checkout case differently than
the sparse checkout/full index case, so I'm fairly confident there won't be
a regression specific to full checkouts.

> Since only the CL and 7/7 quote numbers from p2000, and 7/7 is at least
> a partial regression, it would be nice to have perf numbers on each
> commit (if only as a one-off for ML consumption). Are there any more
> improvements followed by regressions followed by improvements as we go
> along? Would be useful to know...
> 

I don't think any of the apparent slowdowns seen in these results represent
real regressions. After re-running the performance tests, I saw variability
of up to ~20% execution time across changes with commands that should see no
effect on their execution time (e.g. sparse-v* from 1/7 to 4/7).
Additionally, I saw different increases & decreases each time for each
end-to-end run of the tests. The most reliable, noticeable changes across
the test executions were:

1. When each variant of `git reset` was integrated with sparse index, a
   65-75% execution time reduction in relevant sparse-v* tests.
2. `git reset -- does-not-exist` slower than `git reset` in 6/7,
   then matching its speed after 7/7.
3. As of 7/7, full-v* to sparse-v* showing a 50% execution time reduction.

My guess is that the variability comes from general "uncontrolled" factors
when running the tests (e.g., background processes on my system). The good
news is, when the tests are re-run with more trials (and the recent bugfix
to `t/perf/perf-lib.sh` [1]), the execution times look a lot less worrisome 
(apologies for the table width, but I'd like to err on the side of providing
more complete information):

Test                                               base              [1/7]                    [4/7]                    [5/7]                    [6/7]                    [7/7]            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2000.22: git reset (full-v3)                       0.44(0.16+0.19)   0.44(0.17+0.18) +0.0%    0.44(0.17+0.19) +0.0%    0.45(0.17+0.18) +2.3%    0.44(0.17+0.19) +0.0%    0.45(0.17+0.18) +2.3% 
2000.23: git reset (full-v4)                       0.43(0.16+0.18)   0.43(0.16+0.19) +0.0%    0.45(0.17+0.18) +4.7%    0.44(0.17+0.18) +2.3%    0.44(0.17+0.18) +2.3%    0.44(0.18+0.18) +2.3% 
2000.24: git reset (sparse-v3)                     0.82(0.54+0.19)   0.84(0.56+0.19) +2.4%    0.81(0.54+0.19) -1.2%    0.88(0.60+0.19) +7.3%    0.27(0.03+0.45) -67.1%   0.27(0.03+0.47) -67.1%
2000.25: git reset (sparse-v4)                     0.82(0.55+0.18)   0.82(0.53+0.20) +0.0%    0.83(0.55+0.19) +1.2%    0.82(0.54+0.19) +0.0%    0.27(0.03+0.50) -67.1%   0.27(0.03+0.48) -67.1%
2000.26: git reset --hard (full-v3)                0.71(0.38+0.24)   0.69(0.37+0.23) -2.8%    0.70(0.37+0.24) -1.4%    0.78(0.41+0.27) +9.9%    0.71(0.38+0.25) +0.0%    0.70(0.37+0.23) -1.4% 
2000.27: git reset --hard (full-v4)                0.71(0.38+0.23)   0.77(0.42+0.25) +8.5%    0.76(0.41+0.26) +7.0%    0.72(0.40+0.24) +1.4%    0.68(0.37+0.23) -4.2%    0.67(0.36+0.22) -5.6%
2000.28: git reset --hard (sparse-v3)              1.29(0.93+0.26)   1.33(0.95+0.27) +3.1%    1.11(0.76+0.25) -14.0%   0.38(0.05+0.25) -70.5%   0.36(0.04+0.22) -72.1%   0.34(0.04+0.21) -73.6%
2000.29: git reset --hard (sparse-v4)              1.17(0.84+0.24)   1.10(0.79+0.23) -6.0%    1.01(0.69+0.24) -13.7%   0.42(0.05+0.26) -64.1%   0.39(0.05+0.25) -66.7%   0.38(0.05+0.23) -67.5%
2000.30: git reset -- does-not-exist (full-v3)     0.50(0.19+0.20)   0.50(0.19+0.20) +0.0%    0.53(0.21+0.22) +6.0%    0.47(0.18+0.19) -6.0%    0.45(0.18+0.18) -10.0%   0.45(0.18+0.19) -10.0%
2000.31: git reset -- does-not-exist (full-v4)     0.45(0.18+0.18)   0.46(0.18+0.19) +2.2%    0.47(0.19+0.19) +4.4%    0.45(0.18+0.19) +0.0%    0.45(0.18+0.18) +0.0%    0.45(0.18+0.18) +0.0% 
2000.32: git reset -- does-not-exist (sparse-v3)   1.01(0.70+0.21)   0.91(0.62+0.20) -9.9%    0.93(0.64+0.20) -7.9%    0.89(0.61+0.20) -11.9%   0.48(0.23+0.46) -52.5%   0.27(0.03+0.49) -73.3%
2000.33: git reset -- does-not-exist (sparse-v4)   0.99(0.67+0.21)   1.02(0.70+0.22) +3.0%    1.04(0.70+0.22) +5.1%    0.83(0.55+0.19) -16.2%   0.48(0.24+0.48) -51.5%   0.27(0.03+0.49) -72.7%

Note that some commits in this series are not included because they don't
touch any code used by `git reset`.

[1] https://lore.kernel.org/git/pull.1051.git.1633386543759.gitgitgadget@gmail.com/

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2021

On the Git mailing list, Elijah Newren wrote (reply to this):

Hi Victoria,

On Tue, Oct 5, 2021 at 6:20 AM Victoria Dye via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This series integrates the sparse index with git reset and provides
> miscellaneous fixes and improvements to the command in sparse checkouts.
> This includes:
>
>  1. tests added to t1092 and p2000 to establish the baseline functionality
>     of the command
>  2. repository settings to enable the sparse index with ensure_full_index
>     guarding any code paths that break tests without other compatibility
>     updates.
>  3. modifications to remove or reduce the scope in which ensure_full_index
>     must be called.
>
> The sparse index updates are predicated on a fix originating from the
> microsoft/git fork [1], correcting how git reset --mixed handles resetting
> entries outside the sparse checkout definition. Additionally, a performance
> "bug" in next_cache_entry with sparse index is corrected, preventing
> repeatedly looping over already-searched entries.
>
> The p2000 tests demonstrate an overall ~70% execution time reduction across
> all tested usages of git reset using a sparse index:
>
> Test                                               before   after
> ------------------------------------------------------------------------
> 2000.22: git reset (full-v3)                       0.48     0.51 +6.3%
> 2000.23: git reset (full-v4)                       0.47     0.50 +6.4%
> 2000.24: git reset (sparse-v3)                     0.93     0.30 -67.7%
> 2000.25: git reset (sparse-v4)                     0.94     0.29 -69.1%
> 2000.26: git reset --hard (full-v3)                0.69     0.68 -1.4%
> 2000.27: git reset --hard (full-v4)                0.75     0.68 -9.3%
> 2000.28: git reset --hard (sparse-v3)              1.29     0.34 -73.6%
> 2000.29: git reset --hard (sparse-v4)              1.31     0.34 -74.0%
> 2000.30: git reset -- does-not-exist (full-v3)     0.54     0.51 -5.6%
> 2000.31: git reset -- does-not-exist (full-v4)     0.54     0.52 -3.7%
> 2000.32: git reset -- does-not-exist (sparse-v3)   1.02     0.31 -69.6%
> 2000.33: git reset -- does-not-exist (sparse-v4)   1.07     0.30 -72.0%
>
>
>
> Changes since V1
> ================
>
>  * Add --force-full-index option to update-index. The option is used
>    circumvent changing command_requires_full_index from its default value -
>    right now this is effectively a no-op, but will change once update-index
>    is integrated with sparse index. By using this option in the t1092
>    expand/collapse test, the command used to test will not need to be
>    updated with subsequent sparse index integrations.
>  * Update implementation of mixed reset for entries outside sparse checkout
>    definition. The condition in which a file should be checked out before
>    index reset is simplified to "if it has skip-worktree enabled and a reset
>    would change the file, check it out".
>    * After checking the behavior of update_index_from_diff with renames,
>      found that the diff used by reset does not produce diff queue entries
>      with different pathnames for one and two. Because of this, and that
>      nothing in the implementation seems to rely on identical path names, no
>      BUG check is added.
>  * Correct a bug in the sparse index is not expanded tests in t1092 where
>    failure of a git reset --mixed test was not being reported. Test now
>    verifies an appropriate scenario with corrected failure-checking.

I read over the first six patches.  I tried to read over the seventh,
but I've never figured out cache_bottom for some reason and I did
nothing beyond spot checking when Stolee touched that area either.

Anyway, I had lots of little comments, tweaks to the way to fix the
inconsistency in patch 1, various questions, etc.  It probably adds up
to a lot, but it's all small fixable stuff; overall it looks like you
(and Kevin) are making a solid contribution on the sparse-checkout
stuff; I look forward to reading the next round.

Rename and invert value of `is_missing` to `is_in_reset_tree` to make the
variable more descriptive of what it represents.

Signed-off-by: Victoria Dye <vdye@github.com>
@vdye
Copy link
Author

vdye commented Oct 7, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 7, 2021

Submitted as pull.1048.v3.git.1633641339.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-1048/vdye/vdye/sparse-index-part1-v3

To fetch this version to local tag pr-1048/vdye/vdye/sparse-index-part1-v3:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-1048/vdye/vdye/sparse-index-part1-v3

@@ -24,6 +24,7 @@ SYNOPSIS
[--[no-]fsmonitor]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Bagas Sanjaya wrote (reply to this):

On 08/10/21 04.15, Victoria Dye via GitGitGadget wrote:
> From: Victoria Dye <vdye@github.com>
> 
> Add a new `--force-full-index` option to `git update-index`, which skips
> explicitly setting `command_requires_full_index`. This option, intended for
> use in internal testing purposes only, lets `git update-index` run as a
> command without sparse index compatibility implemented, even after it
> receives updates to otherwise use the sparse index.
> 
> The specific test `--force-full-index` is intended for - `t1092 -
> sparse-index is expanded and converted back` - verifies index compatibility
> in commands that do not change the default (enabled)
> `command_requires_full_index` repo setting. In the past, the test used `git
> reset`. However, as `reset` and other commands are integrated with the
> sparse index, the command used in the test would need to keep changing.
> Conversely, the `--force-full-index` option makes `git update-index` behave
> like a not-yet-sparse-aware command, and can be used in the test
> indefinitely without interfering with future sparse index integrations.
> 
> Helped-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: Victoria Dye <vdye@github.com>

Grammar looks OK.

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>

-- 
An old man doll... just what I always wanted! - Clara

Disable `command_requires_full_index` repo setting and add
`ensure_full_index` guards around code paths that cannot yet use sparse
directory index entries. `reset --soft` does not modify the index, so no
compatibility changes are needed for it to function without expanding the
index. For all other reset modes (`--mixed`, `--hard`, `--keep`, `--merge`),
the full index is expanded to prevent cache tree corruption and invalid
variable accesses.

Additionally, the `read_cache()` check verifying an uncorrupted index is
moved after argument parsing and preparing the repo settings. The index is
not used by the preceding argument handling, but `read_cache()` must be run
*after* enabling sparse index for the command (so that the index is not
expanded unnecessarily) and *before* using the index for reset (so that it
is verified as uncorrupted).

Signed-off-by: Victoria Dye <vdye@github.com>
Remove `ensure_full_index` guard on `prime_cache_tree` and update
`prime_cache_tree_rec` to correctly reconstruct sparse directory entries in
the cache tree. While processing a tree's entries, `prime_cache_tree_rec`
must determine whether a directory entry is sparse or not by searching for
it in the index (*without* expanding the index). If a matching sparse
directory index entry is found, no subtrees are added to the cache tree
entry and the entry count is set to 1 (representing the sparse directory
itself). Otherwise, the tree is assumed to not be sparse and its subtrees
are recursively added to the cache tree.

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Victoria Dye <vdye@github.com>
@gitgitgadget
Copy link

gitgitgadget bot commented Nov 29, 2021

There is an issue in commit 1b3631a:
Commit checks stopped - the message is too short

Remove the `ensure_full_index` guard on `read_from_tree` and update `git
reset --mixed` to ensure it can use sparse directory index entries wherever
possible. Sparse directory entries are reset using `diff_tree_oid`, which
requires `change` and `add_remove` functions to process the internal
contents of the sparse directory. The `recursive` diff option handles cases
in which `reset --mixed` must diff/merge files that are nested multiple
levels deep in a sparse directory.

The use of pathspecs with `git reset --mixed` introduces scenarios in which
internal contents of sparse directories may be matched by the pathspec. In
order to reset *all* files in the repo that may match the pathspec, the
following conditions on the pathspec require index expansion before
performing the reset:

* "magic" pathspecs
* wildcard pathspecs that do not match only in-cone files or entire sparse
  directories
* literal pathspecs matching something outside the sparse checkout
  definition

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Victoria Dye <vdye@github.com>
To find the first non-unpacked cache entry, `next_cache_entry` iterates
through index, starting at `cache_bottom`. The performance of this in full
indexes is helped by `cache_bottom` advancing with each invocation of
`mark_ce_used` (called by `unpack_index_entry`). However, the presence of
sparse directories can prevent the `cache_bottom` from advancing in a sparse
index case, effectively forcing `next_cache_entry` to search from the
beginning of the index each time it is called.

The `cache_bottom` must be preserved for the sparse index (see 17a1bb5
(unpack-trees: preserve cache_bottom, 2021-07-14)). Therefore, to retain the
benefit `cache_bottom` provides in non-sparse index cases, a separate `hint`
position indicates the first position `next_cache_entry` should search,
updated each execution with a new position.

Signed-off-by: Victoria Dye <vdye@github.com>
@gitgitgadget
Copy link

gitgitgadget bot commented Nov 29, 2021

There is an issue in commit 1b3631a:
Commit checks stopped - the message is too short

@vdye
Copy link
Author

vdye commented Nov 29, 2021

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 29, 2021

Preview email sent as pull.1048.v6.git.1638200459.gitgitgadget@gmail.com

@vdye
Copy link
Author

vdye commented Nov 29, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 29, 2021

Submitted as pull.1048.v6.git.1638201164.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-1048/vdye/vdye/sparse-index-part1-v6

To fetch this version to local tag pr-1048/vdye/vdye/sparse-index-part1-v6:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-1048/vdye/vdye/sparse-index-part1-v6

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 29, 2021

On the Git mailing list, Elijah Newren wrote (reply to this):

Hi,

On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Changes since V5
> ================
>
>  * Update t1092 test 'reset with wildcard pathspec' with more cases and
>    better checks
>  * Add "special case" wildcard pathspec check when determining whether to
>    expand the index (avoids double-loop over pathspecs & index entries)

Looks pretty good.  However, I'm worried this special case you added
at my prodding might be problematic, and that I may have been wrong to
prod you into it...

> Thanks! -Victoria
>
> Range-diff vs v5:
>
>  7:  a9135a5ed64 ! 7:  822d7344587 reset: make --mixed sparse-aware
>      @@ Commit message
>
>           Remove the `ensure_full_index` guard on `read_from_tree` and update `git
>           reset --mixed` to ensure it can use sparse directory index entries wherever
>      -    possible. Sparse directory entries are reset use `diff_tree_oid`, which
>      +    possible. Sparse directory entries are reset using `diff_tree_oid`, which
>           requires `change` and `add_remove` functions to process the internal
>           contents of the sparse directory. The `recursive` diff option handles cases
>           in which `reset --mixed` must diff/merge files that are nested multiple
>      @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q,
>       +          * (since we can reset whole sparse directories without expanding them).
>       +          */
>       +         if (item.nowildcard_len < item.len) {
>      ++                 /*
>      ++                  * Special case: if the pattern is a path inside the cone
>      ++                  * followed by only wildcards, the pattern cannot match
>      ++                  * partial sparse directories, so we don't expand the index.
>      ++                  */
>      ++                 if (path_in_cone_mode_sparse_checkout(item.original, &the_index) &&
>      ++                     strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len)

I usually expect in an &&-chain to see the cheaper function call first
(because that ordering often avoids the need to call the second
function), and I would presume that strspn() would be the cheaper of
the two.  Did you switch the order because you expect the strspn call
to nearly always return true, though?

Could the strspn() call be replaced by a `item.len ==
item.nowildcard_len + 1`?  I mean, sure, folks could list multiple
asterisks in a row in their pathspec, but that seems super unlikely
and even if it does happen the code will just fall back to the slower
codepath and still give them the right answer.  And the simpler check
feels a lot easier to parse for human readers.

But I'm worried there's a deeper issue here:


Is the wildcard character (or characters) in path treated as a literal
by path_in_cone_mode_sparse_checkout()?  I think it is...and I'm
worried that may be incorrect.  For example, if the path is

   foo/*

and the user has done a

  git sparse-checkout set foo/bar/

Then 'foo/baz/file' is not in the sparse checkout.  However, 'foo/*'
should match 'foo/baz/file' and yet 'foo/*' when treated as a literal
path would be considered in the sparse checkout by
path_in_cone_mode_sparse_checkout.  Does this result in the code
returning an incorrect answer?  (Or did I misunderstand something so
far?)

I'm wondering if I misled you earlier in my musings about whether we
could avoid the slow codepath for pathspecs with wildcard characters.
Maybe there's no safe optimization here and wildcard characters should
always go through the slower codepath.

>      ++                         continue;
>      ++
>       +                 for (pos = 0; pos < active_nr; pos++) {
>       +                         struct cache_entry *ce = active_cache[pos];
>       +
>  8:  f91d1dcf024 = 8:  ddd97fb2837 unpack-trees: improve performance of next_cache_entry
>
> --
> gitgitgadget

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 29, 2021

On the Git mailing list, Victoria Dye wrote (reply to this):

Elijah Newren wrote:
> Hi,
> 
> On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> Changes since V5
>> ================
>>
>>  * Update t1092 test 'reset with wildcard pathspec' with more cases and
>>    better checks
>>  * Add "special case" wildcard pathspec check when determining whether to
>>    expand the index (avoids double-loop over pathspecs & index entries)
> 
> Looks pretty good.  However, I'm worried this special case you added
> at my prodding might be problematic, and that I may have been wrong to
> prod you into it...
> 
>> Thanks! -Victoria
>>
>> Range-diff vs v5:
>>
>>  7:  a9135a5ed64 ! 7:  822d7344587 reset: make --mixed sparse-aware
>>      @@ Commit message
>>
>>           Remove the `ensure_full_index` guard on `read_from_tree` and update `git
>>           reset --mixed` to ensure it can use sparse directory index entries wherever
>>      -    possible. Sparse directory entries are reset use `diff_tree_oid`, which
>>      +    possible. Sparse directory entries are reset using `diff_tree_oid`, which
>>           requires `change` and `add_remove` functions to process the internal
>>           contents of the sparse directory. The `recursive` diff option handles cases
>>           in which `reset --mixed` must diff/merge files that are nested multiple
>>      @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q,
>>       +          * (since we can reset whole sparse directories without expanding them).
>>       +          */
>>       +         if (item.nowildcard_len < item.len) {
>>      ++                 /*
>>      ++                  * Special case: if the pattern is a path inside the cone
>>      ++                  * followed by only wildcards, the pattern cannot match
>>      ++                  * partial sparse directories, so we don't expand the index.
>>      ++                  */
>>      ++                 if (path_in_cone_mode_sparse_checkout(item.original, &the_index) &&
>>      ++                     strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len)
> 
> I usually expect in an &&-chain to see the cheaper function call first
> (because that ordering often avoids the need to call the second
> function), and I would presume that strspn() would be the cheaper of
> the two.  Did you switch the order because you expect the strspn call
> to nearly always return true, though?
> 

This is a miss on my part, the `strspn()` check is probably less expensive
and should be first.

> Could the strspn() call be replaced by a `item.len ==
> item.nowildcard_len + 1`?  I mean, sure, folks could list multiple
> asterisks in a row in their pathspec, but that seems super unlikely
> and even if it does happen the code will just fall back to the slower
> codepath and still give them the right answer.  And the simpler check
> feels a lot easier to parse for human readers.
> 

Agreed on wanting better readability - if the multiple-wildcard case is
unlikely, the `PATHSPEC_ONESTAR` flag would indicate whether the pathspec
ends in a single wildcard character. If that flag is still too obscure,
though, I can stick with the length comparison.

> But I'm worried there's a deeper issue here:
> 
> 
> Is the wildcard character (or characters) in path treated as a literal
> by path_in_cone_mode_sparse_checkout()?  I think it is...and I'm
> worried that may be incorrect.  For example, if the path is
> 
>    foo/*
> 
> and the user has done a
> 
>   git sparse-checkout set foo/bar/
> 
> Then 'foo/baz/file' is not in the sparse checkout.  However, 'foo/*'
> should match 'foo/baz/file' and yet 'foo/*' when treated as a literal
> path would be considered in the sparse checkout by
> path_in_cone_mode_sparse_checkout.  Does this result in the code
> returning an incorrect answer?  (Or did I misunderstand something so
> far?)
> 

Correct: `path_in_cone_mode_sparse_checkout` interprets the wildcard
literally, and the checks here take that into account. The goal of
`pathspec_needs_expanded_index` is to determine if the pathspec *may* match
only partial contents of a sparse directory (like '*.c', or 'f*le'). For a
`git reset --mixed`, only this scenario requires expansion; if an entire
sparse directory is matched by a pathspec, the entire sparse directory is
reset. 

Using your example, 'foo/*' does match 'foo/baz/file', but it also matches
'foo/' itself; as a result, the `foo/` sparse directory index entry is reset
(rather than some individual files contained within it). The same goes for a
patchspec like 'fo*' ("in-cone" and ending in a wildcard). Conversely, a
pathspec like 'foo/ba*' would _not_ work (it wouldn't match something like
'foo/test-file'), and neither would 'f*o' (it would match all of 'foo', but
would only match files ending in "o" in a directory 'f/').

Hope that helps!

> I'm wondering if I misled you earlier in my musings about whether we
> could avoid the slow codepath for pathspecs with wildcard characters.
> Maybe there's no safe optimization here and wildcard characters should
> always go through the slower codepath.
> 
>>      ++                         continue;
>>      ++
>>       +                 for (pos = 0; pos < active_nr; pos++) {
>>       +                         struct cache_entry *ce = active_cache[pos];
>>       +
>>  8:  f91d1dcf024 = 8:  ddd97fb2837 unpack-trees: improve performance of next_cache_entry
>>
>> --
>> gitgitgadget

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 29, 2021

This patch series was integrated into seen via git@7484d84.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 30, 2021

This patch series was integrated into seen via git@dcb05a1.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 30, 2021

There was a status update in the "Cooking" section about the branch vd/sparse-reset on the Git mailing list:

Various operating modes of "git reset" have been made to work
better with the sparse index.

Will merge to 'next'.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 30, 2021

On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Nov 29, 2021 at 11:44 AM Victoria Dye <vdye@github.com> wrote:
>
> Elijah Newren wrote:
> > Hi,
> >
> > On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> Changes since V5
> >> ================
> >>
> >>  * Update t1092 test 'reset with wildcard pathspec' with more cases and
> >>    better checks
> >>  * Add "special case" wildcard pathspec check when determining whether to
> >>    expand the index (avoids double-loop over pathspecs & index entries)
> >
> > Looks pretty good.  However, I'm worried this special case you added
> > at my prodding might be problematic, and that I may have been wrong to
> > prod you into it...
> >
> >> Thanks! -Victoria
> >>
> >> Range-diff vs v5:
> >>
> >>  7:  a9135a5ed64 ! 7:  822d7344587 reset: make --mixed sparse-aware
> >>      @@ Commit message
> >>
> >>           Remove the `ensure_full_index` guard on `read_from_tree` and update `git
> >>           reset --mixed` to ensure it can use sparse directory index entries wherever
> >>      -    possible. Sparse directory entries are reset use `diff_tree_oid`, which
> >>      +    possible. Sparse directory entries are reset using `diff_tree_oid`, which
> >>           requires `change` and `add_remove` functions to process the internal
> >>           contents of the sparse directory. The `recursive` diff option handles cases
> >>           in which `reset --mixed` must diff/merge files that are nested multiple
> >>      @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q,
> >>       +          * (since we can reset whole sparse directories without expanding them).
> >>       +          */
> >>       +         if (item.nowildcard_len < item.len) {
> >>      ++                 /*
> >>      ++                  * Special case: if the pattern is a path inside the cone
> >>      ++                  * followed by only wildcards, the pattern cannot match
> >>      ++                  * partial sparse directories, so we don't expand the index.
> >>      ++                  */
> >>      ++                 if (path_in_cone_mode_sparse_checkout(item.original, &the_index) &&
> >>      ++                     strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len)
> >
> > I usually expect in an &&-chain to see the cheaper function call first
> > (because that ordering often avoids the need to call the second
> > function), and I would presume that strspn() would be the cheaper of
> > the two.  Did you switch the order because you expect the strspn call
> > to nearly always return true, though?
> >
>
> This is a miss on my part, the `strspn()` check is probably less expensive
> and should be first.
>
> > Could the strspn() call be replaced by a `item.len ==
> > item.nowildcard_len + 1`?  I mean, sure, folks could list multiple
> > asterisks in a row in their pathspec, but that seems super unlikely
> > and even if it does happen the code will just fall back to the slower
> > codepath and still give them the right answer.  And the simpler check
> > feels a lot easier to parse for human readers.
> >
>
> Agreed on wanting better readability - if the multiple-wildcard case is
> unlikely, the `PATHSPEC_ONESTAR` flag would indicate whether the pathspec
> ends in a single wildcard character. If that flag is still too obscure,
> though, I can stick with the length comparison.
>
> > But I'm worried there's a deeper issue here:
> >
> >
> > Is the wildcard character (or characters) in path treated as a literal
> > by path_in_cone_mode_sparse_checkout()?  I think it is...and I'm
> > worried that may be incorrect.  For example, if the path is
> >
> >    foo/*
> >
> > and the user has done a
> >
> >   git sparse-checkout set foo/bar/
> >
> > Then 'foo/baz/file' is not in the sparse checkout.  However, 'foo/*'
> > should match 'foo/baz/file' and yet 'foo/*' when treated as a literal
> > path would be considered in the sparse checkout by
> > path_in_cone_mode_sparse_checkout.  Does this result in the code
> > returning an incorrect answer?  (Or did I misunderstand something so
> > far?)
> >
>
> Correct: `path_in_cone_mode_sparse_checkout` interprets the wildcard
> literally, and the checks here take that into account. The goal of
> `pathspec_needs_expanded_index` is to determine if the pathspec *may* match
> only partial contents of a sparse directory (like '*.c', or 'f*le'). For a
> `git reset --mixed`, only this scenario requires expansion; if an entire
> sparse directory is matched by a pathspec, the entire sparse directory is
> reset.
>
> Using your example, 'foo/*' does match 'foo/baz/file', but it also matches
> 'foo/' itself; as a result, the `foo/` sparse directory index entry is reset
> (rather than some individual files contained within it). The same goes for a
> patchspec like 'fo*' ("in-cone" and ending in a wildcard). Conversely, a
> pathspec like 'foo/ba*' would _not_ work (it wouldn't match something like
> 'foo/test-file'), and neither would 'f*o' (it would match all of 'foo', but
> would only match files ending in "o" in a directory 'f/').
>
> Hope that helps!

Ah, yes, thanks for the explanation.  :-)

> > I'm wondering if I misled you earlier in my musings about whether we
> > could avoid the slow codepath for pathspecs with wildcard characters.
> > Maybe there's no safe optimization here and wildcard characters should
> > always go through the slower codepath.
> >
> >>      ++                         continue;
> >>      ++
> >>       +                 for (pos = 0; pos < active_nr; pos++) {
> >>       +                         struct cache_entry *ce = active_cache[pos];
> >>       +
> >>  8:  f91d1dcf024 = 8:  ddd97fb2837 unpack-trees: improve performance of next_cache_entry
> >>
> >> --
> >> gitgitgadget
>

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 1, 2021

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Mon, Nov 29 2021, Victoria Dye wrote:

> Elijah Newren wrote:
>> Hi,
>> 
>> On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget
>> <gitgitgadget@gmail.com> wrote:
>>>
>>> Changes since V5
>>> ================
>>>
>>>  * Update t1092 test 'reset with wildcard pathspec' with more cases and
>>>    better checks
>>>  * Add "special case" wildcard pathspec check when determining whether to
>>>    expand the index (avoids double-loop over pathspecs & index entries)
>> 
>> Looks pretty good.  However, I'm worried this special case you added
>> at my prodding might be problematic, and that I may have been wrong to
>> prod you into it...
>> 
>>> Thanks! -Victoria
>>>
>>> Range-diff vs v5:
>>>
>>>  7:  a9135a5ed64 ! 7:  822d7344587 reset: make --mixed sparse-aware
>>>      @@ Commit message
>>>
>>>           Remove the `ensure_full_index` guard on `read_from_tree` and update `git
>>>           reset --mixed` to ensure it can use sparse directory index entries wherever
>>>      -    possible. Sparse directory entries are reset use `diff_tree_oid`, which
>>>      +    possible. Sparse directory entries are reset using `diff_tree_oid`, which
>>>           requires `change` and `add_remove` functions to process the internal
>>>           contents of the sparse directory. The `recursive` diff option handles cases
>>>           in which `reset --mixed` must diff/merge files that are nested multiple
>>>      @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q,
>>>       +          * (since we can reset whole sparse directories without expanding them).
>>>       +          */
>>>       +         if (item.nowildcard_len < item.len) {
>>>      ++                 /*
>>>      ++                  * Special case: if the pattern is a path inside the cone
>>>      ++                  * followed by only wildcards, the pattern cannot match
>>>      ++                  * partial sparse directories, so we don't expand the index.
>>>      ++                  */
>>>      ++                 if (path_in_cone_mode_sparse_checkout(item.original, &the_index) &&
>>>      ++                     strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len)
>> 
>> I usually expect in an &&-chain to see the cheaper function call first
>> (because that ordering often avoids the need to call the second
>> function), and I would presume that strspn() would be the cheaper of
>> the two.  Did you switch the order because you expect the strspn call
>> to nearly always return true, though?
>> 
>
> This is a miss on my part, the `strspn()` check is probably less expensive
> and should be first.

I doubt it matters either way, and I didn't look into this to any degree
of carefulness.

But having followed the breadcrumb trail from the "What's Cooking"
discussion & looked at the code one thing that stuck out for me was that
path_in_cone_mode_sparse_checkout() appears returns 1 inconditionally in
some cases based on global state:

	/*
	 * We default to accepting a path if there are no patterns or
	 * they are of the wrong type.
	 */
	if (init_sparse_checkout_patterns(istate) ||
	    (require_cone_mode &&
	     !istate->sparse_checkout_patterns->use_cone_patterns))
		return 1;

So moreso than the nano-optimization of strspn()
v.s. path_in_cone_mode_sparse_checkout() I found it a bit odd that we're
calling something in a loop where presumably we can punt out a lot
earlier, and at least make that "continue" a "break" or "return" in that
case.

I.e. something in this direction (this patch obviously doesn't even
compile, but should clarify what I'm blathering about :); but again, I
really haven't looked at this properly, so just food for thought:

diff --git a/builtin/reset.c b/builtin/reset.c
index b1ff699b43a..cefdabb09c2 100644
--- a/builtin/reset.c
+++ b/builtin/reset.c
@@ -187,6 +187,9 @@ static int pathspec_needs_expanded_index(const struct pathspec *pathspec)
 	if (pathspec->magic)
 		return 1;
 
+	if (cant_possibly_have_path_in_cone_mode_blah_blah(&the_index))
+		return 1;
+
 	for (i = 0; i < pathspec->nr; i++) {
 		struct pathspec_item item = pathspec->items[i];
 
diff --git a/dir.c b/dir.c
index 5aa6fbad0b7..19f2d989dd3 100644
--- a/dir.c
+++ b/dir.c
@@ -1456,14 +1456,8 @@ int init_sparse_checkout_patterns(struct index_state *istate)
 	return 0;
 }
 
-static int path_in_sparse_checkout_1(const char *path,
-				     struct index_state *istate,
-				     int require_cone_mode)
+int cant_possibly_have_path_in_cone_mode_blah_blah(...)
 {
-	int dtype = DT_REG;
-	enum pattern_match_result match = UNDECIDED;
-	const char *end, *slash;
-
 	/*
 	 * We default to accepting a path if there are no patterns or
 	 * they are of the wrong type.
@@ -1472,6 +1466,16 @@ static int path_in_sparse_checkout_1(const char *path,
 	    (require_cone_mode &&
 	     !istate->sparse_checkout_patterns->use_cone_patterns))
 		return 1;
+}
+
+
+static int path_in_sparse_checkout_1(const char *path,
+				     struct index_state *istate,
+				     int require_cone_mode)
+{
+	int dtype = DT_REG;
+	enum pattern_match_result match = UNDECIDED;
+	const char *end, *slash;
 
 	/*
 	 * If UNDECIDED, use the match from the parent dir (recursively), or

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 2, 2021

This patch series was integrated into seen via git@434cd01.

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 2, 2021

This patch series was integrated into seen via git@c8099ad.

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 2, 2021

This patch series was integrated into seen via git@2f9cc17.

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 2, 2021

This patch series was integrated into next via git@47b1095.

@gitgitgadget gitgitgadget bot added the next label Dec 2, 2021
@gitgitgadget
Copy link

gitgitgadget bot commented Dec 3, 2021

This patch series was integrated into seen via git@d595dbd.

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 4, 2021

There was a status update in the "Cooking" section about the branch vd/sparse-reset on the Git mailing list:

Various operating modes of "git reset" have been made to work
better with the sparse index.

Will merge to 'master'.
source: <pull.1048.v6.git.1638201164.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 7, 2021

This patch series was integrated into seen via git@3c82291.

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 8, 2021

This patch series was integrated into seen via git@8ca2776.

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 8, 2021

There was a status update in the "Cooking" section about the branch vd/sparse-reset on the Git mailing list:

Various operating modes of "git reset" have been made to work
better with the sparse index.

Will merge to 'master'.
source: <pull.1048.v6.git.1638201164.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 8, 2021

This patch series was integrated into seen via git@55ef6d4.

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 11, 2021

There was a status update in the "Graduated to 'master'" section about the branch vd/sparse-reset on the Git mailing list:

Various operating modes of "git reset" have been made to work
better with the sparse index.
source: <pull.1048.v6.git.1638201164.gitgitgadget@gmail.com>

@vdye vdye closed this Dec 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants