Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack-objects: create new name-hash algorithm #5157

Merged
merged 11 commits into from
Sep 24, 2024

Conversation

derrickstolee
Copy link

This is an updated version of gitgitgadget#1785, intended for early consumption into Git for Windows.

The idea here is to add a new --full-name-hash option to git pack-objects and git repack. This adjusts the name-hash value used for finding delta bases in such a way that uses the full path name with a lower likelihood of collisions than the default name-hash algorithm. In many repositories with name-hash collisions and many versions of those paths, this can significantly reduce the size of a full repack. It can also help in certain cases of git push, but only if the pack is already artificially inflated by name-hash collisions; cases that find "sibling" deltas as better choices become worse with --full-name-hash.

Thus, this option is currently recommended for full repacks of large repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either writing bitmaps or using a bitmap walk during reads. The bitmap file format contains name-hash values, but no way to indicate which function is used, so compatibility is a concern for bitmaps. Future work could explore this idea.

After this PR is merged, then the more-involved --path-walk option may be considered.

The pack_name_hash() method has not been materially changed since it was
introduced in ce0bd64 (pack-objects: improve path grouping
heuristics., 2006-06-05). The intention here is to group objects by path
name, but also attempt to group similar file types together by making
the most-significant digits of the hash be focused on the final
characters.

Here's the crux of the implementation:

	/*
	 * This effectively just creates a sortable number from the
	 * last sixteen non-whitespace characters. Last characters
	 * count "most", so things that end in ".c" sort together.
	 */
	while ((c = *name++) != 0) {
		if (isspace(c))
			continue;
		hash = (hash >> 2) + (c << 24);
	}

As the comment mentions, this only cares about the last sixteen
non-whitespace characters. This cause some filenames to collide more
than others. Here are some examples that I've seen while investigating
repositories that are growing more than they should be:

 * "/CHANGELOG.json" is 15 characters, and is created by the beachball
   [1] tool. Only the final character of the parent directory can
   differntiate different versions of this file, but also only the two
   most-significant digits. If that character is a letter, then this is
   always a collision. Similar issues occur with the similar
   "/CHANGELOG.md" path, though there is more opportunity for
   differences in the parent directory.

 * Localization files frequently have common filenames but differentiate
   via parent directories. In C#, the name "/strings.resx.lcl" is used
   for these localization files and they will all collide in name-hash.

[1] https://github.com/microsoft/beachball

I've come across many other examples where some internal tool uses a
common name across multiple directories and is causing Git to repack
poorly due to name-hash collisions.

It is clear that the existing name-hash algorithm is optimized for
repositories with short path names, but also is optimized for packing a
single snapshot of a repository, not a repository with many versions of
the same file. In my testing, this has proven out where the name-hash
algorithm does a good job of finding peer files as delta bases when
unable to use a historical version of that exact file.

However, for repositories that have many versions of most files and
directories, it is more important that the objects that appear at the
same path are grouped together.

Create a new pack_full_name_hash() method and a new --full-name-hash
option for 'git pack-objects' to call that method instead. Add a simple
pass-through for 'git repack --full-name-hash' for additional testing in
the context of a full repack, where I expect this will be most
effective.

The hash algorithm is as simple as possible to be reasonably effective:
for each character of the path string, add a multiple of that character
and a large prime number (chosen arbitrarily, but intended to be large
relative to the size of a uint32_t). Then, shift the current hash value
to the right by 5, with overlap. The addition and shift parameters are
standard mechanisms for creating hard-to-predict behaviors in the bits
of the resulting hash.

This is not meant to be cryptographic at all, but uniformly distributed
across the possible hash values. This creates a hash that appears
pseudorandom. There is no ability to consider similar file types as
being close to each other.

In a later change, a test-tool will be added so the effectiveness of
this hash can be demonstrated directly.

For now, let's consider how effective this mechanism is when repacking a
repository with and without the --full-name-hash option. Specifically,
let's use 'git repack -adf [--full-name-hash]' as our test.

On the Git repository, we do not expect much difference. All path names
are short. This is backed by our results:

| Stage                 | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone           | 260 MB    | N/A         |
| Standard Repack       | 127MB     | 106s        |
| With --full-name-hash | 126 MB    | 99s         |

This example demonstrates how there is some natural overhead coming from
the cloned copy because the server is hosting many forks and has not
optimized for exactly this set of reachable objects. But the full repack
has similar characteristics with and without --full-name-hash.

However, we can test this in a repository that uses one of the
problematic naming conventions above. The fluentui [2] repo uses
beachball to generate CHANGELOG.json and CHANGELOG.md files, and these
files have very poor delta characteristics when comparing against
versions across parent directories.

| Stage                 | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone           | 694 MB    | N/A         |
| Standard Repack       | 438 MB    | 728s        |
| With --full-name-hash | 168 MB    | 142s        |

[2] https://github.com/microsoft/fluentui

In this example, we see significant gains in the compressed packfile
size as well as the time taken to compute the packfile.

Using a collection of repositories that use the beachball tool, I was
able to make similar comparisions with dramatic results. While the
fluentui repo is public, the others are private so cannot be shared for
reproduction. The results are so significant that I find it important to
share here:

| Repo     | Standard Repack | With --full-name-hash |
|----------|-----------------|-----------------------|
| fluentui |         438 MB  |               168 MB  |
| Repo B   |       6,255 MB  |               829 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |

Future changes could include making --full-name-hash implied by a config
value or even implied by default during a full repack.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The new '--full-name-hash' option for 'git repack' is a simple
pass-through to the underlying 'git pack-objects' subcommand. However,
this subcommand may have other options and a temporary filename as part
of the subcommand execution that may not be predictable or could change
over time.

The existing test_subcommand method requires an exact list of arguments
for the subcommand. This is too rigid for our needs here, so create a
new method, test_subcommand_flex. Use it to check that the
--full-name-hash option is passing through.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Add a new environment variable to opt-in to the --full-name-hash option
in 'git pack-objects'. This allows for extra testing of the feature
without repeating all of the test scenarios.

But this option isn't free. There are a few tests that change behavior
with the variable enabled.

First, there are a few tests that are very sensitive to certain delta
bases being picked. These are both involving the generation of thin
bundles and then counting their objects via 'git index-pack --fix-thin'
which pulls the delta base into the new packfile. For these tests,
disable the option as a decent long-term option.

Second, there are two tests in t5616-partial-clone.sh that I believe are
actually broken scenarios. While the client is set up to clone the
'promisor-server' repo via a treeless partial clone filter (tree:0),
that filter does not translate to the 'server' repo. Thus, fetching from
these repos causes the server to think that the client has all reachable
trees and blobs from the commits advertised as 'haves'. This leads the
server to providing a thin pack assuming those objects as delta bases.
Changing the name-hash algorithm presents new delta bases and thus
breaks the expectations of these tests. An alternative could be to set
up 'server' as a promisor server with the correct filter enabled. This
may also point out more issues with partial clone being set up as a
remote-based filtering mechanism and not a repository-wide setting. For
now, do the minimal change to make the test work by disabling the test
variable.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
This also adds the '--full-name-hash' option introduced in the previous
change and adds newlines to the synopsis.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
As custom options are added to 'git pack-objects' and 'git repack' to
adjust how compression is done, use this new performance test script to
demonstrate their effectiveness in performance and size.

The recently-added --full-name-hash option swaps the default name-hash
algorithm with one that attempts to uniformly distribute the hashes
based on the full path name instead of the last 16 characters.

This has a dramatic effect on full repacks for repositories with many
versions of most paths. It can have a negative impact on cases such as
pushing a single change.

This can be seen by running pt5313 on the open source fluentui
repository [1]. Most commits will have this kind of output for the thin
and big pack cases, though certain commits (such as [2]) will have
problematic thin pack size for other reasons.

[1] https://github.com/microsoft/fluentui
[2] a637a06df05360ce5ff21420803f64608226a875

Checked out at the parent of [2], I see the following statistics:

Test                                           this tree
------------------------------------------------------------------
5313.2: thin pack                              0.02(0.01+0.01)
5313.3: thin pack size                                    1.1K
5313.4: thin pack with --full-name-hash        0.02(0.01+0.00)
5313.5: thin pack size with --full-name-hash              3.0K
5313.6: big pack                               1.65(3.35+0.24)
5313.7: big pack size                                    58.0M
5313.8: big pack with --full-name-hash         1.53(2.52+0.18)
5313.9: big pack size with --full-name-hash              57.6M
5313.10: repack                                176.52(706.60+3.53)
5313.11: repack size                                    446.7K
5313.12: repack with --full-name-hash          37.47(134.18+3.06)
5313.13: repack size with --full-name-hash              183.1K

Note that this demonstrates a 3x size _increase_ in the case that
simulates a small "git push". The size change is neutral on the case of
pushing the difference between HEAD and HEAD~1000.

However, the full repack case is both faster and more efficient.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Add a new test-tool helper, name-hash, to output the value of the
name-hash algorithms for the input list of strings, one per line.

Since the name-hash values can be stored in the .bitmap files, it is
important that these hash functions do not change across Git versions.
Add a simple test to t5310-pack-bitmaps.sh to provide some testing of
the current values. Due to how these functions are implemented, it would
be difficult to change them without disturbing these values.

Create a performance test that uses test_size to demonstrate how
collisions occur for these hash algorithms. This test helps inform
someone as to the behavior of the name-hash algorithms for their repo
based on the paths at HEAD.

My copy of the Git repository shows modest statistics around the
collisions of the default name-hash algorithm:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                        4.5K
5314.2: number of distinct name-hashes                       4.1K
5314.3: number of distinct full-name-hashes                  4.5K
5314.4: maximum multiplicity of name-hashes                    13
5314.5: maximum multiplicity of fullname-hashes                 1

Here, the maximum collision multiplicity is 13, but around 10% of paths
have a collision with another path.

In a more interesting example, the microsoft/fluentui [1] repo had these
statistics at time of committing:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                       19.6K
5314.2: number of distinct name-hashes                       8.2K
5314.3: number of distinct full-name-hashes                 19.6K
5314.4: maximum multiplicity of name-hashes                   279
5314.5: maximum multiplicity of fullname-hashes                 1

[1] https://github.com/microsoft/fluentui

That demonstrates that of the nearly twenty thousand path names, they
are assigned around eight thousand distinct values. 279 paths are
assigned to a single value, leading the packing algorithm to sort
objects from those paths together, by size.

In this repository, no collisions occur for the full-name-hash
algorithm.

In a more extreme example, an internal monorepo had a much worse
collision rate:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                      221.6K
5314.2: number of distinct name-hashes                      72.0K
5314.3: number of distinct full-name-hashes                221.6K
5314.4: maximum multiplicity of name-hashes                 14.4K
5314.5: maximum multiplicity of fullname-hashes                 2

Even in this repository with many more paths at HEAD, the collision rate
was low and the maximum number of paths being grouped into a single
bucket by the full-path-name algorithm was two.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
dscho and others added 4 commits September 24, 2024 09:29
Update to the latest iteration of gitgitgadget#1785.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Update to the latest iteration of gitgitgadget#1785.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Update to the latest iteration of gitgitgadget#1785.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This applies the patch at https://lore.kernel.org/git/ZvJj7PeB52m_1mG9@pks.im:

On Wed, Sep 18, 2024 at 08:46:21PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
> diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
> new file mode 100644
> index 00000000000..15fb8f853c1
> --- /dev/null
> +++ b/t/helper/test-name-hash.c
> @@ -0,0 +1,23 @@
> +/*
> + * test-name-hash.c: Read a list of paths over stdin and report on their
> + * name-hash and full name-hash.
> + */
> +
> +#include "test-tool.h"
> +#include "git-compat-util.h"
> +#include "pack-objects.h"
> +#include "strbuf.h"
> +
> +int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
> +{
> +	struct strbuf line = STRBUF_INIT;
> +
> +	while (!strbuf_getline(&line, stdin)) {
> +		uint32_t name_hash = pack_name_hash(line.buf);
> +		uint32_t full_hash = pack_full_name_hash(line.buf);
> +
> +		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);
> +	}
> +
> +	return 0;
> +}

This patch breaks t5310 with the leak sanitizer enabled due to the
leaking `struct strbuf line`. It needs the following diff on top:

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I plan on integrating this into Git for Windows v2.46.2, due today, with minor adjustments.

test_expect_success '--full-name-hash option passes through to pack-objects' '
GIT_TRACE2_EVENT="$(pwd)/full-trace.txt" \
git repack -a --full-name-hash &&
test_subcommand_flex git pack-objects --full-name-hash <full-trace.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we could (ab-)use the fact that test_command, as all shell functions, really, is quite lax. One side effect of this is that it does not special-case characters that have special meaning in regular expressions. Therefore, we could easily write:

test_subcommand git pack-objects ".*--full-name-hash.*" <full-trace.txt

here.

But this is a minor point, and irrelevant for correctness (and I really want to focus on correctness because I want to slip this into v2.46.2 that, just like v2.46.1, showed up at my doorstep under-announced).

Comment on lines 4557 to 4560
if (write_bitmap_index && use_full_name_hash)
if (write_bitmap_index && use_full_name_hash > 0)
die(_("currently, the --full-name-hash option is incompatible with --write-bitmap-index"));
if (use_full_name_hash < 0)
use_full_name_hash = git_env_bool("GIT_TEST_FULL_NAME_HASH", 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable should probably be interpreted before validating that we're not writing a bitmap, right? I guess that would let more tests fail, though...

@dscho dscho added this to the v2.46.2 milestone Sep 24, 2024
This option is still under discussion on the Git mailing list.

We still would like to have some real-world data, and the best way to
get it is to get a Git for Windows release into users' hands so that
they can test it.

Nevertheless, without the official blessing of the Git maintainer, this
optionis experimental, and we need to be clear about that.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho dscho merged commit 4bb8c65 into git-for-windows:main Sep 24, 2024
44 checks passed
@dscho
Copy link
Member

dscho commented Sep 24, 2024

/add relnote feature Comes with the new, experimental --full-name-hash option for git repack that helps packing monorepos more tightly.

The workflow run was started

git-for-windows-ci pushed a commit that referenced this pull request Sep 24, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
github-actions bot pushed a commit to git-for-windows/build-extra that referenced this pull request Sep 24, 2024
Comes with the [new, experimental `--full-name-hash` option for `git
repack`](git-for-windows/git#5157) that helps
packing monorepos more tightly.

Signed-off-by: gitforwindowshelper[bot] <gitforwindowshelper-bot@users.noreply.github.com>
dscho added a commit to dscho/git that referenced this pull request Sep 24, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
@dscho dscho mentioned this pull request Sep 24, 2024
git-for-windows-ci pushed a commit that referenced this pull request Sep 24, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
derrickstolee added a commit that referenced this pull request Sep 25, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Sep 25, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
git-for-windows-ci pushed a commit that referenced this pull request Sep 25, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Sep 26, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
Talonkinkade added a commit to Talonkinkade/git that referenced this pull request Nov 22, 2024
commit 2996b56fa7470c29e418e4e7249629ea74cdfdca
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Oct 11 16:55:26 2024 +0200

    amend! Add experimental 'git survey' builtin (#5174)

    Add experimental 'git survey' builtin (#5174)

    This introduces `git survey` to Git for Windows ahead of upstream for
    the express purpose of getting the path-based analysis in the hands of
    more folks.

    The inspiration of this builtin is
    [`git-sizer`](https://github.com/github/git-sizer), but since that
    command relies on `git cat-file --batch` to get the contents of objects,
    it has limits to how much information it can provide.

    This is mostly a rewrite of the `git survey` builtin that was introduced
    into the `microsoft/git` fork in microsoft/git#667. That version had a
    lot more bells and whistles, including an analysis much closer to what
    `git-sizer` provides.

    The biggest difference in this version is that this one is focused on
    using the path-walk API in order to visit batches of objects based on a
    common path. This allows identifying, for instance, the path that is
    contributing the most to the on-disk size across all versions at that
    path.

    For example, here are the top ten paths contributing to my local Git
    repository (which includes `microsoft/git` and `gitster/git`):

    ```
    TOP FILES BY DISK SIZE
    ============================================================================
                                        Path | Count | Disk Size | Inflated Size
    -----------------------------------------+-------+-----------+--------------
                           whats-cooking.txt |  1373 |  11637459 |      37226854
                 t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                          git-rebase--helper |     1 |   6027849 |      15269664
                              compat/mingw.c |  6111 |   5194453 |     463466970
                 t/helper/test-parse-options |     1 |   3420385 |       8807968
                      t/helper/test-pkt-line |     1 |   3408661 |       8778960
          t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
                t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                    po/vi.po |   104 |   1376337 |      51441603
                                    po/de.po |   210 |   1360112 |      71198603
    ```

    This kind of analysis has been helpful in identifying the reasons for
    growth in a few internal monorepos. Those findings motivated the changes
    in #5157 and #5171.

    With this early version in Git for Windows, we can expand the reach of
    the experimental tool in advance of it being contributed to the upstream
    project.

    Unfortunately, this will mean that in the next `microsoft/git` rebase,
    Jeff Hostetler's version will need to be pulled out since there are
    enough conflicts. These conflicts include how tables are stored and
    generated, as the version in this PR is slightly more general to allow
    for different kinds of data.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 5f1db046948aa39d2a28ddc9fba5a8975df40fa3
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Oct 11 16:54:53 2024 +0200

    amend! path-walk: improve path-walk speed with many tags (#5205)

    path-walk: improve path-walk speed with many tags (#5205)

    In the presence of many tags, the use of oid_array_lookup() can become
    extremely slow. We should rely upon the SEEN bit instead.

    This affects the tag-peeling walk as well as the switch statement for
    adding the peeled object to the correct oid_array.

    ----

    Derrick Stolee found this while testing the 2.47.0.vfs.0.0 pre-release
    against a repo with many annotated tags.

    This is a backport of https://github.com/microsoft/git/pull/695.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit f380030a00c15fa00d66799f875cf2c986e1fc97
Merge: 61ec9331b6 c33368b771
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Oct 11 13:58:53 2024 +0200

    path-walk: improve path-walk speed with many tags (#5205)

    In the presence of many tags, the use of oid_array_lookup() can become
    extremely slow. We should rely upon the SEEN bit instead.

    This affects the tag-peeling walk as well as the switch statement for
    adding the peeled object to the correct oid_array.

    ----

    @derrickstolee found this while testing the 2.47.0.vfs.0.0 pre-release
    against a repo with many annotated tags.

    This is a backport of https://github.com/microsoft/git/pull/695.

commit 61ec9331b61ab857a259d6e1c4c4f86775b34f26
Merge: 12031c299c 3ead00a02c
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Jun 7 08:45:01 2018 +0200

    Merge 'readme' into HEAD

    Add a README.md for GitHub goodness.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit c33368b7717c90296ad2ffac3b8ffb364e6173ef
Author: Derrick Stolee <stolee@gmail.com>
Date:   Wed Oct 9 09:57:32 2024 -0400

    path-walk: improve path-walk speed with many tags

    In the presence of many tags, the use of oid_array_lookup() can become
    extremely slow. We should rely upon the SEEN bit instead.

    This affects the tag-peeling walk as well as the switch statement for
    adding the peeled object to the correct oid_array.

    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 12031c299c10eeb8af636303901db371b931b272
Merge: 3c2f5aa314 740b27f844
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Wed Oct 7 16:13:40 2020 +0200

    Merge pull request #2837 from dscho/monitor-component-updates

    Start monitoring updates of Git for Windows' component in the open

commit 3c2f5aa3148b8487f15ab828482ddc4e222c5262
Merge: fe2b01e513 8c90275e38
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Tue Jan 24 11:46:21 2023 +0100

    Merge branch 'deprecate-core.useBuiltinFSMonitor'

    Originally introduced as `core.useBuiltinFSMonitor` in Git for Windows
    and developed, improved and stabilized there, the built-in FSMonitor
    only made it into upstream Git (after unnecessarily long hemming and
    hawing and throwing overly perfectionist style review sticks into the
    spokes) as `core.fsmonitor = true`.

    In Git for Windows, with this topic branch, we re-introduce the
    now-obsolete config setting, with warnings suggesting to existing users
    how to switch to the new config setting, with the intention to
    ultimately drop the patch at some stage.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit fe2b01e51341088486df2b3747c0a40b365af1a5
Merge: c36a4deb88 aa062e96ec
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Jun 8 18:34:51 2018 +0200

    Merge branch 'phase-out-reset-stdin'

    This topic branch re-adds the deprecated --stdin/-z options to `git
    reset`. Those patches were overridden by a different set of options in
    the upstream Git project before we could propose `--stdin`.

    We offered this in MinGit to applications that wanted a safer way to
    pass lots of pathspecs to Git, and these applications will need to be
    adjusted.

    Instead of `--stdin`, `--pathspec-from-file=-` should be used, and
    instead of `-z`, `--pathspec-file-nul`.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit c36a4deb885361c37441564b1c62731809aa3917
Merge: 2fdfc3089a 2711b9ca0a
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Sat Jan 15 11:37:56 2022 +0100

    Merge branch 'un-revert-editor-save-and-reset'

    A fix for calling `vim` in Windows Terminal caused a regression and was
    reverted. We partially un-revert this, to get the fix again.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 2fdfc3089a6e7dd44a8c373f68bfaf6950c4da12
Merge: 42778ded1e a5cc82fd83
Author: Victoria Dye <vdye@github.com>
Date:   Thu Oct 28 15:16:10 2021 -0400

    Merge pull request #3492 from dscho/ns/batched-fsync

    Switch to batched fsync by default

commit 42778ded1e9b9bac269a37c3e4c163dbb200a853
Merge: 0d956e0879 7e12ac9200
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Oct 11 23:29:20 2018 +0200

    Merge pull request #1170 from dscho/mingw-kill-process

    Handle Ctrl+C in Git Bash nicely

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 0d956e0879144c6c6d736dde7ac80c3bf30c73a6
Merge: 7773e3c9fa 0a57a784e6
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Mon Aug 7 22:24:29 2023 +0200

    Merge branch 'wsl-file-mode-bits'

    This patch introduces support to set special NTFS attributes that are
    interpreted by the Windows Subsystem for Linux as file mode bits, UID
    and GID.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 7773e3c9fac969b26c6c0238053fb9f7e511d147
Merge: fd8673cf9c c576da7398
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Feb 7 14:35:33 2019 +0100

    Merge branch 'busybox-w32'

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit fd8673cf9cf25119fbc242960b08fa9c4c0412fb
Merge: 2a3d2866f2 3feb8f7dfb
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Wed Oct 31 15:08:16 2018 +0100

    Merge pull request #1897 from piscisaureus/symlink-attr

    Specify symlink type in .gitattributes

commit 2a3d2866f2186843b09b922c25f6ca9c750e0770
Merge: 4725764578 f93344df80
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Mon Aug 7 16:12:10 2023 +0200

    mingw: try resetting the read-only bit if rename fails (#4527)

    With this patch, Git for Windows works as intended on mounted APFS
    volumes (where renaming read-only files would fail).

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 4725764578421fc40b2de1b230f31e7af6d62055
Merge: a4ba13da06 426566b1e4
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Oct 12 23:04:03 2018 +0200

    Merge 'docker-volumes-are-no-symlinks'

    This was pull request #1645 from ZCube/master

    Support windows container.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit a4ba13da0615d74ad7c8b4ed9488bee4edd91e2b
Merge: f860813f4e bd2c03e214
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Nov 15 12:23:43 2018 +0100

    Merge branch 'kblees/kb/symlinks'

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit f860813f4e58b4050eec6c2cd4a896aad7ea6cc5
Merge: 343c75d471 b9a9681993
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Nov 15 12:23:43 2018 +0100

    Merge branch 'msys2'

commit 343c75d4716087b9a4a787a400c9391ba4248460
Merge: 86a198c2e2 0987f685d1
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Apr 22 23:46:50 2022 +0200

    Merge pull request #3817 from mathstuf/name-too-long-advice

    clean: suggest using `core.longPaths` if paths are too long to remove

commit 86a198c2e27492928a052f924509c5ec5d4cc03c
Merge: c7917ce9d3 29f7afc8f8
Author: Jeff Hostetler <jeffhost@microsoft.com>
Date:   Wed Sep 29 17:58:38 2021 -0400

    Merge branch 'fix-v4-fsmonitor-long-paths' into try-v4-fsmonitor

commit c7917ce9d3aaa7a78b42a2eb59519472b7582971
Merge: fa336b3e18 b8923c8fba
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Nov 15 12:23:43 2018 +0100

    Merge branch 'long-paths'

commit fa336b3e18bd1ae110429fbfffa61e42cc6cd665
Merge: b885fdedff f210ba75b1
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Oct 11 13:38:58 2018 +0200

    Merge branch 'gitk-and-git-gui-patches'

    These are Git for Windows' Git GUI and gitk patches. We will have to
    decide at some point what to do about them, but that's a little lower
    priority (as Git GUI seems to be unmaintained for the time being, and
    the gitk maintainer keeps a very low profile on the Git mailing list,
    too).

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 3ead00a02cbbe54627fd9a2f0a64d9cc19167aa6
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Aug 23 14:14:42 2019 +0200

    SECURITY.md: document Git for Windows' policies

    This is the recommended way on GitHub to describe policies revolving around
    security issues and about supported versions.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 740b27f84487f3459c3be078bba05587b6038a09
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Tue Feb 6 18:45:35 2024 +0100

    dependabot: help keeping GitHub Actions versions up to date

    See https://docs.github.com/en/code-security/dependabot/working-with-dependabot/keeping-your-actions-up-to-date-with-dependabot#enabling-dependabot-version-updates-for-actions for details.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 3092e2ace9b944fe13be59dbe5839ab5bc4e9a4e
Author: Alejandro Barreto <alejandro.barreto@ni.com>
Date:   Fri Mar 9 14:17:54 2018 -0600

    Document how $HOME is set on Windows

    Git documentation refers to $HOME and $XDG_CONFIG_HOME often, but does not specify how or where these values come from on Windows where neither is set by default. The new documentation reflects the behavior of setup_windows_environment() in compat/mingw.c.

    Signed-off-by: Alejandro Barreto <alejandro.barreto@ni.com>

commit 43ce7da7f3866891bfddef68d1e4353838711f9d
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Tue Feb 20 15:44:57 2018 +0100

    .github: Add configuration for the Sentiment Bot

    The sentiment bot will help detect when things get too heated.
    Hopefully.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit b1fc6f3c7b58dab4a2e0e35a9b980f9feaa11efe
Author: Philip Oakley <philipoakley@iee.org>
Date:   Fri Dec 22 17:15:50 2017 +0000

    Modify the GitHub Pull Request template (to reflect Git for Windows)

    Git for Windows accepts pull requests; Core Git does not. Therefore we
    need to adjust the template (because it only matches core Git's
    project management style, not ours).

    Also: direct Git for Windows enhancements to their contributions page,
    space out the text for easy reading, and clarify that the mailing list
    is plain text, not HTML.

    Signed-off-by: Philip Oakley <philipoakley@iee.org>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit f95739fa44d273f84a4801aab986f83d5cac3406
Author: Brendan Forster <brendan@github.com>
Date:   Thu Feb 18 21:29:50 2016 +1100

    Add an issue template

    With improvements by Clive Chan, Adric Norris, Ben Bodenmiller and
    Philip Oakley.

    Helped-by: Clive Chan <cc@clive.io>
    Helped-by: Adric Norris <landstander668@gmail.com>
    Helped-by: Ben Bodenmiller <bbodenmiller@hotmail.com>
    Helped-by: Philip Oakley <philipoakley@iee.org>
    Signed-off-by: Brendan Forster <brendan@github.com>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 862899ec93dc1cb1d91e0eeb31d0c3c6b2c57211
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Jan 10 16:16:03 2014 -0600

    README.md: Add a Windows-specific preamble

    Includes touch-ups by 마누엘, Philip Oakley and 孙卓识.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 58997bb46572ba290809301db1b547223004af42
Author: Derrick Stolee <dstolee@microsoft.com>
Date:   Thu Mar 1 12:10:14 2018 -0500

    CONTRIBUTING.md: add guide for first-time contributors

    Getting started contributing to Git can be difficult on a Windows
    machine. CONTRIBUTING.md contains a guide to getting started, including
    detailed steps for setting up build tools, running tests, and
    submitting patches to upstream.

    [includes an example by Pratik Karki how to submit v2, v3, v4, etc.]

    Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

commit 12e0f90bea754eebe6ddf4c5b2af79572a891a63
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Oct 11 13:22:24 2019 +0200

    Modify the Code of Conduct for Git for Windows

    The Git project followed Git for Windows' lead and added their Code of
    Conduct, based on the Contributor Covenant v1.4, later updated to v2.0.

    We adapt it slightly to Git for Windows.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 898aba7180f0de1d7a26a12ce810788614c4f7b1
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Mon Feb 13 13:31:35 2023 +0100

    Describe Git for Windows' architecture [no ci]

    The Git for Windows project has grown quite complex over the years,
    certainly much more complex than during the first years where the
    `msysgit.git` repository was abusing Git for package management purposes
    and the `git/git` fork was called `4msysgit.git`.

    Let's describe the status quo in a thorough way.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 8c90275e38a704e321af7f7cd1c42911526f6fff
Author: Victoria Dye <vdye@github.com>
Date:   Mon Apr 4 15:38:58 2022 -0700

    fsmonitor: reintroduce core.useBuiltinFSMonitor

    Reintroduce the 'core.useBuiltinFSMonitor' config setting (originally added
    in 0a756b2a25 (fsmonitor: config settings are repository-specific,
    2021-03-05)) after its removal from the upstream version of FSMonitor.

    Upstream, the 'core.useBuiltinFSMonitor' setting was rendered obsolete by
    "overloading" the 'core.fsmonitor' setting to take a boolean value. However,
    several applications (e.g., 'scalar') utilize the original config setting,
    so it should be preserved for a deprecation period before complete removal:

    * if 'core.fsmonitor' is a boolean, the user is correctly using the new
      config syntax; do not use 'core.useBuiltinFSMonitor'.
    * if 'core.fsmonitor' is unspecified, use 'core.useBuiltinFSMonitor'.
    * if 'core.fsmonitor' is a path, override and use the builtin FSMonitor if
      'core.useBuiltinFSMonitor' is 'true'; otherwise, use the FSMonitor hook
      indicated by the path.

    Additionally, for this deprecation period, advise users to switch to using
    'core.fsmonitor' to specify their use of the builtin FSMonitor.

    Signed-off-by: Victoria Dye <vdye@github.com>

commit aa062e96ec4348844793382b3eb1cb14c777372e
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Tue Dec 10 21:41:57 2019 +0100

    reset: reinstate support for the deprecated --stdin option

    The `--stdin` option was a well-established paradigm in other commands,
    therefore we implemented it in `git reset` for use by Visual Studio.

    Unfortunately, upstream Git decided that it is time to introduce
    `--pathspec-from-file` instead.

    To keep backwards-compatibility for some grace period, we therefore
    reinstate the `--stdin` option on top of the `--pathspec-from-file`
    option, but mark it firmly as deprecated.

    Helped-by: Victoria Dye <vdye@github.com>
    Helped-by: Matthew John Cheetham <mjcheetham@outlook.com>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 2711b9ca0a7639de281c8f289c52083823728e03
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Nov 25 11:26:41 2021 +0100

    Partially un-revert "editor: save and reset terminal after calling EDITOR"

    In e3f7e01b50be (Revert "editor: save and reset terminal after calling
    EDITOR", 2021-11-22), we reverted the commit wholesale where the
    terminal state would be saved and restored before/after calling an
    editor.

    The reverted commit was intended to fix a problem with Windows Terminal
    where simply calling `vi` would cause problems afterwards.

    To fix the problem addressed by the revert, but _still_ keep the problem
    with Windows Terminal fixed, let's revert the revert, with a twist: we
    restrict the save/restore _specifically_ to the case where `vi` (or
    `vim`) is called, and do not do the same for any other editor.

    This should still catch the majority of the cases, and will bridge the
    time until the original patch is re-done in a way that addresses all
    concerns.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 281b4c972ab365c75653b88c4628698f8bc27858
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Tue Sep 29 13:50:59 2020 +0200

    Add a GitHub workflow to monitor component updates

    Rather than using private IFTTT Applets that send mails to this
    maintainer whenever a new version of a Git for Windows component was
    released, let's use the power of GitHub workflows to make this process
    publicly visible.

    This workflow monitors the Atom/RSS feeds, and opens a ticket whenever a
    new version was released.

    Note: Bash sometimes releases multiple patched versions within a few
    minutes of each other (i.e. 5.1p1 through 5.1p4, 5.0p15 and 5.0p16). The
    MSYS2 runtime also has a similar system. We can address those patches as
    a group, so we shouldn't get multiple issues about them.

    Note further: We're not acting on newlib releases, OpenSSL alphas, Perl
    release candidates or non-stable Perl releases. There's no need to open
    issues about them.

    Co-authored-by: Matthias Aßhauer <mha1993@live.de>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit a5cc82fd8332925e77b1acf9714959e47ff67932
Author: Neeraj K. Singh <neerajsi@microsoft.com>
Date:   Wed Oct 27 14:22:42 2021 -0700

    mingw: do not call xutftowcs_path in mingw_mktemp

    The `xutftowcs_path` function canonicalizes absolute paths using GetFullPathNameW.
    This canonicalization may change the length of the string (e.g. getting rid of \.\),
    which breaks callers that pass the template string in a strbuf and expect the
    length of the string to remain the same.

    In my particular case, the tmp-objdir code is passing a strbuf to mkdtemp and is
    breaking since the strbuf.len is no longer synchronized with strlen(strbuf.buf).

    Signed-off-by: Neeraj K. Singh <neerajsi@microsoft.com>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 7e12ac9200f6795d579a77b01d866a9123a81b74
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Mon Apr 23 00:24:29 2018 +0200

    mingw: really handle SIGINT

    Previously, we did not install any handler for Ctrl+C, but now we really
    want to because the MSYS2 runtime learned the trick to call the
    ConsoleCtrlHandler when Ctrl+C was pressed.

    With this, hitting Ctrl+C while `git log` is running will only terminate
    the Git process, but not the pager. This finally matches the behavior on
    Linux and on macOS.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 1c88e62100f1cdf513bb74e9bbe7bbf018f1d6c6
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Wed May 17 17:05:09 2017 +0200

    mingw: kill child processes in a gentler way

    The TerminateProcess() function does not actually leave the child
    processes any chance to perform any cleanup operations. This is bad
    insofar as Git itself expects its signal handlers to run.

    A symptom is e.g. a left-behind .lock file that would not be left behind
    if the same operation was run, say, on Linux.

    To remedy this situation, we use an obscure trick: we inject a thread
    into the process that needs to be killed and to let that thread run the
    ExitProcess() function with the desired exit status. Thanks J Wyman for
    describing this trick.

    The advantage is that the ExitProcess() function lets the atexit
    handlers run. While this is still different from what Git expects (i.e.
    running a signal handler), in practice Git sets up signal handlers and
    atexit handlers that call the same code to clean up after itself.

    In case that the gentle method to terminate the process failed, we still
    fall back to calling TerminateProcess(), but in that case we now also
    make sure that processes spawned by the spawned process are terminated;
    TerminateProcess() does not give the spawned process a chance to do so
    itself.

    Please note that this change only affects how Git for Windows tries to
    terminate processes spawned by Git's own executables. Third-party
    software that *calls* Git and wants to terminate it *still* need to make
    sure to imitate this gentle method, otherwise this patch will not have
    any effect.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 0a57a784e6a1b684ed4f5f6bf2b4e96e62261e3f
Author: xungeng li <xungeng@gmail.com>
Date:   Wed Jun 7 20:26:33 2023 +0800

    mingw: optionally enable wsl compability file mode bits

    The Windows Subsystem for Linux (WSL) version 2 allows to use `chmod` on
    NTFS volumes provided that they are mounted with metadata enabled (see
    https://devblogs.microsoft.com/commandline/chmod-chown-wsl-improvements/
    for details), for example:

    	$ chmod 0755 /mnt/d/test/a.sh

    In order to facilitate better collaboration between the Windows
    version of Git and the WSL version of Git, we can make the Windows
    version of Git also support reading and writing NTFS file modes
    in a manner compatible with WSL.

    Since this slightly slows down operations where lots of files are
    created (such as an initial checkout), this feature is only enabled when
    `core.WSLCompat` is set to true. Note that you also have to set
    `core.fileMode=true` in repositories that have been initialized without
    enabling WSL compatibility.

    There are several ways to enable metadata loading for NTFS volumes
    in WSL, one of which is to modify `/etc/wsl.conf` by adding:

    ```
    [automount]
    enabled = true
    options = "metadata,umask=027,fmask=117"
    ```

    And reboot WSL.

    It can also be enabled temporarily by this incantation:

    	$ sudo umount /mnt/c &&
    	  sudo mount -t drvfs C: /mnt/c -o metadata,uid=1000,gid=1000,umask=22,fmask=111

    It's important to note that this modification is compatible with, but
    does not depend on WSL. The helper functions in this commit can operate
    independently and functions normally on devices where WSL is not
    installed or properly configured.

    Signed-off-by: xungeng li <xungeng@gmail.com>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit c576da7398572a272d490aea4d42f783130529dc
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Jul 20 00:23:26 2017 +0200

    mingw: add a Makefile target to copy test artifacts

    The Makefile target `install-mingit-test-artifacts` simply copies stuff
    and things directly into a MinGit directory, including an init.bat
    script to set everything up so that the tests can be run in a cmd
    window.

    Sadly, Git's test suite still relies on a Perl interpreter even if
    compiled with NO_PERL=YesPlease. We punt for now, installing a small
    script into /usr/bin/perl that hands off to an existing Perl of a Git
    for Windows SDK.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 68a9e1049d8af655ffdc19513654f4cb79116825
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Jul 7 10:15:36 2017 +0200

    t9200: skip tests when $PWD contains a colon

    On Windows, the current working directory is pretty much guaranteed to
    contain a colon. If we feed that path to CVS, it mistakes it for a
    separator between host and port, though.

    This has not been a problem so far because Git for Windows uses MSYS2's
    Bash using a POSIX emulation layer that also pretends that the current
    directory is a Unix path (at least as long as we're in a shell script).

    However, that is rather limiting, as Git for Windows also explores other
    ports of other Unix shells. One of those is BusyBox-w32's ash, which is
    a native port (i.e. *not* using any POSIX emulation layer, and certainly
    not emulating Unix paths).

    So let's just detect if there is a colon in $PWD and punt in that case.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 895c312461bb353a31cc6b1bb8460cb65a75fe86
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Wed Jul 5 15:14:50 2017 +0200

    t5813: allow for $PWD to be a Windows path

    Git for Windows uses MSYS2's Bash to run the test suite, which comes
    with benefits but also at a heavy price: on the plus side, MSYS2's
    POSIX emulation layer allows us to continue pretending that we are on a
    Unix system, e.g. use Unix paths instead of Windows ones, yet this is
    bought at a rather noticeable performance penalty.

    There *are* some more native ports of Unix shells out there, though,
    most notably BusyBox-w32's ash. These native ports do not use any POSIX
    emulation layer (or at most a *very* thin one, choosing to avoid
    features such as fork() that are expensive to emulate on Windows), and
    they use native Windows paths (usually with forward slashes instead of
    backslashes, which is perfectly legal in almost all use cases).

    And here comes the problem: with a $PWD looking like, say,
    C:/git-sdk-64/usr/src/git/t/trash directory.t5813-proto-disable-ssh
    Git's test scripts get quite a bit confused, as their assumptions have
    been shattered. Not only does this path contain a colon (oh no!), it
    also does not start with a slash.

    This is a problem e.g. when constructing a URL as t5813 does it:
    ssh://remote$PWD. Not only is it impossible to separate the "host" from
    the path with a $PWD as above, even prefixing $PWD by a slash won't
    work, as /C:/git-sdk-64/... is not a valid path.

    As a workaround, detect when $PWD does not start with a slash on
    Windows, and simply strip the drive prefix, using an obscure feature of
    Windows paths: if an absolute Windows path starts with a slash, it is
    implicitly prefixed by the drive prefix of the current directory. As we
    are talking about the current directory here, anyway, that strategy
    works.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 9cc63e93ae3d0bd959c09c2ef321777d71d60497
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Jul 21 13:24:55 2017 +0200

    t5605: special-case hardlink test for BusyBox-w32

    When t5605 tries to verify that files are hardlinked (or that they are
    not), it uses the `-links` option of the `find` utility.

    BusyBox' implementation does not support that option, and BusyBox-w32's
    lstat() does not even report the number of hard links correctly (for
    performance reasons).

    So let's just switch to a different method that actually works on
    Windows.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 40c1fdd50bc71a91773792a86a16034e2f880bd2
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Jul 21 12:48:33 2017 +0200

    t5532: workaround for BusyBox on Windows

    While it may seem super convenient to some old Unix hands to simpy
    require Perl to be available when running the test suite, this is a
    major hassle on Windows, where we want to verify that Perl is not,
    actually, required in a NO_PERL build.

    As a super ugly workaround, we "install" a script into /usr/bin/perl
    reading like this:

    	#!/bin/sh

    	# We'd much rather avoid requiring Perl altogether when testing
    	# an installed Git. Oh well, that's why we cannot have nice
    	# things.
    	exec c:/git-sdk-64/usr/bin/perl.exe "$@"

    The problem with that is that BusyBox assumes that the #! line in a
    script refers to an executable, not to a script. So when it encounters
    the line #!/usr/bin/perl in t5532's proxy-get-cmd, it barfs.

    Let's help this situation by simply executing the Perl script with the
    "interpreter" specified explicitly.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit cc77bda048f544e7469f7b39ea584431faef7475
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Sat Aug 5 21:36:01 2017 +0200

    t5003: use binary file from t/lib-diff/

    At some stage, t5003-archive-zip wants to add a file that is not ASCII.
    To that end, it uses /bin/sh. But that file may actually not exist (it
    is too easy to forget that not all the world is Unix/Linux...)! Besides,
    we already have perfectly fine binary files intended for use solely by
    the tests. So let's use one of them instead.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit e31674004b7bbc4205ab017bdfa0912d8500612b
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Wed Jul 19 17:07:56 2017 +0200

    test-lib: add BUSYBOX prerequisite

    When running with BusyBox, we will want to avoid calling executables on
    the PATH that are implemented in BusyBox itself.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit ee93b74edbb7baafc3d937a29e70b69c6ea54056
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Jun 30 22:32:33 2017 +0200

    tests (mingw): remove Bash-specific pwd option

    The -W option is only understood by MSYS2 Bash's pwd command. We already
    make sure to override `pwd` by `builtin pwd -W` for MINGW, so let's not
    double the effort here.

    This will also help when switching the shell to another one (such as
    BusyBox' ash) whose pwd does *not* understand the -W option.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit c242e5bae9f7ba61d0c5a40a94a85240f925be23
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Jun 30 00:35:40 2017 +0200

    mingw: only use Bash-ism `builtin pwd -W` when available

    Traditionally, Git for Windows' SDK uses Bash as its default shell.
    However, other Unix shells are available, too. Most notably, the Win32
    port of BusyBox comes with `ash` whose `pwd` command already prints
    Windows paths as Git for Windows wants them, while there is not even a
    `builtin` command.

    Therefore, let's be careful not to override `pwd` unless we know that
    the `builtin` command is available.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 416f9924f5b2d9dc491b1e18d3df868136c6bafd
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Mon Nov 19 20:34:13 2018 +0100

    tests: use the correct path separator with BusyBox

    BusyBox-w32 is a true Win32 application, i.e. it does not come with a
    POSIX emulation layer.

    That also means that it does *not* use the Unix convention of separating
    the entries in the PATH variable using colons, but semicolons.

    However, there are also BusyBox ports to Windows which use a POSIX
    emulation layer such as Cygwin's or MSYS2's runtime, i.e. using colons
    as PATH separators.

    As a tell-tale, let's use the presence of semicolons in the PATH
    variable: on Unix, it is highly unlikely that it contains semicolons,
    and on Windows (without POSIX emulation), it is virtually guaranteed, as
    everybody should have both $SYSTEMROOT and $SYSTEMROOT/system32 in their
    PATH.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit ece253d9dab1d398b9a897caf0324dcd64c0a662
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Tue Jul 18 01:15:40 2017 +0200

    tests: only override sort & find if there are usable ones in /usr/bin/

    The idea is to allow running the test suite on MinGit with BusyBox
    installed in /mingw64/bin/sh.exe. In that case, we will want to exclude
    sort & find (and other Unix utilities) from being bundled.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 87fcc0611ca22bbad574c939ab32f955fb0e6507
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Sat Aug 5 20:28:37 2017 +0200

    tests: move test PNGs into t/lib-diff/

    We already have a directory where we store files intended for use by
    multiple test scripts. The same directory is a better home for the
    test-binary-*.png files than t/.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 9679df2f5509f8b988b79812a1712ecd2c531916
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Oct 11 23:55:44 2018 +0200

    gitattributes: mark .png files as binary

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 271dace1e6bc34defdda72628732733cbd82d888
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Jul 20 22:25:21 2017 +0200

    tests(mingw): if `iconv` is unavailable, use `test-helper --iconv`

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 61271ae6f4a22d6b4bdee5f2cc5359c4dbc37e6e
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Jul 20 22:18:56 2017 +0200

    test-tool: learn to act as a drop-in replacement for `iconv`

    It is convenient to assume that everybody who wants to build & test Git
    has access to a working `iconv` executable (after all, we already pretty
    much require libiconv).

    However, that limits esoteric test scenarios such as Git for Windows',
    where an end user installation has to ship with `iconv` for the sole
    purpose of being testable. That payload serves no other purpose.

    So let's just have a test helper (to be able to test Git, the test
    helpers have to be available, after all) to act as `iconv` replacement.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit fb11e63d6d3430e62a8fc2372736034a8af929ed
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Sat Aug 5 22:23:36 2017 +0200

    test-lib: avoid unnecessary Perl invocation

    It is a bit strange, and even undesirable, to require Perl just to run
    the test suite even when NO_PERL was set.

    This patch does not fix this problem by any stretch of imagination.
    However, it fixes *the* Perl invocation that *every single* test script
    has to run.

    While at it, it makes the source code also more grep'able, as the code
    that unsets some, but not all, GIT_* environment variables just became a
    *lot* more explicit. And all that while still reducing the total number
    of lines.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 69165a9a4807e855b634f96e573ba8f120fc6ee4
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Jul 20 20:41:29 2017 +0200

    mingw: when path_lookup() failed, try BusyBox

    BusyBox comes with a ton of applets ("applet" being the identical
    concept to Git's "builtins"). And similar to Git's builtins, the applets
    can be called via `busybox <command>`, or the BusyBox executable can be
    copied/hard-linked to the command name.

    The similarities do not end here. Just as with Git's builtins, it is
    problematic that BusyBox' hard-linked applets cannot easily be put into
    a .zip file: .zip archives have no concept of hard-links and therefore
    would store identical copies (and also extract identical copies,
    "inflating" the archive unnecessarily).

    To counteract that issue, MinGit already ships without hard-linked
    copies of the builtins, and the plan is to do the same with BusyBox'
    applets: simply ship busybox.exe as single executable, without
    hard-linked applets.

    To accommodate that, Git is being taught by this commit a very special
    trick, exploiting the fact that it is possible to call an executable
    with a command-line whose argv[0] is different from the executable's
    name: when `sh` is to be spawned, and no `sh` is found in the PATH, but
    busybox.exe is, use that executable (with unchanged argv).

    Likewise, if any executable to be spawned is not on the PATH, but
    busybox.exe is found, parse the output of `busybox.exe --help` to find
    out what applets are included, and if the command matches an included
    applet name, use busybox.exe to execute it.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 3feb8f7dfb06fb497fcc8472497b05a04c1fe209
Author: Bert Belder <bertbelder@gmail.com>
Date:   Fri Oct 26 23:42:09 2018 +0200

    Win32: symlink: add test for `symlink` attribute

    To verify that the symlink is resolved correctly, we use the fact that
    `git.exe` is a native Win32 program, and that `git.exe config -f <path>`
    therefore uses the native symlink resolution.

    Signed-off-by: Bert Belder <bertbelder@gmail.com>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit f93344df80b4572059b2a73de44512ee4bfcba84
Author: David Lomas <dl3@pale-eds.co.uk>
Date:   Fri Jul 28 15:20:43 2023 +0100

    mingw: work around rename() failing on a read-only file

    At least on _some_ APFS network shares, Git fails to rename the object
    files because they are marked as read-only, because that has the effect
    of setting the uchg flag on APFS, which then means the file can't be
    renamed or deleted.

    To work around that, when a rename failed, and the read-only flag is
    set, try to turn it off and on again.

    This fixes https://github.com/git-for-windows/git/issues/4482

    Signed-off-by: David Lomas <dl3@pale-eds.co.uk>
    Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>

commit 426566b1e42954be10d43b8367342cffe03f718e
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Mon Apr 23 23:20:00 2018 +0200

    mingw: Windows Docker volumes are *not* symbolic links

    ... even if they may look like them.

    As looking up the target of the "symbolic link" (just to see whether it
    starts with `/ContainerMappedDirectories/`) is pretty expensive, we
    do it when we can be *really* sure that there is a possibility that this
    might be the case.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
    Signed-off-by: JiSeop Moon <zcube@zcube.kr>

commit db57f75bc17c4fad6ae9e345e321ac8ffff6ef3e
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Jul 20 22:45:01 2017 +0200

    mingw: explicitly specify with which cmd to prefix the cmdline

    The main idea of this patch is that even if we have to look up the
    absolute path of the script, if only the basename was specified as
    argv[0], then we should use that basename on the command line, too, not
    the absolute path.

    This patch will also help with the upcoming patch where we automatically
    substitute "sh ..." by "busybox sh ..." if "sh" is not in the PATH but
    "busybox" is: we will do that by substituting the actual executable, but
    still keep prepending "sh" to the command line.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 2251b4fce26326ae9c0f163a7f32b00d226b801e
Author: Bert Belder <bertbelder@gmail.com>
Date:   Fri Oct 26 11:51:51 2018 +0200

    mingw: allow to specify the symlink type in .gitattributes

    On Windows, symbolic links have a type: a "file symlink" must point at
    a file, and a "directory symlink" must point at a directory. If the
    type of symlink does not match its target, it doesn't work.

    Git does not record the type of symlink in the index or in a tree. On
    checkout it'll guess the type, which only works if the target exists
    at the time the symlink is created. This may often not be the case,
    for example when the link points at a directory inside a submodule.

    By specifying `symlink=file` or `symlink=dir` the user can specify what
    type of symlink Git should create, so Git doesn't have to rely on
    unreliable heuristics.

    Signed-off-by: Bert Belder <bertbelder@gmail.com>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 3dbd00ff85659be1f0cc300fda9b1caae4b39e1f
Author: JiSeop Moon <zcube@zcube.kr>
Date:   Mon Apr 23 22:35:26 2018 +0200

    mingw: move the file_attr_to_st_mode() function definition

    In preparation for making this function a bit more complicated (to allow
    for special-casing the `ContainerMappedDirectories` in Windows
    containers, which look like a symbolic link, but are not), let's move it
    out of the header.

    Signed-off-by: JiSeop Moon <zcube@zcube.kr>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 2a564d1c88b7bcf970b41afe94997a3ecb11b225
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Mon Feb 11 14:19:18 2019 +0100

    Introduce helper to create symlinks that knows about index_state

    On Windows, symbolic links actually have a type depending on the target:
    it can be a file or a directory.

    In certain circumstances, this poses problems, e.g. when a symbolic link
    is supposed to point into a submodule that is not checked out, so there
    is no way for Git to auto-detect the type.

    To help with that, we will add support over the course of the next
    commits to specify that symlink type via the Git attributes. This
    requires an index_state, though, something that Git for Windows'
    `symlink()` replacement cannot know about because the function signature
    is defined by the POSIX standard and not ours to change.

    So let's introduce a helper function to create symbolic links that
    *does* know about the index_state.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 7c48a30c9c1a82f25137192c1a3ad23ce2df45f4
Author: JiSeop Moon <zcube@zcube.kr>
Date:   Mon Apr 23 22:31:42 2018 +0200

    mingw: when running in a Windows container, try to rename() harder

    It is a known issue that a rename() can fail with an "Access denied"
    error at times, when copying followed by deleting the original file
    works. Let's just fall back to that behavior.

    Signed-off-by: JiSeop Moon <zcube@zcube.kr>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 954733566ebe8baf18f0f64fc56b27094e8d8273
Author: Bert Belder <bertbelder@gmail.com>
Date:   Fri Oct 26 11:13:45 2018 +0200

    Win32: symlink: move phantom symlink creation to a separate function

    Signed-off-by: Bert Belder <bertbelder@gmail.com>

commit bd2c03e21439ab4399fabd3dd0735bed03393e0a
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Thu Jun 4 23:16:07 2020 +0200

    mingw: special-case index entries for symlinks with buggy size

    In https://github.com/git-for-windows/git/pull/2637, we fixed a bug
    where symbolic links' target path sizes were recorded incorrectly in the
    index. The downside of this fix was that every user with tracked
    symbolic links in their checkouts would see them as modified in `git
    status`, but not in `git diff`, and only a `git add <path>` (or `git add
    -u`) would "fix" this.

    Let's do better than that: we can detect that situation and simply
    pretend that a symbolic link with a known bad size (or a size that just
    happens to be that bad size, a _very_ unlikely scenario because it would
    overflow our buffers due to the trailing NUL byte) means that it needs
    to be re-checked as if we had just checked it out.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 783d2ff71119dc92a4c3c079f3cbe8cca6eb16cf
Author: JiSeop Moon <zcube@zcube.kr>
Date:   Mon Apr 23 22:30:18 2018 +0900

    mingw: introduce code to detect whether we're inside a Windows container

    This will come in handy in the next commit.

    Signed-off-by: JiSeop Moon <zcube@zcube.kr>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 048c98b04bb5a1bb785d40ba702f9847fa9d0740
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Mon Mar 2 21:54:29 2020 +0100

    mingw: emulate stat() a little more faithfully

    When creating directories via `safe_create_leading_directories()`, we
    might encounter an already-existing directory which is not
    readable by the current user. To handle that situation, Git's code calls
    `stat()` to determine whether we're looking at a directory.

    In such a case, `CreateFile()` will fail, though, no matter what, and
    consequently `mingw_stat()` will fail, too. But POSIX semantics seem to
    still allow `stat()` to go forward.

    So let's call `mingw_lstat()` for the rescue if we fail to get a file
    handle due to denied permission in `mingw_stat()`, and fill the stat
    info that way.

    We need to be careful to not allow this to go forward in case that we're
    looking at a symbolic link: to resolve the link, we would still have to
    create a file handle, and we just found out that we cannot. Therefore,
    `stat()` still needs to fail with `EACCES` in that case.

    This fixes https://github.com/git-for-windows/git/issues/2531.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit cb8f3ed7022fa1ddf2097da3db9f69ddfa968540
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Tue May 30 21:50:57 2017 +0200

    mingw: try to create symlinks without elevated permissions

    With Windows 10 Build 14972 in Developer Mode, a new flag is supported
    by CreateSymbolicLink() to create symbolic links even when running
    outside of an elevated session (which was previously required).

    This new flag is called SYMBOLIC_LINK_FLAG_ALLOW_UNPRIVILEGED_CREATE and
    has the numeric value 0x02.

    Previous Windows 10 versions will not understand that flag and return an
    ERROR_INVALID_PARAMETER, therefore we have to be careful to try passing
    that flag only when the build number indicates that it is supported.

    For more information about the new flag, see this blog post:
    https://blogs.windows.com/buildingapps/2016/12/02/symlinks-windows-10/

    This patch is loosely based on the patch submitted by Samuel D. Leslie
    as https://github.com/git-for-windows/git/pull/1184.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit cdebdc1d15e108a7b2d2671e38e60499954ec3c3
Author: Karsten Blees <blees@dcon.de>
Date:   Sun May 24 01:48:35 2015 +0200

    Win32: symlink: add support for symlinks to directories

    Symlinks on Windows have a flag that indicates whether the target is a file
    or a directory. Symlinks of wrong type simply don't work. This even affects
    core Win32 APIs (e.g. DeleteFile() refuses to delete directory symlinks).

    However, CreateFile() with FILE_FLAG_BACKUP_SEMANTICS doesn't seem to care.
    Check the target type by first creating a tentative file symlink, opening
    it, and checking the type of the resulting handle. If it is a directory,
    recreate the symlink with the directory flag set.

    It is possible to create symlinks before the target exists (or in case of
    symlinks to symlinks: before the target type is known). If this happens,
    create a tentative file symlink and postpone the directory decision: keep
    a list of phantom symlinks to be processed whenever a new directory is
    created in mingw_mkdir().

    Limitations: This algorithm may fail if a link target changes from file to
    directory or vice versa, or if the target directory is created in another
    process.

    Signed-off-by: Karsten Blees <blees@dcon.de>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 7eb9bc36a16cbbfb90cf07d8e274de36198ed42b
Author: Karsten Blees <blees@dcon.de>
Date:   Sun May 24 01:32:03 2015 +0200

    Win32: implement basic symlink() functionality (file symlinks only)

    Implement symlink() that always creates file symlinks. Fails with ENOSYS
    if symlinks are disabled or unsupported.

    Note: CreateSymbolicLinkW() was introduced with symlink support in Windows
    Vista. For compatibility with Windows XP, we need to load it dynamically
    and fail gracefully if it isnt's available.

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit d039b811d62069194c5625117311e37edf825b50
Author: Bill Zissimopoulos <billziss@navimatics.com>
Date:   Thu May 28 16:35:57 2020 -0700

    mingw: lstat: compute correct size for symlinks

    This commit fixes mingw_lstat by computing the proper size for symlinks
    according to POSIX. POSIX specifies that upon successful return from
    lstat: "the value of the st_size member shall be set to the length of
    the pathname contained in the symbolic link not including any
    terminating null byte".

    Prior to this commit the mingw_lstat function returned a fixed size of
    4096. This caused problems in git repositories that were accessed by
    git for Cygwin or git for WSL. For example, doing `git reset --hard`
    using git for Windows would update the size of symlinks in the index
    to be 4096; at a later time git for Cygwin or git for WSL would find
    that symlinks have changed size during `git status`. Vice versa doing
    `git reset --hard` in git for Cygwin or git for WSL would update the
    size of symlinks in the index with the correct value, only for git for
    Windows to find incorrectly at a later time that the size had changed.

    Signed-off-by: Bill Zissimopoulos <billziss@navimatics.com>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit ff6e3753038838be224f2f0b38670ba9acb39df6
Author: Karsten Blees <blees@dcon.de>
Date:   Sun May 24 01:24:41 2015 +0200

    Win32: implement readlink()

    Implement readlink() by reading NTFS reparse points. Works for symlinks
    and directory junctions. If symlinks are disabled, fail with ENOSYS.

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit 9b5c73a65f3670c13c1eabe119c0c89363ce6a0c
Author: Karsten Blees <blees@dcon.de>
Date:   Sun May 24 01:17:31 2015 +0200

    Win32: mingw_chdir: change to symlink-resolved directory

    If symlinks are enabled, resolve all symlinks when changing directories,
    as required by POSIX.

    Note: Git's real_path() function bases its link resolution algorithm on
    this property of chdir(). Unfortunately, the current directory on Windows
    is limited to only MAX_PATH (260) characters. Therefore using symlinks and
    long paths in combination may be problematic.

    Signed-off-by: Karsten Blees <blees@dcon.de>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit 1c3cf1465db872d62cb260ee9a099aa8222578b1
Author: Karsten Blees <blees@dcon.de>
Date:   Tue May 19 22:42:48 2015 +0200

    Win32: mingw_rename: support renaming symlinks

    MSVCRT's _wrename() cannot rename symlinks over existing files: it returns
    success without doing anything. Newer MSVCR*.dll versions probably do not
    have this problem: according to CRT sources, they just call MoveFileEx()
    with the MOVEFILE_COPY_ALLOWED flag.

    Get rid of _wrename() and call MoveFileEx() with proper error handling.

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit d241df32732a3e4c21b3465ac7f8905c9de12830
Author: Karsten Blees <blees@dcon.de>
Date:   Sun May 24 01:06:10 2015 +0200

    Win32: mingw_unlink: support symlinks to directories

    _wunlink() / DeleteFileW() refuses to delete symlinks to directories. If
    _wunlink() fails with ERROR_ACCESS_DENIED, try _wrmdir() as well.

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit a6f791abafd69161cbb065ce8deeae68a0728223
Author: Karsten Blees <blees@dcon.de>
Date:   Sat May 16 00:32:03 2015 +0200

    Win32: add symlink-specific error codes

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit b9a96819930b28bbba13267887871cbb843d4025
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Fri Feb 23 02:50:03 2018 +0100

    mingw (git_terminal_prompt): do fall back to CONIN$/CONOUT$ method

    To support Git Bash running in a MinTTY, we use a dirty trick to access
    the MSYS2 pseudo terminal: we execute a Bash snippet that accesses
    /dev/tty.

    The idea was to fall back to writing to/reading from CONOUT$/CONIN$ if
    that Bash call failed because Bash was not found.

    However, we should fall back even in other error conditions, because we
    have not successfully read the user input. Let's make it so.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit d8e8d4805d6469ed3e3c9e7d4404cb42414f0caf
Author: Karsten Blees <blees@dcon.de>
Date:   Sun May 24 01:55:05 2015 +0200

    Win32: change default of 'core.symlinks' to false

    Symlinks on Windows don't work the same way as on Unix systems. E.g. there
    are different types of symlinks for directories and files, creating
    symlinks requires administrative privileges etc.

    By default, disable symlink support on Windows. I.e. users explicitly have
    to enable it with 'git config [--system|--global] core.symlinks true'.

    The test suite ignores system / global config files. Allow testing *with*
    symlink support by checking if native symlinks are enabled in MSys2 (via
    'MSYS=winsymlinks:nativestrict').

    Reminder: This would need to be changed if / when we find a way to run the
    test suite in a non-MSys-based shell (e.g. dash).

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit 88e607cab871f5740cee61723ceadcaeadf17c4f
Author: Karsten Blees <blees@dcon.de>
Date:   Tue May 19 21:48:55 2015 +0200

    Win32: factor out retry logic

    The retry pattern is duplicated in three places. It also seems to be too
    hard to use: mingw_unlink() and mingw_rmdir() duplicate the code to retry,
    and both of them do so incompletely. They also do not restore errno if the
    user answers 'no'.

    Introduce a retry_ask_yes_no() helper function that handles retry with
    small delay, asking the user, and restoring errno.

    mingw_unlink: include _wchmod in the retry loop (which may fail if the
    file is locked exclusively).

    mingw_rmdir: include special error handling in the retry loop.

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit 8c63e2136907ad6472b3cbd98456735f0973d217
Author: Karsten Blees <blees@dcon.de>
Date:   Sat May 16 01:11:37 2015 +0200

    Win32: lstat(): return adequate stat.st_size for symlinks

    Git typically doesn't trust the stat.st_size member of symlinks (e.g. see
    strbuf_readlink()). However, some functions take shortcuts if st_size is 0
    (e.g. diff_populate_filespec()).

    In mingw_lstat() and fscache_lstat(), make sure to return an adequate size.

    The extra overhead of opening and reading the reparse point to calculate
    the exact size is not necessary, as git doesn't rely on the value anyway.

    Signed-off-by: Karsten Blees <blees@dcon.de>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit fce1ceb95c9267bec3ec05eb376ba9e792d60202
Author: Karsten Blees <blees@dcon.de>
Date:   Tue Jan 10 23:21:56 2017 +0100

    mingw: teach fscache and dirent about symlinks

    Move S_IFLNK detection to file_attr_to_st_mode() and reuse it in fscache.

    Implement DT_LNK detection in dirent.c and the fscache readdir version.

    Signed-off-by: Karsten Blees <blees@dcon.de>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit d90d4afa7220a16ef2628b1449f5d91d43c7329c
Author: Karsten Blees <blees@dcon.de>
Date:   Sun May 24 00:17:56 2015 +0200

    Win32: let mingw_lstat() error early upon problems with reparse points

    When obtaining lstat information for reparse points, we need to call
    FindFirstFile() in addition to GetFileInformationEx() to obtain the type
    of the reparse point (symlink, mount point etc.). However, currently there
    is no error handling whatsoever if FindFirstFile() fails.

    Call FindFirstFile() before modifying the stat *buf output parameter and
    error out if the call fails.

    Note: The FindFirstFile() return value includes all the data that we get
    from GetFileAttributesEx(), so we could replace GetFileAttributesEx() with
    FindFirstFile(). We don't do that because GetFileAttributesEx() is about
    twice as fast for single files. I.e. we only pay the extra cost of calling
    FindFirstFile() in the rare case that we encounter a reparse point.

    Note: The indentation of the remaining reparse point code will be fixed in
    the next patch.

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit 730279b83dea7c4fd91a6a905ca48966c5679bab
Author: Karsten Blees <blees@dcon.de>
Date:   Tue May 12 00:58:39 2015 +0200

    Win32: remove separate do_lstat() function

    With the new mingw_stat() implementation, do_lstat() is only called from
    mingw_lstat() (with follow == 0). Remove the extra function and the old
    mingw_stat()-specific (follow == 1) logic.

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit e404d24d78940c9682f0e66ef7d2e68a26c8b90f
Author: Karsten Blees <blees@dcon.de>
Date:   Sat May 16 01:18:14 2015 +0200

    Win32: implement stat() with symlink support

    With respect to symlinks, the current stat() implementation is almost the
    same as lstat(): except for the file type (st_mode & S_IFMT), it returns
    information about the link rather than the target.

    Implement stat by opening the file with as little permissions as possible
    and calling GetFileInformationByHandle on it. This way, all link resoltion
    is handled by the Windows file system layer.

    If symlinks are disabled, use lstat() as before, but fail with ELOOP if a
    symlink would have to be resolved.

    Signed-off-by: Karsten Blees <blees@dcon.de>
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

commit bf011e9eec04507605c77f7a826164bfe1ec2d98
Author: Karsten Blees <blees@dcon.de>
Date:   Tue May 12 11:09:01 2015 +0200

    Win32: don't call GetFileAttributes twice in mingw_lstat()

    GetFileAttributes cannot handle paths with trailing dir separator. The
    current [l]stat implementation calls GetFileAttributes twice if the path
    has trailing slashes (first with the original path passed to [l]stat, and
    and a second time with a path copy with trailing '/' removed).

    With Unicode conversion, we get the length of the path for free and also
    have a (wide char) buffer that can be modified.

    Remove trailing directory separators before calling the Win32 API.

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit 743fbb109b9f674eee17e4743b54ec068dff9bd9
Author: Karsten Blees <blees@dcon.de>
Date:   Mon May 11 19:58:14 2015 +0200

    lockfile.c: use is_dir_sep() instead of hardcoded '/' checks

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit cae0ed2b69fc4cc92fbc0a209d69b76d593be9a9
Author: Karsten Blees <blees@dcon.de>
Date:   Mon May 11 22:15:40 2015 +0200

    strbuf_readlink: support link targets that exceed PATH_MAX

    strbuf_readlink() refuses to read link targets that exceed PATH_MAX (even
    if a sufficient size was specified by the caller).

    As some platforms support longer paths, remove this restriction (similar
    to strbuf_getcwd()).

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit ab6cf4a85e9eb7feb01e9835b12a63ee88c7b502
Author: Karsten Blees <blees@dcon.de>
Date:   Mon May 11 19:54:23 2015 +0200

    strbuf_readlink: don't call readlink twice if hint is the exact link size

    strbuf_readlink() calls readlink() twice if the hint argument specifies the
    exact size of the link target (e.g. by passing stat.st_size as returned by
    lstat()). This is necessary because 'readlink(..., hint) == hint' could
    mean that the buffer was too small.

    Use hint + 1 as buffer size to prevent this.

    Signed-off-by: Karsten Blees <blees@dcon.de>

commit 0987f685d16c047ccb9e317857339812ab1255c7
Author: Ben Boeckel <mathstuf@gmail.com>
Date:   Fri Apr 22 09:06:23 2022 -0400

    clean: suggest using `core.longPaths` if paths are too long to remove

    On Windows, git repositories may have extra files which need cleaned
    (e.g., a build directory) that may be arbitrarily deep. Suggest using
    `core.longPaths` if such situations are encountered.

    Fixes: #2715
    Signed-off-by: Ben Boeckel <mathstuf@gmail.com>

commit 29f7afc8f88a589928379298c56bf527ad10185a
Author: Jeff Hostetler <jeffhost@microsoft.com>
Date:   Fri Mar 25 16:56:04 2022 -0400

    compat/fsmonitor/fsm-*-win32: support long paths

    Update wchar_t buffers to use MAX_LONG_PATH instead of MAX_PATH and call
    xutftowcs_long_path() in the Win32 backend source files.

    Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

commit b8923c8fba0c566a57bd7192f41df8f6f666c1dc
Author: Johannes Schindelin <johannes.schindelin@gmx.de>
Date:   Wed Sep 6 09:14:47 2023 +0200

    win32(long path support): leave drive-less absolute paths intact

    When trying to ensure that long paths are handled correctly, we
    first normalize absolute paths as we encounter them.

    However, if the path is a so-called "drive-less" absolute path, i.e. if
    it is relative to the current drive but _does_ start with a directory
    separator, we would want the normalized path to be such a drive-less
    absolute path, too.

    Let's do that, being careful to still include the drive prefix when we
    need to go through the `\\?\` dance (because there, the drive prefix is
    absolutely required).

    This fixes https://github.com/git-for-windows/git/issues/4586.

    Signed-off-by: Johannes Schindelin <johannes.schindelin@gm…
dscho added a commit that referenced this pull request Nov 22, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho added a commit that referenced this pull request Nov 22, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
Jeff Hostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho added a commit that referenced this pull request Nov 22, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho added a commit that referenced this pull request Nov 22, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
Jeff Hostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho added a commit that referenced this pull request Nov 22, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho pushed a commit that referenced this pull request Nov 22, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit that referenced this pull request Nov 22, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
Jeff Hostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho added a commit that referenced this pull request Nov 22, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho pushed a commit that referenced this pull request Nov 22, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit that referenced this pull request Nov 22, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
Jeff Hostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho added a commit that referenced this pull request Nov 22, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho pushed a commit that referenced this pull request Nov 22, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit that referenced this pull request Nov 22, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
Jeff Hostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
Jeff Hostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
Jeff Hostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
Jeff Hostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
Jeff Hostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho added a commit to dscho/git that referenced this pull request Nov 25, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho pushed a commit to dscho/git that referenced this pull request Nov 25, 2024
…5171)

This is a follow up to git-for-windows#5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit to dscho/git that referenced this pull request Nov 25, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in git-for-windows#5157 and git-for-windows#5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Nov 25, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants