Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack-objects: add --path-walk option for better deltas #1813

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Oct 8, 2024

This is a reviewable series introducing the path-walk API and its application within the 'git pack-objects --path-walk' machinery. This API was previously discussed in the path-walk RFC [1] and the patch series around the --full-name-hash option (branch ds/pack-name-hash-tweak) [2]. This series conflicts with ds/pack-name-hash-tweak, but that was on hold because it did not seem as critical based on community interest.

[1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com

[2] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com

The primary motivation for this feature is its use to shrink the packfile created by 'git push' when there are many name-hash collisions. This need was discovered in several Javascript repositories that use the beachball tool [3] to generate CHANGELOG.md and CHANGELOG.json files. When a batch of these files are created at the same time and pushed to a release branch, the 'git pack-objects' process has many collisions among these files and delta bases are selected poorly.

[3] https://github.com/microsoft/beachball

In some cases, 'git push' is pushing 60-100 MB when the new path-walk algorithm will identify better delta bases and pack the same objects into a thin pack less than 1 MB. This was the most extreme example we could find and is present in a private monorepo. However, the microsoft/fluentui repo [4] is a good example for demonstrating similar improvements. The patch descriptions frequently refer to this repo and which commit should be checked out to reproduce this behavior.

[4] https://github.com/microsoft/fluentui

The path-walk API is a new way to walk objects. Traditionally, the revision API walks objects by visiting a commit, then visiting its root tree and recursing through trees to visit reachable objects that were not previously visited. The path-walk API visits objects in batches based on the path walked from a root tree to that object. (Only the first discovered path is chosen; this avoids certain kinds of Git bombs.)

This has an immediate application to 'git pack-objects'.

When using the traditional revision API to walk objects, each object is emitted with an associated path. Since this path may appear for many objects spread across the full list, a heuristic is used: the "name-hash" is stored for that object instead of the full path name. This name-hash will group objects at the same path together, but also has a form of "locality" to group likely-similar objects together. When there are few collisions in the name-hash function, this helps compress objects that appear at the same path as well as help compress objects across different paths that have similar suffixes. When there are many versions of the same path, then finding delta bases across that family of objects is very important. When there are few versions of the same path, then finding cross-name delta bases is also important. The former is important for clones and repacks while the latter is important for shallow clones. They can both be important for pushes. In all cases, this approach is fraught when there are many name-hash collisions as the window size becomes a limiting factor for finding quality delta bases.

When using the path-walk API to walk objects, we group objects by the same path from the start. We don't need to store the path name, since we have the objects already in a group. We can compute deltas within that group and then use the name-hash approach to resort the object list and look for opportunistic cross-path deltas. Thus, the path-walk approach allows finding delta bases at least as good as the traditional revision API approach. (Caveat: if we assume delta reuse and the existing deltas were computed with the revision API approach, then the path-walk API approach may result in slightly worse delta compression. The performance tests in this series use --no-reuse-delta for this reason.)

Once 'git pack-objects --path-walk' exists, we have a few ways to take advantage of it in other places:

  • The new 'pack.usePathWalk' config option will assume the --path-walk option. This allows 'git push' to assume this value and get the effect we want. This is similar to the 'pack.useSparse' config option that uses a similar path-based walk to limit the set of boundary objects.

  • 'git repack' learns a '--path-walk' option to pass to its child 'git pack-objects' process. This is also implied by 'pack.usePathWalk' but allows for testing without modifying repository config.

I'll repeat the following table of repacking results when using 'git repack -adf [--path-walk]' on a set of repositories with many name-hash collisions. Only the microsoft/fluentui repository is publicly available for testing, so the others are left as Repo B/C/D.

| Repo     | Standard Repack | With --path-walk |
|----------|-----------------|------------------|
| fluentui |         438 MB  |          148 MB  |
| Repo B   |       6,255 MB  |          778 MB  |
| Repo C   |      37,737 MB  |        6,158 MB  |
| Repo D   |     130,049 MB  |        4,432 MB  |

While this series is replacing ds/pack-name-hash-tweak and its introduction of the --full-name-hash option, it is worth comparing that option to the --path-walk option.

  • The --full-name-hash option is a much smaller code change, as it drops into the existing uses of the name-hash function.

  • The --full-name-hash option is more likely to integrate with server-side features such as delta islands and reachability bitmaps due to not changing the object walk. It was already noted that the .bitmap files store name-hash values, so there is some compatibility required to integrate with bitmaps. The --path-walk option will be more difficult to integrate with those options (at least during a repack), but maybe is not impossible; the --path-walk option will not work when reading reachability bitmaps, since we are avoiding walking trees entirely.

  • The --full-name-hash option is good when there are many name-hash collisions and many versions of the paths with those collisions. When creating a shallow clone or certain kinds of pushes, the --full-name-hash option is much worse at finding cross-path delta bases since it loses the locality of the standard name-hash function. The --path-walk option includes a second pass of delta computation using the standard name-hash function and thus finds good cross-path delta bases when they improve upon the same-path delta bases.

There are a few differences from the RFC version of this series:

  1. The last two patches refactor the approach to perform delta calculations by path after the object walk and then allows those delta calculations to happen in a threaded manner.

  2. Both 'git pack-objects' and 'git repack' are removed from the t0450 exclusion list.

  3. The path-walk API has improved technical documentation that is extended as its functionality is expanded.

  4. Various bugs have been patched with matching tests. The new 'test-tool path-walk' helper allows for careful testing of the API separately from its use within other commands.

Updates in v2

I'm sending this v2 to request some review feedback on the series. I'm sorry it's so long.

There are two updates in this version:

  • Fixed a performance issue in the presence of many annotated tags. This is caught by p5313 when run on a repo with 10,000+ annotated tags.
  • The Scalar config was previously wrong and should be pack.usePathWalk, not push.usePathWalk.

Thanks, - Stolee

cc: gitster@pobox.com
cc: johannes.schindelin@gmx.de
cc: peff@peff.net
cc: ps@pks.im
cc: me@ttaylorr.com
cc: johncai86@gmail.com
cc: newren@gmail.com
cc: christian.couder@gmail.com
cc: kristofferhaugsbakk@fastmail.com
cc: Jonathan Tan jonathantanmy@google.com

@derrickstolee
Copy link
Author

/submit

Copy link

gitgitgadget bot commented Oct 8, 2024

Submitted as pull.1813.git.1728396723.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1813/derrickstolee/path-walk-upstream-v1

To fetch this version to local tag pr-1813/derrickstolee/path-walk-upstream-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1813/derrickstolee/path-walk-upstream-v1

Copy link

gitgitgadget bot commented Oct 8, 2024

This patch series was integrated into seen via git@b0017d4.

@gitgitgadget gitgitgadget bot added the seen label Oct 8, 2024
Copy link

gitgitgadget bot commented Oct 9, 2024

This patch series was integrated into seen via git@e1dc636.

Copy link

gitgitgadget bot commented Oct 9, 2024

This branch is now known as ds/path-walk.

Copy link

gitgitgadget bot commented Oct 10, 2024

This patch series was integrated into seen via git@be8d416.

Copy link

gitgitgadget bot commented Oct 10, 2024

This patch series was integrated into seen via git@5ddc828.

Copy link

gitgitgadget bot commented Oct 10, 2024

There was a status update in the "Cooking" section about the branch ds/path-walk on the Git mailing list:

A new algorithm for object graph traversal to favor visiting the
objects at the same tree path in succession (as opposed to visiting
objects that are different between trees as we walk commit
histories) is introduced to optimize object packing.

Needs review.
source: <pull.1813.git.1728396723.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Oct 11, 2024

This patch series was integrated into seen via git@7d1078a.

Copy link

gitgitgadget bot commented Oct 11, 2024

This patch series was integrated into seen via git@24e32b0.

Copy link

gitgitgadget bot commented Oct 12, 2024

There was a status update in the "Cooking" section about the branch ds/path-walk on the Git mailing list:

A new algorithm for object graph traversal to favor visiting the
objects at the same tree path in succession (as opposed to visiting
objects that are different between trees as we walk commit
histories) is introduced to optimize object packing.

Needs review.
source: <pull.1813.git.1728396723.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Oct 14, 2024

This patch series was integrated into seen via git@542db49.

Copy link

gitgitgadget bot commented Oct 14, 2024

This patch series was integrated into seen via git@33f243d.

Copy link

gitgitgadget bot commented Oct 16, 2024

This patch series was integrated into seen via git@1a84029.

@derrickstolee derrickstolee force-pushed the path-walk-upstream branch 2 times, most recently from 7b57944 to fddc320 Compare October 18, 2024 19:05
Copy link

gitgitgadget bot commented Oct 18, 2024

This patch series was integrated into seen via git@f497f37.

Copy link

gitgitgadget bot commented Oct 18, 2024

This patch series was integrated into seen via git@5f6571b.

@derrickstolee
Copy link
Author

/submit

Copy link

gitgitgadget bot commented Oct 20, 2024

Submitted as pull.1813.v2.git.1729431810.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1813/derrickstolee/path-walk-upstream-v2

To fetch this version to local tag pr-1813/derrickstolee/path-walk-upstream-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1813/derrickstolee/path-walk-upstream-v2

Copy link

gitgitgadget bot commented Oct 21, 2024

On the Git mailing list, Taylor Blau wrote (reply to this):

On Sun, Oct 20, 2024 at 01:43:13PM +0000, Derrick Stolee via GitGitGadget wrote:
> Updates in v2
> =============
>
> I'm sending this v2 to request some review feedback on the series. I'm sorry
> it's so long.
>
> There are two updates in this version:
>
>  * Fixed a performance issue in the presence of many annotated tags. This is
>    caught by p5313 when run on a repo with 10,000+ annotated tags.
>  * The Scalar config was previously wrong and should be pack.usePathWalk,
>    not push.usePathWalk.

Thanks. I queued the new round. As an aside, I would like to find the
time to review this series in depth, but haven't been able to do so
while also trying to keep up with the volume of the rest of the list.

I know that this topic was split out of a larger one. It may be worth
seeing if there is a way to split this topic out into multiple series
that are more easily review-able, but still sensible on their own.

I haven't looked in enough depth to know myself whether such a cut
exists, but it is worth thinking about if you haven't done so already.

Thanks,
Taylor

Copy link

gitgitgadget bot commented Oct 21, 2024

This patch series was integrated into seen via git@a3cf0fb.

Copy link

gitgitgadget bot commented Oct 22, 2024

This patch series was integrated into seen via git@900e2f6.

Copy link

gitgitgadget bot commented Oct 23, 2024

This patch series was integrated into seen via git@05f5018.

Copy link

gitgitgadget bot commented Oct 24, 2024

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 10/21/24 5:43 PM, Taylor Blau wrote:
> On Sun, Oct 20, 2024 at 01:43:13PM +0000, Derrick Stolee via GitGitGadget wrote:
>> Updates in v2
>> =============
>>
>> I'm sending this v2 to request some review feedback on the series. I'm sorry
>> it's so long.
>>
>> There are two updates in this version:
>>
>>   * Fixed a performance issue in the presence of many annotated tags. This is
>>     caught by p5313 when run on a repo with 10,000+ annotated tags.
>>   * The Scalar config was previously wrong and should be pack.usePathWalk,
>>     not push.usePathWalk.
> > Thanks. I queued the new round. As an aside, I would like to find the
> time to review this series in depth, but haven't been able to do so
> while also trying to keep up with the volume of the rest of the list.
> > I know that this topic was split out of a larger one. It may be worth
> seeing if there is a way to split this topic out into multiple series
> that are more easily review-able, but still sensible on their own.

I'll see what I can do. I needed to re-roll after discovering an issue
when trying to integrate the algorithm with shallow clones. The solution
ends up simplifying the code, which is nice.

> I haven't looked in enough depth to know myself whether such a cut
> exists, but it is worth thinking about if you haven't done so already.

In the current series, there's a natural cut between patches 1-4
and the rest, if we want to put the API in without a non-test consumer.

I could also split out the 'git repack' changes into a third series.

Finally, the threading implementation could be done separately, but I
think it's not complicated enough to leave out from the first version
of the --path-walk option in 'git pack-objects'.

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Oct 24, 2024

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Oct 24, 2024 at 09:29:02AM -0400, Derrick Stolee wrote:
> > I know that this topic was split out of a larger one. It may be worth
> > seeing if there is a way to split this topic out into multiple series
> > that are more easily review-able, but still sensible on their own.
>
> I'll see what I can do. I needed to re-roll after discovering an issue
> when trying to integrate the algorithm with shallow clones. The solution
> ends up simplifying the code, which is nice.

It's always nice when that happens :-).

Should we avoid reviewing the current round in anticipation of a
somewhat restructured series, or would you like us to review the current
round as well?

> > I haven't looked in enough depth to know myself whether such a cut
> > exists, but it is worth thinking about if you haven't done so already.
>
> In the current series, there's a natural cut between patches 1-4
> and the rest, if we want to put the API in without a non-test consumer.
>
> I could also split out the 'git repack' changes into a third series.
>
> Finally, the threading implementation could be done separately, but I
> think it's not complicated enough to leave out from the first version
> of the --path-walk option in 'git pack-objects'.

I'd suggest erring on the side of more smaller series rather than a
single large one. If you feel like there are cut points where we can
review them in isolation and still see some benefit, or at least
clearly how they each fit into the larger puzzle, I think that is worth
doing.

But I trust your judgement here, so if you think that the series is best
reviewed as a whole, then that's fine too. Just my $.02 :-).

Thanks,
Taylor

@derrickstolee derrickstolee force-pushed the path-walk-upstream branch 2 times, most recently from 7ae9a40 to 389c18f Compare October 30, 2024 22:20
Copy link

gitgitgadget bot commented Oct 30, 2024

This patch series was integrated into seen via git@d8f1bc3.

Copy link

gitgitgadget bot commented Oct 31, 2024

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 10/29/24 6:16 PM, Taylor Blau wrote:
> On Tue, Oct 29, 2024 at 02:36:57PM -0700, Jonathan Tan wrote:
>> By incompatibility, do you mean the incompatibility between bitmaps
>> and the overall --path-walk feature as implemented collectively by the
>> patches in Stolee's patch set? If so, I suspect that we will need a
>> parallel code path that takes in the "want" and "uninteresting" commits
>> and emits the list of objects (possibly before sorting the objects by
>> path hash), much like in builtin/pack-objects.c, so I think there will
>> be some effort involved in making the two work together.
> > I am not sure yet, in all honesty, because I haven't had enough time to
> spend yet reviewing these patches to have anything intelligent to say.

I think that the --path-walk option is fundamentally incompatible with
the --use-bitmap-index option (using reachability bitmaps to reduce how
many objects we parse to discover reachability) but is not necessarily
incompatible with writing bitmaps. But it would require testing to be
sure that there are no surprises due to something like an object order
changing or something like that.

The feature is also not currently integrated with delta islands, so
that would need integration and testing to make sure things are
grouped together and delta chains go in the right direction.

These things may be something possible to overcome in the future, but
the lack of current integration points and testing makes me want to
leave this version with guard rails that prevent users from getting
into a bind.

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Oct 31, 2024

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 10/28/24 3:54 PM, Jonathan Tan wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> This new walk is incompatible with some features and is ignored by
>> others:
>>
>>   * Object filters are not currently integrated with the path-walk API,
>>     such as sparse-checkout or tree depth. A blobless packfile could be
>>     integrated easily, but that is deferred for later.
>>
>>   * Server-focused features such as delta islands, shallow packs, and
>>     using a bitmap index are incompatible with the path-walk API.
>>
>>   * The path walk API is only compatible with the --revs option, not
>>     taking object lists or pack lists over stdin. These alternative ways
>>     to specify the objects currently ignores the --path-walk option
>>     without even a warning.
> > It might be better to declare --path-walk as "internal use only" and
> only supporting what send-pack.c (used by "git push") and "git repack"
> needs. (From this list, it seems that there is a lot of incompatibility,
> some of which can happen without a warning to the user, so it sounds
> better to be up-front and say that we only support what send-pack.c
> needs. This also makes reviewing easier, as we don't have to think about
> the possible interactions with every other rev-list feature - only what
> is used by send-pack.c.)

I do wonder what the value of doing this would be. I consider 'gitpack-objects' to already be a plumbing command, so marking any option
as "internal use only" seems like overkill. It takes effort to combine
the options carefully for the right effect. The tests in p5313 are not
terribly simple, such as needing --no-reuse-delta to guarantee we are
using the desired delta algorithm.

> Also from a reviewer perspective, it might be better to restrict this
> patch set to what send-pack.c needs and leave "git repack" for a future
> patch set. This means that we would not need features such as blob
> and tree exclusions, and possibly not even bitmap use or delta reuse
> (assuming that the user would typically push recently-created objects
> that have not been repacked).

While I can understand that as being a potential place to split the
patch series, the integration to add 'git repack --path-walk' is actually
very simple. Repacking "everything" needs to happen to be able to push a
repo to an empty remote, after all.

There are some subtleties around indexed objects, reflogs, and the like
that add some complexity, but they also are handled in the path-walk API
layer. Some of that complexity was helpful to know about during repack
tests.

Finally, the 'git repack --path-walk' use case is a great one for
demonstrating the benefits to threading the 'git pack-objects
--path-walk' algorithm.

>> +		/* Skip objects that do not exist locally. */
>> +		if (exclude_promisor_objects &&
>> +		    oid_object_info_extended(the_repository, oid, &oi,
>> +					     OBJECT_INFO_FOR_PREFETCH) < 0)
>> +			continue;
> > This functionality is typically triggered by --missing=allow;
> --exclude_promisor_objects means (among other things) that we allow
> a missing link only if that object is known to be a promisor object
> (because another promisor object refers to it) (see Documentation/
> rev-list-options.txt, and also get_reference() and elsewhere in
> revision.c - notice how is_promisor_object() is paired with it)
> > Having said that, we should probably just fail outright on missing
> objects, whether or not we have exclude_promisor_objects. If we have
> computed that we need to push an object, that object needs to exist.
> (Same for repack.)

I think that this is not a reasonable assumption that a hard fail
should be expected. Someone could create a blobless clone with a
sparse-checkout and then add a new file outside their sparse-checkout
without ever having the previous version downloaded.

When pushing, they won't be able to use the previous version as a
delta base, but they would certainly be confused about an object
being downloaded during 'git push'.

While the example I bring up is somewhat contrived, I can easily
imagine cases where missing objects are part of the commit boundary
and could be marked as UNINTERESTING but would still be sent in the
batch to be considered as a delta base.

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Oct 31, 2024

On the Git mailing list, Derrick Stolee wrote (reply to this), regarding f8ee11d (outdated):

On 10/29/24 2:07 PM, Taylor Blau wrote:
> On Mon, Oct 28, 2024 at 12:54:04PM -0700, Jonathan Tan wrote:

> Is the thinking there that we care mostly about 'git push' and 'git
> repack' on the client-side?
> > I don't think it's unreasonable necessarily, but I would add that
> client-side users definitely do use bitmaps (though not delta islands),
> either when working in a bare repository (where bitmaps are the default)
> or when using 'git gc' (and/or through 'git maintenance') when
> 'repack.writeBitmaps' is enabled.
I suppose some users do use bitmaps, but in my experience, client-side
pushes are slower with bitmaps because a typical target branch is
faster to compute by doing a commit walk, at least when the bitmaps are
older than the new commits in the topic branch. This may be outdated by
now, as it has been a few years since I did a client-side test of
bitmaps.

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Oct 31, 2024

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 10/29/24 2:02 PM, Taylor Blau wrote:
> On Mon, Oct 28, 2024 at 03:46:11PM -0400, Derrick Stolee wrote:
>> On 10/28/24 1:25 PM, Taylor Blau wrote:
>>> Unfortunately, there is no easy way to reuse the format of the existing
>>> hashcache extension as-is to indicate to the reader whether they are
>>> recording traditional name-hash values, or the new --path-walk hash
>>> values.
>>
>> The --path-walk option does not mess with the name-hash. You're thinking
>> of the --full-name-hash feature [1] that was pulled out due to a lack of
>> interest (and better results with --path-walk).
>>
>> [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/
>
> Ah, gotcha. Thanks for clarifying.
>
> What is the incompatibility between the two, then? Is it just that
> bitmaps give us the objects in pack- or pseudo-pack order, and we don't
> have a way to permute that back into the order that --path-walk would
> give us?

The incompatibility of reading bitmaps and using the path-walk API is
that the path-walk API does not check a bitmap to see if an object is
already discovered. Thus, it does not use the reachability information
from the bitmap at all and would parse commits and trees to find the
objects that should be in the pack-file.

It should also be worth noting that using something like 'git repack
--path-walk' does not mean that future 'git pack-objects' executions
from that packfile data need to use the --path-walk option. I expect
that it should be painless to write bitmaps on top of a packfile created
with 'git repack -adf --path-walk', but since most places doing so also
likely want delta islands, I have not explored this option thoroughly.

(Delta islands are their own challenge, since the path-walk API is not
spreading the reachability information across the objects it walks.
However, this could be remedied by doing a separate walk to identify
islands using the normal method. I believe Peff had an idea in that
direction in another thread. This requires some integration and testing
that I don't have the expertise to provide.)

> If so, a couple of thoughts:
> ...

Since the incompatibility is in a different direction, I don't think
these thoughts were relevant to the problem.

> OTOH, the order in which we pack objects is extremely important to
> performance as you no doubt are aware of. So changing that order to more
> closely match the --path-walk option should be done with great care.

This is a place where I'm unsure about how the --path-walk option adjusts
the object order within the pack. The packing list gets resorted to match
the typical method, at least for how the delta compression window works.

This would be another good reason to consider the --path-walk option in
server environments very carefully. My patch series puts up guard rails
specifically because it makes no claim to be effective in all of the
dimensions that matter for those scenarios. Hopefully, others will be
motivated enough to determine if the compression that's possible with
this algorithm could be achieved in a way that is compatible with server
needs.

> Anyway. All of that is to say that I want to better understand what does
> and doesn't work together between bitmaps and path-walk. Given my
> current understanding, it seems there are a couple of approaches to
> unifying these two things together, so it would be nice to be able to
> do so if possible.

I think this is an excellent opportunity for testing and debugging to
build up more intuition with how the path-walk API works. When I submit
the next version later tonight, the path-walk algorithm will be better
documented.

That said, I don't have any personal motivation to integrate the two
together, so I don't expect to be contributing that integration point
myself. I think that the results speak for themselves in the very
common environment of a Git client without bitmaps.

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Oct 31, 2024

On the Git mailing list, Taylor Blau wrote (reply to this), regarding f8ee11d (outdated):

On Wed, Oct 30, 2024 at 10:14:08PM -0400, Derrick Stolee wrote:
> On 10/29/24 2:07 PM, Taylor Blau wrote:
> > On Mon, Oct 28, 2024 at 12:54:04PM -0700, Jonathan Tan wrote:
>
> > Is the thinking there that we care mostly about 'git push' and 'git
> > repack' on the client-side?
> >
> > I don't think it's unreasonable necessarily, but I would add that
> > client-side users definitely do use bitmaps (though not delta islands),
> > either when working in a bare repository (where bitmaps are the default)
> > or when using 'git gc' (and/or through 'git maintenance') when
> > 'repack.writeBitmaps' is enabled.
> I suppose some users do use bitmaps, but in my experience, client-side
> pushes are slower with bitmaps because a typical target branch is
> faster to compute by doing a commit walk, at least when the bitmaps are
> older than the new commits in the topic branch. This may be outdated by
> now, as it has been a few years since I did a client-side test of
> bitmaps.

All true, though it's hard to estimate the size of "some". I share your
intuition that bitmaps are often a drag on performance for the
client-side because doing a pure commit walk is often faster, especially
if the client has a reasonably up-to-date commit graph.

Thanks,
Taylor

Copy link

gitgitgadget bot commented Oct 31, 2024

On the Git mailing list, Taylor Blau wrote (reply to this):

On Wed, Oct 30, 2024 at 10:28:22PM -0400, Derrick Stolee wrote:
> On 10/29/24 2:02 PM, Taylor Blau wrote:
> > On Mon, Oct 28, 2024 at 03:46:11PM -0400, Derrick Stolee wrote:
> >> On 10/28/24 1:25 PM, Taylor Blau wrote:
> >>> Unfortunately, there is no easy way to reuse the format of the existing
> >>> hashcache extension as-is to indicate to the reader whether they are
> >>> recording traditional name-hash values, or the new --path-walk hash
> >>> values.
> >>
> >> The --path-walk option does not mess with the name-hash. You're thinking
> >> of the --full-name-hash feature [1] that was pulled out due to a lack of
> >> interest (and better results with --path-walk).
> >>
> >> [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/
> >
> > Ah, gotcha. Thanks for clarifying.
> >
> > What is the incompatibility between the two, then? Is it just that
> > bitmaps give us the objects in pack- or pseudo-pack order, and we don't
> > have a way to permute that back into the order that --path-walk would
> > give us?
>
> The incompatibility of reading bitmaps and using the path-walk API is
> that the path-walk API does not check a bitmap to see if an object is
> already discovered. Thus, it does not use the reachability information
> from the bitmap at all and would parse commits and trees to find the
> objects that should be in the pack-file.

Sure, I think what I'm trying to understand here is whether this
"incapability" is a fundamental one, or just that we haven't implemented
checking bitmaps in the path-walk API yet.

Thanks,
Taylor

In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.

These batches are collected in 'struct type_and_oid_list' objects, which
store an object type and an oid_array of objects.

The data structures are documented in 'struct path_walk_context', but in
summary the most important are:

  * 'paths_to_lists' is a strmap that connects a path to a
    type_and_oid_list for that path. To avoid conflicts in path names,
    we make sure that tree paths end in "/" (except the root path with
    is an empty string) and blob paths do not end in "/".

  * 'path_stack' is a string list that is added to in an append-only
    way. This stores the stack of our depth-first search on the heap
    instead of using recursion.

  * 'path_stack_pushed' is a strmap that stores path names that were
    already added to 'path_stack', to avoid repeating paths in the
    stack. Mostly, this saves us from quadratic lookups from doing
    unsorted checks into the string_list.

The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
push_to_stack() method. Call this instead of inserting into these
structures directly.

The walk_objects_by_path() method initializes these structures and
starts walking commits from the given rev_info struct. The commits are
used to find the list of root trees which populate the start of our
depth-first search.

The core of our depth-first search is in a while loop that continues
while we have not indicated an early exit and our 'path_stack' still has
entries in it. The loop body pops a path off of the stack and "visits"
the path via the walk_path() method.

The walk_path() method gets the list of OIDs from the 'path_to_lists'
strmap and executes the callback method on that list with the given path
and type. If the OIDs correspond to tree objects, then iterate over all
trees in the list and run add_children() to add the child objects to
their own lists, adding new entries to the stack if necessary.

In testing, this depth-first search approach was the one that used the
least memory while iterating over the object lists. There is still a
chance that repositories with too-wide path patterns could cause memory
pressure issues. Limiting the stack size could be done in the future by
limiting how many objects are being considered in-progress, or by
visiting blob paths earlier than trees.

There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
This test helper will be helpful to reduce repeated logic in
t6601-path-walk.sh, but may be helpful elsewhere, too.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
derrickstolee and others added 17 commits December 6, 2024 06:49
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.

Store and output a 'batch_nr' value so we can demonstrate that the paths are
grouped together in a batch and not following some other ordering. This
allows us to test the depth-first behavior of the path-walk API. However, we
purposefully do not test the order of the objects in the batch, so the
output is compared to the expected output through a sort.

It is important to mention that the behavior of the API will change soon as
we start to handle UNINTERESTING objects differently, but these tests will
demonstrate the change in behavior.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.

This adds the ability to ask for the commits in a list, as well. We
re-use the empty string for this set of objects because these are passed
directly to the callback function instead of being part of the
'path_stack'.

Future changes will add the ability to visit annotated tags.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The rev_info that is specified for a path-walk traversal may specify
visiting tag refs (both lightweight and annotated) and also may specify
indexed objects (blobs and trees). Update the path-walk API to walk
these objects as well.

When walking tags, we need to peel the annotated objects until reaching
a non-tag object. If we reach a commit, then we can add it to the
pending objects to make sure we visit in the commit walk portion. If we
reach a tree, then we will assume that it is a root tree. If we reach a
blob, then we have no good path name and so add it to a new list of
"tagged blobs".

When the rev_info includes the "--indexed-objects" flag, then the
pending set includes blobs and trees found in the cache entries and
cache-tree. The cache entries are usually blobs, though they could be
trees in the case of a sparse index. The cache-tree stores
previously-hashed tree objects but these are cleared out when staging
objects below those paths. We add tests that demonstrate this.

The indexed objects come with a non-NULL 'path' value in the pending
item. This allows us to prepopulate the 'path_to_lists' strmap with
lists for these paths.

The tricky thing about this walk is that we will want to combine the
indexed objects walk with the commit walk, especially in the future case
of walking objects during a command like 'git repack'.

Whenever possible, we want the objects from the index to be grouped with
similar objects in history. We don't want to miss any paths that appear
only in the index and not in the commit history.

Thus, we need to be careful to let the path stack be populated initially
with only the root tree path (and possibly tags and tagged blobs) and go
through the normal depth-first search. Afterwards, if there are other
paths that are remaining in the paths_to_lists strmap, we should then
iterate through the stack and visit those objects recursively.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
When the input rev_info has UNINTERESTING starting points, we want to be
sure that the UNINTERESTING flag is passed appropriately through the
objects. To match how this is done in places such as 'git pack-objects', we
use the mark_edges_uninteresting() method.

This method has an option for using the "sparse" walk, which is similar in
spirit to the path-walk API's walk. To be sure to keep it independent, add a
new 'prune_all_uninteresting' option to the path_walk_info struct.

To check how the UNINTERSTING flag is spread through our objects, extend the
'test-tool path-walk' command to output whether or not an object has that
flag. This changes our tests significantly, including the removal of some
objects that were previously visited due to the incomplete implementation.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The path-walk API currently uses a stack-based approach to recursing
through the list of paths within the repository. This guarantees that
after a tree path is explored, all paths contained within that tree path
will be explored before continuing to explore siblings of that tree
path.

The initial motivation of this depth-first approach was to minimize
memory pressure while exploring the repository. A breadth-first approach
would have too many "active" paths being stored in the paths_to_lists
map.

We can take this approach one step further by making sure that blob
paths are visited before tree paths. This allows the API to free the
memory for these blob objects before continuing to perform the
depth-first search. This modifies the order in which we visit siblings,
but does not change the fact that we are performing depth-first search.

To achieve this goal, use a priority queue with a custom sorting method.
The sort needs to handle tags, blobs, and trees (commits are handled
slightly differently). When objects share a type, we can sort by path
name. This will keep children of the latest path to leave the stack be
preferred over the rest of the paths in the stack, since they agree in
prefix up to and including a directory separator. When the types are
different, we can prefer tags over other types and blobs over trees.

This causes significant adjustments to t6601-path-walk.sh to rearrange
the order of the visited paths.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
This will be helpful in a future change.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
In order to more easily compute delta bases among objects that appear at the
exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given by
the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

The current implementation performs delta calculations while walking
objects, which is not ideal for a few reasons. First, this will cause
the "Enumerating objects" phase to be much longer than usual. Second, it
does not take advantage of threading during the path-scoped delta
calculations. Even with this lack of threading, the path-walk option is
sometimes faster than the usual approach. Future changes will refactor
this code to allow for threading, but that complexity is deferred until
later to keep this patch as simple as possible.

This new walk is incompatible with some features and is ignored by
others:

 * Object filters are not currently integrated with the path-walk API,
   such as sparse-checkout or tree depth. A blobless packfile could be
   integrated easily, but that is deferred for later.

 * Server-focused features such as delta islands, shallow packs, and
   using a bitmap index are incompatible with the path-walk API.

 * The path walk API is only compatible with the --revs option, not
   taking object lists or pack lists over stdin. These alternative ways
   to specify the objects currently ignores the --path-walk option
   without even a warning.

Future changes will create performance tests that demonstrate the power
of this approach.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The t0450 test script verifies that builtin usage matches the synopsis
in the documentation. Adjust the builtin to match and then remove 'git
pack-objects' from the exception list.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The previous change added a --path-walk option to 'git pack-objects'.
Create a performance test that demonstrates the time and space benefits
of the feature.

In order to get an appropriate comparison, we need to avoid reusing
deltas and recompute them from scratch.

Compare the creation of a thin pack representing a small push and the
creation of a relatively large non-thin pack.

Running on my copy of the Git repository results in this data:

Test                                      HEAD
--------------------------------------------------------------
5313.2: thin pack                         0.00(0.00+0.00)
5313.3: thin pack size                                589
5313.4: thin pack with --path-walk        0.00(0.00+0.00)
5313.5: thin pack size with --path-walk               589
5313.6: big pack                          2.76(7.19+0.27)
5313.7: big pack size                               14.0M
5313.8: big pack with --path-walk         5.76(6.72+0.16)
5313.9: big pack size with --path-walk              13.2M

Note that the timing is slower because there is no threading in the
--path-walk case (yet).

The cases where the --path-walk option really shines is when the default
name-hash is overwhelmed with collisions. An open source example can be
found in the microsoft/fluentui repo [1] at a certain commit [2].

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

Running the tests on this repo results in the following output:

Test                                      HEAD
--------------------------------------------------------------
5313.2: thin pack                         0.28(0.40+0.03)
5313.3: thin pack size                               1.2M
5313.4: thin pack with --path-walk        0.07(0.06+0.00)
5313.5: thin pack size with --path-walk             18.4K
5313.6: big pack                          4.05(29.48+0.41)
5313.7: big pack size                               19.7M
5313.8: big pack with --path-walk         6.01(9.17+0.20)
5313.9: big pack size with --path-walk              16.5M

Notice in particular that in the small thin pack, the time performance
has improved from 0.28s to 0.08s and this is likely due to the improved
size of the resulting pack: 18.4K instead of 1.2M.

Finally, running this on a copy of the Linux kernel repository results
in these data points:

Test                                      HEAD
--------------------------------------------------------------
5313.2: thin pack                         0.03(0.02+0.00)
5313.3: thin pack size                               4.6K
5313.4: thin pack with --path-walk        0.03(0.01+0.01)
5313.5: thin pack size with --path-walk              4.6K
5313.6: big pack                          21.06(60.57+1.45)
5313.7: big pack size                              158.3M
5313.8: big pack with --path-walk         37.65(57.83+0.67)
5313.9: big pack size with --path-walk             152.3M

Signed-off-by: Derrick Stolee <stolee@gmail.com>
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:

 - t0411-clone-from-partial.sh : One test fetches from a repo that does
   not have the boundary objects. This causes the path-based walk to
   fail. Disable the variable for this test.

 - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
   without a boundary object.

 - t5310-pack-bitmaps.sh : One test compares the case when packing with
   bitmaps to the case when packing without them. Since we disable the
   test variable when writing bitmaps, this causes a difference in the
   object list (the --path-walk option adds an extra object). Specify
   --no-path-walk in both processes for the comparison. Another test
   checks for a specific delta base, but when computing dynamically
   without using bitmaps, the base object it too small to be considered
   in the delta calculations so no base is used.

 - t5316-pack-delta-depth.sh : This script cares about certain delta
   choices and their chain lengths. The --path-walk option changes how
   these chains are selected, and thus changes the results of this test.

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

 - t5332-multi-pack-reuse.sh : This test verifies that the preferred
   pack is used for delta reuse when possible. The --path-walk option is
   not currently aware of the preferred pack at all, so finds a
   different delta base.

 - t7406-submodule-update.sh : When using the variable, the --depth
   option collides with the --path-walk feature, resulting in a warning
   message. Disable the variable so this warning does not appear.

I want to call out one specific test change that is only temporary:

 - t5530-upload-pack-error.sh : One test cares specifically about an
   "unable to read" error message. Since the current implementation
   performs delta calculations within the path-walk API callback, a
   different "unable to get size" error message appears. When this
   is changed in a future refactoring, this test change can be reverted.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
It can be notoriously difficult to detect if delta bases are being
computed properly during 'git push'. Construct an example where it will
make a kilobyte worth of difference when a delta base is not found. We
can then use the progress indicators to distinguish between bytes and
KiB depending on whether the delta base is found and used.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

Add the --path-walk option to the performance tests in p5313.

For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the results are very interesting:

Test                                      HEAD
----------------------------------------------------------
5313.2: thin pack                         0.41(0.48+0.03)
5313.3: thin pack size                               1.2M
5313.4: thin pack with --path-walk        0.08(0.05+0.02)
5313.5: thin pack size with --path-walk             18.4K
5313.6: big pack                          4.47(30.62+0.40)
5313.7: big pack size                               19.6M
5313.8: big pack with --path-walk         6.76(9.87+0.23)
5313.9: big pack size with --path-walk              16.5M
5313.10: repack                           96.87(664.29+2.75)
5313.11: repack size                               439.5M
5313.12: repack with --path-walk          95.68(109.90+0.92)
5313.13: repack size with --path-walk              122.6M

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

This repo suffers from having a lot of paths that collide in the name
hash, so examining them in groups by path leads to better deltas. Also,
in this case, the single-threaded implementation is competitive with the
full repack. This is saving time diffing files that have significant
differences from each other.

A similar, but private, repo has even more extremes during repacking:

Test                                      HEAD
-----------------------------------------------------------------
5313.10: repack                           2138.22(11961.00+17.67)
5313.11: repack size                                 6.4G
5313.12: repack with --path-walk          1351.46(1418.28+3.96)
5313.13: repack size with --path-walk              804.1M

There are small benefits in size for my copy of the Git repository:

Test                                      HEAD
-----------------------------------------------------------
5313.10: repack                           22.11(98.37+1.64)
5313.11: repack size                               126.4M
5313.12: repack with --path-walk          66.89(75.61+0.58)
5313.13: repack size with --path-walk              109.6M

As well as in the nodejs/node repository [3]:

Test                                      HEAD
-------------------------------------------------------------
5313.2: thin pack                         0.01(0.01+0.00)
5313.3: thin pack size                               1.6K
5313.4: thin pack with --path-walk        0.02(0.01+0.00)
5313.5: thin pack size with --path-walk              1.6K
5313.6: big pack                          5.35(12.43+0.32)
5313.7: big pack size                               52.2M
5313.8: big pack with --path-walk         7.12(11.97+0.27)
5313.9: big pack size with --path-walk              52.1M
5313.10: repack                           87.74(342.90+4.24)
5313.11: repack size                               739.7M
5313.12: repack with --path-walk          212.79(245.05+1.78)
5313.13: repack size with --path-walk              697.6M

[3] https://github.com/nodejs/node

This benefit also repeats in this instance in the Linux kernel repository:

Test                                      HEAD
---------------------------------------------------------------
5313.2: thin pack                         0.04(0.00+0.03)
5313.3: thin pack size                               4.6K
5313.4: thin pack with --path-walk        0.03(0.01+0.01)
5313.5: thin pack size with --path-walk              4.6K
5313.6: big pack                          21.16(62.81+1.35)
5313.7: big pack size                              158.3M
5313.8: big pack with --path-walk         36.09(55.25+0.67)
5313.9: big pack size with --path-walk             152.2M
5313.10: repack                           734.26(2149.62+31.24)
5313.11: repack size                                 2.5G
5313.12: repack with --path-walk          1457.23(1618.15+7.00)
5313.13: repack size with --path-walk                2.2G

It is important to see that even when the repository shape does not have
many name-hash collisions, there is a slight space boost to be found
using this method. Also, there is no known case where the space is
worse with --path-walk. This is of course due to the second pass where
all objects to be packed are sorted in the usual way and checked for
deltas. This second pass is usually very fast as the path-walk has
primed many objects with quality deltas that short-circuit other delta
computation attempts.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The t0450 test script verifies that the builtin usage matches the
synopsis in the documentation. Update 'git repack' to match and remove
it from the exception list.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.

Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.

This presents a new progress indicator that can be used in tests to
verify that this stage is happening.

The current implementation is not integrated with threads, but could be
done in a future update.

Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.

This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.

Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)

Test                                      HEAD~1    HEAD
-----------------------------------------------------------------
5313.2: thin pack                         0.00        0.00  =
5313.3: thin pack size                     589         589  +0.0%
5313.4: thin pack with --path-walk        0.00        0.00  =
5313.5: thin pack size with --path-walk    589         589  +0.0%
5313.6: big pack                          2.84        2.80  -1.4%
5313.7: big pack size                    14.0M       14.1M  +0.3%
5313.8: big pack with --path-walk         5.46        3.77 -31.0%
5313.9: big pack size with --path-walk   13.2M       13.2M  -0.0%
5313.10: repack                           22.11      21.50  -2.8%
5313.11: repack size                     126.4M     126.2M  -0.2%
5313.12: repack with --path-walk          66.89      26.41 -60.5%
5313.13: repack size with --path-walk    109.6M     109.6M  +0.0%

This 60% reduction in 'git repack --path-walk' time is typical across
all repos I used for testing. What is interesting is to compare when the
overall time improves enough to outperform the standard case. These time
improvements correlate with repositories with data shapes that
significantly improve their data size as well.

For example, the microsoft/fluentui repo has a 439M to 122M size
reduction, and the repack time is now 36.6 seconds with --path-walk
compared to 95+ seconds without it:

Test                                      HEAD~!    HEAD
-----------------------------------------------------------------
5313.2: thin pack                         0.41        0.42  +2.4%
5313.3: thin pack size                    1.2M        1.2M  +0.0%
5313.4: thin pack with --path-walk        0.08        0.05 -37.5%
5313.5: thin pack size with --path-walk  18.4K       18.4K  +0.0%
5313.6: big pack                          4.47        4.53  +1.3%
5313.7: big pack size                    19.6M       19.7M  +0.3%
5313.8: big pack with --path-walk         6.76        3.51 -48.1%
5313.9: big pack size with --path-walk   16.5M       16.4M  -0.2%
5313.10: repack                          96.87       99.05  +2.3%
5313.11: repack size                    439.5M      439.0M  -0.1%
5313.12: repack with --path-walk         95.68       36.55 -61.8%
5313.13: repack size with --path-walk   122.6M      122.6M  +0.0%

In a more extreme example, an internal repository that has a similar
name-hash collision issue to microsoft/fluentui reduces its size from
6.4G to 805M with the --path-walk option. This also reduces the
repacking time from 2,138 seconds to 478 seconds.

Test                                      HEAD~1    HEAD
------------------------------------------------------------------
5313.10: repack                           2138.22   2138.19  -0.0%
5313.11: repack size                         6.4G      6.4G  -0.0%
5313.12: repack with --path-walk          1351.46    477.91 -64.6%
5313.13: repack size with --path-walk      804.1M    804.1M  -0.0%

Finally, the Linux kernel repository is a good test for this repacking
time change, even though the space savings is more reasonable:

Test                                      HEAD~1      HEAD
----------------------------------------------------------------
5313.10: repack                           734.26   735.11  +0.1%
5313.11: repack size                        2.5G     2.5G  -0.0%
5313.12: repack with --path-walk         1457.23   598.17 -59.0%
5313.13: repack size with --path-walk       2.2G     2.2G  +0.0%

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant