Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Path-walk API and applications #1786

Closed
wants to merge 30 commits into from

Commits on Sep 8, 2024

  1. path-walk: introduce an object walk by path

    In anticipation of a few planned applications, introduce the most basic form
    of a path-walk API. It currently assumes that there are no UNINTERESTING
    objects, and does not include any complicated filters. It calls a function
    pointer on groups of tree and blob objects as grouped by path. This only
    includes objects the first time they are discovered, so an object that
    appears at multiple paths will not be included in two batches.
    
    There are many future adaptations that could be made, but they are left for
    future updates when consumers are ready to take advantage of those features.
    
    RFC TODO: It would be helpful to create a test-tool that allows printing of
    each batch for strong testing.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    a53bd0d View commit details
    Browse the repository at this point in the history
  2. backfill: add builtin boilerplate

    In anticipation of implementing 'git backfill', populate the necessary files
    with the boilerplate of a new builtin.
    
    RFC TODO: When preparing this for a full implementation, make sure it is
    based on the newest standards introduced by [1].
    
    [1] https://lore.kernel.org/git/xmqqjzfq2f0f.fsf@gitster.g/T/#m606036ea2e75a6d6819d6b5c90e729643b0ff7f7
        [PATCH 1/3] builtin: add a repository parameter for builtin functions
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    41c49bb View commit details
    Browse the repository at this point in the history
  3. backfill: basic functionality and tests

    The default behavior of 'git backfill' is to fetch all missing blobs that
    are reachable from HEAD. Document and test this behavior.
    
    The implementation is a very simple use of the path-walk API, initializing
    the revision walk at HEAD to start the path-walk from all commits reachable
    from HEAD. Ignore the object arrays that correspond to tree entries,
    assuming that they are all present already.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    be21c83 View commit details
    Browse the repository at this point in the history
  4. backfill: add --batch-size=<n> option

    Users may want to specify a minimum batch size for their needs. This is only
    a minimum: the path-walk API provides a list of OIDs that correspond to the
    same path, and thus it is optimal to allow delta compression across those
    objects in a single server request.
    
    We could consider limiting the request to have a maximum batch size in the
    future.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    f904b02 View commit details
    Browse the repository at this point in the history
  5. backfill: add --sparse option

    One way to significantly reduce the cost of a Git clone and later fetches is
    to use a blobless partial clone and combine that with a sparse-checkout that
    reduces the paths that need to be populated in the working directory. Not
    only does this reduce the cost of clones and fetches, the sparse-checkout
    reduces the number of objects needed to download from a promisor remote.
    
    However, history investigations can be expensie as computing blob diffs will
    trigger promisor remote requests for one object at a time. This can be
    avoided by downloading the blobs needed for the given sparse-checkout using
    'git backfill' and its new '--sparse' mode, at a time that the user is
    willing to pay that extra cost.
    
    Note that this is distinctly different from the '--filter=sparse:<oid>'
    option, as this assumes that the partial clone has all reachable trees and
    we are using client-side logic to avoid downloading blobs outside of the
    sparse-checkout cone. This avoids the server-side cost of walking trees
    while also achieving a similar goal. It also downloads in batches based on
    similar path names, presenting a resumable download if things are
    interrupted.
    
    This augments the path-walk API to have a possibly-NULL 'pl' member that may
    point to a 'struct pattern_list'. This could be more general than the
    sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
    the only consumer.
    
    Be sure to test this in both cone mode and not cone mode. Cone mode has the
    benefit that the path-walk can skip certain paths once they would expand
    beyond the sparse-checkout.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    cd33c62 View commit details
    Browse the repository at this point in the history
  6. backfill: assume --sparse when sparse-checkout is enabled

    The previous change introduced the '--[no-]sparse' option for the 'git
    backfill' command, but did not assume it as enabled by default. However,
    this is likely the behavior that users will most often want to happen.
    Without this default, users with a small sparse-checkout may be confused
    when 'git backfill' downloads every version of every object in the full
    history.
    
    However, this is left as a separate change so this decision can be reviewed
    independently of the value of the '--[no-]sparse' option.
    
    Add a test of adding the '--sparse' option to a repo without sparse-checkout
    to make it clear that supplying it without a sparse-checkout is an error.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    aa34653 View commit details
    Browse the repository at this point in the history
  7. path-walk: allow consumer to specify object types

    This adds the ability to ask for the commits as a single list. This will
    also reduce the calls in 'git backfill' to be a BUG() statement if called
    with anything other than blobs.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    2829fe3 View commit details
    Browse the repository at this point in the history

Commits on Sep 9, 2024

  1. path-walk: allow visiting tags

    In anticipation of using the path-walk API to analyze tags or include
    them in a pack-file, add the ability to walk the tags that were included
    in the revision walk.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    d67679d View commit details
    Browse the repository at this point in the history
  2. survey: stub in new experimental git-survey command

    Start work on a new `git survey` command to scan the repository
    for monorepo performance and scaling problems.  The goal is to
    measure the various known "dimensions of scale" and serve as a
    foundation for adding additional measurements as we learn more
    about Git monorepo scaling problems.
    
    The initial goal is to complement the scanning and analysis performed
    by the GO-based `git-sizer` (https://github.com/github/git-sizer) tool.
    It is hoped that by creating a builtin command, we may be able to take
    advantage of internal Git data structures and code that is not
    accessible from GO to gain further insight into potential scaling
    problems.
    
    RFC TODO: Adapt this boilerplat to match the upcoming changes to builtin
    methods that include a 'struct repository' pointer.
    
    Co-authored-by: Derrick Stolee <stolee@gmail.com>
    Signed-off-by: Jeff Hostetler <jeffhostetler@github.com>
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    jeffhostetler and derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    7d43a16 View commit details
    Browse the repository at this point in the history
  3. survey: add command line opts to select references

    By default we will scan all references in "refs/heads/", "refs/tags/"
    and "refs/remotes/".
    
    Add command line opts let the use ask for all refs or a subset of them
    and to include a detached HEAD.
    
    Signed-off-by: Jeff Hostetler <jeffhostetler@github.com>
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    jeffhostetler authored and derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    9098687 View commit details
    Browse the repository at this point in the history
  4. survey: collect the set of requested refs

    Collect the set of requested branches, tags, and etc into a ref_array and
    collect the set of requested patterns into a strvec.
    
    RFC TODO: This patch has some changes that should be in the previous patch,
    to make the diff look a lot better.
    
    Co-authored-by: Derrick Stolee <stolee@gmail.com>
    Signed-off-by: Jeff Hostetler <jeffhostetler@github.com>
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    jeffhostetler and derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    efa1793 View commit details
    Browse the repository at this point in the history
  5. survey: start pretty printing data in table form

    When 'git survey' provides information to the user, this will be presented
    in one of two formats: plaintext and JSON. The JSON implementation will be
    delayed until the functionality is complete for the plaintext format.
    
    The most important parts of the plaintext format are headers specifying the
    different sections of the report and tables providing concreted data.
    
    Create a custom table data structure that allows specifying a list of
    strings for the row values. When printing the table, check each column for
    the maximum width so we can create a table of the correct size from the
    start.
    
    The table structure is designed to be flexible to the different kinds of
    output that will be implemented in future changes.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    44417cc View commit details
    Browse the repository at this point in the history
  6. survey: add object count summary

    At the moment, nothing is obvious about the reason for the use of the
    path-walk API, but this will become more prevelant in future iterations. For
    now, use the path-walk API to sum up the counts of each kind of object.
    
    For example, this is the reachable object summary output for my local repo:
    
    REACHABLE OBJECT SUMMARY
    ========================
    Object Type |  Count
    ------------+-------
           Tags |      0
        Commits | 178573
          Trees | 312745
          Blobs | 183035
    
    (Note: the "Tags" are zero right now because the path-walk API has not been
    integrated to walk tags yet. This will be fixed in a later change.)
    
    RFC TODO: make sure tags are walked before this change.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    fcc281a View commit details
    Browse the repository at this point in the history
  7. survey: summarize total sizes by object type

    Now that we have explored objects by count, we can expand that a bit more to
    summarize the data for the on-disk and inflated size of those objects. This
    information is helpful for diagnosing both why disk space (and perhaps
    clone or fetch times) is growing but also why certain operations are slow
    because the inflated size of the abstract objects that must be processed is
    so large.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    462ca0b View commit details
    Browse the repository at this point in the history
  8. survey: show progress during object walk

    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    9c54c14 View commit details
    Browse the repository at this point in the history
  9. survey: add ability to track prioritized lists

    In future changes, we will make use of these methods. The intention is to
    keep track of the top contributors according to some metric. We don't want
    to store all of the entries and do a sort at the end, so track a
    constant-size table and remove rows that get pushed out depending on the
    chosen sorting algorithm.
    
    Co-authored-by: Jeff Hostetler <git@jeffhostetler.com>
    Signed-off-by; Jeff Hostetler <git@jeffhostetler.com>
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee and Jeff Hostetler committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    3504abb View commit details
    Browse the repository at this point in the history
  10. survey: add report of "largest" paths

    Since we are already walking our reachable objects using the path-walk API,
    let's now collect lists of the paths that contribute most to different
    metrics. Specifically, we care about
    
     * Number of versions.
     * Total size on disk.
     * Total inflated size (no delta or zlib compression).
    
    This information can be critical to discovering which parts of the
    repository are causing the most growth, especially on-disk size. Different
    packing strategies might help compress data more efficiently, but the toal
    inflated size is a representation of the raw size of all snapshots of those
    paths. Even when stored efficiently on disk, that size represents how much
    information must be processed to complete a command such as 'git blame'.
    
    Since the on-disk size is likely to be fragile, stop testing the exact
    output of 'git survey' and check that the correct set of headers is
    output.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    9e95914 View commit details
    Browse the repository at this point in the history
  11. revision: create mark_trees_uninteresting_dense()

    The sparse tree walk algorithm was created in d5d2e93 (revision:
    implement sparse algorithm, 2019-01-16) and involves using the
    mark_trees_uninteresting_sparse() method. This method takes a repository
    and an oidset of tree IDs, some of which have the UNINTERESTING flag and
    some of which do not.
    
    Create a method that has an equivalent set of preconditions but uses a
    "dense" walk (recursively visits all reachable trees, as long as they
    have not previously been marked UNINTERESTING). This is an important
    difference from mark_tree_uninteresting(), which short-circuits if the
    given tree has the UNINTERESTING flag.
    
    A use of this method will be added in a later change, with a condition
    set whether the sparse or dense approach should be used.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    98a854c View commit details
    Browse the repository at this point in the history
  12. path-walk: add prune_all_uninteresting option

    This option causes the path-walk API to act like the sparse tree-walk
    algorithm implemented by mark_trees_uninteresting_sparse() in
    list-objects.c.
    
    Starting from the commits marked as UNINTERESTING, their root trees and
    all objects reachable from those trees are UNINTERSTING, at least as we
    walk path-by-path. When we reach a path where all objects associated
    with that path are marked UNINTERESTING, then do no continue walking the
    children of that path.
    
    We need to be careful to pass the UNINTERESTING flag in a deep way on
    the UNINTERESTING objects before we start the path-walk, or else the
    depth-first search for the path-walk API may accidentally report some
    objects as interesting.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    78168d9 View commit details
    Browse the repository at this point in the history
  13. pack-objects: add --path-walk option

    In order to more easily compute delta bases among objects that appear at the
    exact same path, add a --path-walk option to 'git pack-objects'.
    
    This option will use the path-walk API instead of the object walk given by
    the revision machinery. Since objects will be provided in batches
    representing a common path, those objects can be tested for delta bases
    immediately instead of waiting for a sort of the full object list by
    name-hash. This has multiple benefits, including avoiding collisions by
    name-hash.
    
    The objects marked as UNINTERESTING are included in these batches, so we
    are guaranteeing some locality to find good delta bases.
    
    After the individual passes are done on a per-path basis, the default
    name-hash is used to find other opportunistic delta bases that did not
    match exactly by the full path name.
    
    RFC TODO: It is important to note that this option is inherently
    incompatible with using a bitmap index. This walk probably also does not
    work with other advanced features, such as delta islands.
    
    Getting ahead of myself, this option compares well with --full-name-hash
    when the packfile is large enough, but also performs at least as well as
    the default in all cases that I've seen.
    
    RFC TODO: this should probably be recording the batch locations to another
    list so they could be processed in a second phase using threads.
    
    RFC TODO: list some examples of how this outperforms previous pack-objects
    strategies. (This is coming in later commits that include performance
    test changes.)
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    3455af2 View commit details
    Browse the repository at this point in the history
  14. pack-objects: extract should_attempt_deltas()

    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    502008b View commit details
    Browse the repository at this point in the history
  15. pack-objects: introduce GIT_TEST_PACK_PATH_WALK

    There are many tests that validate whether 'git pack-objects' works as
    expected. Instead of duplicating these tests, add a new test environment
    variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
    when specified.
    
    This was useful in testing the implementation of the --path-walk
    implementation, especially in conjunction with test such as:
    
     - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
       the --sparse option and how it combines with --path-walk.
    
    RFC TODO: list other helpful test cases, as well as the ones where the
    behavior breaks if this is enabled...
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    b52ee33 View commit details
    Browse the repository at this point in the history
  16. p5313: add size comparison test

    To test the benefits of the new --path-walk option in 'git
    pack-objects', create a performance test that times the process but also
    compares the size of the output.
    
    Against the microsoft/fluentui repo [1] against a particular commit [2],
    this has reproducible results of a similar scale:
    
    Test                                            this tree
    ---------------------------------------------------------------
    5313.2: thin pack                               0.39(0.48+0.03)
    5313.3: thin pack size                                     1.2M
    5313.4: thin pack with --path-walk              0.09(0.07+0.01)
    5313.5: thin pack size with --path-walk                   20.8K
    5313.6: big recent pack                         2.13(8.29+0.26)
    5313.7: big recent pack size                              17.7M
    5313.8: big recent pack with --path-walk        3.18(4.21+0.22)
    5313.9: big recent pack size with --path-walk             15.0M
    
    [1] https://github.com/microsoft/reactui
    [2] e70848ebac1cd720875bccaa3026f4a9ed700e08
    
    RFC TODO: Note that the path-walk version is slower for the big case,
    but the delta calculation is single-threaded with the current
    implementation! It's still faster for the small case that mimics a
    typical push.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    54bd807 View commit details
    Browse the repository at this point in the history
  17. repack: add --path-walk option

    Since 'git pack-objects' supports a --path-walk option, allow passing it
    through in 'git repack'. This presents interesting testing opportunities for
    comparing the different repacking strategies against each other.
    
    For the microsoft/fluentui repo [1], the results are very interesting:
    
    Test                                            this tree
    -------------------------------------------------------------------
    5313.10: full repack                            97.91(663.47+2.83)
    5313.11: full repack size                                449.1K
    5313.12: full repack with --path-walk           105.42(120.49+0.95)
    5313.13: full repack size with --path-walk               159.1K
    
    [1] https://github.com/microsoft/fluentui
    
    This repo suffers from having a lot of paths that collide in the name
    hash, so examining them in groups by path leads to better deltas. Also,
    in this case, the single-threaded implementation is competitive with the
    full repack. This is saving time diffing files that have significant
    differences from each other.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    d3284d0 View commit details
    Browse the repository at this point in the history
  18. pack-objects: enable --path-walk via config

    Users may want to enable the --path-walk option for 'git pack-objects' by
    default, especially underneath commands like 'git push' or 'git repack'.
    
    This should be limited to client repositories, since the --path-walk option
    disables bitmap walks, so would be bad to include in Git servers when
    serving fetches and clones. There is potential that it may be helpful to
    consider when repacking the repository, to take advantage of improved deltas
    across historical versions of the same files.
    
    Much like how "pack.useSparse" was introduced and included in
    "feature.experimental" before being enabled by default, use the repository
    settings infrastructure to make the new "pack.usePathWalk" config enabled by
    "feature.experimental" and "feature.manyFiles".
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    1942f7d View commit details
    Browse the repository at this point in the history
  19. scalar: enable path-walk during push via config

    Repositories registered with Scalar are expected to be client-only
    repositories that are rather large. This means that they are more likely to
    be good candidates for using the --path-walk option when running 'git
    pack-objects', especially under the hood of 'git push'. Enable this config
    in Scalar repositories.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    4c10f85 View commit details
    Browse the repository at this point in the history
  20. pack-objects: add --full-name-hash option

    RFC NOTE: this is essentially the same as the patch introduced
    independently of the RFC, but now is on top of the --path-walk option
    instead. This is included in the RFC for comparison purposes.
    
    RFC NOTE: As you can see from the details below, the --full-name-hash
    option essentially attempts to do similar things as the --path-walk
    option, but sometimes misses the mark. Collisions still happen with the
    --full-name-hash option, leading to some misses. However, in cases where
    the default name-hash algorithm has low collision rates and deltas are
    actually desired across objects with similar names but different full
    names, the --path-walk option can still take advantage of the default
    name hash approach.
    
    Here are the new performance details simulating a single push in an
    internal monorepo using a lot of paths that collide in the default name
    hash. We can see that --full-name-hash gets close to the --path-walk
    option's size.
    
    Test                                           this tree
    --------------------------------------------------------------
    5313.2: thin pack                              2.43(2.92+0.14)
    5313.3: thin pack size                                    4.5M
    5313.4: thin pack with --full-name-hash        0.31(0.49+0.12)
    5313.5: thin pack size with --full-name-hash             15.5K
    5313.6: thin pack with --path-walk             0.35(0.31+0.04)
    5313.7: thin pack size with --path-walk                  14.2K
    
    However, when simulating pushes on repositories that do not have issues
    with name-hash collisions, the --full-name-hash option presents a
    potential of worse delta calculations, such as this example using my
    local Git repository:
    
    Test                                           this tree
    --------------------------------------------------------------
    5313.2: thin pack                              0.03(0.01+0.01)
    5313.3: thin pack size                                     475
    5313.4: thin pack with --full-name-hash        0.02(0.01+0.01)
    5313.5: thin pack size with --full-name-hash             14.8K
    5313.6: thin pack with --path-walk             0.02(0.01+0.01)
    5313.7: thin pack size with --path-walk                    475
    
    Note that the path-walk option found the same delta bases as the default
    options in this case.
    
    In the full repack case, the --full-name-hash option may be preferable
    because it interacts well with other advanced features, such as using
    bitmap indexes and tracking delta islands.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    db8cc46 View commit details
    Browse the repository at this point in the history
  21. test-name-hash: add helper to compute name-hash functions

    Using this tool, we can count how many distinct name-hash values exist
    within a list of paths. Examples include
    
     git ls-tree -r --name-only HEAD | \
    	     test-tool name-hash | \
      	      awk "{print \$1;}" | \
      		 sort -ns | uniq | wc -l
    
    which outputs the number of distinct name-hash values that appear at
    HEAD. Or, the following which presents the resulting name-hash values of
    maximum multiplicity:
    
     git ls-tree -r --name-only HEAD | \
    	     test-tool name-hash | \
    	      awk "{print \$1;}" | \
    	       sort -n | uniq -c | sort -nr | head -n 25
    
    For an internal monorepo with around a quarter million paths at HEAD,
    the highest multiplicity for the standard name-hash function was 14,424
    while the full name-hash algorithm had only seven hash values with any
    collision, with a maximum multiplicity of two.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    8df39a4 View commit details
    Browse the repository at this point in the history
  22. p5314: add a size test for name-hash collisions

    This test helps inform someone as to the behavior of the name-hash
    algorithms for their repo based on the paths at HEAD.
    
    For example, the microsoft/fluentui repo had these statistics at time of
    committing:
    
    Test                                              this tree
    -----------------------------------------------------------------
    5314.1: paths at head                                       19.6K
    5314.2: number of distinct name-hashes                       8.2K
    5314.3: number of distinct full-name-hashes                 19.6K
    5314.4: maximum multiplicity of name-hashes                   279
    5314.5: maximum multiplicity of fullname-hashes                 1
    
    That demonstrates that of the nearly twenty thousand path names, they
    are assigned around eight thousand distinct values. 279 paths are
    assigned to a single value, leading the packing algorithm to sort
    objects from those paths together, by size.
    
    In this repository, no collisions occur for the full-name-hash
    algorithm.
    
    In a more extreme example, an internal monorepo had a much worse
    collision rate:
    
    Test                                              this tree
    -----------------------------------------------------------------
    5314.1: paths at head                                      221.6K
    5314.2: number of distinct name-hashes                      72.0K
    5314.3: number of distinct full-name-hashes                221.6K
    5314.4: maximum multiplicity of name-hashes                 14.4K
    5314.5: maximum multiplicity of fullname-hashes                 2
    
    Even in this repository with many more paths at HEAD, the collision rate
    was low and the maximum number of paths being grouped into a single
    bucket by the full-path-name algorithm was two.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    5dcb20a View commit details
    Browse the repository at this point in the history
  23. pack-objects: output debug info about deltas

    In order to debug what is going on during delta calculations, add a
    --debug-file=<file> option to 'git pack-objects'. This leads to sending
    a JSON-formatted description of the delta information to that file.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    460feef View commit details
    Browse the repository at this point in the history