Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack-objects: create new name-hash algorithm #1785

Closed
wants to merge 6 commits into from

Commits on Sep 23, 2024

  1. pack-objects: add --full-name-hash option

    The pack_name_hash() method has not been materially changed since it was
    introduced in ce0bd64 (pack-objects: improve path grouping
    heuristics., 2006-06-05). The intention here is to group objects by path
    name, but also attempt to group similar file types together by making
    the most-significant digits of the hash be focused on the final
    characters.
    
    Here's the crux of the implementation:
    
    	/*
    	 * This effectively just creates a sortable number from the
    	 * last sixteen non-whitespace characters. Last characters
    	 * count "most", so things that end in ".c" sort together.
    	 */
    	while ((c = *name++) != 0) {
    		if (isspace(c))
    			continue;
    		hash = (hash >> 2) + (c << 24);
    	}
    
    As the comment mentions, this only cares about the last sixteen
    non-whitespace characters. This cause some filenames to collide more
    than others. Here are some examples that I've seen while investigating
    repositories that are growing more than they should be:
    
     * "/CHANGELOG.json" is 15 characters, and is created by the beachball
       [1] tool. Only the final character of the parent directory can
       differntiate different versions of this file, but also only the two
       most-significant digits. If that character is a letter, then this is
       always a collision. Similar issues occur with the similar
       "/CHANGELOG.md" path, though there is more opportunity for
       differences in the parent directory.
    
     * Localization files frequently have common filenames but differentiate
       via parent directories. In C#, the name "/strings.resx.lcl" is used
       for these localization files and they will all collide in name-hash.
    
    [1] https://github.com/microsoft/beachball
    
    I've come across many other examples where some internal tool uses a
    common name across multiple directories and is causing Git to repack
    poorly due to name-hash collisions.
    
    It is clear that the existing name-hash algorithm is optimized for
    repositories with short path names, but also is optimized for packing a
    single snapshot of a repository, not a repository with many versions of
    the same file. In my testing, this has proven out where the name-hash
    algorithm does a good job of finding peer files as delta bases when
    unable to use a historical version of that exact file.
    
    However, for repositories that have many versions of most files and
    directories, it is more important that the objects that appear at the
    same path are grouped together.
    
    Create a new pack_full_name_hash() method and a new --full-name-hash
    option for 'git pack-objects' to call that method instead. Add a simple
    pass-through for 'git repack --full-name-hash' for additional testing in
    the context of a full repack, where I expect this will be most
    effective.
    
    The hash algorithm is as simple as possible to be reasonably effective:
    for each character of the path string, add a multiple of that character
    and a large prime number (chosen arbitrarily, but intended to be large
    relative to the size of a uint32_t). Then, shift the current hash value
    to the right by 5, with overlap. The addition and shift parameters are
    standard mechanisms for creating hard-to-predict behaviors in the bits
    of the resulting hash.
    
    This is not meant to be cryptographic at all, but uniformly distributed
    across the possible hash values. This creates a hash that appears
    pseudorandom. There is no ability to consider similar file types as
    being close to each other.
    
    In a later change, a test-tool will be added so the effectiveness of
    this hash can be demonstrated directly.
    
    For now, let's consider how effective this mechanism is when repacking a
    repository with and without the --full-name-hash option. Specifically,
    let's use 'git repack -adf [--full-name-hash]' as our test.
    
    On the Git repository, we do not expect much difference. All path names
    are short. This is backed by our results:
    
    | Stage                 | Pack Size | Repack Time |
    |-----------------------|-----------|-------------|
    | After clone           | 260 MB    | N/A         |
    | Standard Repack       | 127MB     | 106s        |
    | With --full-name-hash | 126 MB    | 99s         |
    
    This example demonstrates how there is some natural overhead coming from
    the cloned copy because the server is hosting many forks and has not
    optimized for exactly this set of reachable objects. But the full repack
    has similar characteristics with and without --full-name-hash.
    
    However, we can test this in a repository that uses one of the
    problematic naming conventions above. The fluentui [2] repo uses
    beachball to generate CHANGELOG.json and CHANGELOG.md files, and these
    files have very poor delta characteristics when comparing against
    versions across parent directories.
    
    | Stage                 | Pack Size | Repack Time |
    |-----------------------|-----------|-------------|
    | After clone           | 694 MB    | N/A         |
    | Standard Repack       | 438 MB    | 728s        |
    | With --full-name-hash | 168 MB    | 142s        |
    
    [2] https://github.com/microsoft/fluentui
    
    In this example, we see significant gains in the compressed packfile
    size as well as the time taken to compute the packfile.
    
    Using a collection of repositories that use the beachball tool, I was
    able to make similar comparisions with dramatic results. While the
    fluentui repo is public, the others are private so cannot be shared for
    reproduction. The results are so significant that I find it important to
    share here:
    
    | Repo     | Standard Repack | With --full-name-hash |
    |----------|-----------------|-----------------------|
    | fluentui |         438 MB  |               168 MB  |
    | Repo B   |       6,255 MB  |               829 MB  |
    | Repo C   |      37,737 MB  |             7,125 MB  |
    | Repo D   |     130,049 MB  |             6,190 MB  |
    
    Future changes could include making --full-name-hash implied by a config
    value or even implied by default during a full repack.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    9c8f8f3 View commit details
    Browse the repository at this point in the history
  2. repack: test --full-name-hash option

    The new '--full-name-hash' option for 'git repack' is a simple
    pass-through to the underlying 'git pack-objects' subcommand. However,
    this subcommand may have other options and a temporary filename as part
    of the subcommand execution that may not be predictable or could change
    over time.
    
    The existing test_subcommand method requires an exact list of arguments
    for the subcommand. This is too rigid for our needs here, so create a
    new method, test_subcommand_flex. Use it to check that the
    --full-name-hash option is passing through.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    612dbd1 View commit details
    Browse the repository at this point in the history
  3. pack-objects: add GIT_TEST_FULL_NAME_HASH

    Add a new environment variable to opt-in to the --full-name-hash option
    in 'git pack-objects'. This allows for extra testing of the feature
    without repeating all of the test scenarios.
    
    But this option isn't free. There are a few tests that change behavior
    with the variable enabled.
    
    First, there are a few tests that are very sensitive to certain delta
    bases being picked. These are both involving the generation of thin
    bundles and then counting their objects via 'git index-pack --fix-thin'
    which pulls the delta base into the new packfile. For these tests,
    disable the option as a decent long-term option.
    
    Second, there are two tests in t5616-partial-clone.sh that I believe are
    actually broken scenarios. While the client is set up to clone the
    'promisor-server' repo via a treeless partial clone filter (tree:0),
    that filter does not translate to the 'server' repo. Thus, fetching from
    these repos causes the server to think that the client has all reachable
    trees and blobs from the commits advertised as 'haves'. This leads the
    server to providing a thin pack assuming those objects as delta bases.
    Changing the name-hash algorithm presents new delta bases and thus
    breaks the expectations of these tests. An alternative could be to set
    up 'server' as a promisor server with the correct filter enabled. This
    may also point out more issues with partial clone being set up as a
    remote-based filtering mechanism and not a repository-wide setting. For
    now, do the minimal change to make the test work by disabling the test
    variable.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    e173de6 View commit details
    Browse the repository at this point in the history
  4. git-repack: update usage to match docs

    This also adds the '--full-name-hash' option introduced in the previous
    change and adds newlines to the synopsis.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    543382b View commit details
    Browse the repository at this point in the history
  5. p5313: add size comparison test

    As custom options are added to 'git pack-objects' and 'git repack' to
    adjust how compression is done, use this new performance test script to
    demonstrate their effectiveness in performance and size.
    
    The recently-added --full-name-hash option swaps the default name-hash
    algorithm with one that attempts to uniformly distribute the hashes
    based on the full path name instead of the last 16 characters.
    
    This has a dramatic effect on full repacks for repositories with many
    versions of most paths. It can have a negative impact on cases such as
    pushing a single change.
    
    This can be seen by running pt5313 on the open source fluentui
    repository [1]. Most commits will have this kind of output for the thin
    and big pack cases, though certain commits (such as [2]) will have
    problematic thin pack size for other reasons.
    
    [1] https://github.com/microsoft/fluentui
    [2] a637a06df05360ce5ff21420803f64608226a875
    
    Checked out at the parent of [2], I see the following statistics:
    
    Test                                           this tree
    ------------------------------------------------------------------
    5313.2: thin pack                              0.02(0.01+0.01)
    5313.3: thin pack size                                    1.1K
    5313.4: thin pack with --full-name-hash        0.02(0.01+0.00)
    5313.5: thin pack size with --full-name-hash              3.0K
    5313.6: big pack                               1.65(3.35+0.24)
    5313.7: big pack size                                    58.0M
    5313.8: big pack with --full-name-hash         1.53(2.52+0.18)
    5313.9: big pack size with --full-name-hash              57.6M
    5313.10: repack                                176.52(706.60+3.53)
    5313.11: repack size                                    446.7K
    5313.12: repack with --full-name-hash          37.47(134.18+3.06)
    5313.13: repack size with --full-name-hash              183.1K
    
    Note that this demonstrates a 3x size _increase_ in the case that
    simulates a small "git push". The size change is neutral on the case of
    pushing the difference between HEAD and HEAD~1000.
    
    However, the full repack case is both faster and more efficient.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    4d2381a View commit details
    Browse the repository at this point in the history

Commits on Sep 24, 2024

  1. test-tool: add helper for name-hash values

    Add a new test-tool helper, name-hash, to output the value of the
    name-hash algorithms for the input list of strings, one per line.
    
    Since the name-hash values can be stored in the .bitmap files, it is
    important that these hash functions do not change across Git versions.
    Add a simple test to t5310-pack-bitmaps.sh to provide some testing of
    the current values. Due to how these functions are implemented, it would
    be difficult to change them without disturbing these values.
    
    Create a performance test that uses test_size to demonstrate how
    collisions occur for these hash algorithms. This test helps inform
    someone as to the behavior of the name-hash algorithms for their repo
    based on the paths at HEAD.
    
    My copy of the Git repository shows modest statistics around the
    collisions of the default name-hash algorithm:
    
    Test                                              this tree
    -----------------------------------------------------------------
    5314.1: paths at head                                        4.5K
    5314.2: number of distinct name-hashes                       4.1K
    5314.3: number of distinct full-name-hashes                  4.5K
    5314.4: maximum multiplicity of name-hashes                    13
    5314.5: maximum multiplicity of fullname-hashes                 1
    
    Here, the maximum collision multiplicity is 13, but around 10% of paths
    have a collision with another path.
    
    In a more interesting example, the microsoft/fluentui [1] repo had these
    statistics at time of committing:
    
    Test                                              this tree
    -----------------------------------------------------------------
    5314.1: paths at head                                       19.6K
    5314.2: number of distinct name-hashes                       8.2K
    5314.3: number of distinct full-name-hashes                 19.6K
    5314.4: maximum multiplicity of name-hashes                   279
    5314.5: maximum multiplicity of fullname-hashes                 1
    
    [1] https://github.com/microsoft/fluentui
    
    That demonstrates that of the nearly twenty thousand path names, they
    are assigned around eight thousand distinct values. 279 paths are
    assigned to a single value, leading the packing algorithm to sort
    objects from those paths together, by size.
    
    In this repository, no collisions occur for the full-name-hash
    algorithm.
    
    In a more extreme example, an internal monorepo had a much worse
    collision rate:
    
    Test                                              this tree
    -----------------------------------------------------------------
    5314.1: paths at head                                      221.6K
    5314.2: number of distinct name-hashes                      72.0K
    5314.3: number of distinct full-name-hashes                221.6K
    5314.4: maximum multiplicity of name-hashes                 14.4K
    5314.5: maximum multiplicity of fullname-hashes                 2
    
    Even in this repository with many more paths at HEAD, the collision rate
    was low and the maximum number of paths being grouped into a single
    bucket by the full-path-name algorithm was two.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    80ba362 View commit details
    Browse the repository at this point in the history