pack-objects: create new name-hash algorithm #1785

The pack_name_hash() method has not been materially changed since it was introduced in ce0bd64 (pack-objects: improve path grouping heuristics., 2006-06-05). The intention here is to group objects by path name, but also attempt to group similar file types together by making the most-significant digits of the hash be focused on the final characters. Here's the crux of the implementation: /* * This effectively just creates a sortable number from the * last sixteen non-whitespace characters. Last characters * count "most", so things that end in ".c" sort together. */ while ((c = *name++) != 0) { if (isspace(c)) continue; hash = (hash >> 2) + (c << 24); } As the comment mentions, this only cares about the last sixteen non-whitespace characters. This cause some filenames to collide more than others. Here are some examples that I've seen while investigating repositories that are growing more than they should be: * "/CHANGELOG.json" is 15 characters, and is created by the beachball [1] tool. Only the final character of the parent directory can differntiate different versions of this file, but also only the two most-significant digits. If that character is a letter, then this is always a collision. Similar issues occur with the similar "/CHANGELOG.md" path, though there is more opportunity for differences in the parent directory. * Localization files frequently have common filenames but differentiate via parent directories. In C#, the name "/strings.resx.lcl" is used for these localization files and they will all collide in name-hash. [1] https://github.com/microsoft/beachball I've come across many other examples where some internal tool uses a common name across multiple directories and is causing Git to repack poorly due to name-hash collisions. It is clear that the existing name-hash algorithm is optimized for repositories with short path names, but also is optimized for packing a single snapshot of a repository, not a repository with many versions of the same file. In my testing, this has proven out where the name-hash algorithm does a good job of finding peer files as delta bases when unable to use a historical version of that exact file. However, for repositories that have many versions of most files and directories, it is more important that the objects that appear at the same path are grouped together. Create a new pack_full_name_hash() method and a new --full-name-hash option for 'git pack-objects' to call that method instead. Add a simple pass-through for 'git repack --full-name-hash' for additional testing in the context of a full repack, where I expect this will be most effective. The hash algorithm is as simple as possible to be reasonably effective: for each character of the path string, add a multiple of that character and a large prime number (chosen arbitrarily, but intended to be large relative to the size of a uint32_t). Then, shift the current hash value to the right by 5, with overlap. The addition and shift parameters are standard mechanisms for creating hard-to-predict behaviors in the bits of the resulting hash. This is not meant to be cryptographic at all, but uniformly distributed across the possible hash values. This creates a hash that appears pseudorandom. There is no ability to consider similar file types as being close to each other. In a later change, a test-tool will be added so the effectiveness of this hash can be demonstrated directly. For now, let's consider how effective this mechanism is when repacking a repository with and without the --full-name-hash option. Specifically, let's use 'git repack -adf [--full-name-hash]' as our test. On the Git repository, we do not expect much difference. All path names are short. This is backed by our results: | Stage | Pack Size | Repack Time | |-----------------------|-----------|-------------| | After clone | 260 MB | N/A | | Standard Repack | 127MB | 106s | | With --full-name-hash | 126 MB | 99s | This example demonstrates how there is some natural overhead coming from the cloned copy because the server is hosting many forks and has not optimized for exactly this set of reachable objects. But the full repack has similar characteristics with and without --full-name-hash. However, we can test this in a repository that uses one of the problematic naming conventions above. The fluentui [2] repo uses beachball to generate CHANGELOG.json and CHANGELOG.md files, and these files have very poor delta characteristics when comparing against versions across parent directories. | Stage | Pack Size | Repack Time | |-----------------------|-----------|-------------| | After clone | 694 MB | N/A | | Standard Repack | 438 MB | 728s | | With --full-name-hash | 168 MB | 142s | [2] https://github.com/microsoft/fluentui In this example, we see significant gains in the compressed packfile size as well as the time taken to compute the packfile. Using a collection of repositories that use the beachball tool, I was able to make similar comparisions with dramatic results. While the fluentui repo is public, the others are private so cannot be shared for reproduction. The results are so significant that I find it important to share here: | Repo | Standard Repack | With --full-name-hash | |----------|-----------------|-----------------------| | fluentui | 438 MB | 168 MB | | Repo B | 6,255 MB | 829 MB | | Repo C | 37,737 MB | 7,125 MB | | Repo D | 130,049 MB | 6,190 MB | Future changes could include making --full-name-hash implied by a config value or even implied by default during a full repack. Signed-off-by: Derrick Stolee <stolee@gmail.com>

The new '--full-name-hash' option for 'git repack' is a simple pass-through to the underlying 'git pack-objects' subcommand. However, this subcommand may have other options and a temporary filename as part of the subcommand execution that may not be predictable or could change over time. The existing test_subcommand method requires an exact list of arguments for the subcommand. This is too rigid for our needs here, so create a new method, test_subcommand_flex. Use it to check that the --full-name-hash option is passing through. Signed-off-by: Derrick Stolee <stolee@gmail.com>

Add a new environment variable to opt-in to the --full-name-hash option in 'git pack-objects'. This allows for extra testing of the feature without repeating all of the test scenarios. But this option isn't free. There are a few tests that change behavior with the variable enabled. First, there are a few tests that are very sensitive to certain delta bases being picked. These are both involving the generation of thin bundles and then counting their objects via 'git index-pack --fix-thin' which pulls the delta base into the new packfile. For these tests, disable the option as a decent long-term option. Second, there are two tests in t5616-partial-clone.sh that I believe are actually broken scenarios. While the client is set up to clone the 'promisor-server' repo via a treeless partial clone filter (tree:0), that filter does not translate to the 'server' repo. Thus, fetching from these repos causes the server to think that the client has all reachable trees and blobs from the commits advertised as 'haves'. This leads the server to providing a thin pack assuming those objects as delta bases. Changing the name-hash algorithm presents new delta bases and thus breaks the expectations of these tests. An alternative could be to set up 'server' as a promisor server with the correct filter enabled. This may also point out more issues with partial clone being set up as a remote-based filtering mechanism and not a repository-wide setting. For now, do the minimal change to make the test work by disabling the test variable. Signed-off-by: Derrick Stolee <stolee@gmail.com>

This also adds the '--full-name-hash' option introduced in the previous change and adds newlines to the synopsis. Signed-off-by: Derrick Stolee <stolee@gmail.com>

As custom options are added to 'git pack-objects' and 'git repack' to adjust how compression is done, use this new performance test script to demonstrate their effectiveness in performance and size. The recently-added --full-name-hash option swaps the default name-hash algorithm with one that attempts to uniformly distribute the hashes based on the full path name instead of the last 16 characters. This has a dramatic effect on full repacks for repositories with many versions of most paths. It can have a negative impact on cases such as pushing a single change. This can be seen by running pt5313 on the open source fluentui repository [1]. Most commits will have this kind of output for the thin and big pack cases, though certain commits (such as [2]) will have problematic thin pack size for other reasons. [1] https://github.com/microsoft/fluentui [2] a637a06df05360ce5ff21420803f64608226a875 Checked out at the parent of [2], I see the following statistics: Test this tree ------------------------------------------------------------------ 5313.2: thin pack 0.02(0.01+0.01) 5313.3: thin pack size 1.1K 5313.4: thin pack with --full-name-hash 0.02(0.01+0.00) 5313.5: thin pack size with --full-name-hash 3.0K 5313.6: big pack 1.65(3.35+0.24) 5313.7: big pack size 58.0M 5313.8: big pack with --full-name-hash 1.53(2.52+0.18) 5313.9: big pack size with --full-name-hash 57.6M 5313.10: repack 176.52(706.60+3.53) 5313.11: repack size 446.7K 5313.12: repack with --full-name-hash 37.47(134.18+3.06) 5313.13: repack size with --full-name-hash 183.1K Note that this demonstrates a 3x size _increase_ in the case that simulates a small "git push". The size change is neutral on the case of pushing the difference between HEAD and HEAD~1000. However, the full repack case is both faster and more efficient. Signed-off-by: Derrick Stolee <stolee@gmail.com>

Add a new test-tool helper, name-hash, to output the value of the name-hash algorithms for the input list of strings, one per line. Since the name-hash values can be stored in the .bitmap files, it is important that these hash functions do not change across Git versions. Add a simple test to t5310-pack-bitmaps.sh to provide some testing of the current values. Due to how these functions are implemented, it would be difficult to change them without disturbing these values. Create a performance test that uses test_size to demonstrate how collisions occur for these hash algorithms. This test helps inform someone as to the behavior of the name-hash algorithms for their repo based on the paths at HEAD. My copy of the Git repository shows modest statistics around the collisions of the default name-hash algorithm: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 4.5K 5314.2: number of distinct name-hashes 4.1K 5314.3: number of distinct full-name-hashes 4.5K 5314.4: maximum multiplicity of name-hashes 13 5314.5: maximum multiplicity of fullname-hashes 1 Here, the maximum collision multiplicity is 13, but around 10% of paths have a collision with another path. In a more interesting example, the microsoft/fluentui [1] repo had these statistics at time of committing: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 19.6K 5314.2: number of distinct name-hashes 8.2K 5314.3: number of distinct full-name-hashes 19.6K 5314.4: maximum multiplicity of name-hashes 279 5314.5: maximum multiplicity of fullname-hashes 1 [1] https://github.com/microsoft/fluentui That demonstrates that of the nearly twenty thousand path names, they are assigned around eight thousand distinct values. 279 paths are assigned to a single value, leading the packing algorithm to sort objects from those paths together, by size. In this repository, no collisions occur for the full-name-hash algorithm. In a more extreme example, an internal monorepo had a much worse collision rate: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 221.6K 5314.2: number of distinct name-hashes 72.0K 5314.3: number of distinct full-name-hashes 221.6K 5314.4: maximum multiplicity of name-hashes 14.4K 5314.5: maximum multiplicity of fullname-hashes 2 Even in this repository with many more paths at HEAD, the collision rate was low and the maximum number of paths being grouped into a single bucket by the full-path-name algorithm was two. Signed-off-by: Derrick Stolee <stolee@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pack-objects: create new name-hash algorithm #1785

pack-objects: create new name-hash algorithm #1785

Commits on Sep 23, 2024

Commits on Sep 24, 2024