-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Path-walk API and applications #1786
Commits on Sep 8, 2024
-
path-walk: introduce an object walk by path
In anticipation of a few planned applications, introduce the most basic form of a path-walk API. It currently assumes that there are no UNINTERESTING objects, and does not include any complicated filters. It calls a function pointer on groups of tree and blob objects as grouped by path. This only includes objects the first time they are discovered, so an object that appears at multiple paths will not be included in two batches. There are many future adaptations that could be made, but they are left for future updates when consumers are ready to take advantage of those features. RFC TODO: It would be helpful to create a test-tool that allows printing of each batch for strong testing. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for a53bd0d - Browse repository at this point
Copy the full SHA a53bd0dView commit details -
backfill: add builtin boilerplate
In anticipation of implementing 'git backfill', populate the necessary files with the boilerplate of a new builtin. RFC TODO: When preparing this for a full implementation, make sure it is based on the newest standards introduced by [1]. [1] https://lore.kernel.org/git/xmqqjzfq2f0f.fsf@gitster.g/T/#m606036ea2e75a6d6819d6b5c90e729643b0ff7f7 [PATCH 1/3] builtin: add a repository parameter for builtin functions Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 41c49bb - Browse repository at this point
Copy the full SHA 41c49bbView commit details -
backfill: basic functionality and tests
The default behavior of 'git backfill' is to fetch all missing blobs that are reachable from HEAD. Document and test this behavior. The implementation is a very simple use of the path-walk API, initializing the revision walk at HEAD to start the path-walk from all commits reachable from HEAD. Ignore the object arrays that correspond to tree entries, assuming that they are all present already. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for be21c83 - Browse repository at this point
Copy the full SHA be21c83View commit details -
backfill: add --batch-size=<n> option
Users may want to specify a minimum batch size for their needs. This is only a minimum: the path-walk API provides a list of OIDs that correspond to the same path, and thus it is optimal to allow delta compression across those objects in a single server request. We could consider limiting the request to have a maximum batch size in the future. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for f904b02 - Browse repository at this point
Copy the full SHA f904b02View commit details -
One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensie as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for cd33c62 - Browse repository at this point
Copy the full SHA cd33c62View commit details -
backfill: assume --sparse when sparse-checkout is enabled
The previous change introduced the '--[no-]sparse' option for the 'git backfill' command, but did not assume it as enabled by default. However, this is likely the behavior that users will most often want to happen. Without this default, users with a small sparse-checkout may be confused when 'git backfill' downloads every version of every object in the full history. However, this is left as a separate change so this decision can be reviewed independently of the value of the '--[no-]sparse' option. Add a test of adding the '--sparse' option to a repo without sparse-checkout to make it clear that supplying it without a sparse-checkout is an error. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for aa34653 - Browse repository at this point
Copy the full SHA aa34653View commit details -
path-walk: allow consumer to specify object types
This adds the ability to ask for the commits as a single list. This will also reduce the calls in 'git backfill' to be a BUG() statement if called with anything other than blobs. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 2829fe3 - Browse repository at this point
Copy the full SHA 2829fe3View commit details
Commits on Sep 9, 2024
-
path-walk: allow visiting tags
In anticipation of using the path-walk API to analyze tags or include them in a pack-file, add the ability to walk the tags that were included in the revision walk. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for d67679d - Browse repository at this point
Copy the full SHA d67679dView commit details -
survey: stub in new experimental
git-survey
commandStart work on a new `git survey` command to scan the repository for monorepo performance and scaling problems. The goal is to measure the various known "dimensions of scale" and serve as a foundation for adding additional measurements as we learn more about Git monorepo scaling problems. The initial goal is to complement the scanning and analysis performed by the GO-based `git-sizer` (https://github.com/github/git-sizer) tool. It is hoped that by creating a builtin command, we may be able to take advantage of internal Git data structures and code that is not accessible from GO to gain further insight into potential scaling problems. RFC TODO: Adapt this boilerplat to match the upcoming changes to builtin methods that include a 'struct repository' pointer. Co-authored-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Jeff Hostetler <jeffhostetler@github.com> Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 7d43a16 - Browse repository at this point
Copy the full SHA 7d43a16View commit details -
survey: add command line opts to select references
By default we will scan all references in "refs/heads/", "refs/tags/" and "refs/remotes/". Add command line opts let the use ask for all refs or a subset of them and to include a detached HEAD. Signed-off-by: Jeff Hostetler <jeffhostetler@github.com> Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 9098687 - Browse repository at this point
Copy the full SHA 9098687View commit details -
survey: collect the set of requested refs
Collect the set of requested branches, tags, and etc into a ref_array and collect the set of requested patterns into a strvec. RFC TODO: This patch has some changes that should be in the previous patch, to make the diff look a lot better. Co-authored-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Jeff Hostetler <jeffhostetler@github.com> Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for efa1793 - Browse repository at this point
Copy the full SHA efa1793View commit details -
survey: start pretty printing data in table form
When 'git survey' provides information to the user, this will be presented in one of two formats: plaintext and JSON. The JSON implementation will be delayed until the functionality is complete for the plaintext format. The most important parts of the plaintext format are headers specifying the different sections of the report and tables providing concreted data. Create a custom table data structure that allows specifying a list of strings for the row values. When printing the table, check each column for the maximum width so we can create a table of the correct size from the start. The table structure is designed to be flexible to the different kinds of output that will be implemented in future changes. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 44417cc - Browse repository at this point
Copy the full SHA 44417ccView commit details -
survey: add object count summary
At the moment, nothing is obvious about the reason for the use of the path-walk API, but this will become more prevelant in future iterations. For now, use the path-walk API to sum up the counts of each kind of object. For example, this is the reachable object summary output for my local repo: REACHABLE OBJECT SUMMARY ======================== Object Type | Count ------------+------- Tags | 0 Commits | 178573 Trees | 312745 Blobs | 183035 (Note: the "Tags" are zero right now because the path-walk API has not been integrated to walk tags yet. This will be fixed in a later change.) RFC TODO: make sure tags are walked before this change. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for fcc281a - Browse repository at this point
Copy the full SHA fcc281aView commit details -
survey: summarize total sizes by object type
Now that we have explored objects by count, we can expand that a bit more to summarize the data for the on-disk and inflated size of those objects. This information is helpful for diagnosing both why disk space (and perhaps clone or fetch times) is growing but also why certain operations are slow because the inflated size of the abstract objects that must be processed is so large. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 462ca0b - Browse repository at this point
Copy the full SHA 462ca0bView commit details -
survey: show progress during object walk
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 9c54c14 - Browse repository at this point
Copy the full SHA 9c54c14View commit details -
survey: add ability to track prioritized lists
In future changes, we will make use of these methods. The intention is to keep track of the top contributors according to some metric. We don't want to store all of the entries and do a sort at the end, so track a constant-size table and remove rows that get pushed out depending on the chosen sorting algorithm. Co-authored-by: Jeff Hostetler <git@jeffhostetler.com> Signed-off-by; Jeff Hostetler <git@jeffhostetler.com> Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 3504abb - Browse repository at this point
Copy the full SHA 3504abbView commit details -
survey: add report of "largest" paths
Since we are already walking our reachable objects using the path-walk API, let's now collect lists of the paths that contribute most to different metrics. Specifically, we care about * Number of versions. * Total size on disk. * Total inflated size (no delta or zlib compression). This information can be critical to discovering which parts of the repository are causing the most growth, especially on-disk size. Different packing strategies might help compress data more efficiently, but the toal inflated size is a representation of the raw size of all snapshots of those paths. Even when stored efficiently on disk, that size represents how much information must be processed to complete a command such as 'git blame'. Since the on-disk size is likely to be fragile, stop testing the exact output of 'git survey' and check that the correct set of headers is output. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 9e95914 - Browse repository at this point
Copy the full SHA 9e95914View commit details -
revision: create mark_trees_uninteresting_dense()
The sparse tree walk algorithm was created in d5d2e93 (revision: implement sparse algorithm, 2019-01-16) and involves using the mark_trees_uninteresting_sparse() method. This method takes a repository and an oidset of tree IDs, some of which have the UNINTERESTING flag and some of which do not. Create a method that has an equivalent set of preconditions but uses a "dense" walk (recursively visits all reachable trees, as long as they have not previously been marked UNINTERESTING). This is an important difference from mark_tree_uninteresting(), which short-circuits if the given tree has the UNINTERESTING flag. A use of this method will be added in a later change, with a condition set whether the sparse or dense approach should be used. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 98a854c - Browse repository at this point
Copy the full SHA 98a854cView commit details -
path-walk: add prune_all_uninteresting option
This option causes the path-walk API to act like the sparse tree-walk algorithm implemented by mark_trees_uninteresting_sparse() in list-objects.c. Starting from the commits marked as UNINTERESTING, their root trees and all objects reachable from those trees are UNINTERSTING, at least as we walk path-by-path. When we reach a path where all objects associated with that path are marked UNINTERESTING, then do no continue walking the children of that path. We need to be careful to pass the UNINTERESTING flag in a deep way on the UNINTERESTING objects before we start the path-walk, or else the depth-first search for the path-walk API may accidentally report some objects as interesting. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 78168d9 - Browse repository at this point
Copy the full SHA 78168d9View commit details -
pack-objects: add --path-walk option
In order to more easily compute delta bases among objects that appear at the exact same path, add a --path-walk option to 'git pack-objects'. This option will use the path-walk API instead of the object walk given by the revision machinery. Since objects will be provided in batches representing a common path, those objects can be tested for delta bases immediately instead of waiting for a sort of the full object list by name-hash. This has multiple benefits, including avoiding collisions by name-hash. The objects marked as UNINTERESTING are included in these batches, so we are guaranteeing some locality to find good delta bases. After the individual passes are done on a per-path basis, the default name-hash is used to find other opportunistic delta bases that did not match exactly by the full path name. RFC TODO: It is important to note that this option is inherently incompatible with using a bitmap index. This walk probably also does not work with other advanced features, such as delta islands. Getting ahead of myself, this option compares well with --full-name-hash when the packfile is large enough, but also performs at least as well as the default in all cases that I've seen. RFC TODO: this should probably be recording the batch locations to another list so they could be processed in a second phase using threads. RFC TODO: list some examples of how this outperforms previous pack-objects strategies. (This is coming in later commits that include performance test changes.) Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 3455af2 - Browse repository at this point
Copy the full SHA 3455af2View commit details -
pack-objects: extract should_attempt_deltas()
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 502008b - Browse repository at this point
Copy the full SHA 502008bView commit details -
pack-objects: introduce GIT_TEST_PACK_PATH_WALK
There are many tests that validate whether 'git pack-objects' works as expected. Instead of duplicating these tests, add a new test environment variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default when specified. This was useful in testing the implementation of the --path-walk implementation, especially in conjunction with test such as: - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of the --sparse option and how it combines with --path-walk. RFC TODO: list other helpful test cases, as well as the ones where the behavior breaks if this is enabled... Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for b52ee33 - Browse repository at this point
Copy the full SHA b52ee33View commit details -
p5313: add size comparison test
To test the benefits of the new --path-walk option in 'git pack-objects', create a performance test that times the process but also compares the size of the output. Against the microsoft/fluentui repo [1] against a particular commit [2], this has reproducible results of a similar scale: Test this tree --------------------------------------------------------------- 5313.2: thin pack 0.39(0.48+0.03) 5313.3: thin pack size 1.2M 5313.4: thin pack with --path-walk 0.09(0.07+0.01) 5313.5: thin pack size with --path-walk 20.8K 5313.6: big recent pack 2.13(8.29+0.26) 5313.7: big recent pack size 17.7M 5313.8: big recent pack with --path-walk 3.18(4.21+0.22) 5313.9: big recent pack size with --path-walk 15.0M [1] https://github.com/microsoft/reactui [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 RFC TODO: Note that the path-walk version is slower for the big case, but the delta calculation is single-threaded with the current implementation! It's still faster for the small case that mimics a typical push. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 54bd807 - Browse repository at this point
Copy the full SHA 54bd807View commit details -
repack: add --path-walk option
Since 'git pack-objects' supports a --path-walk option, allow passing it through in 'git repack'. This presents interesting testing opportunities for comparing the different repacking strategies against each other. For the microsoft/fluentui repo [1], the results are very interesting: Test this tree ------------------------------------------------------------------- 5313.10: full repack 97.91(663.47+2.83) 5313.11: full repack size 449.1K 5313.12: full repack with --path-walk 105.42(120.49+0.95) 5313.13: full repack size with --path-walk 159.1K [1] https://github.com/microsoft/fluentui This repo suffers from having a lot of paths that collide in the name hash, so examining them in groups by path leads to better deltas. Also, in this case, the single-threaded implementation is competitive with the full repack. This is saving time diffing files that have significant differences from each other. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for d3284d0 - Browse repository at this point
Copy the full SHA d3284d0View commit details -
pack-objects: enable --path-walk via config
Users may want to enable the --path-walk option for 'git pack-objects' by default, especially underneath commands like 'git push' or 'git repack'. This should be limited to client repositories, since the --path-walk option disables bitmap walks, so would be bad to include in Git servers when serving fetches and clones. There is potential that it may be helpful to consider when repacking the repository, to take advantage of improved deltas across historical versions of the same files. Much like how "pack.useSparse" was introduced and included in "feature.experimental" before being enabled by default, use the repository settings infrastructure to make the new "pack.usePathWalk" config enabled by "feature.experimental" and "feature.manyFiles". Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 1942f7d - Browse repository at this point
Copy the full SHA 1942f7dView commit details -
scalar: enable path-walk during push via config
Repositories registered with Scalar are expected to be client-only repositories that are rather large. This means that they are more likely to be good candidates for using the --path-walk option when running 'git pack-objects', especially under the hood of 'git push'. Enable this config in Scalar repositories. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 4c10f85 - Browse repository at this point
Copy the full SHA 4c10f85View commit details -
pack-objects: add --full-name-hash option
RFC NOTE: this is essentially the same as the patch introduced independently of the RFC, but now is on top of the --path-walk option instead. This is included in the RFC for comparison purposes. RFC NOTE: As you can see from the details below, the --full-name-hash option essentially attempts to do similar things as the --path-walk option, but sometimes misses the mark. Collisions still happen with the --full-name-hash option, leading to some misses. However, in cases where the default name-hash algorithm has low collision rates and deltas are actually desired across objects with similar names but different full names, the --path-walk option can still take advantage of the default name hash approach. Here are the new performance details simulating a single push in an internal monorepo using a lot of paths that collide in the default name hash. We can see that --full-name-hash gets close to the --path-walk option's size. Test this tree -------------------------------------------------------------- 5313.2: thin pack 2.43(2.92+0.14) 5313.3: thin pack size 4.5M 5313.4: thin pack with --full-name-hash 0.31(0.49+0.12) 5313.5: thin pack size with --full-name-hash 15.5K 5313.6: thin pack with --path-walk 0.35(0.31+0.04) 5313.7: thin pack size with --path-walk 14.2K However, when simulating pushes on repositories that do not have issues with name-hash collisions, the --full-name-hash option presents a potential of worse delta calculations, such as this example using my local Git repository: Test this tree -------------------------------------------------------------- 5313.2: thin pack 0.03(0.01+0.01) 5313.3: thin pack size 475 5313.4: thin pack with --full-name-hash 0.02(0.01+0.01) 5313.5: thin pack size with --full-name-hash 14.8K 5313.6: thin pack with --path-walk 0.02(0.01+0.01) 5313.7: thin pack size with --path-walk 475 Note that the path-walk option found the same delta bases as the default options in this case. In the full repack case, the --full-name-hash option may be preferable because it interacts well with other advanced features, such as using bitmap indexes and tracking delta islands. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for db8cc46 - Browse repository at this point
Copy the full SHA db8cc46View commit details -
test-name-hash: add helper to compute name-hash functions
Using this tool, we can count how many distinct name-hash values exist within a list of paths. Examples include git ls-tree -r --name-only HEAD | \ test-tool name-hash | \ awk "{print \$1;}" | \ sort -ns | uniq | wc -l which outputs the number of distinct name-hash values that appear at HEAD. Or, the following which presents the resulting name-hash values of maximum multiplicity: git ls-tree -r --name-only HEAD | \ test-tool name-hash | \ awk "{print \$1;}" | \ sort -n | uniq -c | sort -nr | head -n 25 For an internal monorepo with around a quarter million paths at HEAD, the highest multiplicity for the standard name-hash function was 14,424 while the full name-hash algorithm had only seven hash values with any collision, with a maximum multiplicity of two. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 8df39a4 - Browse repository at this point
Copy the full SHA 8df39a4View commit details -
p5314: add a size test for name-hash collisions
This test helps inform someone as to the behavior of the name-hash algorithms for their repo based on the paths at HEAD. For example, the microsoft/fluentui repo had these statistics at time of committing: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 19.6K 5314.2: number of distinct name-hashes 8.2K 5314.3: number of distinct full-name-hashes 19.6K 5314.4: maximum multiplicity of name-hashes 279 5314.5: maximum multiplicity of fullname-hashes 1 That demonstrates that of the nearly twenty thousand path names, they are assigned around eight thousand distinct values. 279 paths are assigned to a single value, leading the packing algorithm to sort objects from those paths together, by size. In this repository, no collisions occur for the full-name-hash algorithm. In a more extreme example, an internal monorepo had a much worse collision rate: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 221.6K 5314.2: number of distinct name-hashes 72.0K 5314.3: number of distinct full-name-hashes 221.6K 5314.4: maximum multiplicity of name-hashes 14.4K 5314.5: maximum multiplicity of fullname-hashes 2 Even in this repository with many more paths at HEAD, the collision rate was low and the maximum number of paths being grouped into a single bucket by the full-path-name algorithm was two. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 5dcb20a - Browse repository at this point
Copy the full SHA 5dcb20aView commit details -
pack-objects: output debug info about deltas
In order to debug what is going on during delta calculations, add a --debug-file=<file> option to 'git pack-objects'. This leads to sending a JSON-formatted description of the delta information to that file. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 460feef - Browse repository at this point
Copy the full SHA 460feefView commit details