forked from git-for-windows/git
-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] sparse index prototype #305
Closed
derrickstolee
wants to merge
121
commits into
microsoft:vfs-2.30.0
from
derrickstolee:sparse-index/vfs
Closed
[DO NOT MERGE] sparse index prototype #305
derrickstolee
wants to merge
121
commits into
microsoft:vfs-2.30.0
from
derrickstolee:sparse-index/vfs
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some tests in t6407 uses a if/then/else to check if a command failed or not, but we have the `test_must_fail' function to do it correctly for us nowadays. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Currently, merge-index iterates over every index entry, skipping stage0 entries. It will then count how many entries following the current one have the same name, then fork to do the merge. It will then increase the iterator by the number of entries to skip them. This behaviour is correct, as even if the subprocess modifies the index, merge-index does not reload it at all. But when it will be rewritten to use a function, the index it will use will be modified and may shrink when a conflict happens or if a file is removed, so we have to be careful to handle such cases. Here is an example: * Merge branches, file1 and file2 are trivially mergeable. |\ | * Modifies file1 and file2. * | Modifies file1 and file2. |/ * Adds file1 and file2. When the merge happens, the index will look like that: i -> 0. file1 (stage1) 1. file1 (stage2) 2. file1 (stage3) 3. file2 (stage1) 4. file2 (stage2) 5. file2 (stage3) merge-index handles `file1' first. As it appears 3 times after the iterator, it is merged. The index is now stale, `i' is increased by 3, and the index now looks like this: 0. file1 (stage1) 1. file1 (stage2) 2. file1 (stage3) i -> 3. file2 (stage1) 4. file2 (stage2) 5. file2 (stage3) `file2' appears three times too, so it is merged. With a naive rewrite, the index would look like this: 0. file1 (stage0) 1. file2 (stage1) 2. file2 (stage2) i -> 3. file2 (stage3) `file2' appears once at the iterator or after, so it will be added, _not_ merged. Which is wrong. A naive rewrite would lead to unproperly merged files, or even files not handled at all. This changes t6060 to reproduce this case, by creating 2 files instead of 1, to check the correctness of the soon-to-be-rewritten merge-index. The files are identical, which is not really important -- the factors that could trigger this issue are that they should be separated by at most one entry in the index, and that the first one in the index should be trivially mergeable. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This moves the function add_cacheinfo() that already exists in update-index.c to update-index.c, renames it add_to_index_cacheinfo(), and adds an `istate' parameter. The new cache entry is returned through a pointer passed in the parameters. The return value is either 0 (success), -1 (invalid path), or -2 (failed to add the file in the index). This will become useful in the next commit, when the three-way merge will need to call this function. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This rewrites `git merge-one-file' from shell to C. This port is not completely straightforward: to save precious cycles by avoiding reading and flushing the index repeatedly, write temporary files when an operation can be performed in-memory, or allow other function to use the rewrite without forking nor worrying about the index, the calls to external processes are replaced by calls to functions in libgit.a: - calls to `update-index --add --cacheinfo' are replaced by calls to add_to_index_cacheinfo(); - calls to `update-index --remove' are replaced by calls to remove_file_from_index(); - calls to `checkout-index -u -f' are replaced by calls to checkout_entry(); - calls to `unpack-file' and `merge-files' are replaced by calls to read_mmblob() and xdl_merge(), respectively, to merge files in-memory; - calls to `checkout-index -f --stage=2' are removed, as this is needed to have the correct permission bits on the merged file from the script, but not in the C version; - calls to `update-index' are replaced by calls to add_file_to_index(). The bulk of the rewrite is done in a new file in libgit.a, merge-strategies.c. This will enable the resolve and octopus strategies to directly call it instead of forking. This also fixes a bug present in the original script: instead of checking if a _regular_ file exists when a file exists in the branch to merge, but not in our branch, the rewritten version checks if a file of any kind (ie. a directory, ...) exists. This fixes the tests t6035.14, where the branch to merge had a new file, `a/b', but our branch had a directory there; it should have failed because a directory exists, but it did not because there was no regular file called `a/b'. This test is now marked as successful. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "resolve" and "octopus" merge strategies do not call directly `git merge-one-file', they delegate the work to another git command, `git merge-index', that will loop over files in the index and call the specified command. Unfortunately, these functions are not part of libgit.a, which means that once rewritten, the strategies would still have to invoke `merge-one-file' by spawning a new process first. To avoid this, this moves and renames merge_one_path(), merge_all(), and their helpers to merge-strategies.c. They also take a callback to dictate what they should do for each file. For now, to preserve the behaviour of `merge-index', only one callback, launching a new process, is defined. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since `git-merge-one-file' has been rewritten and libified, this teaches `merge-index' to call merge_three_way() without forking using a new callback, merge_one_file_func(). To avoid any issue with a shrinking index because of the merge function used (directly in the process or by forking), as described earlier, the iterator of the loop of merge_all_index() is increased by the number of entries with the same name, minus the difference between the number of entries in the index before and after the merge. This should handle a shrinking index correctly, but could lead to issues with a growing index. However, this case is not treated, as there is no callback that can produce such a case. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This rewrites `git merge-resolve' from shell to C. As for `git merge-one-file', this port is not completely straightforward and removes calls to external processes to avoid reading and writing the index over and over again. - The call to `update-index -q --refresh' is replaced by a call to refresh_index(). - The call to `read-tree' is replaced by a call to unpack_trees() (and all the setup needed). - The call to `write-tree' is replaced by a call to write_index_as_tree(). - The call to `merge-index', needed to invoke `git merge-one-file', is replaced by a call to the new merge_all_index() function. The index is read in cmd_merge_resolve(), and is wrote back by merge_strategies_resolve(). The parameters of merge_strategies_resolve() will be surprising at first glance: why using a commit list for `bases' and `remote', where we could use an oid array, and a pointer to an oid? Because, in a later commit, try_merge_strategy() will be able to call merge_strategies_resolve() directly, and it already uses a commit list for `bases' (`common') and `remote' (`remoteheads'), and a string for `head_arg'. To reduce frictions later, merge_strategies_resolve() takes the same types of parameters. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
better_branch_name() will be used by merge-octopus once it is rewritten in C, so instead of duplicating it, this moves this function preventively inside an appropriate file in libgit.a. This function is also renamed to reflect its usage by merge strategies. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This rewrites `git merge-octopus' from shell to C. As for the two last conversions, this port removes calls to external processes to avoid reading and writing the index over and over again. - Calls to `read-tree -u -m (--aggressive)?' are replaced by calls to unpack_trees(). - The call to `write-tree' is replaced by a call to write_index_as_tree(). - The call to `diff-index ...' is replaced by a call to repo_index_has_changes(). - The call to `merge-index', needed to invoke `git merge-one-file', is replaced by a call to merge_all_index(). The index is read in cmd_merge_octopus(), and is wrote back by merge_strategies_octopus(). Here to, merge_strategies_octopus() takes two commit lists and a string to reduce frictions when try_merge_strategies() will be modified to call it directly. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This teaches `git merge' to invoke the "resolve" strategy with a function call instead of forking. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This teaches `git merge' to invoke the "octopus" strategy with a function call instead of forking. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This teaches the sequencer to invoke the "resolve" strategy with a function call instead of forking. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This teaches the sequencer to invoke the "octopus" strategy with a function call instead of forking. Signed-off-by: Alban Gruin <alban.gruin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The traverse_trees() method recursively walks through trees, but also prunes the tree-walk based on a callback. Some callers, such as unpack_trees(), are quite complicated and can have wildly different performance between two different commands. Create constants that count these values and then report the results at the end of a process. These counts are cumulative across multiple "root" instances of traverse_trees(), but they provide reproducible values for demonstrating improvements to the pruning algorithm when possible. This change is modeled after a similar statistics reporting in 42e50e7 (revision.c: add trace2 stats around Bloom filter usage, 2020-04-06). Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The unpack_trees() method is quite complicated and its performance can change dramatically depending on how it is used. We already have some performance tracing regions, but they have not been updated to the trace2 API. Do so now. We already have trace2 regions in unpack_trees.c:clear_ce_flags(), which uses a linear scan through the index without recursing into trees. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
This matches a trace_performance_enter()/trace_performance_leave() pair added by 0d1ed59 (unpack-trees: add performance tracing, 2018-08-18). Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
As we write or read the cached tree index extension, it can be good to isolate how much of the file I/O time is spent constructing this in-memory tree from the existing index or writing it out again to the new index file. Use trace2 regions to indicate that we are spending time on this operation. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Commands such as "git reset --hard" rebuild the in-memory representation of the cached tree index extension by parsing tree objects starting at a known root tree. The performance of this operation can vary widely depending on the width and depth of the repository's working directory structure. Measure the time in this operation using trace2 regions in prime_cache_tree(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The traverse_trees() method recursively walks through trees, but also prunes the tree-walk based on a callback. Some callers, such as unpack_trees(), are quite complicated and can have wildly different performance between two different commands. Create constants that count these values and then report the results at the end of a process. These counts are cumulative across multiple "root" instances of traverse_trees(), but they provide reproducible values for demonstrating improvements to the pruning algorithm when possible. This change is modeled after a similar statistics reporting in 42e50e7 (revision.c: add trace2 stats around Bloom filter usage, 2020-04-06). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The unpack_trees() method is quite complicated and its performance can change dramatically depending on how it is used. We already have some performance tracing regions, but they have not been updated to the trace2 API. Do so now. We already have trace2 regions in unpack_trees.c:clear_ce_flags(), which uses a linear scan through the index without recursing into trees. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This matches a trace_performance_enter()/trace_performance_leave() pair added by 0d1ed59 (unpack-trees: add performance tracing, 2018-08-18). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The mv builtin uses the compatibility macros to interact with the index. Update these to use modern methods referring to a 'struct index_state' pointer. Several helper methods need to be updated to consider such a pointer, but the modifications are rudimentary. Two macros can be deleted from cache.h because these are the last uses. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The rm builtin still uses the antiquated compatibility macros for interacting with the index. Update these to the more modern uses by passing around a 'struct index_state' pointer. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The index has a "cache tree" extension. This corresponds to a significant API implemented in cache-tree.[ch]. However, there are a few places that refer to this erroneously as "cached tree". These are rare, but notably the index-format.txt file itself makes this error. The only other reference is in t7104-reset-hard.sh. Reported-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
I had difficulty in my efforts to learn about the cache tree extension based on the documentation and code because I had an incorrect assumption about how it behaved. This might be due to some ambiguity in the documentation, so this change modifies the beginning of the cached tree format by expanding the description of the feature. My hope is that this documentation clarifies a few things: 1. There is an in-memory recursive tree structure that is constructed from the extension data. This structure has a few differences, such as where the name is stored. 2. What does it mean for an entry to be invalid? 3. When exactly are "new" trees created? Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The end of the cached tree index extension format trails off with ellipses ever since 23fcc98 (doc: technical details about the index file format, 2011-03-01). While an intuitive reader could gather what this means, it could be better to use "and so on" instead. Really, this is only justified because I also wanted to point out that the number of subtrees in the index format is used to determine when the recursive depth-first-search stack should be "popped." This should help to add clarity to the format. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Use the name length field of cache entries instead of calculating its value anew. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The previous change reduced time spent in strlen() while comparing consecutive paths in verify_cache(), but we can do better. The conditional checks the existence of a directory separator at the correct location, but only after doing a string comparison. Swap the order to be logically equivalent but perform fewer string comparisons. To test the effect on performance, I used a repository with over three million paths in the index. I then ran the following command on repeat: git -c index.threads=1 commit --amend --allow-empty --no-edit Here are the measurements over 10 runs after a 5-run warmup: Benchmark #1: v2.30.0 Time (mean ± σ): 854.5 ms ± 18.2 ms Range (min … max): 825.0 ms … 892.8 ms Benchmark #2: Previous change Time (mean ± σ): 833.2 ms ± 10.3 ms Range (min … max): 815.8 ms … 849.7 ms Benchmark #3: This change Time (mean ± σ): 815.5 ms ± 18.1 ms Range (min … max): 795.4 ms … 849.5 ms This change is 2% faster than the previous change and 5% faster than v2.30.0. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
To reduce the need for the index compatibility macros, we will replace their uses in update-index mechanically. This is the most interesting change, which creates global "repo" and "istate" pointers. The macros that expand to use the_index can then be mechanically replaced by references to the istate pointer. We will be careful to use "repo->index" over "istate" whenever repo is needed by a method. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Also use repo->index over istate, when possible. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When calling index_name_pos() to find a cache entry, we need to be aware that a path might be within a sparse directory. Ensure a full index on each use in apply.c. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Any use of index_name_pos() will require modification to properly handle sparse directory entries. Ensure a full index in diff.c:reuse_worktree_file(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The checkout_file() method in builtin/checkout-index.c uses cache_name_pos() to get the current path position from the current index. This might have strange behavior when the path is inside a sparse directory, so use ensure_full_index(). While we are here, drop an instance of the antiquated cache_name_pos() and replace it with index_name_pos(), even though we need to use the_repository->index as a parameter. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
This builtin may require special handling of sparse directories, so require a full index for now. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Until we can test this builtin with a sparse index, ensure the index is expanded to a full index. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The rm builtin removes matching entries from the index. In the presence of sparse directories, this behavior will change. This behavior change might be desireable, but should be delayed until we can properly test and document the behavior. Further, we will likely want to make it possible to still do the previous behavior, so that will need to be created as well. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Add this guard until we can test and document the behavior in the presence of a sparse index. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
TBD WARNING: actually guard all callers. we will peel them away as necessary. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
By testing 'git -c core.fsmonitor= status -uno', we can check for the simplest index operations that can be made sparse-aware. read-cache.c:refresh_index() This works as long as we are not ignoring the skip-worktree bit, since the refresh_cache_ent() method will just copy the sparse directories into the refreshed index. TODO: check how to trigger this method with ignore_skip_worktree. diff-lib.c:run_diff_files() This loop skips things that are in stage 0 and have skip-worktree enabled, so seems safe to disable ensure_full_index() here. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Before we check if a specific file or directory exists in the index, it would be good to see if a leading directory is a sparse-directory. If so, we will want to expand the index _just enough_ to be sure that the paths we are interested in are in the index. The actually-interesting implementation will follow in a later change. For now, simply call ensure_full_index() to expand every directory simultaneously. Calls like index_dir_exists(), adjust_dirname_case(), and index_file_exists() in name-hash.c can trust the name hash if the index was properly expanded for the requested names. These methods can transition from ensure_full_index() to expand_to_path(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
We need to check the file hashmap first, then look to see if the directory signals a non-sparse directory entry. In such a case, we can rely on the contents of the sparse-index. We still use ensure_full_index() in the case that we hit a path that is within a sparse-directory entry. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Replace enough callers to ensure_full_index() to instead call expand_to_path() to reduce how often 'git add' expands a sparse index in memory (before writing a sparse index again). Add a test to check that 'git add -A' and 'git add <file>' does not expand the index at all, as long as <file> is not within a sparse directory. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Since we already do not collapse a sparse directory if it contains a submodule, we don't need to expand to a full index in die_path_inside_submodule(). A simple scan of the index entries is sufficient. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The FS Monitor feature uses a bitmap over the index entries. This currently interacts poorly with a sparse index. We will revisit this interaction in the future, but for now protect the index by refusing to use the FS Monitor extension at all if the index is sparse. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee
force-pushed
the
sparse-index/vfs
branch
from
January 25, 2021 02:23
538aaa1
to
bd3fb1a
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In order to get better tests, I'm creating this PR to generate installers. Those will be used to test performance across all three platforms. Further, I'll run the Scalar functional tests while enabling the sparse-index feature.