[TEST] Make ORT the default merge strategy #400

derrickstolee · 2021-07-22T18:16:09Z

I'm creating this to run the Scalar functional tests and generate installers that I can test for performance data.

The ORT strategy will be required for git merge to work with the sparse-index, so let's get it integrated earlier.

Add a new testcase where one side of history renames: olddir/ -> newdir/ and the other side of history renames: olddir/a -> olddir/alpha When using merge.directoryRenames=true, it seems logical to expect the file to end up at newdir/alpha. Unfortunately, both merge-recursive and merge-ort currently see this as a rename/rename conflict: olddir/a -> newdir/a vs. olddir/a -> newdir/alpha Suggesting that there's some extra logic we probably want to add somewhere to allow this case to run without triggering a conflict. For now simply document this known issue. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

There's no point in creating a repository or directory only to decide right afterwards that we're skipping all the tests. We can save ourselves the redundant "git init" or "mkdir" and "rm -rf" in this case. We carry around the "$remove_trash" variable because if the directory is unexpectedly gone at test_done time we'll still want to hit the "trash directory already removed" error, but not if we never created the trash directory. See df4c0d1 (test-lib: abort when can't remove trash directory, 2017-04-20) for the addition of that error. So let's partially revert 06478da (test-lib: retire $remove_trash variable, 2017-04-23) and move the decision about whether to skip all tests earlier. Let's also fix a bug that was with us since abc5d37 (Enable parallel tests, 2008-08-08): we would leak $remove_trash from the environment. We don't want this to error out, so let's reset it to the empty string first: remove_trash=t GIT_SKIP_TESTS=t0001 ./t0001-init.sh I tested this with --debug, see 4d0912a (test-lib.sh: do not barf under --debug at the end of the test, 2017-04-24) for a bug we don't want to re-introduce. While I'm at it, let's move the HOME assignment to just before test_create_repo, it could be lower, but it seems better to set it before calling anything in test-lib-functions.sh Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Stop setting the GIT_TEST_FRAMEWORK_SELFTEST variable. This was originally needed back in 4231d1b (t0000: do not get self-test disrupted by environment warnings, 2018-09-20). It hasn't been needed since I deleted the relevant code in test-lib.sh in c0eedbc (test-lib: remove check_var_migration, 2021-02-09), I just didn't notice that it was set here. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Reword the documentation for "test_commit --append" added in my 3373518 (test-lib functions: add an --append option to test_commit, 2021-01-12). A follow-up commit will make the "echo" part of this configurable, and in any case saying "echo >>" rather than ">>" was redundant. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

In 76b8b8d (test-lib functions: document arguments to test_commit, 2021-01-12) I added missing documentation to test_commit, but in less than a month later in 3803a3a (t: add --no-tag option to test_commit, 2021-02-09) we got another undocumented option. Let's fix that. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Add an --annotated option to test_commit to create annotated tags. The tag will share the same message as the commit, and we'll call test_tick before creating it (unless --notick) is provided. There's quite a few tests that could be simplified with this construct. I've picked one to convert in this change as a demonstration. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Convert the setup of the describe tests to use test_commit when possible. This makes use of the new --annotate option to test_commit. Some of the setup here could simply be removed since the data being created wasn't important to any of the subsequent tests, so I've done so. E.g. assigning to the "one" variable was always useless, and just checking that we can describe HEAD after the first commit wasn't useful. In the case of the "two" variable we could instead use the tag we just created. See 5312ab1 (Add describe test., 2007-01-13) for the initial version of this code. There's other cases here like redundant "test_tick" invocations, or the simplification of not echoing "X" to a file we're about to tag as "x", now we just use "x" in both cases. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Add a --printf option to test_commit to allow writing to the file with "printf" instead of "echo". This is useful for writing "\n", "\0" etc., in particular in combination with the --append option added in 3373518 (test-lib functions: add an --append option to test_commit, 2021-01-12). I'm converting a few tests to use the new option rather than a manual printf/add/commit combination to demonstrate its usefulness. While I'm at it use "test_create_repo" where appropriate, and give the first/second commit a meaningful/more conventional log message in cases where no test cared about that message. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Change a use of $GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME added in 704fed9 (tests: start moving to a different default main branch name, 2020-10-23) to simply discover the initial branch name of a repository set up in this function with "symbolic-ref --short". That's something done in another test in 704fed9, so doing it like this seems like an omission, or rather an overly eager search/replacement instead of fixing the test logic. There are only three uses of the GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME variable in the test suite, this gets rid of one of those. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Reformat an argument list changed in 675704c (init: provide useful advice about init.defaultBranch, 2020-12-11) to have the "-c" on the same line as the argument it sets. This whitespace-only change makes it easier to review a subsequent commit. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Arrange for the advice about naming the initial branch not to be shown in the --verbose output of the test suite. Since 675704c (init: provide useful advice about init.defaultBranch, 2020-12-11) some tests have been very chatty with repeated occurrences of this multi-line advice. Having it be this verbose isn't helpful for anyone in the context of git's own test suite, and it makes debugging tests that use their own "git init" invocations needlessly distracting. By setting the GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME variable early in test-lib.sh itself we'll squash the warning not only for test_create_repo(), as 675704c explicitly intended, but also for other "git init" invocations. And once we'd like to have this configuration set for all "git init" invocations in the test suite we can get rid of the init.defaultBranch configuration setting in test_create_repo(), as repo_default_branch_name() in refs.c will take the GIT_TEST_* variable over it being set. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Remove various redundant or obsolete code from the test_create_repo() function, and split up its use in test-lib.sh from what tests need from it. This leave us with a pass-through wrapper for "git init" in test-lib-functions.sh, in test-lib.sh we have the same, except for needing to redirect stdout/stderr, and emitting an error ourselves if it fails. We don't need to error() ourselves when test_create_repo() is invoked, as the invocation will be a part of a test's "&&"-chain. Everything below this paragraph is a detailed summary of the history of test_create_repo() explaining why it's safe to remove the various things it was doing: 1. "mkdir -p" isn't needed because "git init" itself will create leading directories if needed. 2. Since we're now a simple wrapper for "git init" we don't need to check that we have only one argument. If someone wants to run "test_create_repo --bare x" that's OK. 3. We won't ever hit that "Cannot setup test environment" error. Checking the test environment sanity when doing "git init" dates back to eea4206 (t0000: catch trivial pilot errors., 2005-12-10) and 2ccd202 (trivial: check, if t/trash directory was successfully created, 2006-01-05). We can also see it in another form a bit later in my own 0d314ce (test-lib: use subshell instead of cd $new && .. && cd $old, 2010-08-30). But since 2006f0a (t/test-lib: make sure Git has already been built, 2012-09-17) we already check if we have a built git earlier. The one thing this was testing after that 2012 change was that we'd just built "git", but not "git-init", but since 3af4c71 (tests: respect GIT_TEST_INSTALLED when initializing repositories, 2018-11-12) we invoke "git", not "git-init". So all of that's been checked already, and we don't need to re-check it here. 4. We don't need to move .git/hooks out of the way. That dates back to c09a69a (Disable hooks during tests., 2005-10-16), since then hooks became disabled by default in f98f8cb (Ship sample hooks with .sample suffix, 2008-06-24). So the hooks were already disabled by default, but as can be seen from "mkdir .git/hooks" changes various tests needed to re-setup that directory. Now they no longer do. This makes us implicitly depend on the default hooks being disabled, which is a good thing. If and when we'd have any on-by-default hooks (I see no reason we ever would) we'd want to see the subtle and not so subtle ways that would break the test suite. 5. We don't need to "cd" to the "$repo" directory at all anymore. In the code being removed here we both "cd"'d to the repository before calling "init", and did so in a subshell. It's not important to do either, so both of those can be removed. We cd'd because this code grew from test-lib.sh code where we'd have done so already, see eedf8f9 (Abstract test_create_repo out for use in tests., 2006-02-17), and later "cd"'d inside a subshell since 0d314ce to avoid having to keep track of an "old pwd" variable to cd back after the setup. Being in the repository directory made moving the hooks around easier (we wouldn't have to fully qualify the path). Since we're not moving the hooks per #4 above we don't need to "cd" for that reason either. 6. We can drop the --template argument and instead rely on the GIT_TEMPLATE_DIR set to the same path earlier in test-lib.sh. See 8683a45 (Introduce GIT_TEMPLATE_DIR, 2006-12-19) 7. We only needed that ">&3 2>&4" redirection when invoked from test-lib.sh. We could still invoke test_create_repo() there, but as the invocation is now trivial and we don't have a good reason to use test_create_repo() elsewhere let's call "git init" there ourselves. 8. We didn't need to resolve "git" as "${GIT_TEST_INSTALLED:-$GIT_EXEC_PATH}/git$X" in test_create_repo(), even for the use of test-lib.sh PATH is already set up in test-lib.sh to start with GIT_TEST_INSTALLED and/or GIT_EXEC_PATH before test_create_repo() (now "git init") is called.. So we can simply run "git" and rely on the PATH lookup choosing the right executable. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When the support for "objectsize:disk" was bolted onto the existing support for "objectsize", it didn't follow the usual pattern for handling "atomtype:modifier", which reads the <modifier> part just once while parsing the format string, and store the parsed result in the union in the used_atom structure, so that the string form of it does not have to be parsed over and over at runtime (e.g. in grab_common_values()). Add a new member `objectsize` to the union `used_atom.u`, so that we can separate the check of <modifier> from the check of <atomtype>, this will bring scalability to atom `%(objectsize)`. Signed-off-by: ZheNing Hu <adlternative@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

In the original ref-filter design, it will copy the parsed atom's name and attributes to `used_atom[i].name` in the atom's parsing step, and use it again for string matching in the later specific ref attributes filling step. It use a lot of string matching to determine which atom we need. Introduce the enum "atom_type", each enum value is named as `ATOM_*`, which is the index of each corresponding valid_atom entry. In the first step of the atom parsing, `used_atom.atom_type` will record corresponding enum value from valid_atom entry index, and then in specific reference attribute filling step, only need to compare the value of the `used_atom[i].atom_type` to check the atom type. Helped-by: Junio C Hamano <gitster@pobox.com> Helped-by: Christian Couder <christian.couder@gmail.com> Signed-off-by: ZheNing Hu <adlternative@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

These strings have not been modified in any translation, nor should they be. Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The sendemail.smtpServer configuration option and --smtp-server command line option both support using a sendmail-like program to send emails by specifying an absolute file path. However, this is not ideal for the following reasons: 1. It overloads the meaning of smtpServer (now a program is being used for the server?) 2. It doesn't allow for non-absolute paths, arguments, or arbitrary scripting Requiring an absolute path is bad for portability, as the same program may be in different locations on different systems. If a user wishes to pass arguments to their program, they have to use the smtpServerOption option, which is cumbersome (as it must be repeated for each option) and doesn't adhere to normal git conventions. Introduce a new configuration option sendemail.sendmailCmd as well as a command line option --sendmail-cmd that can be used to specify a command (with or without arguments) or shell expression to run to send email. The name of this option is consistent with --to-cmd and --cc-cmd. This invocation honors the user's $PATH so that absolute paths are not necessary. Arbitrary shell expressions are also supported, allowing users to do basic scripting. Give this option a higher precedence over --smtp-server and sendemail.smtpServer, as the new interface is more flexible. For backward compatibility, continue to support absolute paths in --smtp-server and sendemail.smtpServer. Signed-off-by: Gregory Anders <greg@gpanders.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

I sometimes receive patches from people with short mononyms, and in my cultural environment these are not uncommon. To my dismay, git-am currently discards their names, and replaces them with their email addresses. Link: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ Signed-off-by: edef <edef@edef.eu> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Clarify the default length used for the abbreviated form used for commits in git describe. The behavior was modified in Git 2.11.0, but the documentation was not updated to clarify the new behavior. Signed-off-by: Anders Höckersten <anders@hockersten.se> Signed-off-by: Junio C Hamano <gitster@pobox.com>

These error messages are intended for the user. Let's touch them up since we're here from the previous commit. Signed-off-by: Wolfgang Müller <wolf@oriole.systems> Signed-off-by: Junio C Hamano <gitster@pobox.com>

An object_id storing a SHA-1 name has some unused bytes at the end of the hash array. Since these bytes are not used, they are usually not initialized to any value either. However, at parallel_checkout.c:send_one_item() the object_id of a cache entry is copied into a buffer which is later sent to a checkout worker through a pipe write(). This makes Valgrind complain about passing uninitialized bytes to a syscall. The worker won't use these uninitialized bytes either, but the warning could confuse someone trying to debug this code; So instead of using oidcpy(), send_one_item() uses hashcpy() to only copy the used/initialized bytes of the object_id, and leave the remaining part with zeros. However, since cf09832 ("hash: add an algo member to struct object_id", 2021-04-26), using hashcpy() is no longer sufficient here as it won't copy the new algo field from the object_id. Let's add and use a new function which meets both our requirements of copying all the important object_id data while still avoiding the uninitialized bytes, by padding the end of the hash array in the destination object_id. With this change, we also no longer need the destination buffer from send_one_item() to be initialized with zeros, so let's switch from xcalloc() to xmalloc() to make this clear. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The flush() logic in csum-file.c was introduced originally by c38138c (git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26) and a portion of the logic performs similar utility to write_in_full() in wrapper.c. The history of write_in_full() is full of moves and renames, but was originally introduced by 7230e6d (Add write_or_die(), a helper function, 2006-08-21). The point of these sections of code are to flush a write buffer using xwrite() and report errors in the case of disk space issues or other generic input/output errors. The logic in flush() can interpret the output of write_in_full() to provide the correct error messages to users. The logic in the hashfile API has an additional set of logic to augment the progress indicator between calls to xwrite(). This was introduced by 2a128d6 (add throughput display to git-push, 2007-10-30). It seems that since the hashfile's buffer is only 8KB, these additional progress indicators might not be incredibly necessary. Instead, update the progress only when write_in_full() complete. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Sometimes new people are confused by how a revision "range" works, in that it is not a random collection of commits but a set of commits that are all connected to each other, and most Git commands work on a single such "range". Give an example to clarify it. Signed-off-by: Junio C Hamano <gitster@pobox.com>

The xsize_t helper aims to safely convert an off_t to a size_t, erroring out when a file offset is too large to fit into a memory address. It does this by using two casts: size_t size = (size_t) len; if (len != (off_t) size) ... error out ... On a platform with sizeof(size_t) < sizeof(off_t), this check is safe and correct. The first cast truncates to a size_t by finding the remainder modulo SIZE_MAX+1 (see C99 section 6.3.1.3 Signed and unsigned integers) and the second promotes to an off_t, meaning the result is true if and only if len is representable as a size_t. On other platforms, this two-casts strategy still works well (always succeeds) for len >= 0. But for len < 0, when the first cast succeeds and produces SIZE_MAX + 1 + len, the resulting value is too large to be represented as an off_t, so the second cast produces implementation defined behavior. In practice, it is likely to produce a result of true despite len not being representable as size_t. Simplify by replacing with a more straightforward check: compare len to the relevant bounds and then cast it. (To avoid a -Wsign-compare warning, after checking that len >= 0, we explicitly convert to a sufficiently-large unsigned type before comparing to SIZE_MAX.) In practice, this is not likely to come up since typical callers use nonnegative len. Still, it's helpful to handle this case to make the behavior easy to reason about. Historical note: the original bounds-checking in 46be82d (xsize_t: check whether we lose bits, 2010-07-28) did not produce this implementation-defined behavior, though it still did not handle negative offsets. It was not until 73560c7 (git-compat-util.h: xsize_t() - avoid -Wsign-compare warnings, 2017-09-21) introduced the double cast that the implementation-defined behavior was triggered. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The hashfile API uses a hard-coded buffer size of 8KB and has ever since it was introduced in c38138c (git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26). It performs a similar function to the hashing buffers in read-cache.c, but that code was updated from 8KB to 128KB in f279894 (read-cache: make the index write buffer size 128K, 2021-02-18). The justification there was that do_write_index() improves from 1.02s to 0.72s. Since our end goal is to have the index writing code use the hashfile API, we need to unify this buffer size to avoid a performance regression. There is a buffer, 'check_buffer', that is used to verify the check_fd file descriptor. When this buffer increases to 128K to fit the data being flushed, it causes the stack to overflow the limits placed in the test suite. To avoid issues with stack size, move both 'buffer' and 'check_buffer' to be heap pointers within 'struct hashfile'. The 'check_buffer' member is left as NULL unless check_fd is set in hashfd_check(). Both buffers are cleared as part of finalize_hashfile() which also frees the full structure. Since these buffers are now on the heap, we can adjust their size based on the needs of the consumer. In particular, callers to hashfd_throughput() are expecting to report progress indicators as the buffer flushes. These callers would prefer the smaller 8k buffer to avoid large delays between updates, especially for users with slower networks. When the progress indicator is not used, the larger buffer is preferrable. By adding a new trace2 region in the chunk-format API, we can see that the writing portion of 'git multi-pack-index write' lowers from ~1.49s to ~1.47s on a Linux machine. These effects may be more pronounced or diminished on other filesystems. The end-to-end timing is too noisy to have a definitive change either way. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The do_write_index() method in read-cache.c has its own hashing logic and buffering mechanism. Specifically, the ce_write() method was introduced by 4990aad (Speed up index file writing by chunking it nicely, 2005-04-20) and similar mechanisms were introduced a few months later in c38138c (git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26). Based on the timing, in the early days of the Git codebase, I figured that these roughly equivalent code paths were never unified only because it got lost in the shuffle. The hashfile API has since been used extensively in other file formats, such as pack-indexes, multi-pack-indexes, and commit-graphs. Therefore, it seems prudent to unify the index writing code to use the same mechanism. I discovered this disparity while trying to create a new index format that uses the chunk-format API. That API uses a hashfile as its base, so it is incompatible with the custom code in read-cache.c. This rewrite is rather straightforward. It replaces all writes to the temporary file with writes to the hashfile struct. This takes care of many of the direct interactions with the_hash_algo. There are still some git_hash_ctx uses remaining: the extension headers are hashed for use in the End of Index Entries (EOIE) extension. This use of the git_hash_ctx is left as-is. There are multiple reasons to not use a hashfile here, including the fact that the data is not actually writing to a file, just a hash computation. These hashes do not block our adoption of the chunk-format API in a future change to the index, so leave it as-is. The internals of the algorithms are mostly identical. Previously, the hashfile API used a smaller 8KB buffer instead of the 128KB buffer from read-cache.c. The previous change already unified these sizes. There is one subtle point: we do not pass the CSUM_FSYNC to the finalize_hashfile() method, which differs from most consumers of the hashfile API. The extra fsync() call indicated by this flag causes a significant peformance degradation that is noticeable for quick commands that write the index, such as "git add". Other consumers can absorb this cost with their more complicated data structure organization, and further writing structures such as pack-files and commit-graphs is rarely in the critical path for common user interactions. Some static methods become orphaned in this diff, so I marked them as MAYBE_UNUSED. The diff is much harder to read if they are deleted during this change. Instead, they will be deleted in the following change. In addition to the test suite passing, I computed indexes using the previous binaries and the binaries compiled after this change, and found the index data to be exactly equal. Finally, I did extensive performance testing of "git update-index --force-write" on repos of various sizes, including one with over 2 million paths at HEAD. These tests demonstrated less than 1% difference in behavior. As expected, the performance should be considered unchanged. The previous changes to increase the hashfile buffer size from 8K to 128K ensured this change would not create a peformance regression. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

These methods were marked as MAYBE_UNUSED in the previous change to avoid a complicated diff. Delete them entirely, since we now use the hashfile API instead of this custom hashing code. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The text was somewhat confusing between the revision itself and the author. Signed-off-by: Reuven Yagel <robi@post.jce.ac.il> Signed-off-by: Junio C Hamano <gitster@pobox.com>

git-clone started respecting errors from the transport subsystem in aab179d (builtin/clone.c: don't ignore transport_fetch_refs() errors, 2020-12-03). However, that commit didn't handle the cleanup of the filesystem quite right. The cleanup of the directory that cmd_clone() creates is done by an atexit() handler, which we control with a flag. It starts as JUNK_LEAVE_NONE ("clean up everything"), then progresses to JUNK_LEAVE_REPO when we know we have a valid repo but not working tree, and then finally JUNK_LEAVE_ALL when we have a successful checkout. Most errors cause us to die(), which then triggers the handler to do the right thing based on how far into cmd_clone() we got. But the checks added by aab179d instead set the "err" variable and then jump to a new "cleanup" label, which then returns our non-zero status. However, the code after the cleanup label includes setting the flag to JUNK_LEAVE_ALL, and so we accidentally leave the repository and working tree in place. One obvious option to fix this is to reorder the end of the function to set the flag first, before cleanup code, and put the label between them. But we can observe another small bug: the error return from transport_fetch_refs() is generally "-1", and we propagate that to the return value of cmd_clone(), which ultimately becomes the exit code of the process. And we try to avoid transmitting negative values via exit codes (only the low 8 bits are passed along as an unsigned value, though in practice for "-1" this at least retains the property that it's non-zero). Instead, let's just die(). That makes us consistent with rest of the code in the function. It does add a new "fatal:" line to the output, but I'd argue that's a good thing: - in the rare case that the transport code didn't say anything, now the user gets _some_ error message - even if the transport code said something like "error: ssh died of signal 9", it's nice to also say "fatal" to indicate that we considered that to be a show-stopper. Triggering this in the test suite turns out to be surprisingly difficult. Almost every error we'd encounter, including ones deep inside the transport code, cause us to just die() right there! However, one way is to put a fake wrapper around git-upload-pack that sends the complete packfile but exits with a failure code. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When some commands run with command_requires_full_index=1, then the index can get in a state where the in-memory cache tree is actually equal to the sparse index's cache tree instead of the full one. This results in incorrect entry_count values. By clearing the cache tree before converting to sparse, we avoid this issue. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Previous changes did the necessary improvements to unpack-trees.c and diff-lib.c in order to modify a sparse index based on its comparision with a tree. The only remaining work is to remove some ensure_full_index() calls and add tests that verify that the index is not expanded in our interesting cases. Include 'switch' and 'restore' in these tests, as they share a base implementation with 'checkout'. Here are the relevant performance results from p2000-sparse-operations.sh: Test HEAD~1 HEAD -------------------------------------------------------------------------------- 2000.18: git checkout -f - (full-v3) 0.49(0.43+0.03) 0.47(0.39+0.05) -4.1% 2000.19: git checkout -f - (full-v4) 0.45(0.37+0.06) 0.42(0.37+0.05) -6.7% 2000.20: git checkout -f - (sparse-v3) 0.76(0.71+0.07) 0.04(0.03+0.04) -94.7% 2000.21: git checkout -f - (sparse-v4) 0.75(0.72+0.04) 0.05(0.06+0.04) -93.3% It is important to compare the full index case to the sparse index case, as the previous results for the sparse index were inflated by the index expansion. For index v4, this is an 88% improvement. On an internal repository with over two million paths at HEAD and a sparse-checkout definition containing ~60,000 of those paths, 'git checkout' went from 3.5s to 297ms with this change. The theoretical optimum where only those ~60,000 paths exist was 275ms, so the extra sparse directory entries contribute a 22ms overhead. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Add new branches to the test repo that demonstrate directory/file conflicts in different ways. Since the directory 'folder1/' has adjacent files 'folder1-', 'folder1.txt', and 'folder10' it causes searches for 'folder1/' to land in a different place in the index than a search for 'folder1'. This causes a change in behavior when working with the df-conflict-1 and df-conflict-2 branches, whose only difference is that the first uses 'folder1' as the conflict and the other uses 'folder2' which does not have these adjacent files. We can extend two tests that compare the behavior across different 'git checkout' commands, and we see already that the behavior will be different in some cases and not in others. The difference between the two test loops is that one uses 'git reset --hard' between iterations. Further, we isolate the behavior of creating a staged change within a directory and then checking out a branch where that directory is replaced with a file. A full checkout behaves differently across these two cases, while a sparse-checkout cone behaves consistently. In both cases, the behavior is wrong. In one case, the staged change is dropped entirely. The other case the staged change is kept, replacing the file at that location, but none of the other files in the directory are kept. Likely, the correct behavior in this case is to reject the checkout and report the conflict, leaving HEAD in its previous location. None of the cases behave this way currently. Use comments to demonstrate that the tested behavior is only a documentation of the current, incorrect behavior to ensure we do not _accidentally_ change it. Instead, we would prefer to change it on purpose with a future change. At this point, the sparse-index does not handle these 'git checkout' commands correctly. Or rather, it _does_ reject the 'git checkout' when we have the staged change, but for the wrong reason. It also rejects the 'git checkout' commands when there is no staged change and we want to replace a directory with a file. A fix for that unstaged case will follow in the next change, but that will make the sparse-index agree with the full checkout case in these documented incorrect behaviors. Helped-by: Elijah Newren <newren@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

When running unpack_trees() with a sparse index, we attempt to operate on the index without expanding the sparse directory entries. Thus, we operate by manipulating entire directories and passing them to the unpack function. In the case of the 'git checkout' command, this is the twoway_merge() function. There are several cases in twoway_merge() that handle different situations. One new one to add is the case of a directory/file conflict where the directory is sparse. Before the sparse index, such a conflict would appear as a list of file additions and deletions. Now, twoway_merge() initializes 'current', 'oldtree', and 'newtree' from src[0], src[1], and src[2], then sets 'oldtree' to NULL because it is equal to the df_conflict_entry. The way to determine that we have a directory/file conflict is to test that 'current' and 'newtree' disagree on being sparse directory entries. When we are in this case, we want to resolve the situation by calling merged_entry(). This allows replacing the 'current' entry with the 'newtree' entry. This is important for cases where we want to run 'git checkout' across the conflict and have the new HEAD represent the new file type at that path. The first NEEDSWORK comment dropped in t1092 demonstrates this necessary behavior. However, we still are in a confusing state when 'current' corresponds to a staged change within a sparse directory that is not present at HEAD. This should be atypical, because it requires adding a change outside of the sparse-checkout cone, but it is possible. Since we are unable to determine that this is a staged change within twoway_merge(), we cannot add a case to reject the merge at this point. I believe this is due to the use of df_conflict_entry in the place of 'oldtree' instead of using the valud at HEAD, which would provide some perspective to this decision. Any change that would allow this differentiation for staged entries would need to involve information further up in unpack_trees(). That work should be done, sometime, because we are further confusing the behavior of a directory/file conflict when staging a change in the directory. The two cases 'checkout behaves oddly with df-conflict-?' in t1092 demonstrate that even without a sparse-checkout, Git is not consistent in its behavior. Neither of the two options seems correct, either. This change makes the sparse-index behave differently than the typcial sparse-checkout case, but it does match the full checkout behavior in the df-conflict-2 case. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Disable command_requires_full_index for 'git add'. This does not require any additional removals of ensure_full_index(). The main reason is that 'git add' discovers changes based on the pathspec and the worktree itself. These are then inserted into the index directly, and calls to index_name_pos() or index_file_exists() already call expand_to_path() at the appropriate time to support a sparse-index. Add a test to check that 'git add -A' and 'git add <file>' does not expand the index at all, as long as <file> is not within a sparse directory. This does not help the global 'git add .' case. We can measure the improvement using p2000-sparse-operations.sh with these results: Test HEAD~1 HEAD ------------------------------------------------------------------------------ 2000.6: git add -A (full-index-v3) 0.35(0.30+0.05) 0.37(0.29+0.06) +5.7% 2000.7: git add -A (full-index-v4) 0.31(0.26+0.06) 0.33(0.27+0.06) +6.5% 2000.8: git add -A (sparse-index-v3) 0.57(0.53+0.07) 0.05(0.04+0.08) -91.2% 2000.9: git add -A (sparse-index-v4) 0.58(0.55+0.06) 0.05(0.05+0.06) -91.4% While the 91% improvement seems impressive, it's important to recognize that previously we had significant overhead for expanding the sparse-index. Comparing to the full index case, 'git add -A' goes from 0.37s to 0.05s, which is "only" an 86% improvement. This modification to 'git add' creates some behavior change depending on the use of a sparse index. We modify a test in t1092 to demonstrate these changes which will be remedied in future changes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The add_pathspec_matches_against_index() focuses on matching a pathspec to file entries in the index. This already works correctly for its only use: checking if untracked files exist in the index. The compatibility checks in t1092 already test that 'git add <dir>' works for a directory outside of the sparse cone. That provides coverage for removing this guard. This finalizes our ability to run 'git add .' without expanding a sparse index to a full one. This is evidenced by an update to t1092 and by these performance numbers for p2000-sparse-operations.sh: Test HEAD~1 HEAD -------------------------------------------------------------------------------- 2000.10: git add . (full-index-v3) 0.37(0.28+0.07) 0.36(0.27+0.06) -2.7% 2000.11: git add . (full-index-v4) 0.33(0.26+0.06) 0.32(0.28+0.05) -3.0% 2000.12: git add . (sparse-index-v3) 0.57(0.53+0.07) 0.06(0.06+0.07) -89.5% 2000.13: git add . (sparse-index-v4) 0.57(0.53+0.07) 0.05(0.03+0.09) -91.2% While the ~90% improvement is shown by the test results, it is worth noting that expanding the sparse index was adding overhead in previous commits. Comparing to the full index case, we see the performance go from 0.33s to 0.05s, an 85% improvement. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Since b243012 (refresh_index(): add flag to ignore SKIP_WORKTREE entries, 2021-04-08), 'git add --refresh <path>' will output a warning message when the path is outside the sparse-checkout definition. The implementation of this warning happened in parallel with the sparse-index work to add ensure_full_index() calls throughout the codebase. Update this loop to have the proper logic that checks to see if the pathspec is outside the sparse-checkout definition. This avoids the need to expand the sparse directory entry and determine if the path is tracked, untracked, or ignored. We simply avoid updating the stat() information because there isn't even an entry that matches the path! Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The --renormalize option updates the EOL conversions for the tracked files. However, the loop already ignores files marked with the SKIP_WORKTREE bit, so it will continue to do so with a sparse index because the sparse directory entries also have this bit set. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

While the sparse-index is only enabled when core.sparseCheckoutCone is also enabled, it is possible for the user to modify the sparse-checkout file manually in a way that does not match cone-mode patterns. In this case, we should refuse to convert an index into a sparse index, since the sparse_checkout_patterns will not be initialized with recursive and parent path hashsets. Also silently return if there are no cache entries, which is a simple case: there are no paths to make sparse! Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

If cache_tree_update() returns a non-zero value, then it could not create the cache tree. This is likely due to a path having a merge conflict. Since we are already returning early, let's return silently to avoid making it seem like we failed to write the index at all. If we remove our dependence on the cache tree within convert_to_sparse(), then we could still recover from this scenario and have a sparse index. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

When constructing the cache-tree extension in convert_to_sparse(), it is possible that we construct a tree object that is new to the object database. Without the WRITE_TREE_MISSING_OK flag, this results in an error that halts our conversion to a sparse index. Add this flag to remove this limitation. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The iterated search in find_cache_entry() was recently modified to include a loop that searches backwards for a sparse directory entry that matches the given traverse_info and name_entry. However, the string comparison failed to actually concatenate those two strings, so this failed to find a sparse directory when it was not a top-level directory. This caused some errors in rare cases where a 'git checkout' spanned a diff that modified files within the sparse directory entry, but we could not correctly find the entry. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The sparse index is tested with the FS Monitor hook and extension since f8fe49e (fsmonitor: integrate with sparse index, 2021-07-14). This test was very fragile because it shared an index across sparse and non-sparse behavior. Since that expansion and contraction could cause the index to lose its FS Monitor bitmap and token, behavior is fragile to changes in 'git sparse-checkout set'. Rewrite the test to use two clones of the original repo: full and sparse. This allows us to also keep the test files (actual, expect, trace2.txt) out of the repos we are testing with 'git status'. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

When changing the scope of a sparse-checkout using cone mode, we might have some tracked directories go out of scope. The current logic removes the tracked files from within those directories, but leaves the ignored files within those directories. This is a bit unexpected to users who have given input to Git saying they don't need those directories anymore. This is something that is new to the cone mode pattern type: the user has explicitly said "I want these directories and _not_ those directories." The typical sparse-checkout patterns more generally apply to "I want files with with these patterns" so it is natural to leave ignored files as they are. This focus on directories in cone mode provides us an opportunity to change the behavior. Leaving these ignored files in the sparse directories makes it impossible to gain performance benefits in the sparse index. When we track into these directories, we need to know if the files are ignored or not, which might depend on the _tracked_ .gitignore file(s) within the sparse directory. This depends on the indexed version of the file, so the sparse directory must be expanded. By deleting the sparse directories when changing scope (or running 'git sparse-checkout reapply') we regain these performance benefits as if the repository was in a clean state. Since these ignored files are frequently build output or helper files from IDEs, the users should not need the files now that the tracked files are removed. If the tracked files reappear, then they will have newer timestamps than the build artifacts, so the artifacts will need to be regenerated anyway. If users depend on ignored files within the sparse directories, then they have created a bad shape in their repository. Regardless, such shapes would create risk that changing the behavior for all cone mode users might be too risky to take on at the moment. Since this data shape makes it impossible to get performance benefits using the sparse index, we limit the change to only be enabled when the sparse index is enabled. Users can opt out of this behavior by disabline the sparse index. Depending on user feedback or real-world use, we might want to consider expanding the behavior change to all of cone mode. Since we are currently restricting to the sparse index case, we can use the existence of sparse directory entries in the index as indicators of which directories should be removed. The contained files are deleted by a 'git clean -dfx' subcommand while the directories themselves are deleted once they are empty. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The sparse index will be compatible with the ORT merge strategy, so let's use it explicitly in our tests. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Add tests to check that cherry-pick and rebase behave the same in the sparse-index case as in the full index cases. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The hard work was already done with 'git merge' and the ORT strategy. Just add extra tests to see that we get the expected results in the non-conflict cases. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

newren and others added 30 commits May 4, 2021 12:53

merge: don't translate literal commands

a30e43f

These strings have not been modified in any translation, nor should they be. Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

stash: don't translate literal commands

4901884

Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

submodule: use the imperative mood to describe the --files option

f5f5a61

Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

rev-parse: mark die() messages for translation

e2c5993

These error messages are intended for the user. Let's touch them up since we're here from the previous commit. Signed-off-by: Wolfgang Müller <wolf@oriole.systems> Signed-off-by: Junio C Hamano <gitster@pobox.com>

docs: improve fast-forward in glossary content

e22f2da

The text was somewhat confusing between the revision itself and the author. Signed-off-by: Reuven Yagel <robi@post.jce.ac.il> Signed-off-by: Junio C Hamano <gitster@pobox.com>

derrickstolee added 4 commits July 20, 2021 09:40

derrickstolee self-assigned this Jul 22, 2021

derrickstolee added 11 commits July 26, 2021 10:14

t1092: test merge conflicts outside cone

8f2fd93

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

derrickstolee force-pushed the merge-ort-vfs branch from 50045c8 to a9f66aa Compare July 26, 2021 15:33

derrickstolee added 10 commits July 26, 2021 16:48

t1092: use ORT merge strategy

35e7978

The sparse index will be compatible with the ORT merge strategy, so let's use it explicitly in our tests. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

sparse-checkout: create helper methods

0acbbd6

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

attr: be careful about sparse directories

3b24580

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

diff: ignore sparse paths in diffstat

15d9ae1

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

merge: make sparse-aware with ORT

fda3709

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

t1092: add cherry-pick, rebase tests

6d09074

Add tests to check that cherry-pick and rebase behave the same in the sparse-index case as in the full index cases. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

sparse-index: integrate with cherry-pick and rebase

efc0bf7

The hard work was already done with 'git merge' and the ORT strategy. Just add extra tests to see that we get the expected results in the non-conflict cases. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Merge branch 'sparse-index/merge' into sparse-index/merge-vfs

0224adb

fixup! t1092: use ORT merge strategy

f36b3ff

merge-ort: expand only for out-of-cone conflicts

4f48196

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

derrickstolee force-pushed the merge-ort-vfs branch from a9f66aa to 4f48196 Compare July 28, 2021 16:32

derrickstolee closed this Aug 2, 2021

derrickstolee mentioned this pull request Aug 2, 2021

Make 'ort' the default merge strategy #404

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TEST] Make ORT the default merge strategy #400

[TEST] Make ORT the default merge strategy #400

derrickstolee commented Jul 22, 2021

[TEST] Make ORT the default merge strategy #400

[TEST] Make ORT the default merge strategy #400

Conversation

derrickstolee commented Jul 22, 2021