Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bundle URIs Part 5: creationToken heuristic #22

Closed

Commits on Jan 24, 2023

  1. bundle: test unbundling with incomplete history

    When verifying a bundle, Git checks first that all prerequisite commits
    exist in the object store, then adds an additional check: those
    prerequisite commits must be reachable from references in the
    repository.
    
    This check is stronger than what is checked for refs being added during
    'git fetch', which simply guarantees that the new refs have a complete
    history up to the point where it intersects with the current reachable
    history.
    
    However, we also do not have any tests that check the behavior under
    this condition. Create a test that demonstrates its behavior.
    
    In order to construct a broken history, perform a shallow clone of a
    repository with a linear history, but whose default branch ('base') has
    a single commit, so dropping the shallow markers leaves a complete
    history from that reference. However, the 'tip' reference adds a
    shallow commit whose parent is missing in the cloned repository. Trying
    to unbundle a bundle with the 'tip' as a prerequisite will succeed past
    the object store check and move into the reachability check.
    
    The two errors that are reported are of this form:
    
      error: Could not read <missing-commit>
      fatal: Failed to traverse parents of commit <present-commit>
    
    These messages are not particularly helpful for the person running the
    unbundle command, but they do prevent the command from succeeding.
    
    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    derrickstolee committed Jan 24, 2023
    Configuration menu
    Copy the full SHA
    f9b0cc8 View commit details
    Browse the repository at this point in the history

Commits on Jan 31, 2023

  1. bundle: verify using check_connected()

    When Git verifies a bundle to see if it is safe for unbundling, it first
    looks to see if the prerequisite commits are in the object store. This
    is an easy way to "fail fast" but it is not a sufficient check for
    updating refs that guarantee closure under reachability. There could
    still be issues if those commits are not reachable from the repository's
    references. The repository only has guarantees that its object store is
    closed under reachability for the objects that are reachable from
    references.
    
    Thus, the code in verify_bundle() has previously had the additional
    check that all prerequisite commits are reachable from repository
    references. This is done via a revision walk from all references,
    stopping only if all prerequisite commits are discovered or all commits
    are walked. This uses a custom walk to verify_bundle().
    
    This check is more strict than what Git applies to fetched pack-files.
    In the fetch case, Git guarantees that the new references are closed
    under reachability by walking from the new references until walking
    commits that are reachable from repository refs. This is done through
    the well-used check_connected() method.
    
    To better align with the restrictions required by 'git fetch',
    reimplement this check in verify_bundle() to use check_connected(). This
    also simplifies the code significantly.
    
    The previous change added a test that verified the behavior of 'git
    bundle verify' and 'git bundle unbundle' in this case, and the error
    messages looked like this:
    
      error: Could not read <missing-commit>
      fatal: Failed to traverse parents of commit <extant-commit>
    
    However, by changing the revision walk slightly within check_connected()
    and using its quiet mode, we can omit those messages. Instead, we get
    only this message, tailored to describing the current state of the
    repository:
    
      error: some prerequisite commits exist in the object store,
             but are not connected to the repository's history
    
    (Line break added here for the commit message formatting, only.)
    
    While this message does not include any object IDs, there is no
    guarantee that those object IDs would help the user diagnose what is
    going on, as they could be separated from the prerequisite commits by
    some distance. At minimum, this situation describes the situation in a
    more informative way than the previous error messages.
    
    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    derrickstolee committed Jan 31, 2023
    Configuration menu
    Copy the full SHA
    8dd4760 View commit details
    Browse the repository at this point in the history
  2. t5558: add tests for creationToken heuristic

    As documented in the bundle URI design doc in 2da14fa (docs:
    document bundle URI standard, 2022-08-09), the 'creationToken' member of
    a bundle URI allows a bundle provider to specify a total order on the
    bundles.
    
    Future changes will allow the Git client to understand these members and
    modify its behavior around downloading the bundles in that order. In the
    meantime, create tests that add creation tokens to the bundle list. For
    now, the Git client correctly ignores these unknown keys.
    
    Create a new test helper function, test_remote_https_urls, which filters
    GIT_TRACE2_EVENT output to extract a list of URLs passed to
    git-remote-https child processes. This can be used to verify the order
    of these requests as we implement the creationToken heuristic. For now,
    we need to sort the actual output since the current client does not have
    a well-defined order that it applies to the bundles.
    
    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    derrickstolee committed Jan 31, 2023
    Configuration menu
    Copy the full SHA
    103792e View commit details
    Browse the repository at this point in the history
  3. bundle-uri: parse bundle.heuristic=creationToken

    The bundle.heuristic value communicates that the bundle list is
    organized to make use of the bundle.<id>.creationToken values that may
    be provided in the bundle list. Those values will create a total order
    on the bundles, allowing the Git client to download them in a specific
    order and even remember previously-downloaded bundles by storing the
    maximum creation token value.
    
    Before implementing any logic that parses or uses the
    bundle.<id>.creationToken values, teach Git to parse the
    bundle.heuristic value from a bundle list. We can use 'test-tool
    bundle-uri' to print the heuristic value and verify that the parsing
    works correctly.
    
    As an extra precaution, create the internal 'heuristics' array to be a
    list of (enum, string) pairs so we can iterate through the array entries
    carefully, regardless of the enum values.
    
    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    derrickstolee committed Jan 31, 2023
    Configuration menu
    Copy the full SHA
    c4b1c9b View commit details
    Browse the repository at this point in the history
  4. bundle-uri: parse bundle.<id>.creationToken values

    The previous change taught Git to parse the bundle.heuristic value,
    especially when its value is "creationToken". Now, teach Git to parse
    the bundle.<id>.creationToken values on each bundle in a bundle list.
    
    Before implementing any logic based on creationToken values for the
    creationToken heuristic, parse and print these values for testing
    purposes.
    
    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    derrickstolee committed Jan 31, 2023
    Configuration menu
    Copy the full SHA
    11b2c91 View commit details
    Browse the repository at this point in the history
  5. bundle-uri: download in creationToken order

    The creationToken heuristic provides an ordering on the bundles
    advertised by a bundle list. Teach the Git client to download bundles
    differently when this heuristic is advertised.
    
    The bundles in the list are sorted by their advertised creationToken
    values, then downloaded in decreasing order. This avoids the previous
    strategy of downloading bundles in an arbitrary order and attempting
    to apply them (likely failing in the case of required commits) until
    discovering the order through attempted unbundling.
    
    During a fresh 'git clone', it may make sense to download the bundles in
    increasing order, since that would prevent the need to attempt
    unbundling a bundle with required commits that do not exist in our empty
    object store. The cost of testing an unbundle is quite low, and instead
    the chosen order is optimizing for a future bundle download during a
    'git fetch' operation with a non-empty object store.
    
    Since the Git client continues fetching from the Git remote after
    downloading and unbundling bundles, the client's object store can be
    ahead of the bundle provider's object store. The next time it attempts
    to download from the bundle list, it makes most sense to download only
    the most-recent bundles until all tips successfully unbundle. The
    strategy implemented here provides that short-circuit where the client
    downloads a minimal set of bundles.
    
    However, we are not satisfied by the naive approach of downloading
    bundles until one successfully unbundles, expecting the earlier bundles
    to successfully unbundle now. The example repository in t5558
    demonstrates this well:
    
     ---------------- bundle-4
    
           4
          / \
     ----|---|------- bundle-3
         |   |
         |   3
         |   |
     ----|---|------- bundle-2
         |   |
         2   |
         |   |
     ----|---|------- bundle-1
          \ /
           1
           |
     (previous commits)
    
    In this repository, if we already have the objects for bundle-1 and then
    try to fetch from this list, the naive approach will fail. bundle-4
    requires both bundle-3 and bundle-2, though bundle-3 will successfully
    unbundle without bundle-2. Thus, the algorithm needs to keep this in
    mind.
    
    A later implementation detail will store the maximum creationToken seen
    during such a bundle download, and the client will avoid downloading a
    bundle unless its creationToken is strictly greater than that stored
    value. For now, if the client seeks to download from an identical
    bundle list since its previous download, it will download the
    most-recent bundle then stop since its required commits are already in
    the object store.
    
    Add tests that exercise this behavior, but we will expand upon these
    tests when incremental downloads during 'git fetch' make use of
    creationToken values.
    
    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    derrickstolee committed Jan 31, 2023
    Configuration menu
    Copy the full SHA
    a052b32 View commit details
    Browse the repository at this point in the history
  6. clone: set fetch.bundleURI if appropriate

    Bundle providers may organize their bundle lists in a way that is
    intended to improve incremental fetches, not just initial clones.
    However, they do need to state that they have organized with that in
    mind, or else the client will not expect to save time by downloading
    bundles after the initial clone. This is done by specifying a
    bundle.heuristic value.
    
    There are two types of bundle lists: those at a static URI and those
    that are advertised from a Git remote over protocol v2.
    
    The new fetch.bundleURI config value applies for static bundle URIs that
    are not advertised over protocol v2. If the user specifies a static URI
    via 'git clone --bundle-uri', then Git can set this config as a reminder
    for future 'git fetch' operations to check the bundle list before
    connecting to the remote(s).
    
    For lists provided over protocol v2, we will want to take a different
    approach and create a property of the remote itself by creating a
    remote.<id>.* type config key. That is not implemented in this change.
    
    Later changes will update 'git fetch' to consume this option.
    
    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    derrickstolee committed Jan 31, 2023
    Configuration menu
    Copy the full SHA
    bd9bd3d View commit details
    Browse the repository at this point in the history
  7. bundle-uri: drop bundle.flag from design doc

    The Implementation Plan section lists a 'bundle.flag' option that is not
    documented anywhere else. What is documented elsewhere in the document
    and implemented by previous changes is the 'bundle.heuristic' config
    key. For now, a heuristic is required to indicate that a bundle list is
    organized for use during 'git fetch', and it is also sufficient for all
    existing designs.
    
    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    derrickstolee committed Jan 31, 2023
    Configuration menu
    Copy the full SHA
    d13547d View commit details
    Browse the repository at this point in the history
  8. fetch: fetch from an external bundle URI

    When a user specifies a URI via 'git clone --bundle-uri', that URI may
    be a bundle list that advertises a 'bundle.heuristic' value. In that
    case, the Git client stores a 'fetch.bundleURI' config value storing
    that URI.
    
    Teach 'git fetch' to check for this config value and download bundles
    from that URI before fetching from the Git remote(s). Likely, the bundle
    provider has configured a heuristic (such as "creationToken") that will
    allow the Git client to download only a portion of the bundles before
    continuing the fetch.
    
    Since this URI is completely independent of the remote server, we want
    to be sure that we connect to the bundle URI before creating a
    connection to the Git remote. We do not want to hold a stateful
    connection for too long if we can avoid it.
    
    To test that this works correctly, extend the previous tests that set
    'fetch.bundleURI' to do follow-up fetches. The bundle list is updated
    incrementally at each phase to demonstrate that the heuristic avoids
    downloading older bundles. This includes the middle fetch downloading
    the objects in bundle-3.bundle from the Git remote, and therefore not
    needing that bundle in the third fetch.
    
    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    derrickstolee committed Jan 31, 2023
    Configuration menu
    Copy the full SHA
    44235ec View commit details
    Browse the repository at this point in the history
  9. bundle-uri: store fetch.bundleCreationToken

    When a bundle list specifies the "creationToken" heuristic, the Git
    client downloads the list and then starts downloading bundles in
    descending creationToken order. This process stops as soon as all
    downloaded bundles can be applied to the repository (because all
    required commits are present in the repository or in the downloaded
    bundles).
    
    When checking the same bundle list twice, this strategy requires
    downloading the bundle with the maximum creationToken again, which is
    wasteful. The creationToken heuristic promises that the client will not
    have a use for that bundle if its creationToken value is at most the
    previous creationToken value.
    
    To prevent these wasteful downloads, create a fetch.bundleCreationToken
    config setting that the Git client sets after downloading bundles. This
    value allows skipping that maximum bundle download when this config
    value is the same value (or larger).
    
    To test that this works correctly, we can insert some "duplicate"
    fetches into existing tests and demonstrate that only the bundle list is
    downloaded.
    
    The previous logic for downloading bundles by creationToken worked even
    if the bundle list was empty, but now we have logic that depends on the
    first entry of the list. Terminate early in the (non-sensical) case of
    an empty bundle list.
    
    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    derrickstolee committed Jan 31, 2023
    Configuration menu
    Copy the full SHA
    9399cce View commit details
    Browse the repository at this point in the history
  10. bundle-uri: test missing bundles with heuristic

    The creationToken heuristic uses a different mechanism for downloading
    bundles from the "standard" approach. Specifically: it uses a concrete
    order based on the creationToken values and attempts to download as few
    bundles as possible. It also modifies local config to store a value for
    future fetches to avoid downloading bundles, if possible.
    
    However, if any of the individual bundles has a failed download, then
    the logic for the ordering comes into question. It is important to avoid
    infinite loops, assigning invalid creation token values in config, but
    also to be opportunistic as possible when downloading as many bundles as
    seem appropriate.
    
    These tests were used to inform the implementation of
    fetch_bundles_by_token() in bundle-uri.c, but are being added
    independently here to allow focusing on faulty downloads. There may be
    more cases that could be added that result in modifications to
    fetch_bundles_by_token() as interesting data shapes reveal themselves in
    real scenarios.
    
    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    derrickstolee committed Jan 31, 2023
    Configuration menu
    Copy the full SHA
    aba61e2 View commit details
    Browse the repository at this point in the history