Skip to content

Commit

Permalink
Hydrate missing loose objects in check_and_freshen()
Browse files Browse the repository at this point in the history
Hydrate missing loose objects in check_and_freshen() when running
virtualized. Add test cases to verify read-object hook works when
running virtualized.

This hook is called in check_and_freshen() rather than
check_and_freshen_local() to make the hook work also with alternates.

Helped-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Ben Peart <Ben.Peart@microsoft.com>
  • Loading branch information
Ben Peart authored and dscho committed Sep 22, 2022
1 parent 6cc32f9 commit 52e1bb5
Show file tree
Hide file tree
Showing 5 changed files with 482 additions and 16 deletions.
102 changes: 102 additions & 0 deletions Documentation/technical/read-object-protocol.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
Read Object Process
^^^^^^^^^^^^^^^^^^^^^^^^^^^

The read-object process enables Git to read all missing blobs with a
single process invocation for the entire life of a single Git command.
This is achieved by using a packet format (pkt-line, see technical/
protocol-common.txt) based protocol over standard input and standard
output as follows. All packets, except for the "*CONTENT" packets and
the "0000" flush packet, are considered text and therefore are
terminated by a LF.

Git starts the process when it encounters the first missing object that
needs to be retrieved. After the process is started, Git sends a welcome
message ("git-read-object-client"), a list of supported protocol version
numbers, and a flush packet. Git expects to read a welcome response
message ("git-read-object-server"), exactly one protocol version number
from the previously sent list, and a flush packet. All further
communication will be based on the selected version.

The remaining protocol description below documents "version=1". Please
note that "version=42" in the example below does not exist and is only
there to illustrate how the protocol would look with more than one
version.

After the version negotiation Git sends a list of all capabilities that
it supports and a flush packet. Git expects to read a list of desired
capabilities, which must be a subset of the supported capabilities list,
and a flush packet as response:
------------------------
packet: git> git-read-object-client
packet: git> version=1
packet: git> version=42
packet: git> 0000
packet: git< git-read-object-server
packet: git< version=1
packet: git< 0000
packet: git> capability=get
packet: git> capability=have
packet: git> capability=put
packet: git> capability=not-yet-invented
packet: git> 0000
packet: git< capability=get
packet: git< 0000
------------------------
The only supported capability in version 1 is "get".

Afterwards Git sends a list of "key=value" pairs terminated with a flush
packet. The list will contain at least the command (based on the
supported capabilities) and the sha1 of the object to retrieve. Please
note, that the process must not send any response before it received the
final flush packet.

When the process receives the "get" command, it should make the requested
object available in the git object store and then return success. Git will
then check the object store again and this time find it and proceed.
------------------------
packet: git> command=get
packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
packet: git> 0000
------------------------

The process is expected to respond with a list of "key=value" pairs
terminated with a flush packet. If the process does not experience
problems then the list must contain a "success" status.
------------------------
packet: git< status=success
packet: git< 0000
------------------------

In case the process cannot or does not want to process the content, it
is expected to respond with an "error" status.
------------------------
packet: git< status=error
packet: git< 0000
------------------------

In case the process cannot or does not want to process the content as
well as any future content for the lifetime of the Git process, then it
is expected to respond with an "abort" status at any point in the
protocol.
------------------------
packet: git< status=abort
packet: git< 0000
------------------------

Git neither stops nor restarts the process in case the "error"/"abort"
status is set.

If the process dies during the communication or does not adhere to the
protocol then Git will stop the process and restart it with the next
object that needs to be processed.

After the read-object process has processed an object it is expected to
wait for the next "key=value" list containing a command. Git will close
the command pipe on exit. The process is expected to detect EOF and exit
gracefully on its own. Git will wait until the process has stopped.

A long running read-object process demo implementation can be found in
`contrib/long-running-read-object/example.pl` located in the Git core
repository. If you develop your own long running process then the
`GIT_TRACE_PACKET` environment variables can be very helpful for
debugging (see linkgit:git[1]).
114 changes: 114 additions & 0 deletions contrib/long-running-read-object/example.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
#!/usr/bin/perl
#
# Example implementation for the Git read-object protocol version 1
# See Documentation/technical/read-object-protocol.txt
#
# Allows you to test the ability for blobs to be pulled from a host git repo
# "on demand." Called when git needs a blob it couldn't find locally due to
# a lazy clone that only cloned the commits and trees.
#
# A lazy clone can be simulated via the following commands from the host repo
# you wish to create a lazy clone of:
#
# cd /host_repo
# git rev-parse HEAD
# git init /guest_repo
# git cat-file --batch-check --batch-all-objects | grep -v 'blob' |
# cut -d' ' -f1 | git pack-objects /guest_repo/.git/objects/pack/noblobs
# cd /guest_repo
# git config core.virtualizeobjects true
# git reset --hard <sha from rev-parse call above>
#
# Please note, this sample is a minimal skeleton. No proper error handling
# was implemented.
#

use strict;
use warnings;

#
# Point $DIR to the folder where your host git repo is located so we can pull
# missing objects from it
#
my $DIR = "/host_repo/.git/";

sub packet_bin_read {
my $buffer;
my $bytes_read = read STDIN, $buffer, 4;
if ( $bytes_read == 0 ) {

# EOF - Git stopped talking to us!
exit();
}
elsif ( $bytes_read != 4 ) {
die "invalid packet: '$buffer'";
}
my $pkt_size = hex($buffer);
if ( $pkt_size == 0 ) {
return ( 1, "" );
}
elsif ( $pkt_size > 4 ) {
my $content_size = $pkt_size - 4;
$bytes_read = read STDIN, $buffer, $content_size;
if ( $bytes_read != $content_size ) {
die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
}
return ( 0, $buffer );
}
else {
die "invalid packet size: $pkt_size";
}
}

sub packet_txt_read {
my ( $res, $buf ) = packet_bin_read();
unless ( $buf =~ s/\n$// ) {
die "A non-binary line MUST be terminated by an LF.";
}
return ( $res, $buf );
}

sub packet_bin_write {
my $buf = shift;
print STDOUT sprintf( "%04x", length($buf) + 4 );
print STDOUT $buf;
STDOUT->flush();
}

sub packet_txt_write {
packet_bin_write( $_[0] . "\n" );
}

sub packet_flush {
print STDOUT sprintf( "%04x", 0 );
STDOUT->flush();
}

( packet_txt_read() eq ( 0, "git-read-object-client" ) ) || die "bad initialize";
( packet_txt_read() eq ( 0, "version=1" ) ) || die "bad version";
( packet_bin_read() eq ( 1, "" ) ) || die "bad version end";

packet_txt_write("git-read-object-server");
packet_txt_write("version=1");
packet_flush();

( packet_txt_read() eq ( 0, "capability=get" ) ) || die "bad capability";
( packet_bin_read() eq ( 1, "" ) ) || die "bad capability end";

packet_txt_write("capability=get");
packet_flush();

while (1) {
my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;

if ( $command eq "get" ) {
my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
packet_bin_read();

system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c core.virtualizeobjects=false hash-object -w --stdin >/dev/null 2>&1');
packet_txt_write(($?) ? "status=error" : "status=success");
packet_flush();
} else {
die "bad command '$command'";
}
}
141 changes: 125 additions & 16 deletions object-file.c
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@
#include "promisor-remote.h"
#include "submodule.h"
#include "hook.h"
#include "sigchain.h"
#include "sub-process.h"
#include "pkt-line.h"

/* The maximum size for an object header. */
#define MAX_HEADER_LEN 32
Expand Down Expand Up @@ -939,6 +942,115 @@ void prepare_alt_odb(struct repository *r)
r->objects->loaded_alternates = 1;
}

#define CAP_GET (1u<<0)

static int subprocess_map_initialized;
static struct hashmap subprocess_map;

struct read_object_process {
struct subprocess_entry subprocess;
unsigned int supported_capabilities;
};

static int start_read_object_fn(struct subprocess_entry *subprocess)
{
struct read_object_process *entry = (struct read_object_process *)subprocess;
static int versions[] = {1, 0};
static struct subprocess_capability capabilities[] = {
{ "get", CAP_GET },
{ NULL, 0 }
};

return subprocess_handshake(subprocess, "git-read-object", versions,
NULL, capabilities,
&entry->supported_capabilities);
}

static int read_object_process(const struct object_id *oid)
{
int err;
struct read_object_process *entry;
struct child_process *process;
struct strbuf status = STRBUF_INIT;
const char *cmd = find_hook("read-object");
uint64_t start;

start = getnanotime();

if (!subprocess_map_initialized) {
subprocess_map_initialized = 1;
hashmap_init(&subprocess_map, (hashmap_cmp_fn)cmd2process_cmp,
NULL, 0);
entry = NULL;
} else {
entry = (struct read_object_process *) subprocess_find_entry(&subprocess_map, cmd);
}

if (!entry) {
entry = xmalloc(sizeof(*entry));
entry->supported_capabilities = 0;

if (subprocess_start(&subprocess_map, &entry->subprocess, cmd,
start_read_object_fn)) {
free(entry);
return -1;
}
}
process = &entry->subprocess.process;

if (!(CAP_GET & entry->supported_capabilities))
return -1;

sigchain_push(SIGPIPE, SIG_IGN);

err = packet_write_fmt_gently(process->in, "command=get\n");
if (err)
goto done;

err = packet_write_fmt_gently(process->in, "sha1=%s\n", oid_to_hex(oid));
if (err)
goto done;

err = packet_flush_gently(process->in);
if (err)
goto done;

err = subprocess_read_status(process->out, &status);
err = err ? err : strcmp(status.buf, "success");

done:
sigchain_pop(SIGPIPE);

if (err || errno == EPIPE) {
err = err ? err : errno;
if (!strcmp(status.buf, "error")) {
/* The process signaled a problem with the file. */
}
else if (!strcmp(status.buf, "abort")) {
/*
* The process signaled a permanent problem. Don't try to read
* objects with the same command for the lifetime of the current
* Git process.
*/
entry->supported_capabilities &= ~CAP_GET;
}
else {
/*
* Something went wrong with the read-object process.
* Force shutdown and restart if needed.
*/
error("external process '%s' failed", cmd);
subprocess_stop(&subprocess_map,
(struct subprocess_entry *)entry);
free(entry);
}
}

trace_performance_since(start, "read_object_process");

return err;
}

/* Returns 1 if we have successfully freshened the file, 0 otherwise. */
static int freshen_file(const char *fn)
{
Expand Down Expand Up @@ -989,8 +1101,19 @@ static int check_and_freshen_nonlocal(const struct object_id *oid, int freshen)

static int check_and_freshen(const struct object_id *oid, int freshen)
{
return check_and_freshen_local(oid, freshen) ||
int ret;
int tried_hook = 0;

retry:
ret = check_and_freshen_local(oid, freshen) ||
check_and_freshen_nonlocal(oid, freshen);
if (!ret && core_virtualize_objects && !tried_hook) {
tried_hook = 1;
if (!read_object_process(oid))
goto retry;
}

return ret;
}

int has_loose_object_nonlocal(const struct object_id *oid)
Expand Down Expand Up @@ -1526,20 +1649,6 @@ void disable_obj_read_lock(void)
pthread_mutex_destroy(&obj_read_mutex);
}

static int run_read_object_hook(const struct object_id *oid)
{
struct run_hooks_opt opt = RUN_HOOKS_OPT_INIT;
int ret;
uint64_t start;

start = getnanotime();
strvec_push(&opt.args, oid_to_hex(oid));
ret = run_hooks_opt("read-object", &opt);
trace_performance_since(start, "run_read_object_hook");

return ret;
}

int fetch_if_missing = 1;

static int do_oid_object_info_extended(struct repository *r,
Expand Down Expand Up @@ -1601,7 +1710,7 @@ static int do_oid_object_info_extended(struct repository *r,
break;
if (core_virtualize_objects && !tried_hook) {
tried_hook = 1;
if (!run_read_object_hook(oid))
if (!read_object_process(oid))
goto retry;
}
}
Expand Down
Loading

0 comments on commit 52e1bb5

Please sign in to comment.