Skip to content

Commit

Permalink
status: add status serialization mechanism
Browse files Browse the repository at this point in the history
Teach STATUS to optionally serialize the results of a
status computation to a file.

Teach STATUS to optionally read an existing serialization
file and simply print the results, rather than actually
scanning.

This is intended for immediate status results on extremely
large repos and assumes the use of a service/daemon to
maintain a fresh current status snapshot.

2021-10-30: packet_read() changed its prototype in ec9a37d (pkt-line.[ch]:
remove unused packet_read_line_buf(), 2021-10-14).

2021-10-30: sscanf() now does an extra check that "%d" goes into an "int"
and complains about "uint32_t". Replacing with "%u" fixes the compile-time
error.

2021-10-30: string_list_init() was removed by abf897b (string-list.[ch]:
remove string_list_init() compatibility function, 2021-09-28), so we need to
initialize manually.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
  • Loading branch information
jeffhostetler authored and dscho committed Aug 11, 2023
1 parent 8ee13ce commit 6979cb3
Show file tree
Hide file tree
Showing 14 changed files with 1,337 additions and 4 deletions.
6 changes: 6 additions & 0 deletions Documentation/config/status.txt
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,9 @@ status.submoduleSummary::
the --ignore-submodules=dirty command-line option or the 'git
submodule summary' command, which shows a similar output but does
not honor these settings.

status.deserializePath::
EXPERIMENTAL, Pathname to a file containing cached status results
generated by `--serialize`. This will be overridden by
`--deserialize=<path>` on the command line. If the cache file is
invalid or stale, git will fall-back and compute status normally.
33 changes: 33 additions & 0 deletions Documentation/git-status.txt
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,19 @@ ignored, then the directory is not shown, but all contents are shown.
threshold.
See also linkgit:git-diff[1] `--find-renames`.

--serialize[=<version>]::
(EXPERIMENTAL) Serialize raw status results to stdout in a
format suitable for use by `--deserialize`. Valid values for
`<version>` are "1" and "v1".

--deserialize[=<path>]::
(EXPERIMENTAL) Deserialize raw status results from a file or
stdin rather than scanning the worktree. If `<path>` is omitted
and `status.deserializePath` is unset, input is read from stdin.
--no-deserialize::
(EXPERIMENTAL) Disable implicit deserialization of status results
from the value of `status.deserializePath`.

<pathspec>...::
See the 'pathspec' entry in linkgit:gitglossary[7].

Expand Down Expand Up @@ -421,6 +434,26 @@ quoted as explained for the configuration variable `core.quotePath`
(see linkgit:git-config[1]).


SERIALIZATION and DESERIALIZATION (EXPERIMENTAL)
------------------------------------------------

The `--serialize` option allows git to cache the result of a
possibly time-consuming status scan to a binary file. A local
service/daemon watching file system events could use this to
periodically pre-compute a fresh status result.

Interactive users could then use `--deserialize` to simply
(and immediately) print the last-known-good result without
waiting for the status scan.

The binary serialization file format includes some worktree state
information allowing `--deserialize` to reject the cached data
and force a normal status scan if, for example, the commit, branch,
or status modes/options change. The format cannot, however, indicate
when the cached data is otherwise stale -- that coordination belongs
to the task driving the serializations.


CONFIGURATION
-------------

Expand Down
107 changes: 107 additions & 0 deletions Documentation/technical/status-serialization-format.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
Git status serialization format
===============================

Git status serialization enables git to dump the results of a status scan
to a binary file. This file can then be loaded by later status invocations
to print the cached status results.

The file contains the essential fields from:
() the index
() the "struct wt_status" for the overall results
() the contents of "struct wt_status_change_data" for tracked changed files
() the list of untracked and ignored files

Version 1 Format:
=================

The V1 file begins with a required header section followed by optional
sections for each type of item (changed, untracked, ignored). Individual
item sections are only present if necessary. Each item section begins
with an item-type header with the number of items in the section.

Each "line" in the format is encoded using pkt-line with a final LF.
Flush packets are used to terminate sections.

-----------------
PKT-LINE("version" SP "1")
<v1-header-section>
[<v1-changed-item-section>]
[<v1-untracked-item-section>]
[<v1-ignored-item-section>]
-----------------


V1 Header
---------

The v1-header-section fields are taken directly from "struct wt_status".
Each field is printed on a separate pkt-line. Lines for NULL string
values are omitted. All integers are printed with "%d". OIDs are
printed in hex.

v1-header-section = <v1-index-headers>
<v1-wt-status-headers>
PKT-LINE(<flush>)

v1-index-headers = PKT-LINE("index_mtime" SP <sec> SP <nsec> LF)

v1-wt-status-headers = PKT-LINE("is_initial" SP <integer> LF)
[ PKT-LINE("branch" SP <branch-name> LF) ]
[ PKT-LINE("reference" SP <reference-name> LF) ]
PKT-LINE("show_ignored_files" SP <integer> LF)
PKT-LINE("show_untracked_files" SP <integer> LF)
PKT-LINE("show_ignored_directory" SP <integer> LF)
[ PKT-LINE("ignore_submodule_arg" SP <string> LF) ]
PKT-LINE("detect_rename" SP <integer> LF)
PKT-LINE("rename_score" SP <integer> LF)
PKT-LINE("rename_limit" SP <integer> LF)
PKT-LINE("detect_break" SP <integer> LF)
PKT-LINE("sha1_commit" SP <oid> LF)
PKT-LINE("committable" SP <integer> LF)
PKT-LINE("workdir_dirty" SP <integer> LF)


V1 Changed Items
----------------

The v1-changed-item-section lists all of the changed items with one
item per pkt-line. Each pkt-line contains: a binary block of data
from "struct wt_status_serialize_data_fixed" in a fixed header where
integers are in network byte order and OIDs are in raw (non-hex) form.
This is followed by one or two raw pathnames (not c-quoted) with NUL
terminators (both NULs are always present even if there is no rename).

v1-changed-item-section = PKT-LINE("changed" SP <count> LF)
[ PKT-LINE(<changed_item> LF) ]+
PKT-LINE(<flush>)

changed_item = <byte[4] worktree_status>
<byte[4] index_status>
<byte[4] stagemask>
<byte[4] score>
<byte[4] mode_head>
<byte[4] mode_index>
<byte[4] mode_worktree>
<byte[4] dirty_submodule>
<byte[4] new_submodule_commits>
<byte[20] oid_head>
<byte[20] oid_index>
<byte[*] path>
NUL
[ <byte[*] src_path> ]
NUL


V1 Untracked and Ignored Items
------------------------------

These sections are simple lists of pathnames. They ARE NOT
c-quoted.

v1-untracked-item-section = PKT-LINE("untracked" SP <count> LF)
[ PKT-LINE(<pathname> LF) ]+
PKT-LINE(<flush>)

v1-ignored-item-section = PKT-LINE("ignored" SP <count> LF)
[ PKT-LINE(<pathname> LF) ]+
PKT-LINE(<flush>)
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -1201,6 +1201,8 @@ LIB_OBJS += wrapper.o
LIB_OBJS += write-or-die.o
LIB_OBJS += ws.o
LIB_OBJS += wt-status.o
LIB_OBJS += wt-status-deserialize.o
LIB_OBJS += wt-status-serialize.o
LIB_OBJS += xdiff-interface.o

BUILTIN_OBJS += builtin/add.o
Expand Down
123 changes: 122 additions & 1 deletion builtin/commit.c
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,70 @@ static int opt_parse_porcelain(const struct option *opt, const char *arg, int un
return 0;
}

static int do_serialize = 0;
static int do_implicit_deserialize = 0;
static int do_explicit_deserialize = 0;
static char *deserialize_path = NULL;

/*
* --serialize | --serialize=1 | --serialize=v1
*
* Request that we serialize our output rather than printing in
* any of the established formats. Optionally specify serialization
* version.
*/
static int opt_parse_serialize(const struct option *opt, const char *arg, int unset)
{
enum wt_status_format *value = (enum wt_status_format *)opt->value;
if (unset || !arg)
*value = STATUS_FORMAT_SERIALIZE_V1;
else if (!strcmp(arg, "v1") || !strcmp(arg, "1"))
*value = STATUS_FORMAT_SERIALIZE_V1;
else
die("unsupported serialize version '%s'", arg);

if (do_explicit_deserialize)
die("cannot mix --serialize and --deserialize");
do_implicit_deserialize = 0;

do_serialize = 1;
return 0;
}

/*
* --deserialize | --deserialize=<path> |
* --no-deserialize
*
* Request that we deserialize status data from some existing resource
* rather than performing a status scan.
*
* The input source can come from stdin or a path given here -- or be
* inherited from the config settings.
*/
static int opt_parse_deserialize(const struct option *opt, const char *arg, int unset)
{
if (unset) {
do_implicit_deserialize = 0;
do_explicit_deserialize = 0;
} else {
if (do_serialize)
die("cannot mix --serialize and --deserialize");
if (arg) {
/* override config or stdin */
free(deserialize_path);
deserialize_path = xstrdup(arg);
}
if (deserialize_path && *deserialize_path
&& (access(deserialize_path, R_OK) != 0))
die("cannot find serialization file '%s'",
deserialize_path);

do_explicit_deserialize = 1;
}

return 0;
}

static int opt_parse_m(const struct option *opt, const char *arg, int unset)
{
struct strbuf *buf = opt->value;
Expand Down Expand Up @@ -1176,6 +1240,8 @@ static void handle_untracked_files_arg(struct wt_status *s)
s->show_untracked_files = SHOW_NORMAL_UNTRACKED_FILES;
else if (!strcmp(untracked_files_arg, "all"))
s->show_untracked_files = SHOW_ALL_UNTRACKED_FILES;
else if (!strcmp(untracked_files_arg,"complete"))
s->show_untracked_files = SHOW_COMPLETE_UNTRACKED_FILES;
/*
* Please update $__git_untracked_file_modes in
* git-completion.bash when you add new options
Expand Down Expand Up @@ -1463,6 +1529,19 @@ static int git_status_config(const char *k, const char *v,
s->relative_paths = git_config_bool(k, v);
return 0;
}
if (!strcmp(k, "status.deserializepath")) {
/*
* Automatically assume deserialization if this is
* set in the config and the file exists. Do not
* complain if the file does not exist, because we
* silently fall back to normal mode.
*/
if (v && *v && access(v, R_OK) == 0) {
do_implicit_deserialize = 1;
deserialize_path = xstrdup(v);
}
return 0;
}
if (!strcmp(k, "status.showuntrackedfiles")) {
if (!v)
return config_error_nonbool(k);
Expand Down Expand Up @@ -1503,7 +1582,8 @@ int cmd_status(int argc, const char **argv, const char *prefix)
static const char *rename_score_arg = (const char *)-1;
static struct wt_status s;
unsigned int progress_flag = 0;
int fd;
int try_deserialize;
int fd = -1;
struct object_id oid;
static struct option builtin_status_options[] = {
OPT__VERBOSE(&verbose, N_("be verbose")),
Expand All @@ -1518,6 +1598,12 @@ int cmd_status(int argc, const char **argv, const char *prefix)
OPT_CALLBACK_F(0, "porcelain", &status_format,
N_("version"), N_("machine-readable output"),
PARSE_OPT_OPTARG, opt_parse_porcelain),
{ OPTION_CALLBACK, 0, "serialize", &status_format,
N_("version"), N_("serialize raw status data to stdout"),
PARSE_OPT_OPTARG | PARSE_OPT_NONEG, opt_parse_serialize },
{ OPTION_CALLBACK, 0, "deserialize", NULL,
N_("path"), N_("deserialize raw status data from file"),
PARSE_OPT_OPTARG, opt_parse_deserialize },
OPT_SET_INT(0, "long", &status_format,
N_("show status in long format (default)"),
STATUS_FORMAT_LONG),
Expand Down Expand Up @@ -1562,10 +1648,26 @@ int cmd_status(int argc, const char **argv, const char *prefix)
s.show_untracked_files == SHOW_NO_UNTRACKED_FILES)
die(_("Unsupported combination of ignored and untracked-files arguments"));

if (s.show_untracked_files == SHOW_COMPLETE_UNTRACKED_FILES &&
s.show_ignored_mode == SHOW_NO_IGNORED)
die(_("Complete Untracked only supported with ignored files"));

parse_pathspec(&s.pathspec, 0,
PATHSPEC_PREFER_FULL,
prefix, argv);

/*
* If we want to try to deserialize status data from a cache file,
* we need to re-order the initialization code. The problem is that
* this makes for a very nasty diff and causes merge conflicts as we
* carry it forward. And it easy to mess up the merge, so we
* duplicate some code here to hopefully reduce conflicts.
*/
try_deserialize = (!do_serialize &&
(do_implicit_deserialize || do_explicit_deserialize));
if (try_deserialize)
goto skip_init;

enable_fscache(0);
if (status_format != STATUS_FORMAT_PORCELAIN &&
status_format != STATUS_FORMAT_PORCELAIN_V2)
Expand All @@ -1580,6 +1682,7 @@ int cmd_status(int argc, const char **argv, const char *prefix)
else
fd = -1;

skip_init:
s.is_initial = repo_get_oid(the_repository, s.reference, &oid) ? 1 : 0;
if (!s.is_initial)
oidcpy(&s.oid_commit, &oid);
Expand All @@ -1596,6 +1699,24 @@ int cmd_status(int argc, const char **argv, const char *prefix)
s.rename_score = parse_rename_score(&rename_score_arg);
}

if (try_deserialize) {
if (s.relative_paths)
s.prefix = prefix;

if (wt_status_deserialize(&s, deserialize_path) == DESERIALIZE_OK)
return 0;

/* deserialize failed, so force the initialization we skipped above. */
enable_fscache(1);
repo_read_index_preload(the_repository, &s.pathspec, 0);
refresh_index(&the_index, REFRESH_QUIET|REFRESH_UNMERGED, &s.pathspec, NULL, NULL);

if (use_optional_locks())
fd = repo_hold_locked_index(the_repository, &index_lock, 0);
else
fd = -1;
}

wt_status_collect(&s);

if (0 <= fd)
Expand Down
2 changes: 1 addition & 1 deletion contrib/completion/git-completion.bash
Original file line number Diff line number Diff line change
Expand Up @@ -1675,7 +1675,7 @@ _git_clone ()
esac
}

__git_untracked_file_modes="all no normal"
__git_untracked_file_modes="all no normal complete"

_git_commit ()
{
Expand Down
2 changes: 1 addition & 1 deletion pkt-line.c
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ static int do_packet_write(const int fd_out, const char *buf, size_t size,
return 0;
}

static int packet_write_gently(const int fd_out, const char *buf, size_t size)
int packet_write_gently(const int fd_out, const char *buf, size_t size)
{
struct strbuf err = STRBUF_INIT;
if (do_packet_write(fd_out, buf, size, &err)) {
Expand Down
1 change: 1 addition & 0 deletions pkt-line.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ void packet_write(int fd_out, const char *buf, size_t size);
void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
int packet_flush_gently(int fd);
int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
int packet_write_gently(const int fd_out, const char *buf, size_t size);
int write_packetized_from_fd_no_flush(int fd_in, int fd_out);
int write_packetized_from_buf_no_flush_count(const char *src_in, size_t len,
int fd_out, int *packet_counter);
Expand Down
Loading

0 comments on commit 6979cb3

Please sign in to comment.