Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Multi-pack Index (MIDX) #1

Closed
wants to merge 18 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@
/git-merge-subtree
/git-mergetool
/git-mergetool--lib
/git-midx
/git-mktag
/git-mktree
/git-name-rev
Expand Down
3 changes: 3 additions & 0 deletions Documentation/config.txt
Original file line number Diff line number Diff line change
Expand Up @@ -896,6 +896,9 @@ core.notesRef::
This setting defaults to "refs/notes/commits", and it can be overridden by
the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1].

core.midx::
Enable "multi-pack-index" feature. Set to true to read and write MIDX files.

core.sparseCheckout::
Enable "sparse checkout" feature. See section "Sparse checkout" in
linkgit:git-read-tree[1] for more information.
Expand Down
106 changes: 106 additions & 0 deletions Documentation/git-midx.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
git-midx(1)
============

NAME
----
git-midx - Write and verify multi-pack-indexes (MIDX files).


SYNOPSIS
--------
[verse]
'git midx' [--write|--read|--clear] <options> [--pack-dir <pack_dir>]

DESCRIPTION
-----------
Write a MIDX file.

OPTIONS
-------

--pack-dir <pack_dir>::
Use given directory for the location of packfiles, pack-indexes,
and MIDX files.

--clear::
If specified, delete the midx file specified by midx-head, and
midx-head. (Cannot be combined with --write or --read.)

--read::
If specified, read a midx file specified by the midx-head file
and output basic details about the midx file. (Cannot be combined
with --write.)

--midx-id <oid>::
If specified with --read, use the given oid to read midx-[oid].midx
instead of using midx-head.
--write::
If specified, write a new midx file to the pack directory using
the packfiles present. Outputs the hash of the result midx file.
(Cannot be combined with --read.)

--update-head::
If specified with --write, update the midx-head file to point to
the written midx file.

--delete-expired::
If specified with --write and --update-head, delete the midx file
previously pointed to by midx-head (if changed).

EXAMPLES
--------

* Read the midx-head file and output the OID of the head MIDX file.
+
------------------------------------------------
$ git midx
------------------------------------------------

* Write a MIDX file for the packfiles in your local .git folder.
+
------------------------------------------------
$ git midx --write
------------------------------------------------

* Write a MIDX file for the packfiles in your local .git folder and
* update the midx-head file.
+
------------------------------------------------
$ git midx --write --update-head
------------------------------------------------

* Write a MIDX file for the packfiles in a different folder
+
---------------------------------------------------------
$ git midx --write --pack-dir ../../alt/pack/
---------------------------------------------------------

* Read the current midx-head.
+
-----------------------------------------------
$ git midx --read
-----------------------------------------------

* Read a specific MIDX file in the local .git folder.
+
--------------------------------------------------------------------
$ git midx --read --midx-id 3e50d982a2257168c7fd0ff12ffe5cf6af38c74e
--------------------------------------------------------------------

* Delete the current midx-head and the file it references.
+
-----------------------------------------------
$ git midx --clear
-----------------------------------------------

CONFIGURATION
-------------

core.midx::
The midx command will fail if core.midx is false.
Also, the written MIDX files will be ignored by other commands
unless core.midx is true.

GIT
---
Part of the linkgit:git[1] suite
149 changes: 149 additions & 0 deletions Documentation/technical/multi-pack-index.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
Multi-Pack-Index (MIDX) Design Notes
====================================

The Git object directory contains a 'pack' directory containing
packfiles (with suffix ".pack") and pack-indexes (with suffix
".idx"). The pack-indexes provide a way to lookup objects and
navigate to their offset within the pack, but these must come
in pairs with the packfiles. This pairing depends on the file
names, as the pack-index differs only in suffix with its pack-
file. While the pack-indexes provide fast lookup per packfile,
this performance degrades as the number of packfiles increases,
because abbreviations need to inspect every packfile and we are
more likely to have a miss on our most-recently-used packfile.
For some large repositories, repacking into a single packfile
is not feasible due to storage space or excessive repack times.

The multi-pack-index (MIDX for short, with suffix ".midx")
stores a list of objects and their offsets into multiple pack-
files. It contains:

- A list of packfile names.
- A sorted list of object IDs.
- A list of metadata for the ith object ID including:
- A value j referring to the jth packfile.
- An offset within the jth packfile for the object.
- If large offsets are required, we use another list of large
offsets similar to version 2 pack-indexes.

Thus, we can provide O(log N) lookup time for any number
of packfiles.

A new config setting 'core.midx' must be enabled before writing
or reading MIDX files.

The MIDX files are updated by the 'midx' builtin with the
following common parameter combinations:

- 'git midx' gives the hash of the current MIDX head.
- 'git midx --write --update-head --delete-expired' writes a new
MIDX file, points the MIDX head to that file, and deletes the
existing MIDX file if out-of-date.
- 'git midx --read' lists some basic information about the current
MIDX head. Used for basic tests.
- 'git midx --clear' deletes the current MIDX head.

Design Details
--------------

- The MIDX file refers only to packfiles in the same directory
as the MIDX file.

- A special file, 'midx-head', stores the hash of the latest
MIDX file so we can load the file without performing a dirstat.
This file is especially important with incremental MIDX files,
pointing to the newest file.

- If a packfile exists in the pack directory but is not referenced
by the MIDX file, then the packfile is loaded into the packed_git
list and Git can access the objects as usual. This behavior is
necessary since other tools could add packfiles to the pack
directory without notifying Git.

- The MIDX file should be only a supplemental structure. If a
user downgrades or disables the `core.midx` config setting,
then the existing .idx and .pack files should be sufficient
to operate correctly.

- The file format includes parameters for the object id length
and hash algorithm, so a future change of hash algorithm does
not require a change in format.

- If an object appears in multiple packfiles, then only one copy
is stored in the MIDX. This has a possible performance issue:
If an object appears as the delta-base of multiple objects from
multiple packs, then cross-pack delta calculations may slow down.
This is currently only theoretical and has not been demonstrated
to be a measurable issue.

Current Limitations
-------------------

- MIDX files are managed only by the midx builtin and is not
automatically updated on clone or fetch.

- There is no '--verify' option for the midx builtin to verify
the contents of the MIDX file against the pack contents.

- Constructing a MIDX file currently requires the single-pack
index for every pack being added to the MIDX.

- The fsck builtin does not check MIDX files, but should.

- The repack builtin is not aware of the MIDX files, and may
invalidate the MIDX files by deleting existing packfiles. The
MIDX may also be extended in the future to store metadata about
a packfile that can be used for faster repack commands.

- The naive Git HTTP server advertises lists of packfiles using
the file system directly.

Future Work
-----------

- The current file-format requires between 28 and 36 bytes per
object. As the repository grows, the MIDX file can become
very large and become a bottleneck when updating the file. To
fix this "big write" problem, we can make the MIDX file
incremental. Instead of just one MIDX file, we will have a
sequence of MIDX files that can be unioned together. Then
on write we take the new objects to add and consider how many
existing files should be merged into a new file containing
the latest objects.

This list of "base indexes" will be presented as an optional
chunk in the MIDX format and contains the OIDs for the base
files. Thus, the `midx_head` file only stores the OID for the
"tip" MIDX file and then the rest are loaded based on those
pointers, such as the following figure:

[ BIG ] <- [ MEDIUM ] <- [tiny] <- midx_head
^___________________________|

The plan being that every write replaces the "tiny" index,
and when that index becomes large enough it merges with the
"medium" index and a new tiny index is created in the next
write. Very rarely, the "big" index would be updated, causing
a slow write.

- After the MIDX feature is sufficiently hardened and widely used,
consider making Git more fully depend on the MIDX file. If MIDX
is the default, then we can delete the single-pack-indexes from
the pack directory. We could also allow thin packs in the pack
directory.

- The MIDX could be extended to store a "stable object order" such
that adding objects to the order does not change the existing
objects. This would enable re-using the reachability bitmaps after
repacking and updating the MIDX file.

Related Links
-------------

[0] https://bugs.chromium.org/p/git/issues/detail?id=6
Chromium work item for: Multi-Pack Index (MIDX)

[1] https://public-inbox.org/git/CB5074CF.3AD7A%25joshua.redstone@fb.com/T/#u
Subject: Git performance results on a large repository
Date: 3 Feb 2012

85 changes: 85 additions & 0 deletions Documentation/technical/pack-format.txt
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,88 @@ Pack file entry: <+
corresponding packfile.

20-byte SHA-1-checksum of all of the above.

== midx-*.midx files have the following format:

The multi-pack-index (MIDX) files refer to multiple pack-files.

In order to allow extensions that add extra data to the MIDX format, we
organize the body into "chunks" and provide a lookup table at the beginning
of the body. The header includes certain length values, such as the number
of packs, the number of base MIDX files, hash lengths and types.

All 4-byte numbers are in network order.

HEADER:

4-byte signature:
The signature is: {'M', 'I', 'D', 'X'}

4-byte version number:
Git currently only supports version 1.

1-byte Object Id Version (1 = SHA-1)

1-byte Object Id Length (H)

1-byte number (I) of base multi-pack-index files:
This value is currently always zero.

1-byte number (C) of "chunks"

4-byte number (P) of pack files

CHUNK LOOKUP:

(C + 1) * 12 bytes providing the chunk offsets:
First 4 bytes describe chunk id. Value 0 is a terminating label.
Other 8 bytes provide offset in current file for chunk to start.
(Chunks are provided in file-order, so you can infer the length
using the next chunk position if necessary.)

The remaining data in the body is described one chunk at a time, and
these chunks may be given in any order. Chunks are required unless
otherwise specified.

CHUNK DATA:

OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
The ith entry, F[i], stores the number of OIDs with first
byte at most i. Thus F[255] stores the total
number of objects (N). The number of objects with first byte
value i is (F[i] - F[i-1]) for i > 0.

OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
The OIDs for all objects in the MIDX are stored in lexicographic
order in this chunk.

Object Offsets (ID: {'O', 'O', 'F', 'F'}) (N * 8 bytes)
Stores two 4-byte values for every object.
1: The pack-int-id for the pack storing this object.
2: The offset within the pack.
If all offsets are less than 2^31, then the large offset chunk
will not exist and offsets are stored as in IDX v1.
If there is at least one offset value larger than 2^32-1, then
the large offset chunk must exist. If the large offset chunk
exists and the 31st bit is on, then removing that bit reveals
the row in the large offsets containing the 8-byte offset of
this object.

[Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
8-byte offsets into large packfiles.

Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
P * 4 bytes storing the offset in the packfile name chunk for
the null-terminated string containing the filename for the
ith packfile. The filename is relative to the MIDX file's parent
directory.

Packfile Names (ID: {'P', 'N', 'A', 'M'})
Stores the packfile names as concatenated, null-terminated strings.
Packfiles must be listed in lexicographic order for fast lookups by
name. This is the only chunk not guaranteed to be a multiple of four
bytes in length, so it should be the last chunk for alignment reasons.

TRAILER:

H-byte HASH-checksum of all of the above.
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -827,6 +827,7 @@ LIB_OBJS += merge.o
LIB_OBJS += merge-blobs.o
LIB_OBJS += merge-recursive.o
LIB_OBJS += mergesort.o
LIB_OBJS += midx.o
LIB_OBJS += mru.o
LIB_OBJS += name-hash.o
LIB_OBJS += notes.o
Expand Down Expand Up @@ -979,6 +980,7 @@ BUILTIN_OBJS += builtin/merge-index.o
BUILTIN_OBJS += builtin/merge-ours.o
BUILTIN_OBJS += builtin/merge-recursive.o
BUILTIN_OBJS += builtin/merge-tree.o
BUILTIN_OBJS += builtin/midx.o
BUILTIN_OBJS += builtin/mktag.o
BUILTIN_OBJS += builtin/mktree.o
BUILTIN_OBJS += builtin/mv.o
Expand Down
1 change: 1 addition & 0 deletions builtin.h
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@ extern int cmd_merge_ours(int argc, const char **argv, const char *prefix);
extern int cmd_merge_file(int argc, const char **argv, const char *prefix);
extern int cmd_merge_recursive(int argc, const char **argv, const char *prefix);
extern int cmd_merge_tree(int argc, const char **argv, const char *prefix);
extern int cmd_midx(int argc, const char **argv, const char *prefix);
extern int cmd_mktag(int argc, const char **argv, const char *prefix);
extern int cmd_mktree(int argc, const char **argv, const char *prefix);
extern int cmd_mv(int argc, const char **argv, const char *prefix);
Expand Down
Loading