Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Scalar (ported to C) into vfs-2.32.0 #363

Closed

Conversation

dscho
Copy link
Member

@dscho dscho commented Jun 2, 2021

Scalar is, in its own words, "an opinionated repository management tool". It builds on top of Git and aims to make it easy and effortless to work with large repositories.

Originally built using .NET, with the take-home lessons from VFS for Git, Scalar provides sort of a laboratory for experimenting with tactics and strategies to help Git scale better. Many recent scalability improvements in Git originate from Scalar, for example:

  • partial clone
  • sparse checkout (cone mode)
  • commit graphs
  • multi-pack indices
  • scheduled maintenance
  • prefetch
  • ...

While providing an experimentation lab outside of Git, the intention of the Scalar project always was to ship its improvements into core Git (i.e. to "upstream" them). As the list above demonstrates, it worked.

It worked so much that there are essentially only very few bits and pieces that are not (yet) upstreamed. The remaining parts fall roughly into these categories:

  • The scalar executable itself
  • The concept of an "enlistment", where the Git-tracked files live in the src/ subdirectory (which is the actual Git worktree), to encourage clear separation of tracked vs untracked files
  • A list of registered Scalar enlistments that is maintained independently from the list of Git repositories registered with git maintenance
  • A set of recommended config settings that get configured upon scalar clone or scalar register
  • Support for side-stepping the missing partial clone support in Azure Repos by using the GVFS protocol instead via the gvfs-helper

While the gvfs-helper part is very unlikely to ever make it into core Git, the remainder can easily be contributed in the form of contrib/scalar/.

This Pull Request adds these parts, in a neatly-structured thicket of topic branches, and it concludes the effort of three developers and almost two months.

dscho and others added 2 commits May 21, 2021 22:58
When two `git maintenance` processes try to write the `.plist` file, we
need to help them with serializing their efforts.

The 150ms time-out value was determined from thin air.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
On macOS, we use launchctl to manage the background maintenance
schedule. This uses a set of .plist files to describe the schedule, but
these files are also registered with 'launchctl bootstrap'. If multiple
'git maintenance start' commands run concurrently, then they can collide
replacing these schedule files and registering them with launchctl.

To avoid extra launchctl commands, do a check for the .plist files on
disk and check if they are registered using 'launchctl list <name>'.
This command will return with exit code 0 if it exists, or exit code 113
if it does not.

We can test this behavior using the GIT_TEST_MAINT_SCHEDULER environment
variable.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho
Copy link
Member Author

dscho commented Jun 2, 2021

I had to cancel a run-away job (GitHub Actions experienced some problems, which might be related). In any case, the idea is to merge this only after vfs-2.32.0 is finalized, anyway, so I'll leave it in this state for now.

@derrickstolee
Copy link
Collaborator

derrickstolee commented Jun 2, 2021

This looks great! I'm very excited to see how this goes upstream. The organization is impeccable.

We might want to combine microsoft/scalar#505 with microsoft/scalar#510 to test this version, but we can also wait for the 2.32.0 release before doing that.

Also, we will want to update the microsoft/git README file to mention Scalar and point at the Scalar docs. But for now, we don't need that with this PR.

@dscho
Copy link
Member Author

dscho commented Jun 2, 2021

We might want to combine microsoft/scalar#505 with microsoft/scalar#510 to test this version

About this: maybe we can somehow make it possible to run the Functional Tests on both Scalar.NET and Scalar/C?

@derrickstolee
Copy link
Collaborator

We might want to combine microsoft/scalar#505 with microsoft/scalar#510 to test this version

About this: maybe we can somehow make it possible to run the Functional Tests on both Scalar.NET and Scalar/C?

I want to delete the product code from microsoft/scalar, leaving only the test code.

@dscho
Copy link
Member Author

dscho commented Jun 2, 2021

We might want to combine microsoft/scalar#505 with microsoft/scalar#510 to test this version

About this: maybe we can somehow make it possible to run the Functional Tests on both Scalar.NET and Scalar/C?

I want to delete the product code from microsoft/scalar, leaving only the test code.

Okay, so you want to commit to the way forward. Makes sense. If need be, we can always start a maintenance track.

@dscho dscho force-pushed the tentative/vfs-2.32.0 branch 2 times, most recently from d577ac9 to b39873a Compare June 7, 2021 11:42
dscho and others added 19 commits June 7, 2021 14:17
With this patch, we start the journey from the C# project at
https://github.com/microsoft/scalar to move what is left to Git's own
`contrib/` directory.

The idea of Scalar, and before that VFS for Git, has always been to
prove that Git _can_ scale, and to upstream whatever strategies have
been demonstrated to help.

For example, while the virtual filesystem provided by VFS for Git helped
the team developing the Windows operating system to move onto Git, it is
not really an upstreamable strategy: getting it to work, and the
required server-side support, make this not quite feasible.

The Scalar project learned from that and tackled the problem with
different tactics: instead of pretending to Git that the working
directory is fully populated, it _specifically_ teaches Git about
partial clone (which is based on VFS for Git's cache server), about
sparse checkout (which VFS for Git tried to do transparently, in the
file system layer), and regularly runs maintenance tasks to keep the
repository in a healthy state.

With partial clone, sparse checkout and `git maintenance` having been
upstreamed, there is little left that `scalar.exe` does that which
`git.exe` cannot do. One such thing is that `scalar clone <url>` will
automatically set up a partial, sparse clone, and configure
known-helpful settings from the start.

Let's bring this convenience directly into Git's tree.

The idea here is that you can (optionally) build Scalar via

	make -C contrib/scalar/Makefile

This will build the `scalar` executable and put it into the
contrib/scalar/ subdirectory.

The slightly awkward addition of the `contrib/scalar/*` bits to the
top-level `Makefile` are actually really required: we want to link to
`libgit.a`, which means that we will need to use the very same `CFLAGS`
and `LDFLAGS` as the rest of Git.

An early development version of this patch tried to replicate the
respective conditionals in `contrib/scalar/Makefile` (just like
`contrib/svn-fe/Makefile` tried to do). It turned out to be quite the
whack-a-mole game: the SHA-1-related flags, the flags enabling/disabling
`compat/poll/`, `compat/regex/`, `compat/win32mmap.c` etc based on the
current platform... To put it mildly: it was a major mess.

Instead, this patch makes minimal changes to the top-level `Makefile` so
that the bits in `contrib/scalar/` can be compiled and linked, and
adds a `contrib/scalar/Makefile` that uses the top-level `Makefile` in a
most minimal way to do the actual compiling.

Note: With this commit, we only establish the infrastructure, no
Scalar functionality is implemented yet; We will do that incrementally
over the next few commits.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
... which does not do much, yet...

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Over the course of Scalar's development, it became obvious that there is
a need for a command that can gather all kinds of useful information
that can help identify the most typical problems with large
worktrees/repositories.

The `diagnose` command is the culmination of this hard-won knowledge: it
gathers the installed hooks, the config, a couple statistics describing
the data shape, among other pieces of information, and then wraps
everything up in a tidy, neat `.zip` archive.

Note: in the .NET version we have the luxury of a comprehensive standard
library that includes basic functionality such as writing a `.zip` file.
In the C version, we lack such a commodity. Rather than introducing a
dependency on, say, libzip, we slightly abuse Git's `archive` command:
instead of writing the `.zip` file directly, we stage the file contents
in a Git index of a temporary, bare repository, only to let `git
archive` have at it, and finally removing the temporary repository.

Also note: Due to the frequent spawned `git hash-object` processes, this
command is quite a bit slow on Windows. Should it turn out to be a big
problem, the lack of a batch mode of the `hash-object` command could
potentially be worked around via using `git fast-import` with a crafted
`stdin`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Let's start implementing the `register` command. With this commit,
recommended settings are configured upon `scalar register`.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This implements Scalar's opinionated `clone` command: it tries to use a
partial clone and sets up a sparse checkout by default. In contrast to
`git clone`, `scalar clone` sets up the worktree in the `src/`
subdirectory, to encourage a separation between the source files and the
build output (which helps Git tremendously because it avoids untracked
files that have to be specifically ignored when refreshing the index).

Also, it registers the repository for regular, scheduled maintenance,
and configures a slur of configuration settings based on the experience
of the Microsoft Windows and the Microsoft Office development teams.

Note: We intentionally use a slightly wasteful `set_config()` function
(which does not reuse a single `strbuf`, for example, though performance
_really_ does not matter here) because it is very, very convenient.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This commit establishes the infrastructure to build the manual page for
te `scalar` command.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Arguably, the biggest learning from the Scalar project is that scheduled
maintenance is crucial to keep large repositories in a good shape.

With this commit, `scalar register` starts those scheduled maintenance
tasks, and `scalar unregister` stops them.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This commit adds a simple regression test, modeled after Git's own
test suite.

A more comprehensive functional (or: integration) test suite can be
found at https://github.com/microsoft/scalar; There is no intention to
port that fuller test suite to `contrib/scalar/`; Instead, it will still
be used to verify the `scalar` functionality in Microsoft's Git fork.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This comes in handy during Scalar upgrades, or when config settings were
messed up by mistake.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Teach the `scalar diagnose` command to gather file size information
about pack files.

Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com>
Let's populate the manual page of `scalar` a bit.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
With this commit, `git help scalar` will open the appropriate manual
or HTML page (instead of looking for `gitscalar`).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The list is simply those registered under the multi-valued scalar.repo
config setting.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
For example after a Scalar upgrade, it can come in really handy if there
is an easy way to reconfigure all Scalar enlistments. This new option
offers this functionality.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Teach the `scalar diagnose` command to gather loose object counts.

Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com>
Using the built-in FSMonitor makes many common commands quite a bit
faster. So let's teach the `scalar register` command to enable the
built-in FSMonitor and kick-start the fsmonitor--daemon process (for
convenience).

For simplicity, we only support the built-in FSMonitor (and no external
file system monitor such as e.g. Watchman).

Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Continuing the documentation journey.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho and others added 26 commits June 7, 2021 14:17
We already have the `config` command that accesses the `gvfs/config`
endpoint.

To implement `scalar`, we also need to be able to access the `vsts/info`
endpoint. Let's add a command to do precisely that.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
On Windows, both the forward slash and the backslash are directory
separators. Which means that `a\b\c` really is inside `a/b`. Therefore,
we need to special-case the directory separators in the helper function
`cmp_icase()` that is used in the loop in `dir_inside_of()`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This finalizes the port of the `QueryVstsInfo()` function: we already
taught `gvfs-helper` to access the `vsts/info` endpoint on demand, we
implemented proper JSON parsing, and now it is time to hook it all up.

To that end, we also provide a default local cache root directory. It
works the same way as the .NET version of Scalar: it uses

    C:\scalarCache on Windows,

    ~/.scalarCache/ on macOS and

    ~/.cache/scalar on Linux

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Well, technically also the http:// protocol is allowed _when testing_...

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Azure Repos does not support partial clones at the moment, but it does
support the GVFS protocol. To that end, the Microsoft fork of Git has a
`gvfs-helper` command that is optionally used to perform essentially the
same functionality as partial clone.

Let's verify that `scalar clone` detects that situation and enables the
GVFS helper.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This allows setting the GVFS-enabled cache server, or listing the one(s)
associated with the remote repository.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Sadly, this is a bit trickier than merely flipping the
`INCLUDE_SCALAR=YesPlease` switch: The Windows tests are run in a very
different way.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
In Scalar's functional tests, we do not do anything with authentication.
Therefore, we do want to avoid accessing the `vsts/info` endpoint
because it requires authentication even on otherwise public
repositories.

Let's introduce the environment variable `SCALAR_TEST_SKIP_VSTS_INFO`
which can be set to `true` to simply skip that step (and force the
`url_*` style repository IDs instead of `id_*` whenever possible).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This adds the bare minimum to compile the `scalar` executable.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This implements the subcommands `register`, `unregister` and `list`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This implements `clone`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This implements `scalar run`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This allows fixing settings after a Scalar upgrade, or after botching
the enlistments configuration.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This implements the `diagnose` subcommand.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This is a convenient shortcut for `scalar unregister <enlistment> &&
rm -rf <enlistment>`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This implements the `version` command for backwards-compatibility with
the .NET version of Scalar.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
For convenience, this ports the `git -c <key>=<value> -C <dir>
<command>` functionality to `scalar`, allowing config settings and
workig directories to be set for the duration of the Scalar invocation.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Make `contrib/scalar/` work nicely with the built-in FSMonitor.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Document the whole thing.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Allow concurrent `scalar register` and `scalar unregister` calls to be
more collaborative when trying to lock the global Git config at the very
same time.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This topic branch offers to include `scalar` in a regular Git build,
simply by setting `INCLUDE_SCALAR=YesPlease`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
For ease of development, build and test `scalar`, too.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Prepare `scalar` to use the GVFS protocol instead of partial clone
(required to support Azure Repos).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@derrickstolee derrickstolee deleted the branch microsoft:tentative/vfs-2.32.0 June 7, 2021 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants