Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GNU make jobserver client support #1139

Open
stefanb2 opened this issue Apr 27, 2016 · 104 comments · May be fixed by #2260, #2263 or #2474
Open

Add GNU make jobserver client support #1139

stefanb2 opened this issue Apr 27, 2016 · 104 comments · May be fixed by #2260, #2263 or #2474
Labels
Milestone

Comments

@stefanb2
Copy link
Contributor

As long as ninja is the only build execution tool, the current ninja -jN implementation works fine.

But when you try to convert parts of an existing recursive GNU make based SW build system to ninja, then you have the following situation:

  • top-level GNU Make (with -jX, acts as job server)
  • M instances of GNU make (with -j, act as job server clients)
  • N instances of ninja (don't know anything about job server)

Simply calling `ninja -jY' isn't enough, because then the ninja instances will try to run Y*N jobs, plus the X jobs from the GNU make instances, causing the build host to overload. Relying on -lZ to fix this issue is sub-optimal, because load average is sometimes too slow to reflect the actual situation on the build host.

It would be nice if GNU make jobserver client support could be added to Ninja. Then the N ninja instances would cooperate with the M GNU make instances and on the build host only X jobs would be executed at one time.

stefanb2 pushed a commit to stefanb2/ninja that referenced this issue Apr 27, 2016
- add new TokenPool interface
- GNU make implementation for TokenPool parses and verifies the magic
  information from the MAKEFLAGS environment variable
- RealCommandRunner tries to acquire TokenPool
  * if no token pool is available then there is no change in behaviour
- When a token pool is available then RealCommandRunner behaviour
  changes as follows
  * CanRunMore() only returns true if TokenPool::Acquire() returns true
  * StartCommand() calls TokenPool::Reserve()
  * WaitForCommand() calls TokenPool::Release()

Documentation for GNU make jobserver

  http://make.mad-scientist.net/papers/jobserver-implementation/

Fixes ninja-build#1139
@stefanb2
Copy link
Contributor Author

I have tested this implementation over the last few weeks in two different recursive GNU make based build systems that originally had M+1 GNU make instances:

  • use case A: top-level GNU make, 1 ninja instance, M-1 GNU make instances
  • use case B: top-level GNU make, N ninja instances, M-N GNU make instances

FYI: google/kati was used to convert existing single makefile GNU make parts to Ninja build file.

@nico
Copy link
Collaborator

nico commented Apr 27, 2016

Thanks for the patch!

We've discussed this on the mailing list a few times (e.g. here https://groups.google.com/forum/#!searchin/ninja-build/jobserver/ninja-build/PUlsr7-jpI0/Ga19TOg1c14J). Ninja works best if it knows about the whole build. Now that kati exists, one can convert those to ninja files and munge them up to have a single build manifest (that's Android's transition strategy from Make to Ninja -- they use kati to get everything converted to Ninja files, and then they're incrementally converting directories to use something-not-make -- and then kati produces parts of their Ninja files and the new thing produces parts of the ninja files.)

Is your use case that you have recursive makefiles?

@stefanb2
Copy link
Contributor Author

I could have guessed that this has been discussed before, because I'm surely not the first person facing such a situation.

Here are my reasons for requesting this:

  1. recursion: kati currently can't translate recursive GNU make based build systems, like Linux kernel kbuild. IMHO a major effort and unfortunately I can't wait for kati to provide this, hence the such sub-component builds will have to stay with GNU make for the time being.
  2. missing features: kati currently can't translate fully modularized GNU make based build systems, i.e. where each component is built in isolation and in a separate build directory, so that all ninja.build files could be merged into a single one. While IMHO not such a major issue as (1) it is much simpler to replace the lowest-level $(MAKE) recipe with a kati/ninja recipe. Parsing + merging might also introduce unnecessary build delay (needs to be seen what would happen in real life)
  3. technical barriers: e.g. sub-component builds that run behind a "chroot firewall". Even if everything moves to Ninja, you would still need 1 (main) + N (one for each chroot) ninja instances that need to cooperate. Ninja doesn't offer anything like that.
  4. too simple workarounds: AOSP makeparallel + kati/ninja runs all $(MAKE) instances hard-coded with "make -j4" with no cooperation between any of the GNU make instances. That is only acceptable if you have no or only a few or small $(MAKE) invocations from the ninja.build file.
  5. organizational barriers: even if it might be possible to use kati/ninja to convert an existing GNU make base sub-part of the system, you might not be allowed to do so. Such sub-component builds need to stay with GNU make.
  6. You ask: why not split the build up and run them as separate builds? Goto (5)...

IMHO my patch provides a good solution, considering

  • how small the required changes to ninja are,
  • that the default behaviour is completely unchanged, and
  • that this will make the life easier for many other ninja users which face the same issues

@ghost
Copy link

ghost commented May 23, 2016

wow +1

stefanb2 pushed a commit to stefanb2/ninja that referenced this issue May 23, 2016
- add new TokenPool interface
- GNU make implementation for TokenPool parses and verifies the magic
  information from the MAKEFLAGS environment variable
- RealCommandRunner tries to acquire TokenPool
  * if no token pool is available then there is no change in behaviour
- When a token pool is available then RealCommandRunner behaviour
  changes as follows
  * CanRunMore() only returns true if TokenPool::Acquire() returns true
  * StartCommand() calls TokenPool::Reserve()
  * WaitForCommand() calls TokenPool::Release()

Documentation for GNU make jobserver

  http://make.mad-scientist.net/papers/jobserver-implementation/

Fixes ninja-build#1139
stefanb2 pushed a commit to stefanb2/ninja that referenced this issue May 25, 2016
- add new TokenPool interface
- GNU make implementation for TokenPool parses and verifies the magic
  information from the MAKEFLAGS environment variable
- RealCommandRunner tries to acquire TokenPool
  * if no token pool is available then there is no change in behaviour
- When a token pool is available then RealCommandRunner behaviour
  changes as follows
  * CanRunMore() only returns true if TokenPool::Acquire() returns true
  * StartCommand() calls TokenPool::Reserve()
  * WaitForCommand() calls TokenPool::Release()

Documentation for GNU make jobserver

  http://make.mad-scientist.net/papers/jobserver-implementation/

Fixes ninja-build#1139
stefanb2 pushed a commit to stefanb2/ninja that referenced this issue May 25, 2016
- add new TokenPool interface
- GNU make implementation for TokenPool parses and verifies the magic
  information from the MAKEFLAGS environment variable
- RealCommandRunner tries to acquire TokenPool
  * if no token pool is available then there is no change in behaviour
- When a token pool is available then RealCommandRunner behaviour
  changes as follows
  * CanRunMore() only returns true if TokenPool::Acquire() returns true
  * StartCommand() calls TokenPool::Reserve()
  * WaitForCommand() calls TokenPool::Release()

Documentation for GNU make jobserver

  http://make.mad-scientist.net/papers/jobserver-implementation/

Fixes ninja-build#1139
stefanb2 pushed a commit to stefanb2/ninja that referenced this issue May 26, 2016
- add new TokenPool interface
- GNU make implementation for TokenPool parses and verifies the magic
  information from the MAKEFLAGS environment variable
- RealCommandRunner tries to acquire TokenPool
  * if no token pool is available then there is no change in behaviour
- When a token pool is available then RealCommandRunner behaviour
  changes as follows
  * CanRunMore() only returns true if TokenPool::Acquire() returns true
  * StartCommand() calls TokenPool::Reserve()
  * WaitForCommand() calls TokenPool::Release()

Documentation for GNU make jobserver

  http://make.mad-scientist.net/papers/jobserver-implementation/

Fixes ninja-build#1139
stefanb2 pushed a commit to stefanb2/ninja that referenced this issue May 27, 2016
- add new TokenPool interface
- GNU make implementation for TokenPool parses and verifies the magic
  information from the MAKEFLAGS environment variable
- RealCommandRunner tries to acquire TokenPool
  * if no token pool is available then there is no change in behaviour
- When a token pool is available then RealCommandRunner behaviour
  changes as follows
  * CanRunMore() only returns true if TokenPool::Acquire() returns true
  * StartCommand() calls TokenPool::Reserve()
  * WaitForCommand() calls TokenPool::Release()

Documentation for GNU make jobserver

  http://make.mad-scientist.net/papers/jobserver-implementation/

Fixes ninja-build#1139
stefanb2 pushed a commit to stefanb2/ninja that referenced this issue May 27, 2016
- add new TokenPool interface
- GNU make implementation for TokenPool parses and verifies the magic
  information from the MAKEFLAGS environment variable
- RealCommandRunner tries to acquire TokenPool
  * if no token pool is available then there is no change in behaviour
- When a token pool is available then RealCommandRunner behaviour
  changes as follows
  * CanRunMore() only returns true if TokenPool::Acquire() returns true
  * StartCommand() calls TokenPool::Reserve()
  * WaitForCommand() calls TokenPool::Release()

Documentation for GNU make jobserver

  http://make.mad-scientist.net/papers/jobserver-implementation/

Fixes ninja-build#1139
stefanb2 pushed a commit to stefanb2/ninja that referenced this issue May 28, 2016
- add new TokenPool interface
- GNU make implementation for TokenPool parses and verifies the magic
  information from the MAKEFLAGS environment variable
- RealCommandRunner tries to acquire TokenPool
  * if no token pool is available then there is no change in behaviour
- When a token pool is available then RealCommandRunner behaviour
  changes as follows
  * CanRunMore() only returns true if TokenPool::Acquire() returns true
  * StartCommand() calls TokenPool::Reserve()
  * WaitForCommand() calls TokenPool::Release()

Documentation for GNU make jobserver

  http://make.mad-scientist.net/papers/jobserver-implementation/

Fixes ninja-build#1139
stefanb2 pushed a commit to stefanb2/ninja that referenced this issue May 30, 2016
- add new TokenPool interface
- GNU make implementation for TokenPool parses and verifies the magic
  information from the MAKEFLAGS environment variable
- RealCommandRunner tries to acquire TokenPool
  * if no token pool is available then there is no change in behaviour
- When a token pool is available then RealCommandRunner behaviour
  changes as follows
  * CanRunMore() only returns true if TokenPool::Acquire() returns true
  * StartCommand() calls TokenPool::Reserve()
  * WaitForCommand() calls TokenPool::Release()

Documentation for GNU make jobserver

  http://make.mad-scientist.net/papers/jobserver-implementation/

Fixes ninja-build#1139
@maximuska
Copy link
Contributor

Another possible reason for having jobserver in ninja seems to be LTO support in gcc. -flto=jobserver tells gcc to use GNU make's job server mode to determine the number of parallel jobs. The alternative is to spawn a fixed number of jobs with e.g., -flto=16.

@fabio-porcedda
Copy link

I would like too have this feature merged, i simply cannot convert all projects to ninja-build because i'm not allowed to do that.

@stefanb2 Thanks a lot for your work

@dublet
Copy link

dublet commented Apr 12, 2017

Can I just add my voice to the list of people who would like this to be merged? At my company we also use a nested build system, and with this patch it makes ninja behave very nicely indeed. We're not in the position to make ninja build everything yet.

@glandium
Copy link

Please note that from a quick glance at the commit on @stefanb2's branch, I expect it doesn't work on Windows, where Make uses a different setup.

@stefanb2
Copy link
Contributor Author

@glandium correct, in the Windows build a no-op token pool implementation is included. But I fail to see why this would be a relevant reason for rejecting this pull request.

That said, I'm pretty sure that it would be possible to provide an update that implements the token protocol used by Windows GNU make 4.x. Probably tokenpool-gnu-make.cc could be refactored into system agnostic and UNIX-dependent bits.

stefanb2 pushed a commit to stefanb2/ninja that referenced this issue Nov 7, 2017
- add new TokenPool interface
- GNU make implementation for TokenPool parses and verifies the magic
  information from the MAKEFLAGS environment variable
- RealCommandRunner tries to acquire TokenPool
  * if no token pool is available then there is no change in behaviour
- When a token pool is available then RealCommandRunner behaviour
  changes as follows
  * CanRunMore() only returns true if TokenPool::Acquire() returns true
  * StartCommand() calls TokenPool::Reserve()
  * WaitForCommand() calls TokenPool::Release()

Documentation for GNU make jobserver

  http://make.mad-scientist.net/papers/jobserver-implementation/

Fixes ninja-build#1139
@nox
Copy link

nox commented Nov 11, 2017

This would be really useful too when invoking ninja as part of another build tool, such as cargo.

@comicfans
Copy link

This should be very useful for super-project build, in our large code base, due to different compiler/environment config, we can not include all projects in one single ninja build, so we have 1 top-level and N sub-projects built by ninja , this config trigger Y*N problem.

@xqms
Copy link

xqms commented Dec 6, 2017

+1 - this is highly interesting for parallel builds with catkin_tools (https://catkin-tools.readthedocs.io/en/latest/). A catkin_tools workspace consists of separate CMake projects which are built in isolation. To control the CPU consumption of parallel make runs, catkin_tools contains a GNU Make jobserver implementation.
In this way, the make jobserver is starting to become a standard "protocol" for controlling resource consumption of parallel builds.

Note that in the catkin_tools scenario, it is not easy to merge the individual build.ninja files into a hierarchy of subninja files, because

  • Targets/individual rules will clash - would need CMake changes to keep them apart.
  • We would need some way of encoding inter-package dependencies (build this subninja before that).
  • catkin_tools needs to perform additional installation steps after a package has been built.
  • Also, catkin_tools provides many nice features which would be defeated by a merged build (package-level monitoring, build output grouped by packages, ...).

@yann-morin-1998
Copy link

@nico I would like to add my voice to having support for GNu make job-server support in ninja.

Meta-buildsystems like OpenEmbedded (Yocto), OpenWRT, Buildroot and a lot of others,
are tasked with generating systems by building a lot of various packages from various sources,
all using various buildsystems. I'll mostly use Buildroot as an example, as I'm very familiar with
it, but the following is in principle applicable to all the buildsystems as well.

Such build systems will typically have this sequence per package they build:

  1. download sources of a package
  2. extract the sources
  3. configure the package
  4. build the package
  5. install it in a staging location

And they will repeat that sequence for each and all packages that are needed to build the
target system:

  1. build busybox
  2. build coreutils
  3. build foo
  4. build bar
  5. etc...

Once all packages have been built and installed in the staging location, a system image
(e.g. a bootloader + Linux Kernel + root filesystem for example) is generated from that
staging location. That system image can the be directly flashed onto a device.

Now, that was the quick overview.

Since a system can be made of a lot of packages, we want to build as many packages in
parallel (respecting a depndency chain, of course). But then for each package, we also want
to take advantage of parallel compilation, in case no other package is being built at the same
time.

So, if we have a 8-core machine, we would want to build up to 8 jobs in parallel, which means
we have to distribute those jobs to the various packages that need to be built at some point in
time, so that we maximie the number of jobs, but do not over-shoot the 8-CPU limit.

For example, if 8 ninja-based packages are built in parallel and they do not share a job-server,
they will each be building 8 jobs, which is a total of 64 parallel jobs. On the other hand, limiting
the ninja builds to a single job will be a waste of time when only a single package is built at some
point in time (e.g. becasue the other ones have already finished building, or because the
dependency chain needs that one package before continuing).

And as has been already explained in previous posts in this thread, not every package is based
on ninja, and not every package is even conceivably switchable to ninja. And even if every packages
were using ninja, we can't simply aggregate all the ninja definitions to have a super-build, because
eveything would end up clashing with everything else... So we still need to be able to cooperate with
the rest of the world, especially when that rest of the world has been established for decades now... ;-)

Thanks for reading so far! :-)

nashif added a commit to nashif/zephyr that referenced this issue Mar 6, 2018
This reverts commit 0e6689d.

Parallel builds are broken due to a mix of Make/Ninja and the job server
not being operational.

See ninja-build/ninja#1139

Signed-off-by: Anas Nashif <anas.nashif@intel.com>
@ihnorton
Copy link

+1. We also face this issue of Y*N ninjas while using CMake ExternalProject functionality.

nashif added a commit to zephyrproject-rtos/zephyr that referenced this issue Mar 20, 2018
This reverts commit 0e6689d.

Parallel builds are broken due to a mix of Make/Ninja and the job server
not being operational.

See ninja-build/ninja#1139

Signed-off-by: Anas Nashif <anas.nashif@intel.com>
@degasus
Copy link

degasus commented Feb 28, 2024

The PR is still being worked on. Not sure what you want us to "fix".

@jhasse After reading all of the comments here, I'm under the impression that the pull request #1140 is finished for more than half a decade, and just rebased every year.

@stefanb2 Please correct me if I'm wrong, but it seems like you are waiting for a decision for either

  • ninja does not want to have any jobserver client support and this issue and this three PR shall be closed
  • ninja does want to have a jobserver support, but not the GNU Make jobserver protocol. So we shall look for alternatives
  • ninja does want to have GNU Make jobserver client support, but you don't like the implementation in Add GNU make jobserver client support #1140. So what shall be modified?
  • ninja does want to have GNU Make jobserver client support, and there is nothing wrong in Add GNU make jobserver client support #1140. So how many more years shall this wait?

@digit-google
Copy link
Contributor

I do not know @jhasse exact point of view on the topic, but I can see several issues with the PR:

  • There is no regression test suite for what is a major change to Ninja's behavior. While there are unit-tests that verify some parts of the implementation, a real regression test suite that can be run on CI would verify that the Ninja binary works as expected, either as a client, a server, or both at the same time. This requires writing new Python tests under misc/ that simulate a jobserver-enabled build with multiple scenarios.

  • The code is hard to understand and maintain. In particular the way Posix signals are used is scary and brittle. It will very likely lead to flaky and unexpected failures under heavy loads and non-conventional runtime environments (think containers or qemu user emulation). The Win32 part writes directly to the completion port of SubprocessSet. At a minimum, all signal-twiddling and completion-related code should be part of subprocess-posix.cc or subprocess-win32.cc, which would provide a sane API for the token pool implementation.

Ideally, Ninja would implement an asynchronous loop API that would allow to wait for several i/o conditions and timers concurrently in the main thread, and act upon it, and SubprocessSet and TokenPool classes would be all users of it, but that's probably for another PR.

  • Minor: I recommend reworking the commits in the PR to ensure that each one of them is final, correct, individually testable, and updates both configure.py and CMakeLists.txt at the same time.

@eli-schwartz
Copy link

@digit-google it would be productive to make review comments directly on the PR.

Preferably any time in the past 8 years, but no time like the present! :)

Note that regression testing can be somewhat accounted for by the fact that an extremely large number of people have been running the patchset in production for years now.

@digit-google
Copy link
Contributor

I agree, but I was responding to @degasus who was asking in this thread what could be changed in the PR. However, I'll add similar comments there too.

It is great that the current patchset has been working well, and I encourage putting actual metrics, like number of users, build performance improvement times, in the actual PR description and final patchset.

However, unit and regression testing is about ensuring that future Ninja changes do not break its behavior unexpectedly. Given the complexity of the feature and the fact that is changes how Ninja interacts with its runtime environment, unit-testing is not enough. But that's just my humble opinion.

@eli-schwartz
Copy link

You can say the same thing about all the existing functionality ninja has.

My opinion is that it isn't fair or reasonable to ask this PR to be a special exception, but it would be fair iff someone wrote an end-to-end testing suite, then asked the jobserver PR to include jobserver coverage in it.

@digit-google
Copy link
Contributor

Frankly, that PR would be fine to me, even without a full regression test suite, if it didn't spread tricky signal-handling code in what looks like unrelated parts of the source tree. This is a hackish design that is bound to be a maintenance nightmare for anyone that accepts that in their git repository. I assume that's why @jhasse, who has very very limited bandwidth to maintain Ninja, has not felt confident in accepting it.

And for full disclosure, I am not an official Ninja maintainer in any way, but I maintain my own Ninja fork for the Fuchsia project in order to support a number of important additional features.

While I do plan to implement jobserver support there to, this will not be based on this PR for exactly this reason.

@stefanb2
Copy link
Contributor Author

stefanb2 commented Mar 2, 2024

@stefanb2 Please correct me if I'm wrong, but it seems like you are waiting for a decision for either

The short answer: I'm not waiting for anything.

The long answer:

This contribution is a side product of the migration of the internal code base at my former workplace to Android N. Android N build system introduced the kati-ninja-combo, which had severe negative impacts on build performance. These were not acceptable for the company, so I looked into adding jobserver client support to ninja. This turned out to be rather simple and the build performance problem was solved. As the resulting changes were already paid for, I requested for permission to contribute them upstream.

IMHO there is nothing for me to do. Either

  • the project makes a decision about the contribution, or
  • my former employer requests me to withdraw it.

@segevfiner
Copy link

Kitware (CMake's authors) also maintain https://github.com/Kitware/ninja which is a fork/build with this PR, and the ninja you can install from PyPI https://pypi.org/project/ninja/, is actually this fork.

@jcfr
Copy link

jcfr commented Mar 21, 2024

Kitware (CMake's authors) also maintain https://github.com/Kitware/ninja

Ditto. We have been using our fork as both (1) a staging area for features in review and (2) the version built and distributed1 on PyPI.

For context, the distribution of both cmake and ninja on PyPI was motivated to support the scikit-build2 initiative.

Footnotes

  1. https://pypi.org/project/ninja/

  2. https://github.com/scikit-build/scikit-build-core

edtanous added a commit to edtanous/openbmc that referenced this issue May 8, 2024
Ninja has a PR for adding make jobserver support [1] that has been a widely
debated PR for many... many years.  Given that many people have forked to
incorporate this PR, and it claims to solve a problem we have (OOM on gcc
processes) it seems like it would be worthwhile using a well maintained fork
instead of the main project.

This is not a one way door.  If we find that the project goes
unmaintained, doesn't build, or otherwise has problems, we can always go
back to using mainline.

Of the forks that have pulled this in, there are:
The Fuscia project [2]
Their targets seem more specific and less generic, although their
improvements seem more extensive.

Kitware [3]
Maintains a fork of ninja

Docker [4]

[1] ninja-build/ninja#1139
[2] https://fuchsia.googlesource.com/third_party/github.com/ninja-build/ninja/+/refs/heads/main/README.fuchsia
[3] https://github.com/Kitware/ninja
[4] https://github.com/dockbuild/ninja-jobserver

'''
EXTRA_OEMESON_COMPILE:append = " \
--ninja-args='--tokenpool-master=fifo' \
"

PARALLEL_MAKE = "-j 20"
BB_NUMBER_THREADS = "20"
'''

Signed-off-by: Ed Tanous <ed@tanous.net>
@mortie
Copy link

mortie commented Jun 3, 2024

What's the current status on this? I'm interested in it from the meta build system perspective, where many different projects written in different languages and using different build systems are all compiled in a coordinated manner. Without make jobserver client support in ninja, meta build systems are forced to make one of the following terrible trade-offs:

  • Build the projects sequentially, relying on the individual build systems' concurrency support. This is bad, since significant parts of from-scratch build times are single-threaded (especially the configure at the beginning and the linking at the end). AFAIK, this is what Buildroot does by default.
  • Build the projects concurrently, but limit each individual project to use one core. This works well if there are many small projects, but terrible if there are one or two projects which are significantly bigger than the others. You do not want to build Chromium with only one core.
  • Build the projects concurrently and let each project use many cores. This is probably the fastest if you have enough RAM, but if you have one of those 32-thread systems, it means you're running 32*32=1024 compiler processes at the same time in the worst case. This requires an immense amount of RAM. This is what Bitbake does by default.

If Ninja and other build systems supported the jobserver protocol, there would be another option:

  • Build the projects concurrently, and let each project use multiple cores, but run one central job server which limits the total concurrency across all projects.

To my knowledge, Ninja is the only real hold-out to make this a practical possibility. GNU Make and Rust's Cargo already support being jobserver clients.

@eli-schwartz
Copy link

The current status is that after @stefanb2's PR died the death of eternally pending review, @hundeboll reimplemented it two weeks ago in #2450 and it has been approved and scheduled for inclusion in ninja 1.13.0 (but the merge button hasn't been hit).

No jobserver master support, only client support, but this is probably not a worry for you.

It would be nice if the new PR had linked to the issue as well but it is what it is.

@mcprat
Copy link
Contributor

mcprat commented Aug 30, 2024

Hi all,

I opened PR #2474 as a heavy rework of #2450 by @hundeboll, making us co-authors. This is an attempt to expedite the process of achieving a simple, efficient, style-compliant, specification-precise, and therefore acceptable implementation of a client for the GNU jobserver in order to finally have solid upstream support for it.

In the PR, support for Windows is excluded, and preprocessor macros are in place to handle that exclusion and prepare spots where there would be Windows-specific declarations if or when Windows support is added later. There are multiple reasons for this, including my lack of being able to build and test on Windows, and having a shorter PR that's easier to review. I'm also aware that unit tests are currently missing at the moment.

Also, my version of the implementation supports older versions of Make like 4.3 and 4.2, including the simple (anonymous) pipe method. After reading through this entire thread I have noticed a lot of confusion regarding that original method, including some false claims. However, I do understand that there was a reason to add the fifo method, even though it may be a minor or insignificant difference for most use cases. The code that allows support for previous versions is a very small diff compared to a version with only support for the fifo method, and I made sure to write the client so that parsing the newer method is attempted first. I have tested mostly my own use case which does not test environment/FD propagation to a subprocess of a subprocess, but I did at least one test that did so with a Makefile that calls ninja to a build.ninja that calls ninja again to run a different build.ninja.

There are many other small changes that I outlined in the first post of the PR that deserve attention, so adding support for older versions of Make is NOT the only goal here.

I have spent the last week or so self-reviewing and fixing minor issues found by the CI, so now would
be the time to ask for as much review and testing and criticism as possible as the next step.

Thanks in advance

@digit-google
Copy link
Contributor

digit-google commented Oct 1, 2024

I just uploaded PR #2506 as a complete rework of previous PRs.

This was motivated by the fact that the latest version of #2474 degraded performance for my own builds significantly (htop showing that everything was serialized to a single job after a small amount of time), and there were still significant other issues with the code, hence the complete rewrite. This new PR includes both client and server mode, for Posix and Windows.

With my version, some of my builds go from 5m10s to 4m40s, and some others from 22m49s to 11m56s.
However, these are Fuchsia build plans which require a massive source checkout, and these times correspond to running builds on very powerful machines, with or without a vast array of remote builders. I.e. this is not something that can be reproduced easily by other developers.

My PR includes unit tests and even a small regression test suite. However, I think it would be important to get a reference real-world multi-build that could be used to verify that whatever implementation ends up in the Ninja source tree works correctly, while still being buildable in only a few minutes on a low-end or moderate developer machine (let's go for 8 cores and 16 GiB or RAM).

Any suggestions for such a multi-build? Any recursive Ninja invocations (either from Meson or CMake) would fit the build I think. Or maybe something that uses Make at the top-level, but then builds multiple things with Ninja.

@mcprat
Copy link
Contributor

mcprat commented Oct 1, 2024

so this "degraded performance"... is it just that I came up with a scaling method for capacity instead of "infinite"? then the performance would be the same if I put it back to infinite perhaps?

maybe I should have mentioned my system has only 4 cores

the jobserver protocol is supposed to be simple but what I'm seeing instead is as verbose and unreadable as you can possibly make it, pulling out all the tricks you have ever learned, for what benefit honestly?

don't get me wrong, I'm sure it works great, it just looks vastly over-engineered at first glance

@mathstuf
Copy link
Contributor

mathstuf commented Oct 1, 2024

Any suggestions for such a multi-build? Any recursive Ninja invocations (either from Meson or CMake) would fit the build I think. Or maybe something that uses Make at the top-level, but then builds multiple things with Ninja.

This project is a collection of inner projects driven by an outer project: https://gitlab.kitware.com/paraview/common-superbuild

You'll want to use the selftest subdirectory and to enable some selection of projects (the compression libraries are probably a good candidate set). To get mismatched generators (i.e., different generators for the "superbuild" compared to the inner builds, specify CMAKE_GENERATOR "inner name" in the _superbuild_ExternalProject_add call in cmake/SuperbuildMacros.cmake (line 1550 or so). There's also some logic that may need tweaked to understand that ninja can also provide a jobserver around line 30 in cmake/SuperbuildExternalProject.cmake.

@digit-google
Copy link
Contributor

so this "degraded performance"... is it just that I came up with a scaling method for capacity instead of "infinite"? then the performance would be the same if I put it back to infinite perhaps?

That's what I tried initially with your PR, but the resulting binary didn't provide any improvements to my build, except very reduced volatility. Here are the corresponding raw numbers I had at the time for a tiny Fuchsia build, in case you are interested:

ninja-upstream & -j128

4:35.20elapsed 8249%CPU
4:45.41elapsed 7914%CPU
4:46.71elapsed 7957%CPU
4:49.69elapsed 7829%CPU
5:01.97elapsed 7394%CPU

ninja-digit & jobserver_pool.py --token-count=128

3:55.67elapsed 10459%CPU
3:59.62elapsed 10247%CPU
4:01.02elapsed 10235%CPU
4:05.92elapsed 9984%CPU
4:17.52elapsed 9524%CPU

jobserver-final & jobserver_pool.py --token-count=j128

4:47.13elapsed 8187%CPU
4:47.43elapsed 8130%CPU
4:47.79elapsed 8153%CPU
4:48.15elapsed 8134%CPU
4:48.28elapsed 8204%CPU

But note that with your latest commit, things are far worse, unfortunately. You didn't even explain why you decided to change the capacity logic, nor provide any meaningful numbers for its usefulness. And again, you didn't provide any tests verifying the basic correctness of your code.

I know you have worked considerably on this, and addressed a lot of review comments, but you didn't address all the issues/ The sad truth is that this was still not good enough, even performance wise, at least for my use case. Hence the motivation to start something else.

maybe I should have mentioned my system has only 4 cores

the jobserver protocol is supposed to be simple but what I'm seeing instead is as verbose and unreadable as you can possibly make it, pulling out all the tricks you have ever learned, for what benefit honestly?

don't get me wrong, I'm sure it works great, it just looks vastly over-engineered at first glance

That's just your opinion.

At this point, I am more interested in finding a common ground that we can all use to verify that whatever lands into the final Ninja tree works correctly for everyone.

Because PR#2506 works for me doesn't mean it would work for you or other people, hence why it is so important to get good common regression tests. here. And fix whatever needs to be fixed otherwise.

@mcprat
Copy link
Contributor

mcprat commented Oct 1, 2024

That's what I tried initially with your PR, but the resulting binary didn't provide any improvements to my build, except very reduced volatility

Ah ok... so there is much more going on here...

You didn't even explain why you decided to change the capacity logic, nor provide any meaningful numbers for its usefulness

I was following the initial complaint or concern about the value being "infinite", so I tried to determine a working finite value, or at least I thought it was working, that coincides with the whole reason that the "capacity" value was set up in the first place in 8a2575e, in order to gently reach maximum parallelism without an instantaneous load which can shock the system. What you're calling "volatility" in this exact case may be on purpose... or am I misunderstanding something?

The value I put is supposed to represent a capacity of 1 + 2^n with n incrementing for every loop of the builder per finished edge until the next step is the maximum. Because my ability to test high parallelism is limited, it could be that it's actually 1 + 2*n instead...

So do you know which optimization is improving the performance and where? I would assume it has something to do with how the "acquire" and "release" functions handle the data. I would like to learn something and be corrected and informed instead of just bypassed. Perhaps extreme optimizations are necessary in one or a few spots, but it would be nice if we can determine where it actually matters so other spots in the code can return to simplicity. I understand that in the world of optimization, speed requires a sacrifice of otherwise saved code size, but it just seems a bit overboard.

But note that with your latest commit, things are far worse,

would be nice if someone could have let me know

you didn't provide any tests

I was waiting for a sign that anything I was doing had even a chance of being accepted, something like "ok, everything looks good, now we need ____", before I spend significantly more time on the finishing requirements other than what downstream projects need, including documentation. The age of this thread was kept in mind...

That's just your opinion

without understanding the difference or having explanations to consider, opinions are all that remains

I am more interested in finding a common ground that we can all use

as was I, and still am. looking forward to trying it and working with you despite mixed feelings so far

@digit-google
Copy link
Contributor

After experimenting today, I have good and bad news :-)

The good news is that thanks to @mathstuf , I have been able to experiment with https://gitlab.kitware.com/paraview/common-superbuild which shows a deterministic though modest improvement when --jobserver is being used.

Here are my reproductions steps, where I am using the hyperfine tool to benchmark Ninja with and without a jobserver pool:

# Clone superbuild repository.
git clone https://gitlab.kitware.com/paraview/common-superbuild.git

# Configure a self-test build with a few projects with CMake in the `_build` directory.
cd common-superbuild/selftest

cmake -B _build -S. \
  -DENABLE_zlib=1 \
  -DENABLE_zstd=1 \
  -DENABLE_szip=1 \
  -DENABLE_snappy=1 \
  -DENABLE_ninja=1 \
  -DENABLE_jsoncpp=1 \
  -DENABLE_freetype=1 \
  -DENABLE_imath=1 \
  -DENABLE_bzip2=1 \
  -DENABLE_lz4=1 \
  -GNinja

# Run Ninja to download and extract all project sources (this can be *long* and not parallel)
/usr/bin/time ninja -C _build download-all

# There is no configure-all target, so let's do it manually.
# This is also *long* and not parallel.
# NOTE: Adding certain projects (e.g. `llvm` or `meson`) requires building + installing
# python3, as well as all its dependencies, during their configure step. This skews benchmarked
# numbers tremendously, so avoid them if possible, or just disable the following two
# lines (benchmarks will include the times for the long on-parallel `configure` steps though :-()
CONFIGURE_TARGETS=$(ninja -C _build -t inputs all | grep -e '-configure$')
/usr/bin/time ninja -C _build ${CONFIGURE_TARGETS}

# Rename _build to _build0, so we can reuse its content in multiple builds.
# Unfortunately, `ninja -C _build clean` or `ninja -C _build -t clean` do not clean properly.
mv _build _build0

# Now benchmark Ninja without and with --jobserver
# The preparation step restores the _build directory to its pristine configured state.
hyperfine --prepare "rm -rf _build && cp -rp _build0 _build" \
  "ninja -C _build" \
  "ninja -C _build --jobserver"

At first, naively running this on my powerful work machine doesn't show any significant difference:

# Running on a machine with 96 cores: no meaningfull difference
Benchmark 1: ninja -C _build
  Time (mean ± σ):      7.751 s ±  0.092 s    [User: 73.560 s, System: 3.978 s]
  Range (min … max):    7.615 s …  7.904 s    10 runs

Benchmark 2: ninja -C _build --jobserver
  Time (mean ± σ):      7.743 s ±  0.070 s    [User: 73.627 s, System: 3.832 s]
  Range (min … max):    7.621 s …  7.826 s    10 runs

Summary
  ninja -C _build --jobserver ran
    1.00 ± 0.01 times faster than ninja -C _build

But this is not surprising. This build is simply not very parallel and can use all the CPU cores it needs on this machine, with or without a jobserver. Hence the same overall build times.

The interesting part is using Linux cgroups to restrict the number of CPU cores available to the commands, and here we say a difference. I am on a Debian system and using systemd-run --scope --property AllowedCPUs=XX-YY <command> to run hyperfine, where XX and YY are the starting and ending CPU indices.

When using 8 cores only (e.g. AllowedCPUs=0-7), the --jobserver version is x1.18 faster.

When using 4 cores only (e.g. AllowedCPUs=0-3), the --jobserver version is x1.06 faster only.

Here are the benchmark results for the 8-core case:

# Running in a Linux cgroups with only 8 CPUs (systemd-run --scope --property AllowedCPUs=0-7 hyperfine ...)
Benchmark 1: ninja -C _build
  Time (mean ± σ):     13.263 s ±  0.198 s    [User: 68.994 s, System: 2.825 s]
  Range (min … max):   12.900 s … 13.576 s    10 runs
 
Benchmark 2: ninja -C _build --jobserver
  Time (mean ± σ):     11.283 s ±  1.040 s    [User: 64.962 s, System: 2.887 s]
  Range (min … max):   10.176 s … 12.802 s    10 runs
 
Summary
  ninja -C _build --jobserver ran
    1.18 ± 0.11 times faster than ninja -C _build

And for the 4-core case:

# Running in a Linux cgroups with only 4 CPUs (systemd-run --scope --property AllowedCPUs=0-3 hyperfine ...)
Benchmark 1: ninja -C _build
  Time (mean ± σ):     20.981 s ±  0.217 s    [User: 68.251 s, System: 2.707 s]
  Range (min … max):   20.614 s … 21.396 s    10 runs

Benchmark 2: ninja -C _build --jobserver
  Time (mean ± σ):     19.728 s ±  1.111 s    [User: 64.685 s, System: 2.815 s]
  Range (min … max):   17.838 s … 21.483 s    10 runs

Summary
  ninja -C _build --jobserver ran
    1.06 ± 0.06 times faster than ninja -C _build

That's for the good news

@digit-google
Copy link
Contributor

Notice however that the volatility of the jobserver builds in the previous section is high, but I can reproduce the same mean over several hyperfine runs locally :-/

Now, I also wanted to get a better understanding of what's going on, and also try to get something that is more parallel, so I ended up creating my own tiny super-build system. Its main benefit is the ability to generate a build_trace.json file that describes the execution of all build tasks for analysis.

You can find it at https://github.com/digit-google/ninja-recursive-build, the README.md contains all instructions you need to replicate my results and play with it.

Now for the bad news: I can see a jobserver-related reproducible performance regression with this new build system.

Similarly to previously, running with all cores shows no significant difference (see below).
However, when limiting to 8 CPUs, the jobserver build is noticeably slower (x1.08, with high volatility, but that can be reproduced consistently).

# Running with full 96 cores
Benchmark 1: ninja
  Time (mean ± σ):      8.963 s ±  0.168 s    [User: 247.292 s, System: 14.446 s]
  Range (min … max):    8.677 s …  9.233 s    10 runs
 
Benchmark 2: ninja --jobserver
  Time (mean ± σ):      8.897 s ±  0.197 s    [User: 257.761 s, System: 15.928 s]
  Range (min … max):    8.666 s …  9.341 s    10 runs
 
Summary
  ninja --jobserver ran
    1.01 ± 0.03 times faster than ninja
# Limited to 8 cores
Benchmark 1: ninja
  Time (mean ± σ):     22.953 s ±  0.395 s    [User: 160.829 s, System: 7.868 s]
  Range (min … max):   22.494 s … 23.702 s    10 runs
 
Benchmark 2: ninja --jobserver
  Time (mean ± σ):     24.821 s ±  1.531 s    [User: 152.558 s, System: 8.320 s]
  Range (min … max):   20.797 s … 25.800 s    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  ninja ran
    1.08 ± 0.07 times faster than ninja --jobserver

@stefanb2
Copy link
Contributor Author

stefanb2 commented Oct 2, 2024

@digit-google After experimenting today, I have good and bad news :-)

While those numbers are interesting, IMHO your test case does not really represent why people need jobserver support in ninja.

There are multi-level build systems out there that blindly translate "component sub-build invocation" into a new independent parallel build, e.g. make -j4. AFAIR kati did this for every $(MAKE) inside an Android build. So with N components you end up with 4*N parallel build steps, leading to a complete overload of the build machine.

The final goal is that the top-level build tool provides a job server with T tokens and any build tool instance, including the top-level instance, adhere to the limit, i.e. in total there are never more than T build steps running in parallel (not counting the build tool instances itself, because each of them should be waiting for their children to complete).

@digit-google
Copy link
Contributor

I think I have an explanation for this phenomenom though.

As far as I know, none of the proposed implementations changes the fact that Ninja only decides to launch new commands when at least one of them has completed. In other words, if a job slot becomes available in the pool, Ninja will not see it until the end of one of its current running subprocesses, even if it has capacity to launch more of them.

(This is not how GNU Make works by the way, it is capable of waiting for both process completion and new token in the FIFO at the same time, at least on Posix, not sure about Windows).

Here's an example to illustrate this. Let's imagine a top-level build that invokes two sub-builds, each one of them has two parallel tasks, but one of them has short ones, and the other has long ones. If there are enough cores for everyone, everything can be dispatched in parallel, which would look like:

top-level
|___subbuild_A
|   |_____[short1]
|   |_____[short2]
|
|___subbuild_B
    |_____[long___________________1]
    |_____[long___________________2]

          ^                         ^
          |                         |
        build starts             build ends

Which can be summarized as:

--------------------------------------------------> timeline
[short1]
[short2]
[long___________________1]
[long___________________2]

Now let's imagine that there are only 2 cores available on the machine this runs on, without a jobserver, 4 processes are started at the same time, and the OS will control parallelism by prempting them as it see fit. Assuming these are purely CPU-bound tasks, the most optimal scheduling would end up with a build time that is similar to this timeline:

[short1][long___________________1]
[short2][long___________________2]

Now, if a jobserver is used to synchronize the parallelism, it is possible to end up in a situation where each sub-build only gets one token initially:

[short1]
[long___________________1]

The first sub-build completes, and reuses its token to finish its second task:

[short1][short2]
[long___________________1]

At the end of [short2] the first sub-build releases its token to the pool, but the second sub-build doesn't know about it, and waits for the completion of its first task, before launching the second one:

[short1][short2]
[long___________________1][long___________________2]

And a long build time is the result, due to poor scheduling.
It would be better if Ninja could detect the new token immediately to launch the second long task asap, as in:

[short1][short2]
[long___________________1]
                [long___________________2]

This would end up with a better build time, even though not the optimal one that can be achieved without a jobserver.

@digit-google
Copy link
Contributor

Some feedback, since I could roll this feature in the Ninja binary used by the Fuchsia build, and changed our builder configurations to use it.

The Fuchsia CI bots build around 90 different build configurations of the project, only a few of them actually launch Ninja sub-builds from the top-level one (to generate SDKs that require binaries from different CPU architectures and API levels, then merging the results into a final SDK archive, in case you're curious :-))

I could enable the jobserver feature for all build configurations and perform clean build time comparisons:

  • One SDK build configuration saves 12 minutes (from 1h to 48 minutes) due to it.
  • Other SDK build configurations save only 6 minutes (e.g. from 42 to 36 minutes), which is still a significant win.
  • Other build configurations, which do not use sub-builds, are simply not affected (as expected, but it's still nice to check).

These confirm the savings I could observe locally on my own workstation (these are builds running on CI bots of various CPU/RAM dimensions, with remote builds enabled to boot, so we run everything with -j CPU_COUNT*10 on these).

The jobserver feature is now enabled by default on all Fuchsia SDK build configurations.

I addressed the few review comments on the PR #2506 that were sent, and would welcome any more feedback.

I would be grateful if anyone could use the latest binaries generated by the Github CI and see if this improves their own multi-build times, or report issues they are facing.

Regarding the sub-optimal scheduling described in my previous comment. Solving this will require changing the way Subprocess::DoWork() works in order to either detect new types of events (and report them). I think this is better left for another PR and discussion. What do you think?

@thesamesam
Copy link

I would be grateful if anyone could use the latest binaries generated by the Github CI and see if this improves their own multi-build times, or report issues they are facing.

I've been using it for the last week or so. The only quirk was the interaction between NINJA_JOBSERVER=1 and --jobserver, but that wasn't a big deal at all, I just hadn't clocked it at first. I've been able to increase parallelism and not have LTO bomb the system on large world rebuilds.

I think this is better left for another PR and discussion. What do you think?

Yeah, I'd hate to see it delayed further, especially given nobody so far (hopefully more can test) have raised issues with the changes. It also means that we're not fighting for interest/attention on a topic that deserves its own space, I think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment