Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GNU jobserver posix client support #2474

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mcprat
Copy link
Contributor

@mcprat mcprat commented Aug 10, 2024

a rework of #2450 supporting all versions of GNU Make, but without Windows support
(I'm not able to test for Windows, and I have doubts with proposed Windows support)

resolves #1139 for posix systems

thanks to @hundeboll for much of the work with this newer implementation

ping @jhasse @digit-google

significant differences:

  • no changes to any function parameters
  • no new intermediary functions
  • instantiate client support in real_main() instead of Plan
  • pass references pointers to the jobserver class into other classes
  • use a constructor to initialize jobserver client
  • release all tokens on any fatal error
  • calculate a value for "load capacity" instead of returning SIZE_MAX
  • supports both fifo and simple pipe file descriptors from Make
  • detect invalid or closed pipe and inform user about the most likely reason

@mcprat
Copy link
Contributor Author

mcprat commented Aug 10, 2024

I forgot I have to adapt to a windows build even if I'm not going to support it on windows...

@mcprat mcprat force-pushed the jobserver-final branch 4 times, most recently from c1c6829 to 8530799 Compare August 10, 2024 06:09
@mcprat
Copy link
Contributor Author

mcprat commented Aug 10, 2024

The CI for Windows is happy now, but it would be nice to have a tester for Windows...

src/jobserver.h Outdated
Comment on lines 81 to 71
/// The number of currently acquired tokens, or the jobserver status if negative.
/// Used to verify that all acquired tokens have been released before exiting,
/// and when the implicit (first) token has been acquired (initialization).
/// -1: initialized without a token
/// 0: uninitialized or disabled
/// +n: number of tokens in use
int token_count_ = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to point out this concept and ask whether or not this usage of an int is too unusual or non-standard.
This coincides with the last line of the constructor Jobserver::Jobserver() and the value of capacity in CanRunMore() using the absolute value function.

It's pretty easy to rework this. I just happened to have this idea first, to use the token number in place of Enabled() when there actually are no tokens yet.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 11, 2024

found and fixed some minor mistakes, added some commit tags

@mcprat mcprat force-pushed the jobserver-final branch 3 times, most recently from da3903d to dd128f2 Compare August 12, 2024 19:48
@mcprat
Copy link
Contributor Author

mcprat commented Aug 12, 2024

sorry I didn't realize that there were "builder" constructors for the test suite, I built and ran the test suite this time.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 13, 2024

oops... I went too fast and everything is still "green" on segfault...

@jhasse
Copy link
Collaborator

jhasse commented Aug 13, 2024

We're very wary of changes that increase the complexity of Ninja, so a PR that implements both methods while one of them is technical superior and results in less code in Ninja (and to my understand that's the case for fifo), is very unlikely to get merged.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 13, 2024

The current previous struggle is regarding the creation of the Jobserver object in real_main(), nothing to do with the new files and new functionality. I was trying to avoid making functions that call Jobserver functions through another class and instead pass references to the object wherever it's needed, but I can always go back to the other way of creating the object within the Plan struct.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 13, 2024

ah, I see what you mean, it errored on readability...

but the other thing I have doubts about

Run ctest -C Release -vv
CMake Error: Unknown argument: -vv
CMake Error: Run 'ctest --help' for all supported options.
Error: Process completed with exit code 1.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 13, 2024

the bool now defaults to false, and set true with a simple if instead of a ternary, like in the rest of the project 👍🏼

@mcprat
Copy link
Contributor Author

mcprat commented Aug 14, 2024

I believe that all the minor issues caught by the CI are handled now...

@mcprat
Copy link
Contributor Author

mcprat commented Aug 15, 2024

some simplification:

I was reading the Google Style Guide and saw this

...we never allow non-­const reference parameters.

so changes in the last push are:

  1. Converted all new references to pointers, to comply with the style guide for Jobserver being a non-const object.

Then I realized that I no longer need to create a Jobserver object for build_test.cc (at least in this commit), so

  1. Pass NULL for Jobserver* in Builder instantiations in build_test.cc, check for null dereferencing Jobserver in build.cc

then finally, another opportunity to save lines:

  1. Create Jobserver object in NinjaMain instead of real_main() since I realized that only 1 NinjaMain is created in a single process run.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 15, 2024

updated commit message

@mcprat
Copy link
Contributor Author

mcprat commented Aug 17, 2024

@jhasse can we run the CI workflow again?

@mcprat
Copy link
Contributor Author

mcprat commented Aug 29, 2024

small organization update:

  • comment rewriting
  • style formatting
  • save some more lines
  • line wrapping
  • moved new const functions to header

@hwti
Copy link

hwti commented Aug 31, 2024

  • relaxed the errors in the cases of being passed an invalid set of FDs (-2,-2) and failing to read from the pipe when it is non-fifo. the build continues as non-parallel, and a new bool facilitates that.

This means that with make 4.3, when ninja launches itself (like what I found with WebKit, when a CMake project reconfigures itself), there is a ninja: warning: pipe closed: 3 (mark the command as recursive) and the build is non-parallel.
The pipe shouldn't be closed, or if there is no other way maybe the --jobserver-auth= / --jobserver-fds= should be removed from MAKEFLAGS.

src/jobserver.h Show resolved Hide resolved
src/jobserver-posix.cc Outdated Show resolved Hide resolved
@mcprat
Copy link
Contributor Author

mcprat commented Sep 1, 2024

I know, but with -j1 or -j, make doesn't start a jobserver.

I understand what you mean now... I had been used to testing with a build system that already uses ninja with -j 1, and I have been looking for parallelism when adding jobserver support. However, the opposite case is possible, without any options, if the jobserver client stops initialization early because of a problem, ninja will use the default parallelism of nproc + 2.

This is a tricky situation but I think I got it... I'm going to review what I just pushed again for a possible regression, and then do the variable renames later...

summary:

  • Most return statements are replaced with more uses of the new bool for falling back to non-parallel building.
  • "invalid value" warning now only applies to non-empty strings
  • specifically set FDs to 0 when falling back to non-parallel build (but FD 0 will never be read from or written to)

@mcprat
Copy link
Contributor Author

mcprat commented Sep 1, 2024

if there is no other way maybe the --jobserver-auth= / --jobserver-fds= should be removed from MAKEFLAGS.

It is the responsibility of the job token server to adjust the environment including the MAKEFLAGS variable. Make versions 4.3.90+ and 4.4.x do this by appending --jobserver-auth=-2,-2 to the end of MAKEFLAGS. The warning is appropriate for the cases where this does not happen, as the environment variable suggests that the pipe will be open when it is never actually opened.

@hwti
Copy link

hwti commented Sep 1, 2024

if there is no other way maybe the --jobserver-auth= / --jobserver-fds= should be removed from MAKEFLAGS.

It is the responsibility of the job token server to adjust the environment including the MAKEFLAGS variable. Make versions 4.3.90+ and 4.4.x do this by appending --jobserver-auth=-2,-2 to the end of MAKEFLAGS. The warning is appropriate for the cases where this does not happen, as the environment variable suggests that the pipe will be open when it is never actually opened.

Here make is fine, it's about the make -j2 => ninja (OK, runs a rule regenerating build.ninja, then later pipe closed).
I thought ninja executed a second instance, since I see ninja: using jobserver: 3,4 twice, but it doesn't seem to be the case.
It seems to "restart" after having regenerated the build.ninja, closing the pipe fds before the the second using jobserver trace.

The same case (ninja restarting after the CMake reconfigure) works when the jobserver uses a fifo (make 4.4).

When trying to reproduce with a manually written ninja file (simulating a buildsystem which generates build.ninja from depfile), I managed to get a case which even fails writing to the pipe.

build.ninja :

rule DUMMY
  command = echo dummy

rule REGENERATE
  command = touch build.ninja

build all: phony all1 all2

build all1: DUMMY
build all2: DUMMY

build build.ninja: REGENERATE depfile

Makefile :

all:
        touch depfile
        +ninja

=>

$ make -j2
touch depfile
ninja
ninja: using jobserver: 3,4
[1/1] touch build.ninja
ninja: using jobserver: 3,4
[1/3] echo dummy
dummy
ninja: fatal: failed to write to jobserver: 4: Bad file descriptor
make: *** [Makefile:3: all] Error 1

@avikivity
Copy link

Suggest to update the manual, so users are aware that the feature exists and how to operate it.

src/build.cc Outdated Show resolved Hide resolved
@mcprat
Copy link
Contributor Author

mcprat commented Sep 2, 2024

I thought ninja executed a second instance, since I see ninja: using jobserver: 3,4 twice, but it doesn't seem to be the case.
It seems to "restart" after having regenerated the build.ninja, closing the pipe fds before the the second using jobserver trace

Earlier, I did a small refactor:

Create Jobserver object in NinjaMain instead of real_main() since I realized that only 1 NinjaMain is created in a single process run.

It sounds like that observation I made is wrong, so I'll put the client instantiation back in real_main(). I should have known that's the reason why you sometimes see 2 subprocesses that are numbered as 1/N targets...

@mcprat
Copy link
Contributor Author

mcprat commented Sep 2, 2024

update summary:

  • small regression: Acquire() must return false when pipe is closed
  • moved instantiation of the client back to real_main()

Jobserver::Jobserver() {
assert(!Enabled());

// Return early if no makeflags are passed in the environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move all parsing logic to a separate static method that can be unit-tested properly. Also be aware that the GNU Make documentation states explicitly: Be aware that the MAKEFLAGS variable may contain multiple instances of the --jobserver-auth= option. Only the last instance is relevant. Hence this should be implemented properly (and tested).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, see also to implement the following requirement:

Your tool may also examine the first word of the MAKEFLAGS variable and look for the character n. If this character is present then make was invoked with the ‘-n’ option and your tool may want to stop without performing any operations

From https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, see also to implement the following requirement:

Your tool may also examine the first word of the MAKEFLAGS variable and look for the character n. If this character is present then make was invoked with the ‘-n’ option and your tool may want to stop without performing any operations

From https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html

I wish they didn't mention this in that part of the manual. It's completely out of scope for the jobserver. The jobserver client should not be responsible for determining whether ninja does a "dry run" even though it happens to be parsing MAKEFLAGS which can have many different flags for many different reasons...

This should be implemented in a separate commit (and separate PR) and directly with the BuildConfig object instead of involving the Jobserver objects. Traditionally, ninja does not rely on the environment for any of it's configuration flags, so that's another conversation as well.

// Tokenize string to characters in flag_, then words in flags_.
while (flag_char_ < strlen(makeflags)) {
while (flag_char_ < strlen(makeflags) &&
!isblank(static_cast<unsigned char>(makeflags[flag_char_]))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isblank() is locale-dependent, which leads to surprises and is generally slow. It is easier to just compare with ' ' and ' t' in this case.

src/jobserver-posix.cc Outdated Show resolved Hide resolved
src/jobserver.h Outdated Show resolved Hide resolved
src/jobserver.h Outdated Show resolved Hide resolved

#include "util.h"

Jobserver::Jobserver() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend splitting this PR into multiple commits, i.e.:

  1. One that adds the Jobserver class, and its Posix implementation + appropriate unit-tests for it. Also try to make the class as independent from the rest of Ninja as possible (e.g. to not call Warning() or Info() in the parser function, leave that to clients).

  2. One that adds usage of the class to build.cc / ninja.cc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend splitting this PR into multiple commits, i.e.:

In my opinion, the first commit in should be functional so that other projects can easily pull the patch while waiting for a release, and also for the "bisect rule", so each individual checkout is functional on it's own. I'm planning on Windows implementation and tests to be separate commits while trying to keep this one small (except for the new files)...

if (!jobserver_fifo_)
Warning("pipe closed: %d (mark the command as recursive)", rfd_);
else
Fatal("failed to read from jobserver: %d: %s", rfd_, strerror(errno));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to fallback gracefully to the usual mode when this happens instead?

In general, it is better to avoid calling Fatal() in methods like these, because these conditions cannot be properly unit-tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the remaining calls to Fatal() are for the following cases:

  1. MAKEFLAGS from environment told ninja that a fifo object needs to have file descriptors open for it. The open() syscall is ran and errored.
  2. File descriptors have been opened to the fifo object, but reading from it has failed not due to blocking. This likely means that the fifo object has been deleted while we have file descriptors that point to nothing, or maybe a problem with the filesystem itself.
  3. File descriptors have been opened to the fifo object, but writing to it has failed, again suggesting that the fifo object has been deleted while we have file descriptors that point to nothing or a problem with the filesystem.

They can become warnings, but I think they protect execution from continuing when something is extremely wrong. It should be exceedingly rare for the program to step into these Fatal() calls and they are probably out of scope for unit tests anyway, but I'll let you decide.

if (flags_[n].find(AUTH_KEY) == 0)
flag_ = flags_[n].substr(strlen(AUTH_KEY));

// --jobserver-fds=<val>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this option is never going to work unless you can modify subprocess-posix.cc and subprocess-win32.cc to pass these extra file descriptors to all child processes. Which is a very non-trivial change. Do you have a commit planned for this? Otherwise, I would leave this out of the current PR, explaning why it's not implemented yet.

(Alternatively, support it in the parser function, but ensure that the client recognizes this is not implemented yet, and ignore it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this option is never going to work unless you can modify subprocess-posix.cc and subprocess-win32.cc to pass these extra file descriptors to all child processes.

Are you sure about that? I don't see a difference in functionality between Make 4.1 and Make 4.2 for the jobserver. There was a lot of refactoring going on, but the commit that renames the flag is literally just a rename. I'll do some testing regarding this though.

Do you have a commit planned for this? Otherwise, I would leave this out of the current PR, explaning why it's not implemented yet.

No, I don't have a commit for that. I imagine that any problem with the simple pipe with invalid FDs passed will fallback to non-parallel building.

Copy link
Contributor Author

@mcprat mcprat Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@digit-google funny enough, instead of finding a problem between Make 4.1 and Make 4.2, I found a problem between Make 4.2.1 and Make 4.3. This bug makes any process into a zombie if and only if they try to read a token from the token server and the token is not available AND there is no other subprocess that will finish so that a token is delivered back into the pool to be redirected to the zombie process so it can run. In my testing use case this makes make -j 3 fail while make -j without a number works.

This is because the pipe used to be blocking on the write side of the server, the case described by POSIX as:

If some process has the pipe open for writing and O_NONBLOCK is clear, read() will block the calling thread until some data is written or the pipe is closed by all processes that had the pipe open for writing.

One way we can adapt is to check the NONBLOCK flag and force non-parallel if it is cleared. However, most of the responsibility here should be to the user who should know the limitations of Make 4.2.1 and earlier.

bug:
https://savannah.gnu.org/bugs/?51159

fixed with commit "[SV 51159] Use a non-blocking read with pselect to avoid hangs.":
https://git.savannah.gnu.org/cgit/make.git/commit/?id=b552b05251980f693c729e251f93f5225b400714

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little more context, the way the client is currently written, ninja tries to get a second token before returning the first one it just got, so this Make bug causes a ninja build to hang even if it is not multiple recursive levels deep.

src/jobserver-posix.cc Outdated Show resolved Hide resolved
@mcprat
Copy link
Contributor Author

mcprat commented Sep 5, 2024

big update...

summary:

  • the destructor has been removed. I realized that it never gets run at all. Also, it is not the responsibility of the client to close the pipe. Emergency return of tokens before exiting still happens during call to Clear().
  • the constructor has been split into a Parse() function for the majority of the parsing again.
  • the Jobserver class is split into base and derived classes, reducing preprocessor #if's to 1.
  • Enabled() uses jobserver_closed_ instead of using jobserver_closed_ to put fake FDs.
  • macros for constant strings have been converted to static constexpr's as suggested.
  • removed underscores from local variables that were previously member variables.
  • moved <vector> include from the header to the source.
  • small rewrites to comments and warnings.

@mcprat
Copy link
Contributor Author

mcprat commented Sep 9, 2024

smaller update, this might be considered mergeable now.
hopefully all major issues are handled so we can focus on style and nitpicking...

summary:

  • now handling and storing the actual token character instead of just a count. this allowed for some simplification so there is a line decrease. this is written assuming a char is 1 byte.
  • added a case to fallback to non-parallel build, when the token server provides an FD to read tokens from that is blocking (Make versions 4.2.1 and earlier).
  • some light cleanup and rewording
  • tested all the way back to Make 4.0, which is when Windows support for jobserver was added. anything earlier than that can be considered "ancient" enough to ignore...

src/jobserver-posix.cc Outdated Show resolved Hide resolved
/// It must be called for each successful call to Acquire() after the command
/// even if subprocesses fail or in the case of errors causing Ninja to exit.
/// Ninja is aborted on write errors, and otherwise calls always succeed.
virtual void Release(unsigned char*) {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain what the pointer being passed as argument here should point to? E.g. is it ok to call this with a pointer that points to a value that was never the result of a previous Acquire() code (as suggested by your Clear() function implementation). In which case, what is the default value to be used (also apparently \0 from the code, but you should make that clear in the documentation). Also will the function modify the pointed value or not?

Ideally, users of the API should not have to guess these details by looking at the source code.

I suggest writing a dedicated move-only Token class, even if trivial, to better encapsulate these semantics.

/// A wrapper for token values acquired or released to the pool.
/// A default instance has no value, and is used to indicate that
/// no token is available.
struct Token {
  // Default constructor builds a value-less token.
  Token() = default;
  
  // Explicit constructor for a Token with a value from the pipe.
  explicit Token(uint8_t value) : value_(static_cast<int>(value)) {}
  
  // Move operations are allowed.
  Token(Token&& other) noexcept : value_(other.value_) { other.value_ = -1; }
  Token& operator=(Token&&) noexecpt = default;
  
  // Copy operations are forbidden.
  Token(const Token&) = delete;
  Token& operator=(const Token&) noexcept = default;
  
  /// Returns true if this instance contains a value received from the pipe.
  bool HasValue() const { return value_ != -1; }
  
  /// Return underlying value. It is a runtime error to call this method
  /// if HasValue() returns false.
  uint8_t  GetValue() const {
    assert(HasValue());
    return static_cast<uint8_t>(value_); 
  }
  
  int value_ = -1;
};

Then you can have Acquire() return a Token by value, and Release() take a token by value, and forget about pointers entirely, e.g.:

/// Try to acquire a token from the pool. A value-less token instance is returned
/// if no token is available.
virtual Token Acquire() { return Token(); }

/// Release a previously acquire token to the pool. Does nothing if the
/// token argument has no value.
virtual void Release(Token token) {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote comments in the header to reflect the change in functionality and new returns and arguments.

Making a whole class just to manage the value of the tokens seems complicated and overkill to me. It makes the function declarations read better, but the Token class itself is not very readable and adds a lot of object constructs for not much benefit...

Is there any benefit to avoiding pointers here? or avoiding char?

What if we just made a simple typedef for the tokens?

I'm very confident that a jobserver will never return a NUL char unless it's to indicate "no tokens available" just as I am using it to mean. I'm also pretty confident that Ninja will never receive anything other than a '+' when being used with Make, probably even with forks of Make...

The core principle of a jobserver is simple:
before starting a new job (edge in ninja-speak),
a token must be acquired from an external entity as approval.

Once a job is finished, the token is returned to represent a free job slot.
In the case of GNU Make, this external entity is the parent process
which has executed Ninja and is managing the load capacity for
all subprocesses which it has spawned. Introducing client support
for this model allows Ninja to give load capacity management
to it's parent process, allowing it to control the number of
subprocesses that Ninja spawns at any given time.

This functionality is desirable when Ninja is part of a bigger build,
such as Yocto/OpenEmbedded, Openwrt/Linux, Buildroot, and Android.
Here, multiple compile jobs are executed in parallel
in order to maximize cpu utilization, but if each compile job in Ninja
uses all available cores, the system is overloaded.

This implementation instantiates the client in real_main()
and passes pointers to the Jobserver class into other classes.
All tokens are returned whenever the CommandRunner aborts,
and the current number of tokens compared to the current number
of running subprocesses controls the available load capacity,
used to determine how many new tokens to attempt to acquire
in order to try to start another job for each loop to find work.

Jobserver related functions are defined as no-op for Windows
pending Windows-specific support for the jobserver.

Co-authored-by: Martin Hundebøll <martin@geanix.com>
Co-developed-by: Martin Hundebøll <martin@geanix.com>
Signed-off-by: Martin Hundebøll <martin@geanix.com>
Signed-off-by: Michael Pratt <mcpratt@pm.me>
@mcprat
Copy link
Contributor Author

mcprat commented Sep 9, 2024

applied some review comments

summary:

  • moved new Clear() function to the RealCommandRunner class as ClearJobTokens()
  • ClearJobTokens() function uses range-based for loop statement
  • ClearJobTokens() function takes a const reference for the vector
  • removed headers for vectors and Edge class from jobserver.h
  • reworded comments in jobserver.h

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add GNU make jobserver client support
7 participants