Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide resiliency against inputs changing during the build #1943

Merged
merged 1 commit into from
Jun 10, 2022

Conversation

jdrouhard
Copy link
Contributor

@jdrouhard jdrouhard commented Mar 26, 2021

This is a better, faster approach than #1753 for providing robustness against inputs changing during the build. It fixes #1162 and #1940.

When recording the outputs' mtimes when the command finishes, use the
remembered temporary file's mtime from when the edge started for
recording in the build log. Subsequent runs will use that as the mtime
for the output.

This provides robustness against inputs changing while the command
itself is running. If an input is changed, the subsequent run will
detect the output as dirty since its recorded mtime reflects when the
build command began, not when the output was actually written to disk.

@jdrouhard
Copy link
Contributor Author

cc: @bradking @jhasse

If we are more comfortable with this approach, we can just close #1936. I apologize for all the back and forth on the previous attempt to provide the input file robustness. This seems so much cleaner, I wish I had arrived here to begin with.

@jdrouhard jdrouhard force-pushed the output_time_before_command branch 4 times, most recently from 54b3de6 to 8bee1e5 Compare March 27, 2021 15:36
@jdrouhard jdrouhard changed the title When an edge starts to build, create a temporary file and remember its mtime Provide resiliency against inputs changing during the build Mar 28, 2021
@jdrouhard
Copy link
Contributor Author

@jhasse this is ready to merge and is far superior to the original approach in #1753, let me know if you have any questions about this. I will close #1936 since this supersedes it and is much improved over it.

@bradking having a positive CMake test run with this branch would be a good sanity check.

@bradking
Copy link
Contributor

The CMake test suite fails several tests with this change:

  • RunCMake.Ninja
  • RunCMake.NinjaMultiConfig
  • RunCMake.file
  • RunCMake.try_compile

@jdrouhard I'll contact you externally about reproducing the failures.

@jdrouhard
Copy link
Contributor Author

All CMake Tests pass with this branch, now. The failing tests were due to a slight hole in a regression test I added earlier in #1935. Generator rules should continue to record outputs' mtimes as they are on disk when a command finishes instead of using the time when the command is started. This branch now makes generator rules use the existing behavior instead of the new behavior.

@bradking feel free to double check this. I ran the entire CMake test suite, including all Qt5 tests.
@jhasse this is ready for review and merging.

@bradking
Copy link
Contributor

@jdrouhard the RunCMake.Ninja test still fails. Please run it with Qt 5.15 available.

Copy link
Collaborator

@jhasse jhasse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what the performance implications of touching a file are? I'm also scared about edge-cased like NFS mounted filesystems.

Shouldn't one file be enough? I know that this will result in unnecessary rebuilds if an input is changed while its edge hasn't started, but better safe than sorry (and it would bring clarity to the performance implication of this PR).

I think one file might be a good idea. First of all we can create it next to .ninja_deps and .ninja_log (a directory we definitely have write access to). Second we might re-use it as a locking mechanism: Currently it's possible to run ninja multiple times in a directory resulting in all sorts of bugs. What do you think?


Another thought: What about the hash approach Evan suggested? While this would be a HUGE change, it might be something we want to do in the future anyway, see #1459

src/build.cc Outdated Show resolved Hide resolved
ASSERT_EQ(1u, fs_.files_created_.count("out 2.rsp"));
ASSERT_EQ(1u, fs_.files_created_.count("out 3.rsp"));
ASSERT_EQ(1u, fs_.files_created_.count(".ninja_tmp"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this is an implementation detail and shouldn't be part of the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the file is created by the virtual file system during the test; putting the assert here makes it clear why there were 3 files created.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, I would rather remove the explicit check for ASSERT_EQ(files_created + 2, fs_.files_created_.size());. This unit test shouldn't care.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll remove it.

ASSERT_EQ(1u, fs_.files_removed_.count("out 2.rsp"));
ASSERT_EQ(1u, fs_.files_removed_.count("out 3.rsp"));
ASSERT_EQ(1u, fs_.files_removed_.count(".ninja_tmp"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely be part of the test though. Maybe fs_ should have a method to count files and we should check that everything was cleaned up.

@@ -198,7 +198,7 @@ TEST_F(DiskInterfaceTest, MakeDirs) {
EXPECT_EQ(0, fclose(f));
#ifdef _WIN32
string path2 = "another\\with\\back\\\\slashes\\";
EXPECT_TRUE(disk_.MakeDirs(path2.c_str()));
EXPECT_TRUE(disk_.MakeDirs(path2));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Too bad our CI doesn't run clang-tidy on Windows (yet).

@jdrouhard
Copy link
Contributor Author

I'm wondering what the performance implications of touching a file are? I'm also scared about edge-cased like NFS mounted filesystems.

For any edge, it only touches a single file. I don't believe there will be much performance impact here since the edge command will be writing its outputs to the disk when it runs anyway.

Shouldn't one file be enough? I know that this will result in unnecessary rebuilds if an input is changed while its edge hasn't started, but better safe than sorry (and it would bring clarity to the performance implication of this PR).

I think one file might be a good idea. First of all we can create it next to .ninja_deps and .ninja_log (a directory we definitely have write access to). Second we might re-use it as a locking mechanism: Currently it's possible to run ninja multiple times in a directory resulting in all sorts of bugs. What do you think?

As we've discovered in the previous PRs #1753 and #1932 we really do need to track the beginning of each edge separately. However, I can change it so the temp file we use lives next to .ninja_deps and .ninja_log but each edge will still touch/stat it to get the current time. I could clean up a bit of the refactoring I did around DirName(...) that way.

Using it as a locking mechanism could also come in handy if we decide to clean the file up at the end of the run. I'm not going to tackle that in this PR, though.

Another thought: What about the hash approach Evan suggested? While this would be a HUGE change, it might be something we want to do in the future anyway, see #1459

This would be a massive change that would likely break many many things. There are many things to consider with a hash approach, and in general, using a hash instead of mtime has its own set of problems that are just different than mtimes. At least the approach in this PR makes it pretty resilient and is a definite improvement over current functionality.

@bradking
Copy link
Contributor

I ran with Qt 5.14 and I cannot reproduce. If there was a specific change in Qt 5.15 that breaks this unrelated test, I'd be surprised. What's the specific condition that is failing?

@jdrouhard there is a section of the RunCMake.Ninja test that only runs with Qt 5.15, and not 5.14.

@jhasse
Copy link
Collaborator

jhasse commented Mar 31, 2021

For any edge, it only touches a single file. I don't believe there will be much performance impact here since the edge command will be writing its outputs to the disk when it runs anyway.

What if the edge doesn't have any outputs? Also the edge writes to its outputs, you create, stat and delete a file - depending on the filesystem this might actually take longer.

Just speculation on my part, too, we should really have a good way to back this up with a benchmark, see #1733.

As we've discovered in the previous PRs #1753 and #1932 we really do need to track the beginning of each edge separately.

If I understood it correctly it definitely makes things tricky, but not impossible. Can you elaborate?

@jdrouhard
Copy link
Contributor Author

What if the edge doesn't have any outputs? Also the edge writes to its outputs, you create, stat and delete a file - depending on the filesystem this might actually take longer.

If the edge doesn't have any outputs then it doesn't create or stat the temp file. For the edges that do produce outputs, I am updating the PR to simply use a sentinel .ninja_tmp file in the base directory. Each edge will simply touch the file and stat it, not add/remove it.

As we've discovered in the previous PRs #1753 and #1932 we really do need to track the beginning of each edge separately.

If I understood it correctly it definitely makes things tricky, but not impossible. Can you elaborate?

I'm not sure what the alternative is, record the same mtime for every single output whose command was run during the course of the build?

@jdrouhard
Copy link
Contributor Author

@bradking tested with Qt 5.15. restat rules must also restat the outputs when the command finishes (along with generator rules).
@jhasse I've changed it so the .ninja_tmp file is created once per build next to .ninja_log/.ninja_deps and if it can't be touched/stat'd when an edge starts, then it falls back to old behavior.

Let me know what else you'd like to see.

@dmoody256
Copy link

@jdrouhard could you try out my unit test from #1941, wether this goes in or that PR, I would like the unit test to be added to protect the hardlinking scenario that is causing me trouble currently in my builds.

@bradking
Copy link
Contributor

bradking commented Apr 1, 2021

tested with Qt 5.15. restat rules must also restat the outputs

Thanks. With that fix, this PR as of commit 8d191a9550 passes CMake's entire test suite.

@jdrouhard
Copy link
Contributor Author

jdrouhard commented Apr 1, 2021

@jdrouhard could you try out my unit test from #1941, wether this goes in or that PR, I would like the unit test to be added to protect the hardlinking scenario that is causing me trouble currently in my builds.

@dmoody256 I can try your test on my branch. I'd like to see this get merged as is, and then we can change your PR to just be a new regression test, if that works for you?

@dmoody256
Copy link

@jdrouhard sure thats fine, i'll try to follow up in my PR

@jdrouhard
Copy link
Contributor Author

@jhasse this should be good to go. If you need more time to look it over, maybe you could at least add a label for the next ninja release?

@jdrouhard
Copy link
Contributor Author

@jhasse ping

@jdrouhard
Copy link
Contributor Author

If it doesn't know the inputs until after the command, [...]

It doesn't know all inputs. For the inputs it knows, it would work, wouldn't it?

Since the major use-case for this is source files being changed in an editor while running ninja and the unknown inputs are generated files in the build directory most of the time (correct me if I'm wrong here), I think that would be good enough while not being 100% accurate.

It wouldn't work even with inputs it does know unless we stat every input right before every command. This was exactly what #1753 did but was rolled back due to performance concerns and CMake test suite failures. #1936 was the follow up that did the same thing with the CMake tests fixed, but it still had the performance issue of way too many stat calls.

This PR does work 100% of the time with less performance impact since it doesn't require stat'ing every input before every command.

@jhasse
Copy link
Collaborator

jhasse commented May 2, 2021

It wouldn't work even with inputs it does know unless we stat every input right before every command.

Why not?

@jdrouhard
Copy link
Contributor Author

It wouldn't work even with inputs it does know unless we stat every input right before every command.

Why not?

This has been discussed at length in those other PRs I mentioned. I'm pretty sure you were part of those discussions.

In any case, the build log needs to remember an mtime it can use to compare against on subsequent runs. Where do we source that mtime? It has to come from something at the time the build command begins or we lose the "anchor" time for knowing whether something was changed since the inputs were read by whatever command produced the output.

If we get it from the most recent mtime, then we have to know the most recent input's mtime when the build command starts. That's #1753 and #1936. To know the most recent input's mtime is a performance bottleneck since we have to stat every known input in order to find that mtime. And if we don't know all the inputs, it won't be an accurate time anyway.

Currently ninja doesn't track anything like that at all, it simply records the output's mtime after the command finishes.

Like I said, this attempt (most recent input mtime) was tried in those PRs. You were willing to give them a shot, I'm asking you give this PR the same as it solves the problem better. Let's just let it hit master do it gets more exposure for awhile.

@jhasse
Copy link
Collaborator

jhasse commented May 2, 2021

In any case, the build log needs to remember an mtime it can use to compare against on subsequent runs.

Yes, but only because we always compare against that mtime. Couldn't ninja use the output mtime instead if it was comparing against an (at the time of the previous run) unknown input?

@jdrouhard
Copy link
Contributor Author

jdrouhard commented May 2, 2021

In any case, the build log needs to remember an mtime it can use to compare against on subsequent runs.

Yes, but only because we always compare against that mtime. Couldn't ninja use the output mtime instead if it was comparing against an (at the time of the previous run) unknown input?

You are now suggesting a build log version bump where we record both the output's actual mtime and a maybe-incorrect most recent input mtime in the build log? We would also have to be able to identify which of those two we should use based on context we don't have. Somehow we'd have to track whether the known input set changed for a particular output since the last run, and that gets ambiguous (not possible) with restat and generator rules which create inputs for other outputs, including order only dependencies that create inputs for other edges when they run.

Doing it as you suggest is actually a more intrusive and scary change for a solution that wouldn't even provide the same safety as this PR.

src/build.cc Outdated Show resolved Hide resolved
src/build.cc Outdated
// XXX: this will block; do we care?
for (vector<Node*>::iterator o = edge->outputs_.begin();
o != edge->outputs_.end(); ++o) {
if (!disk_interface_->MakeDirs((*o)->path()))
return false;
if (build_start == -1) {
disk_interface_->WriteFile(tmp_mtime_file_path_, "");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which command updates the mtime in RealDiskInterface::WriteFile? Couldn't fwrite just do nothing if it was passed an empty buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwrite will update the file to be empty if it is passed an empty buffer, it can't simply decide to do nothing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if it is already empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will still write the empty string to the file and update the mtime.

Copy link
Collaborator

@jhasse jhasse May 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this documented / specified somewhere?

Wouldn't matter if we had real-world tests we could run on all systems :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.cppreference.com/w/c/io/fwrite

While it's ultimately up to the filesystem implementation to update the mtime if a file is written to, fwrite() does guarantee that the file is written to. We specifically take advantage of using the filesystem for acquiring an mtime because we compare this acquired time to the input mtimes on subsequent runs, which are presumably residing on the same filesystem (or one with times that are closely in sync).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If size or count is zero, fwrite returns zero and performs no other action.

Sounds like it doesn't write to the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, it's not fwrite() that causes the mtime to be updated, it's the fopen(path, "w") call that does so. The fact that we're writing 0 bytes is actually an optimization that means the fwrite() call itself is a no-op (like you say).

fopen() creates the file if it does not exist, or deletes (overwrites) the file if it already exists. Either way, mtime is updated.

}

// Dirty if the output is older than the input.
if (!used_restat && most_recent_input && output->mtime() < most_recent_input->mtime()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output->mtime() is the command_start_time_ of the previous run, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, output->mtime() is the actual mtime of the file on disk when this build was started (it's calculated the first time the output is considered when generating the build graph).

ASSERT_EQ(1u, fs_.files_created_.count("out 2.rsp"));
ASSERT_EQ(1u, fs_.files_created_.count("out 3.rsp"));
ASSERT_EQ(1u, fs_.files_created_.count(".ninja_tmp"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, I would rather remove the explicit check for ASSERT_EQ(files_created + 2, fs_.files_created_.size());. This unit test shouldn't care.

src/build_test.cc Outdated Show resolved Hide resolved
src/build_test.cc Outdated Show resolved Hide resolved
@jhasse
Copy link
Collaborator

jhasse commented May 2, 2021

I'm reluctant to merge this PR because I think it's an invasive change to how ninja works - after this we won't be comparing output mtimes anymore. As you notice I haven't fully understood all the edge cases, too, and don't want to do a revert again.

There's also the problem that we don't have any good performance testing in place so the implications are probably small. Hopefully. For every use-case?

Another thing that would need to change, but it's a bigger issue of ninja: The unit tests only use the VirtualFileSystem and not the real thing. That's a problem especially for stuff that relies on filesystem behavior like this PR.

Last but not least I always thought about adding a locking mechanism to ninja, so that two runs don't interfere with each other. Maybe we could re-use .ninja_tmp for that (and probably rename it to ninja.lock or something like that).

Another idea to think about would be to re-use .ninja_log to get the mtime. Not sure if that is feasible.

Somehow we'd have to track whether the known input set changed for a particular output since the last run, and that gets ambiguous (not possible) with restat and generator rules which create inputs for other outputs, including order only dependencies that create inputs for other edges when they run.

Why wouldn't it be possible?

@jdrouhard
Copy link
Contributor Author

I'm reluctant to merge this PR because I think it's an invasive change to how ninja works - after this we won't be comparing output mtimes anymore. As you notice I haven't fully understood all the edge cases, too, and don't want to do a revert again.

I understand. I know it's a big change, but with any good improvements like this, there's a bit of risk involved. It solves a number of issues that have been opened related to this problem and definitely makes ninja more resilient to real word usages.

In this case, it will be still be comparing output mtimes, but what it compares against is sometimes the time the command that produced it began, and other times is exactly what it is now (mtime of the output when it finished which is the case for restat and generator rules).

There's also the problem that we don't have any good performance testing in place so the implications are probably small. Hopefully. For every use-case?

This is true. The only way to gauge the true performance impact is to let it be used in as many real world situations as it can before a release is made with the change in it. If you have suggestions for how we can increase the confidence level in the performance impact, I'm all ears. My anecdotal evidence is from a Linux perspective where I daily run builds with 4000 edges that each have hundreds of inputs. There's no discernible performance impact vs current master.

Another thing that would need to change, but it's a bigger issue of ninja: The unit tests only use the VirtualFileSystem and not the real thing. That's a problem especially for stuff that relies on filesystem behavior like this PR.

Agreed. We can unit test all day long but if a real file system doesn't behave exactly the way a test author assumes, the unit tests won't capture real world behavior.

System tests like these are typically more brittle as they will rely on factors outside the control of the test, but having some would be a good step forward in any case.

Last but not least I always thought about adding a locking mechanism to ninja, so that two runs don't interfere with each other. Maybe we could re-use .ninja_tmp for that (and probably rename it to ninja.lock or something like that).

That seems like a good change. Maybe a follow up to this PR after it's merged?

Somehow we'd have to track whether the known input set changed for a particular output since the last run, and that gets ambiguous (not possible) with restat and generator rules which create inputs for other outputs, including order only dependencies that create inputs for other edges when they run.

Why wouldn't it be possible?

Without actually trying this change, I can't say for certain, but after my experience with the various attempts to solve this problem, I know that it's not quite so simple as tracking the inputs for an edge. There are some unique rules that can exist that don't really mesh with simply checking whether inputs have changed.

It might be possible. I do think l, however, that this change would be even more invasive than what this PR does, and would only sometimes help the resiliency.

@jhasse
Copy link
Collaborator

jhasse commented May 2, 2021

Thank you for being so patient with me :) Sorry it all takes so long!

If you have suggestions for how we can increase the confidence level in the performance impact, I'm all ears.

Fix #1733 and let a script run on CI (although I'm not how sure how consistent performance is during a run on GitHub Actions - between runs it probably is comparable).

@jdrouhard jdrouhard force-pushed the output_time_before_command branch from 3104540 to a746bb1 Compare May 2, 2021 18:42
@jdrouhard
Copy link
Contributor Author

Thank you for being so patient with me :) Sorry it all takes so long!

No worries, thanks for seeing this through. I've been trying to get this problem fixed for about 5 years so I'm glad there's finally traction for it.

If you have suggestions for how we can increase the confidence level in the performance impact, I'm all ears.

Fix #1733 and let a script run on CI (although I'm not how sure how consistent performance is during a run on GitHub Actions - between runs it probably is comparable).

Are you making #1733 a prerequisite for getting this PR merged, or can we merge this, assuming we can get to #1733 before the next release?

@jdrouhard
Copy link
Contributor Author

ping @jhasse -- any other comments before you get this merged in?

@jhasse
Copy link
Collaborator

jhasse commented Jun 9, 2021

Are you making #1733 a prerequisite for getting this PR merged [...]

No, but I think I will release 1.11 first, then merge this PR into master to get some more testing. Maybe someone will fix #1733 along the way, but I doubt it :(

@jhasse jhasse added this to the 1.12.0 milestone Jun 9, 2021
@jdrouhard
Copy link
Contributor Author

Ping @jhasse @nico

This has been thoroughly tested by the full cmake test suite and we've been using this patch at my place of work for nearly a year with zero issues.

I know you wanted to release a new version before merging this so it could have maximal time in HEAD for testing, but it doesn't seem like a release is coming soon.

Can we get this merged so we can get more exposure? I don't think this PR is going to have any discernible performance impact.

@jdrouhard
Copy link
Contributor Author

Are you making #1733 a prerequisite for getting this PR merged [...]

No, but I think I will release 1.11 first, then merge this PR into master to get some more testing. Maybe someone will fix #1733 along the way, but I doubt it :(

Looks like 1.11.0 was released recently. Congrats on the release! Looks like a great one. I rebased this PR on current master. Could we get it merged soon to let this bake as long as possible before the next release? Thanks @jhasse!

@jhasse
Copy link
Collaborator

jhasse commented Jun 9, 2022

Yes :)

Lets get the filename right first though. ninja.lock or .ninja_lock?

@jdrouhard
Copy link
Contributor Author

I don’t have a preference whatsoever on the lock file name. Which do you prefer?

@jhasse
Copy link
Collaborator

jhasse commented Jun 10, 2022

Let's use .ninja_lock which fits in with the other files.

In a future PR we would then write the PID to this file to check if it's a zombie file, right? I don't have any experience in how to do directory locking :D

…he build runs

When an edge starts to run, create a temporary lock file in the build
directory, stat it, and cache its mtime. When the command finishes, use
the temporary lock file's mtime from when the edge started running as
the mtime that is recorded in the build log for each of the edge's
output(s). Subsequent runs will use that as the mtime for the output(s).

This provides robustness against inputs changing while the command
itself is running. If an input is changed, the subsequent run will
detect the output as dirty since its recorded mtime reflects when the
build command began, not when the output was actually written to disk.

Generator and restat rules are exempt from this and will continue to
record their actual mtime on disk at the time the command finished in
the build log (unless the restat rule cleans the output). This avoids
potential infinite loops when the generator rule touches input
dependencies of the output(s) or a restat rule intentionally changes
implicit dependencies of its output.
@jdrouhard
Copy link
Contributor Author

Let's use .ninja_lock which fits in with the other files.

Done.

In a future PR we would then write the PID to this file to check if it's a zombie file, right? I don't have any experience in how to do directory locking :D

On Linux at least, this is pretty easy. We can make ninja acquire a lock on this file before running, which would block other ninja processes from running concurrently. I'm not as familiar with the windows side of things but I'm sure something similar can be achieved.

@jhasse jhasse merged commit 9ae7926 into ninja-build:master Jun 10, 2022
@jdrouhard jdrouhard deleted the output_time_before_command branch June 10, 2022 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Long running compile commands can cause incorrect dirty status computation
4 participants