Option to use file characteristics instead of timestamps #1459

greened · 2018-08-14T01:27:20Z

As noted in other issues, using timestamps to determine whether to rebuild something can be problematic. While timestamps are convenient and relatively fast, it is often desirable to key build decisions on some characteristic intrinsic to the file itself, such as a hash of its contents.

I use git a lot and it is annoying that simply changing branches triggers rebuilds. Ideally, I would be able to switch from my current work branch to some other branch, not touch any files and then switch back to the original branch and not have to rebuild anything at all. As far as I know that is not possible if the build system uses timestamps. Using file hashes would solve this particular issue.

I understand the using file hashes or other such intrisic file characteristics could slow down Ninja so using them should be an option.

metti · 2018-11-13T08:31:30Z

#929 has an implementation. Even though it is successfully used (in a fork) for thousands of builds daily, it was not considered for merging.

jhasse · 2018-11-13T11:42:53Z

#929 is single-threaded and can therefore be slower like ccache or other solutions. Also I think that this should be a command line flag instead, so that it doesn't require changes to the build definitions.

metti · 2018-11-13T12:15:15Z

It can't (or at least should not) be a command line flag, as then hashing would apply to all rules. Hashing e.g. all inputs of link rules is expensive and therefore not desired, while source files and their known dependencies are good candidates. For that distinction, it has to be part of the build description. In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

Responding to the single-threaded argument: Yes, it adds instructions to the single-thread loop. In reality that only matters if the single thread gets more work than it can work down (I.e. more rules finish than the one thread can process (depslog+hashlog+...)). Only then hashing hurts. Else the single thread loop waits for jobs to finish anyway. We never saw a busy ninja with hashing even with -j1000 experiments. (And for fast-finishing rules, hashing is anyway not interesting to safe time.)

Also consider: the hashing with murmur hash is considerably fast and even large source files take only some miliseconds to be hashed. In addition the hashing happens right after the source file (and the dependencies) have been seen by the compiler. Therefore they are usually read from filesystem cache.
As hashing happens during the build (in parallel to executed rules), the overall build time is usually not measureably affected.

Lastly, the implementation in #929 is opt-in and comes at no cost for people not using the feature (besides the if-statement).

jhasse · 2018-11-13T13:53:49Z

Hashing e.g. all inputs of link rules is expensive and therefore not desired, while source files and their known dependencies are good candidates.

I would say that hashing of inputs to the linker is especially desired as it can often result in the linking being skipped completely (e. g. for formatting changes or when comments are changed). As the compilation of object files finishes piece by piece, the hash calculation can happen while the build is running (as you pointed out).

If it's really too slow (e. g. with big static libraries), we could think about implementing hashes only for pure inputs and not intermediate files. That would solve the "switching Git branches causes full rebuilds" case at least.

In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

I would say that is an advantage: If I'm working on a single branch and want to iterate fast, I would not use hashes. If I'm comparing different features branches, I would use hashes.

Boddlnagg · 2018-11-14T07:14:45Z

In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

I would say that is an advantage: If I'm working on a single branch and want to iterate fast, I would not use hashes. If I'm comparing different features branches, I would use hashes.

But then there would need to be a way to switch from non-hashing to hashing, which means that the current state of files would need to be hashed, so the next rebuild can use the hashes (which probably didn't exist or are out of date if you didn't pass the flag).

jhasse · 2018-11-14T12:36:27Z

I would guess that hashing still uses timestamp first, so that if the timestamps match, there's no need to compare the hashes. It would mean, that the first few builds might unnecessarily recompile some files, but that shouldn't happen often (most of the time the timestamp heuristic is right after all).

moroten · 2018-11-14T14:24:42Z

I've also used hashing at work, with great success. It is based on #929, but with a bunch of patches, as can be seen in https://github.com/moroten/ninja/commits/hashed. hash_input = 1 on selected rules is very convenient. My branch still contains a bug where files are stat too often, O(n^2) instead of O(n). The bug is related to phony edges.

One problem is how to handle phony edges. I use phony edges to group e.g. header files. Therefore, my implementation iterates recursively through phony edges. It also relates to the bug in #1021.

Another thought is to make the hashing first class member of ninja, i.e. move the hashes into the build log. Using SHA256 would be one step towards adding support for Bazel's remote execution API. A C++ implementation can be found at https://gitlab.com/bloomberg/recc/. Wouldn't that be pretty nice?

Unfortunately, sorting out the semantics for phony edges will probably break backward compatibility.

csagan5 · 2018-11-17T10:08:15Z

Meanwhile I have concocted a solution for my use case of switching branches in Chromium (which is particularly painful); the small Go program and script I use is here: https://github.com/bromite/mtool

Feel free to adapt it to your use-cases, if it works for Chromium I bet it will work for smaller build projects as well (and it takes negligible time for my runs). Only downside is that if you use the script I published there as-is it will litter .mtool files all around in each git repository parent directory, but nothing a global gitignore cannot cure.

Interesting to note that I use the git ls-files --stage output for the hashing needs; it is possible (but less efficient) to ask git to hash also non-indexed files if your build depends on those.

Feature-wise one would expect that to implement the feature being discussed here ninja could do the same internally (without relying on git) and with similar performance results.

Trass3r · 2018-11-18T01:16:18Z

After a quick look those ninja patches don't seem to hash the compiler commandline in addition to the input files. Am I missing anything?

metti · 2018-11-18T04:06:23Z

The command line is already hashed by ninja and stored in the build log.

rulatir · 2020-12-04T06:23:01Z

I would guess that hashing still uses timestamp first, so that if the timestamps match, there's no need to compare the hashes. It would mean, that the first few builds might unnecessarily recompile some files, but that shouldn't happen often (most of the time the timestamp heuristic is right after all).

That would be very wrong. Comparing hashes can be elided if timestamps don't match; the build system can assume that the dependency was modified, and if it wasn't really modified (only touched) then the build will be suboptimal but correct. However if the timestamps match, it is still possible that the dependency was modified and its timestamp was forcibly reset (EDIT: or that the target was touched and thus made newer than its dependencies). Without double checking by comparing hashes this would result in incorrect build.

burdiyan · 2020-12-04T13:05:04Z

I guess Ninja already assumes and skips all the work when timestamps match. So in your example this would be an incorrect build anyways.

I'm not aware (probably due to my ignorance) of any tool that would produce a different output and would keep the previous timestamp. Why would one do that?

IMHO, skipping the hash check when the timestamps match is a very valid optimization.

Hash check would simply avoid some "false-dirty" rebuilds, while keeping the existing semantics already provided by Ninja.

rulatir · 2020-12-14T07:03:49Z

The problem is not false-dirty rebuilds, it's false-clean non-rebuilds. A git checkout touches everything it overwrites. It can make a target newer than a dependency (yes, people commit generated code for various valid reasons). A hash check would prevent a false-clean non-rebuild in this case.

burdiyan · 2020-12-16T10:50:30Z

@rulatir I think I mostly understand what you're saying :) I guess that's one of the reason why Bazel and other build systems based on hash-checks are really against in-tree target outputs.

Although, wouldn't this problem be solved if build system would check if targets are newer than previously known time, and rebuild if necessary?

rulatir · 2020-12-18T19:44:10Z

@rulatir I think I mostly understand what you're saying :) I guess that's one of the reason why Bazel and other build systems based on hash-checks are really against in-tree target outputs.

How does hash depend on where the file is?

Although, wouldn't this problem be solved if build system would check if targets are newer than previously known time, and rebuild if necessary?

I understand that the main benefit of using timestamps is in avoiding the need to maintain a separate database that keeps track of "previously known" version signatures. If you are willing to forego that benefit, why shouldn't those signatures be hashes?

seanmiddleditch · 2021-12-20T23:38:02Z

A strategy we employ in large distributed content pipelines (not using ninja, but it's all similar concepts) uses hash caches keyed off a triplet like (timestamp; file size; inode).

The chances in a typical and non-malicious environment where a file changes without one of those three changing is incredibly low... and ninja has no business trying to protect itself against malicious manipulation IMO, since users can just manipulate build files and ninja caches at that point; if they want to break things, they will.

I will note that the timestamp used in a hash cache should be considered out of date if it's at all different, not just newer-than. Otherwise rolling back a change to a character in a source file would go undetected.

For those possible cases where the triplet isn't reliable to detect file mutation (I've personally never encountered such a case, but of course I haven't encountered every build environment that exists!), a --rehash flag could force a rehash of every file considered by the build. I've always added those kinds of things to content pipeline tools, though I've never needed them outside of testing/verification of the hash cache itself.

We've in the past considered extending those triplets with additional information, e.g. other file attributes or permissions, as well as adding options to use underlying hash metadata on some esoteric file systems or some SCM backends, but the basic triplet has proven reliable for us so for on all systems we've cared about.

For content pipelines, the hash cashing is essential since the files are often very large and there's often very many of them (far more than there are source code files, generally) and content hashing scales with total file size, of course. I've never measured what the difference would be in source code builds since I've generally just relied on standard ninja/make/msbuild behavior for such things. I imagine the results would be less impactful than with content pipelines, but probably still worth it for large projects like Chromium or the typical AAA game's code base.

burdiyan · 2021-12-21T09:07:28Z

@seanmiddleditch the problem I find with anything timestamp is that a branch switch forces unnecessary rebuilds. Just touching a file marks the target dirty. Having a proper hashing at least as an opt-in strategy could save many people lots of unnecessary build time, IMHO.

seanmiddleditch · 2021-12-21T19:48:37Z

@burdiyan that's the whole point of the hashes. The timestamps are just an acceleration for the common case so you aren't needlessly rehashing all your files when nothing's changed on incremental builds.

(edit) I mean the timestamps in what I proposed are just an acceleration; status quo ninja today has the problems you describe, of course.

rulatir · 2022-02-10T12:42:28Z

@burdiyan that's the whole point of the hashes. The timestamps are just an acceleration for the common case so you aren't needlessly rehashing all your files when nothing's changed on incremental builds.

It's 2022. Version control systems that destroy all assumptions about timestamps are a reality now. THEY are the common case.

seanmiddleditch · 2022-04-20T01:13:27Z

Just to make it very clear: I am not arguing against hash-based builds. :) I'm just offering that adding a cache in front of the hashes helps maintain very fast local builds. It's a "best of both worlds" approach: hashes when the VCS stomps all over timestamps, and timestamps when files are completely untouched. Win-win.

burdiyan · 2022-04-21T20:05:29Z

I agree with that! In my wacky scripts I do something similar, i.e. check the timestamp first, and if it's different check the hashes. So indeed win-win. Pretty much always it's faster to hash a file even if it's unnecessary than to run an unnecessary build, IMHO.

jcelerier · 2022-06-09T12:55:53Z

I'm having a case where the build machine periodically resets its time to 00:00 which wreaks havok in ninja + cmake (it seems to make ninja call cmake's configure in a loop), it would be great to have support for hash-based rather than time-based things to overcome this.

mathstuf · 2022-06-09T14:53:44Z

While this might fix this particular fallout of that root problem, I suspect that you really need to get that fixed because this is not the only thing on a modern system that cares about the system clock being…useful.

jcelerier · 2022-06-09T14:57:42Z

It really can't - the system is synced to an external PTP clock which behaves like this.

mathstuf · 2022-06-09T15:52:29Z

I would doubt that such behavior is useful for a "Precision Time Protocol" device. I highly recommend you solve the root cause because this issue isn't moving anywhere near fast enough for your problem (cf. basically nowhere beyond ideation so far).

hadrielk · 2022-06-09T17:21:09Z

this issue isn't moving anywhere near fast enough for your problem

Don't worry, according to this PTP clock, there'll be plenty of time remaining to fix it.

(sorry, couldn't help it)

jcelerier · 2022-08-13T20:07:26Z

I highly recommend you solve the root cause

Not much possibility - those are proprietary audio devices which act as a master clock setting to synchronize themselves independently of real-world time ; since the linux driver is in-kernel this sets the whole machine to the same clock.

StevenShw · 2024-01-25T11:49:56Z

Sad, I want to know the final solution...

maxwell-bland · 2024-03-08T23:55:21Z

If someone wants to build a system based on (potentially) inotifywait that ties in as an easy to use plugin and for a given target simplifies the output of "-t commands" to just that set of commands necessary to recompile steps dependent on files written to since the last build in the directory, I will personally mail you 5 USD as a bounty. 10 USD and a can of room-temp/warm Chicago beer if you make it so that it "just works", where the next run of a target with the plugin enabled only runs the commands related to the changes.

No idea if this solves all the hashing nonesense or whatever Bazel is calling it nowadays (hermetic ... synonym "esoteric" ... at least they are up front about it (-; ). Anyways, I think inotifywait is a good approach, at least in some simple case. I'll try it once I get some time, but if it gets to that point I'll post results here and the bounty will be gone.

ericriff · 2024-03-14T16:41:49Z

I'm also interested on this.

greened changed the title ~~Feature requestion: Option to use file characteristics instead of timestamps~~ Feature request: Option to use file characteristics instead of timestamps Aug 14, 2018

jhasse added the feature label Oct 29, 2018

Trass3r mentioned this issue Nov 13, 2018

add a way to mark all outputs as up-to-date to avoid rebuilds #1493

Open

jhasse changed the title ~~Feature request: Option to use file characteristics instead of timestamps~~ Option to use file characteristics instead of timestamps Nov 13, 2018

jhasse mentioned this issue Jun 5, 2019

Having an option to ignore generator rule/statement #1585

Closed

jhasse mentioned this issue Feb 20, 2020

Targets compiled repeatedly on NTFS share #1740

Open

csagan5 mentioned this issue Jun 1, 2020

Docker build support nikolowry/bromite-builder#8

Open

csagan5 mentioned this issue Sep 20, 2020

Do not perform domain substitutions within comments ungoogled-software/ungoogled-chromium#1026

Open

greened mentioned this issue Oct 26, 2020

hash based input checking #929

Closed

jdrouhard mentioned this issue Mar 31, 2021

Provide resiliency against inputs changing during the build #1943

Merged

JanWielemaker mentioned this issue Aug 10, 2021

Avoid rerunning mkvmi repeatedly for no reason SWI-Prolog/swipl-devel#875

Closed

marc-hb mentioned this issue Nov 10, 2021

A cmake re-run after a git commit will cause all source files to re-compile. zephyrproject-rtos/zephyr#40167

Closed

This comment was marked as abuse.

Sign in to view

sunfishcode mentioned this issue Jan 19, 2023

Remove the device and inode numbers from the API. WebAssembly/wasi-filesystem#81

Merged

ekrich mentioned this issue May 19, 2023

Skip nativeLink if the input hasn't changed from the previous build scala-native/scala-native#3287

Merged

jhasse added this to the 2.0.0 milestone Apr 14, 2024

jhasse mentioned this issue May 24, 2024

Prevent unnecessary reconfig by not writing build.ninja to .ninja_log. #1580

Closed

krolli mentioned this issue Jun 13, 2024

Ninja build edge dirty only when timestamp is newer leads to build not being triggered in some cases #2463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to use file characteristics instead of timestamps #1459

Option to use file characteristics instead of timestamps #1459

greened commented Aug 14, 2018

metti commented Nov 13, 2018

jhasse commented Nov 13, 2018

metti commented Nov 13, 2018

jhasse commented Nov 13, 2018

Boddlnagg commented Nov 14, 2018

jhasse commented Nov 14, 2018

moroten commented Nov 14, 2018

csagan5 commented Nov 17, 2018

Trass3r commented Nov 18, 2018 •

edited

Loading

metti commented Nov 18, 2018

rulatir commented Dec 4, 2020 •

edited

Loading

burdiyan commented Dec 4, 2020

rulatir commented Dec 14, 2020 •

edited

Loading

burdiyan commented Dec 16, 2020

rulatir commented Dec 18, 2020

seanmiddleditch commented Dec 20, 2021

burdiyan commented Dec 21, 2021

seanmiddleditch commented Dec 21, 2021 •

edited

Loading

rulatir commented Feb 10, 2022

seanmiddleditch commented Apr 20, 2022

burdiyan commented Apr 21, 2022 •

edited

Loading

This comment was marked as abuse.

jcelerier commented Jun 9, 2022

mathstuf commented Jun 9, 2022

jcelerier commented Jun 9, 2022

mathstuf commented Jun 9, 2022

hadrielk commented Jun 9, 2022

jcelerier commented Aug 13, 2022 •

edited

Loading

StevenShw commented Jan 25, 2024

maxwell-bland commented Mar 8, 2024

ericriff commented Mar 14, 2024

Option to use file characteristics instead of timestamps #1459

Option to use file characteristics instead of timestamps #1459

Comments

greened commented Aug 14, 2018

metti commented Nov 13, 2018

jhasse commented Nov 13, 2018

metti commented Nov 13, 2018

jhasse commented Nov 13, 2018

Boddlnagg commented Nov 14, 2018

jhasse commented Nov 14, 2018

moroten commented Nov 14, 2018

csagan5 commented Nov 17, 2018

Trass3r commented Nov 18, 2018 • edited Loading

metti commented Nov 18, 2018

rulatir commented Dec 4, 2020 • edited Loading

burdiyan commented Dec 4, 2020

rulatir commented Dec 14, 2020 • edited Loading

burdiyan commented Dec 16, 2020

rulatir commented Dec 18, 2020

seanmiddleditch commented Dec 20, 2021

burdiyan commented Dec 21, 2021

seanmiddleditch commented Dec 21, 2021 • edited Loading

rulatir commented Feb 10, 2022

seanmiddleditch commented Apr 20, 2022

burdiyan commented Apr 21, 2022 • edited Loading

This comment was marked as abuse.

jcelerier commented Jun 9, 2022

mathstuf commented Jun 9, 2022

jcelerier commented Jun 9, 2022

mathstuf commented Jun 9, 2022

hadrielk commented Jun 9, 2022

jcelerier commented Aug 13, 2022 • edited Loading

StevenShw commented Jan 25, 2024

maxwell-bland commented Mar 8, 2024

ericriff commented Mar 14, 2024

Trass3r commented Nov 18, 2018 •

edited

Loading

rulatir commented Dec 4, 2020 •

edited

Loading

rulatir commented Dec 14, 2020 •

edited

Loading

seanmiddleditch commented Dec 21, 2021 •

edited

Loading

burdiyan commented Apr 21, 2022 •

edited

Loading

jcelerier commented Aug 13, 2022 •

edited

Loading