[WIP] mtime+content tracking #8623

gilescope · 2020-08-16T16:37:27Z

This adds the unstable option -Zhash-tracking that uses content hashes and file sizes as well as mtimes to prevent compile cascades when mtimes are messed with but the content is unchanged (Fixes: #6529 ). This paves the way towards fixing #5918.
It works even better with -Zbinary-dep-depinfo.

It depends on rust-lang/rust#75594 with the following enabling rustc changes:

.rmeta now has SVH in uncompressed header as well as rust version.
.rlib previously contained a lib.rmeta file - now renamed to lib-crate-name-svh.rmeta
dylibs now have an additional SVH symbol.
.d dependency files contain expected file size and hash.

This is almost zero cost when mtimes are up-to-date (rustc already hashed everything it needed to but didn't expose this to cargo). Cost is only incurred when mtime checks fail and typically the filesize check means we can skip a more expensive hash check.

Worst case: if the files mtimes are not up to date and files are different but the same size then it will take cargo slightly longer to detect this and drop back to a full rebuild.

I appreciate that not everyone might be sold on content hashing right now, but it would be great to let them try the functionallity on nightly to see whether it makes some people's lives easier in the field.

Testing:

Seems to work on OSX with everything fresh when moving the target dir for findshlibs (30 crates) and rust-analyzer (200 crates).

Still todo:

Unresolved questions:

Will cross compilation just work?
Should the additional crate dependencies be put under a default feature?

rust-highfive · 2020-08-16T16:37:31Z

r? @Eh2406

(rust_highfive has picked a reviewer for you, use r? to override)

src/cargo/core/compiler/fingerprint.rs

Eh2406 · 2020-08-17T17:19:14Z

Is is a far more ambitious fix than what I had in mind for #6529, I can see advantages to this approach but want to make sure I understand the tradeoffs. The current mtime based handling happens entirely in Cargo. This PR handles hashes using both Cargo and Rustc. What is the advantage of involving Rustc? Given the advantages of the new way should we move the handling of mtime to it?

gilescope · 2020-08-20T08:20:28Z

It turns out that rustc hashes all the source files (to put that hash in the debug info) anyway, so if we let rustc do the hashing then we get that initial hash (which I was expecting to be expensive) for free.

@bjorn3 mentioned that we should extend this so that we don't solely rely on mtime checking for binary depedenencies too. This seems reasonable though I've not figured out yet if there's some cunning way that those hashes can be gained for 'free' also - that would seem to be too good to be true, but idk, rustc might happen to have that data already somewhere.

gilescope · 2020-08-20T08:23:09Z

In terms of trade-offs so far it looks very positive - If we can get the initial hashes for free and we only attempt to hash a file if the mtimes are different but the file size is the same then there's a high probability that that file is exactly the same and thus it would be a significant improvement on reducing recompiles.

I think the realistic worst case here is that cargo will do the additional work of hashing one file before it marks a unit dirty if and only if there's a file with a different mtime but same size as an input to that unit.

(I guess the worst worst case is that all files have been touched but only the last file has been modified in a way that didn't change the size. - Thanks that's a good point: we should try and dirty the unit cheaply via a file size mismatch before having to do any hash checks. I'm assuming that hashing the input files is always going to be considerably cheaper than doing the compile.)

bjorn3 · 2020-08-20T08:25:17Z

For binary dependencies there is already the SVH. This is not stored in a stable place though. There are two ways I can think of to get a binary dependency hash. The first is to define that the SVH is stored in a stable location. This would require cargo to include a ar archive and object file reader though. The second is to hash the whole file while reading the metadata. This would result in a performance decrease though, as we currently use mmap to prevent reading parts of the binary dependencies and even parts of the actual metadata that we don't need to compile the current crate.

gilescope · 2020-08-21T09:44:39Z

Ok updated to take advantage of rustc's SourceFile precomputed hashes. Next time to explore bin hashing.

gilescope · 2020-08-24T06:21:26Z

I worry reproducing a SVH would require quite a bit of parsing compared to md5.

Potentially we could use the fact that cargo only refences parts of files as a benefit. - For the parts of the files that cargo does read, would there be things in there very likely to change if the other (unread) parts of those file changed? If so we could store a hash + range(s) of which parts of the file contributed to the hash. I.e. leave instructions as to how one could re-create the hash.

gilescope · 2020-08-30T07:34:36Z

Ah, my bad, we don't need to recompute a SVH, just compare already created SVHs. As long as fishing the SVH out of a binary then we're good.

(I've found out that hashes are inserted when linked with /Brepro for mach-o/win but not for linux.)

If we embed the SVH then we have a single consistent mechanism). Comparing SVHs would be much cheaper than hashing files and as a bonus it fixes rust-lang/rust#73917 .

bjorn3 · 2020-08-30T08:08:42Z

You can use the ar crate for reading rlibs and object for reading object files. These are what I use for cg_clif.

src/cargo/core/compiler/fingerprint.rs

bjorn3 · 2020-09-05T12:16:45Z

Optimisation: No need to figure out if bin files are up to date if they have svh hashes in their filenames.

Svh are never part of the filename as far as I know.

gilescope · 2020-09-05T13:06:03Z

I might have got the wrong end of the stick but it looks like svh is the hash part that is in quite a few of the filenames: https://github.com/rust-lang/rust/blob/02fe30971ef397bcff3460a9aaf175e0810c2c90/compiler/rustc_incremental/src/persist/fs.rs#L37 - I'm thinking if the svh hash is in the dependencie's filename then as long as that file exists it really ought to contain the right contents. If we can rely on that then we only have to worry about the few bin files without hashes in their filenames.

EDIT: I had got the wrong end of the stick and instead we can put the svh inside the .rlib in the lib.rmeta filename.

bjorn3 · 2020-09-05T13:08:20Z

The svh is only the hash part for incremental compilation directories. In all other cases it is a Cargo calculated hash passed to rustc using -Cextra-filename that is unique for each crate + compilation options pair, but does not depend on the source files.

bors · 2020-10-14T02:34:50Z

☔ The latest upstream changes (presumably #8778) made this pull request unmergeable. Please resolve the merge conflicts.

Note that reviewers usually do not review pull requests until merge conflicts are resolved! Once you resolve the conflicts, you should change the labels applied by bors to indicate that your PR is ready for review. Post this as a comment to change the labels:

@rustbot modify labels: +S-waiting-on-review -S-waiting-on-author

gilescope · 2020-10-25T08:47:37Z

Ah that's better: cx.add_used_global(llglobal); -- that's got me moving again.

gilescope · 2020-10-25T12:26:08Z

Hmm, now I have more symbols than I can shake a stick at - I have one for each code gen unit, for each crate. I can just create it for the first code-gen unit, but still seems to be one per crate - maybe I can only do it for the current crate.

bjorn3 · 2020-10-25T12:30:20Z

For dylibs there is a separate codegen unit for metadata. You can add the symbol there. For executables you can add it to the main shim generation. For rlibs you can add a separate archive member.

…ey have svh hashes in their filenames.

gilescope · 2020-11-12T15:56:47Z

Tests-wise I think we want to run all those freshness tests with content hashing turned on.

gilescope · 2020-11-13T10:22:49Z

Rust analyzer fresh build when moving the target dir to upset the mtimes currently builds fresh (which is great) but takes 6-12s to figure out everything is fresh. This is much better than a full 2m30 build that would otherwise happen. I suspect where we do have to hash files we should be doing that in parallel.

gilescope · 2020-11-16T07:57:34Z

@bjorn3 both rustc and cargo PRs are back in sync now. Seems like things are slightly twisted on the hashes front. I think reversing the slices will probably untwist it, but what's there is consistent and works for findshlibs. Now that sha256 has been added in (in master), I'm more minded to plump for 256 rather than 512 as that will tidy the code up a little.

gilescope · 2020-11-17T20:31:27Z

src/cargo/core/compiler/fingerprint.rs

+                                        .files
+                                        .iter()
+                                        .find(|reference| *dep_in == reference.path);
+                                    if let Some(reference) = ref_file {


Bug: no else here marking it as stale.

bors · 2020-12-07T18:10:38Z

☔ The latest upstream changes (presumably #8954) made this pull request unmergeable. Please resolve the merge conflicts.

Note that reviewers usually do not review pull requests until merge conflicts are resolved! Once you resolve the conflicts, you should change the labels applied by bors to indicate that your PR is ready for review. Post this as a comment to change the labels:

@rustbot modify labels: +S-waiting-on-review -S-waiting-on-author

ehuss · 2021-07-01T00:56:25Z

Ping @gilescope, are you still interested in moving this forward? Is there somewhere you are stuck or need help with?

gilescope · 2021-07-01T08:19:56Z

I took a long break parsing integers but finally got my atoi_radix10 crate out the door. It’s time to crack on with this and get it over the line. My goal is to get it done before August because at least that way it won’t have taken a year.

…

On Thu, 1 Jul 2021 at 01:56, Eric Huss ***@***.***> wrote: Ping @gilescope <https://github.com/gilescope>, are you still interested in moving this forward? Is there somewhere you are stuck or need help with? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8623 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGEJCB6RYKSRIMWMB2CANDTVO4MJANCNFSM4QA5EKGQ> .

gilescope · 2021-07-29T14:37:03Z

Sorry still intend on breathing life back into this just life is getting in the way.

ehuss · 2022-03-22T17:53:57Z

I'm going to close this due to inactivity. Unfortunately at this time we don't have the capacity to review major changes like this, but this is still a feature that we would love to see happen someday. If you are interested in resuming the work, please reach out to see if we can accept help with working on it.

rust-highfive assigned Eh2406 Aug 16, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 16, 2020

gilescope mentioned this pull request Aug 16, 2020

[WIP] Content hash support. (See also cargo changes) rust-lang/rust#75594

Closed

bjorn3 reviewed Aug 16, 2020

View reviewed changes

src/cargo/core/compiler/fingerprint.rs Outdated Show resolved Hide resolved

src/cargo/core/compiler/fingerprint.rs Outdated Show resolved Hide resolved

ehuss mentioned this pull request Aug 19, 2020

Cargo recompiles unchanged re-saved files #8616

Closed

bjorn3 reviewed Sep 5, 2020

View reviewed changes

src/cargo/core/compiler/fingerprint.rs Outdated Show resolved Hide resolved

gilescope changed the title ~~WIP: mtime+content tracking~~ [WIP] mtime+content tracking Sep 20, 2020

gilescope added 6 commits October 29, 2020 13:24

mtime+content tracking

68c94e8

v2: Take advantage of rustc's precalulated src hashes.

df16720

Put hash back in cache.

642cf9c

Fix existing tests.

fecc6da

Optimisation: No need to figure out if bin files are up to date if th…

0675c0e

…ey have svh hashes in their filenames.

WIP object reading

7ca4fbd

gilescope force-pushed the endmtime branch from 230a6dd to 7ca4fbd Compare October 29, 2020 13:26

gilescope added 9 commits November 10, 2020 08:45

Merge branch 'master' into endmtime

d498f08

loop to find

f289f82

match to if let

eb5c061

Less prints

5d63d44

Better log messages

03dc307

less print stmts

e73de2a

size and hash optional

4b63f8a

size and hash optional

1411aa5

Only activate when switched on

87600cc

Updates to serialiseation format in tests.

1379263

gilescope added 2 commits November 13, 2020 12:55

reduced duplication

58478ec

Working following format change of hashes

615dd81

gilescope added 2 commits November 16, 2020 22:11

cargo fmt + fix tests

a750137

Break out to separate file as fingerprint.rs is big

533a597

gilescope commented Nov 17, 2020

View reviewed changes

ehuss added the S-blocked label Jan 12, 2021

Eh2406 mentioned this pull request Jun 3, 2021

(Option to) Fingerprint by file contents instead of mtime #6529

Closed

Eh2406 mentioned this pull request Oct 13, 2021

"cargo check" (incorrectly) shows compiler errors that are not present when running "cargo build" #9971

Closed

Ben-PH mentioned this pull request Dec 21, 2021

Rebuild triggered after touching a (git submodule) file under the registry/src directory #10223

Closed

ehuss closed this Mar 22, 2022

jonhoo mentioned this pull request Jun 13, 2022

Tracking issue for -Z binary-dep-depinfo rust-lang/rust#63012

Open

2 tasks

TriplEight mentioned this pull request Jul 21, 2022

Workaround cargo fingerprinting mtime paritytech/cachepot#165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] mtime+content tracking #8623

[WIP] mtime+content tracking #8623

gilescope commented Aug 16, 2020 •

edited

Loading

rust-highfive commented Aug 16, 2020

Eh2406 commented Aug 17, 2020

gilescope commented Aug 20, 2020

gilescope commented Aug 20, 2020 •

edited

Loading

bjorn3 commented Aug 20, 2020

gilescope commented Aug 21, 2020

gilescope commented Aug 24, 2020

gilescope commented Aug 30, 2020 •

edited

Loading

bjorn3 commented Aug 30, 2020

bjorn3 commented Sep 5, 2020

gilescope commented Sep 5, 2020 •

edited

Loading

bjorn3 commented Sep 5, 2020

bors commented Oct 14, 2020

gilescope commented Oct 25, 2020

gilescope commented Oct 25, 2020

bjorn3 commented Oct 25, 2020

gilescope commented Nov 12, 2020

gilescope commented Nov 13, 2020

gilescope commented Nov 16, 2020

gilescope Nov 17, 2020

bors commented Dec 7, 2020

ehuss commented Jul 1, 2021

gilescope commented Jul 1, 2021 via email

gilescope commented Jul 29, 2021

ehuss commented Mar 22, 2022

[WIP] mtime+content tracking #8623

[WIP] mtime+content tracking #8623

Conversation

gilescope commented Aug 16, 2020 • edited Loading

rust-highfive commented Aug 16, 2020

Eh2406 commented Aug 17, 2020

gilescope commented Aug 20, 2020

gilescope commented Aug 20, 2020 • edited Loading

bjorn3 commented Aug 20, 2020

gilescope commented Aug 21, 2020

gilescope commented Aug 24, 2020

gilescope commented Aug 30, 2020 • edited Loading

bjorn3 commented Aug 30, 2020

bjorn3 commented Sep 5, 2020

gilescope commented Sep 5, 2020 • edited Loading

bjorn3 commented Sep 5, 2020

bors commented Oct 14, 2020

gilescope commented Oct 25, 2020

gilescope commented Oct 25, 2020

bjorn3 commented Oct 25, 2020

gilescope commented Nov 12, 2020

gilescope commented Nov 13, 2020

gilescope commented Nov 16, 2020

gilescope Nov 17, 2020

Choose a reason for hiding this comment

bors commented Dec 7, 2020

ehuss commented Jul 1, 2021

gilescope commented Jul 1, 2021 via email

gilescope commented Jul 29, 2021

ehuss commented Mar 22, 2022

gilescope commented Aug 16, 2020 •

edited

Loading

gilescope commented Aug 20, 2020 •

edited

Loading

gilescope commented Aug 30, 2020 •

edited

Loading

gilescope commented Sep 5, 2020 •

edited

Loading