Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use stat() to avoid checking content hashes for repository up-to-dateness checks #21044

Open
lberki opened this issue Jan 26, 2024 · 11 comments
Open
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: feature request

Comments

@lberki
Copy link
Contributor

lberki commented Jan 26, 2024

Description of the feature request:

The way Bazel verifies up-to-dateness of source files is that if some select data from stat() is unchanged, it believes it, and if not, it does a secondary check by checksumming source files to avoid re-running actions after a simple touch.

Repository rules don't work like this: they simply checksum every file that the repository depends on, which makes up-to-dateness much slower than they would otherwise be.

It would be nice to make repository rules do the same thing as actions.

Which category does this issue belong to?

No response

What underlying problem are you trying to solve with this feature?

Make the repository up-to-dateness check more efficient.

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@fmeum
Copy link
Collaborator

fmeum commented Jan 26, 2024

Just want to note that Bazel doesn't check just mtime, but also ctime and other attributes, before computing a digest: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/vfs/DigestUtils.java;l=42;bpv=1;bpt=1

@lberki
Copy link
Contributor Author

lberki commented Jan 26, 2024

You're indeed right -- I used "mtime" as a shorthand for "whatever can be gleaned about a file short of looking at its contents".

@lberki lberki changed the title Use mtime to avoid checking content hashes for repository up-to-dateness checks Use stat() to avoid checking content hashes for repository up-to-dateness checks Jan 26, 2024
@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jan 26, 2024
@tjgq tjgq added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Jan 30, 2024
@meteorcloudy meteorcloudy added P1 I'll work on this now. (Assignee required) and removed untriaged labels Jan 30, 2024
@SalmaSamy
Copy link
Contributor

@lberki, @fmeum I guess storing only ctime should be enough? as it already tracks changes to the file content and the extra stuff

@lberki
Copy link
Contributor Author

lberki commented Feb 1, 2024

@SalmaSamy AFAIU the ctime always changes if mtime does, so using that instead of the mtime would cause suboptimal caching in the worst case, which is perfectly fine (if somewhat inconsistent with action inputs, but that's just a minor gripe)

@fmeum
Copy link
Collaborator

fmeum commented Feb 1, 2024

There is the (probably very theoretical) concern of having the filesystem backing a path change. By chance, the ctime could remain the same but the file contents could differ. Bazel uses more stat properties such as file size and inside for the digest cache key, but I could see that being overkill for repo rules.

@lberki
Copy link
Contributor Author

lberki commented Feb 2, 2024

Good point! TBH I think the best approach here is to mimic what FileContentsProxy does, which is a tuple of (mtime, ctime, inode number)., It's only very slightly more complicated than relying on ctime alone, and protects against a few odd scenarios like the one you describe above.

@meteorcloudy
Copy link
Member

@bazel-io fork 7.1.0

@meisterT
Copy link
Member

meisterT commented Feb 7, 2024

Note that timestamps on HFS+ have a granularity of one second.

@SalmaSamy SalmaSamy self-assigned this Feb 7, 2024
@lberki
Copy link
Contributor Author

lberki commented Feb 7, 2024

FWIW, that's what TimestampGranularityMonitor is for: it makes Bazel wait long enough at the end of the build so that Bazel can be sure that if something modifies an input after it has returned, its timestamp changes.

@SalmaSamy
Copy link
Contributor

I think just using mtime alone now provides sufficient information for our needs, as it directly reflects changes to the file's content, which is our primary concern. Storing ctime or inode may add unnecessary complexity without offering much additional value as I think the extra information they store would already invalidate the label Ex: file location or name or very rarely changed Ex: file owner.

Wyverald added a commit that referenced this issue Feb 15, 2024
- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.
Wyverald added a commit that referenced this issue Feb 15, 2024
- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.
Wyverald added a commit that referenced this issue Feb 15, 2024
- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.
Wyverald added a commit that referenced this issue Feb 15, 2024
- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
  - In the future we could add glob patterns to this method.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.
Wyverald added a commit that referenced this issue Feb 15, 2024
- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
  - In the future we could add glob patterns to this method.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.
Wyverald added a commit that referenced this issue Feb 16, 2024
- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
  - In the future we could add glob patterns to this method.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.
copybara-service bot pushed a commit that referenced this issue Feb 20, 2024
- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
  -  In the future we could add glob patterns to this method.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.

Closes #21362.

PiperOrigin-RevId: 608667062
Change-Id: Ibacbb7af4cf4d7628fe8fcf06e2c4fa50e811e4e
@Wyverald
Copy link
Member

Lowering the priority of this one due to the uncertainty of its correctness (and we're out of time for 7.1.0 anyway).

@Wyverald Wyverald added P2 We'll consider working on this in future. (Assignee optional) and removed P1 I'll work on this now. (Assignee required) labels Feb 20, 2024
Wyverald added a commit that referenced this issue Feb 20, 2024
- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
  -  In the future we could add glob patterns to this method.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.

Closes #21362.

PiperOrigin-RevId: 608667062
Change-Id: Ibacbb7af4cf4d7628fe8fcf06e2c4fa50e811e4e
Wyverald added a commit that referenced this issue Feb 20, 2024
- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
  -  In the future we could add glob patterns to this method.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.

Closes #21362.

PiperOrigin-RevId: 608667062
Change-Id: Ibacbb7af4cf4d7628fe8fcf06e2c4fa50e811e4e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: feature request
Projects
None yet
Development

No branches or pull requests

10 participants