Remote: Update FilesystemValueChecker to handle the case that a file was remote but is now staged. #14233

coeuvre · 2021-11-05T07:09:39Z

When using BwtB, intermediate outputs are not downloaded. When an action is going to execute locally (either is decided by dynamic execution or is forced to run locally), Bazel will download all the inputs which are outputs of dependent actions and the downloaded file are kept after build.

For a following build, Bazel doesn't realize the downloaded files are the same as remote ones. It marks those generating actions dirty and try to re-execute. Normally, this re-execution will be skipped due to the action cache. However, if only part of the outputs are downloaded, the action cache can't help.

This PR update FilesystemValueChecker to compare between previous remote metadata and current local file metadata so the generating action isn't invalidated if the output files are staged and are same with remote ones.

This PR only fixes output files. A following PR is needed to handle output tree artifacts.

Fixes #13912.

…was remote but is not staged.

alexjski

Would we need this change if we enabled action cache to store the remote metadata? My main worry is about the extra IO in case when we are not using an output service, namely in the case when a digest does not match -- will we end up computing a digest of of a stale output twice (once when checking the outputs, once when checking action cache for it)?

In particular, due to lack of sharing of these digest, we can imagine that 1 change to an output can trigger 2x digesting of N outputs for a rule which has N of them.

I can see that potentially helping a bit in the case of no changes to outputs, namely we would not dirty some ActionExecutionValues which later get change-pruned.

alexjski · 2021-11-05T14:48:59Z

src/test/shell/bazel/remote/remote_execution_test.sh

+    attrs = {
+        "producer": attr.label(providers = [[ProducerInfo]]),
+    },
+)


Could we not use genrules with srcs and outs of producer/{a,b}.txt?

The critical part of this test case is that consumer only need b.txt from producer (i.e. only part of outputs from its dependency). I don't know how to achieve that with genrules (sorry for my poor starlark skill).

No worries -- that is really fancy Starlark. I would think about something like this:

genrule(name="dep", outs=["a.txt", "b.txt"], cmd="touch $(SRCS)") genrule(name="top", srcs=[":b.txt"], outs=["top.out"], cmd="cp $< $@")

alexjski · 2021-11-05T15:29:43Z

src/main/java/com/google/devtools/build/lib/skyframe/FilesystemValueChecker.java

+        byte[] digest = DigestUtils.manuallyComputeDigest(file.getPath(), fileMetadata.getSize());
+        if (Arrays.equals(digest, lastKnownData.getDigest())
+            && fileMetadata.getSize() == lastKnownData.getSize()) {
+          // TODO(chiwang): Find a way to update lastKnownData to fileMetadata so we don't need to


One way to do this could be to attach a FileContentsProxy to a RemoteFileArtifactValue when we download the input. I think whether that idea makes sense would be informed by how we download the inputs -- how do we avoid re-downloading inputs today? By re-downloading, I mean a case of Action B depends on Action A and say we change something with Action B, but not A -- we would want not to materialize remote outputs of Action A again.

If we did that, we should see if the extra pointer takes more memory (for non-RBE case, we already have the actionId we don't use). Then, this code may not need any changes -- all you would need is to allow that content proxy comparison to work between RegularFileArtifactValue and RemoteFileArtifactValue.

how do we avoid re-downloading inputs today?

We guard the download with FileArtifactValue.isRemote.

After changing Action B, the following build will check Action A via action cache during which the RegularFileArtifactValues of staged outputs are generated. If Action A hit the cache, we will use these RegularFileArtifactValues for Action B. So there is no extra downloads for Action B in this run.

how do we avoid re-downloading inputs today?

Imagine we have action A executed remotely and action B locally.

Build 1 -- we run A remotely, download outputs of A to run action B.

We change a source of Action B so that it needs to be rerun.

Build 2 -- we run bazel build :b -- for A, we have a Skyframe cache hit, B we execute.

For A, we get an ActionExecutionValue from Skyframe, no action cache checks needed, hence the metadata for all its outputs should stay as remote. Will we download the inputs again in this case even though they are already available locally from prior build?

What happens if I change the downloaded file on disk between bazel builds?

Will we download the inputs again in this case even though they are already available locally from prior build?

In theory, we would download the inputs again because the ActionExecutionValue for Action A contains RemoteFileArtifactValue. However, in practice, and I have done some local tests, for some reason, Action A is invalidated and we check the action cache against it so the metadata becomes RegularFileArtifactValue when we run Action B. I don't totally understand the reason why Action A is invalidated but maybe related to the original issue.

What happens if I change the downloaded file on disk between bazel builds?

The generating action (in this case, Action A) is invalidated and re-executed. Before execution, the downloaded (and modified) file will be deleted. Action A is executed remotely and then we get the action result without downloading the outputs.

If the result of Action A is same as before, Action B won't be re-executed because of action cache.

Otherwise, execute Action B, download the updated outputs of Action A.

That's surprising -- I would think one should be able to invalidate a top-action without needing to change the base one. I would be more than happy to try to help you offline with constructing an example doing that.

coeuvre · 2021-11-08T05:55:03Z

Would we need this change if we enabled action cache to store the remote metadata?

If we store remote metadata in action cache, we don't need this change. But I am wondering do we want these extra steps in this case i.e. invalidating generating actions, checking action cache and change-pruning.

My main worry is about the extra IO in case when we are not using an output service, namely in the case when a digest does not match -- will we end up computing a digest of of a stale output twice (once when checking the outputs, once when checking action cache for it)?

True.

I think, the best we can do is, after downloading the inputs, replacing RemoteFileArtifactValue with RegularFileArtifactValue in the ActionExecutionValues. Is that possible?

coeuvre · 2021-11-08T06:54:05Z

I think, the best we can do is, after downloading the inputs, replacing RemoteFileArtifactValue with RegularFileArtifactValue in the ActionExecutionValues. Is that possible?

Let skyframe invalidate the generating actions and redo the action cache checking is actually an indirect way to update ActionExeciontValues with RegularFileArtifactValues. Do we have more direct way to achieve this immediately after downloads?

alexjski · 2021-11-09T03:34:45Z

Would we need this change if we enabled action cache to store the remote metadata?

If we store remote metadata in action cache, we don't need this change. But I am wondering do we want these extra steps in this case i.e. invalidating generating actions, checking action cache and change-pruning.

My main worry is about the extra IO in case when we are not using an output service, namely in the case when a digest does not match -- will we end up computing a digest of of a stale output twice (once when checking the outputs, once when checking action cache for it)?

True.

I think, the best we can do is, after downloading the inputs, replacing RemoteFileArtifactValue with RegularFileArtifactValue in the ActionExecutionValues. Is that possible?

The unfortunate part is that this would make ActionExecutionValue not-immutable anymore + you would need to synchronize that. It may be easier to do so at the level of the RemoteFileArtifactValue (imagine you add a transitive field there with the content proxy).

coeuvre · 2021-11-09T08:44:40Z

I think replacing RemoteFileArtifactValue with RegularFileArtifactValue after we downloaded the files (either with some hacks in prefetchInputs or via action rewinding) is a proper fix. This can fix the original issue and can prevent the same files be re-downloaded as well.

I can see a lot of issues will be fixed by implementing action rewinding but it's not that easy to do. I will try with some hacks in prefetchInputs first.

coeuvre · 2021-11-10T07:42:39Z

Closing since this change adds extra digest computations and I would like to explore other ways to fix the issue. In the meantime, a workaround is to use --experimental_action_cache_store_output_metadata.

alexjski · 2021-11-10T17:00:21Z

I think replacing RemoteFileArtifactValue with RegularFileArtifactValue after we downloaded the files (either with some hacks in prefetchInputs or via action rewinding) is a proper fix. This can fix the original issue and can prevent the same files be re-downloaded as well.

I can see a lot of issues will be fixed by implementing action rewinding but it's not that easy to do. I will try with some hacks in prefetchInputs first.

The problem there is that in case of rewinding, you would want a special case of that -- one which does know about the prior result to avoid re-digesting the downloaded files. I think the best bet may be to update the actionExcecutionValue in prefetchInputs. Actually, my favorite would be to add nullable contentproxy to RemoteActionFileArtifactValue, update that after a download and use that to avoid the re-download as well.

coeuvre added 2 commits November 5, 2021 13:05

WIP

4d98c39

Remote: Update FilesystemValueChecker to handle the case that a file …

a424a11

…was remote but is not staged.

coeuvre requested a review from alexjski November 5, 2021 07:09

coeuvre requested a review from a team as a code owner November 5, 2021 07:09

google-cla bot added the cla: yes label Nov 5, 2021

coeuvre changed the title ~~Remote: Update FilesystemValueChecker to handle the case that a file was remote but is not staged.~~ Remote: Update FilesystemValueChecker to handle the case that a file was remote but is now staged. Nov 5, 2021

alexjski suggested changes Nov 5, 2021

View reviewed changes

coeuvre closed this Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote: Update FilesystemValueChecker to handle the case that a file was remote but is now staged. #14233

Remote: Update FilesystemValueChecker to handle the case that a file was remote but is now staged. #14233

coeuvre commented Nov 5, 2021 •

edited

Loading

alexjski left a comment

alexjski Nov 5, 2021

coeuvre Nov 8, 2021

alexjski Nov 8, 2021

alexjski Nov 5, 2021

coeuvre Nov 8, 2021

alexjski Nov 8, 2021

coeuvre Nov 9, 2021

alexjski Nov 10, 2021

coeuvre commented Nov 8, 2021

coeuvre commented Nov 8, 2021

alexjski commented Nov 9, 2021

coeuvre commented Nov 9, 2021 •

edited

Loading

coeuvre commented Nov 10, 2021

alexjski commented Nov 10, 2021

Remote: Update FilesystemValueChecker to handle the case that a file was remote but is now staged. #14233

Remote: Update FilesystemValueChecker to handle the case that a file was remote but is now staged. #14233

Conversation

coeuvre commented Nov 5, 2021 • edited Loading

alexjski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coeuvre commented Nov 8, 2021

coeuvre commented Nov 8, 2021

alexjski commented Nov 9, 2021

coeuvre commented Nov 9, 2021 • edited Loading

coeuvre commented Nov 10, 2021

alexjski commented Nov 10, 2021

coeuvre commented Nov 5, 2021 •

edited

Loading

coeuvre commented Nov 9, 2021 •

edited

Loading