Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote: Update FilesystemValueChecker to handle the case that a file was remote but is now staged. #14233

Closed
wants to merge 2 commits into from

Conversation

coeuvre
Copy link
Member

@coeuvre coeuvre commented Nov 5, 2021

When using BwtB, intermediate outputs are not downloaded. When an action is going to execute locally (either is decided by dynamic execution or is forced to run locally), Bazel will download all the inputs which are outputs of dependent actions and the downloaded file are kept after build.

For a following build, Bazel doesn't realize the downloaded files are the same as remote ones. It marks those generating actions dirty and try to re-execute. Normally, this re-execution will be skipped due to the action cache. However, if only part of the outputs are downloaded, the action cache can't help.

This PR update FilesystemValueChecker to compare between previous remote metadata and current local file metadata so the generating action isn't invalidated if the output files are staged and are same with remote ones.

This PR only fixes output files. A following PR is needed to handle output tree artifacts.

Fixes #13912.

@coeuvre coeuvre requested a review from alexjski November 5, 2021 07:09
@coeuvre coeuvre requested a review from a team as a code owner November 5, 2021 07:09
@google-cla google-cla bot added the cla: yes label Nov 5, 2021
@coeuvre coeuvre changed the title Remote: Update FilesystemValueChecker to handle the case that a file was remote but is not staged. Remote: Update FilesystemValueChecker to handle the case that a file was remote but is now staged. Nov 5, 2021
Copy link
Contributor

@alexjski alexjski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need this change if we enabled action cache to store the remote metadata? My main worry is about the extra IO in case when we are not using an output service, namely in the case when a digest does not match -- will we end up computing a digest of of a stale output twice (once when checking the outputs, once when checking action cache for it)?

In particular, due to lack of sharing of these digest, we can imagine that 1 change to an output can trigger 2x digesting of N outputs for a rule which has N of them.

I can see that potentially helping a bit in the case of no changes to outputs, namely we would not dirty some ActionExecutionValues which later get change-pruned.

attrs = {
"producer": attr.label(providers = [[ProducerInfo]]),
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not use genrules with srcs and outs of producer/{a,b}.txt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The critical part of this test case is that consumer only need b.txt from producer (i.e. only part of outputs from its dependency). I don't know how to achieve that with genrules (sorry for my poor starlark skill).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries -- that is really fancy Starlark. I would think about something like this:

genrule(name="dep", outs=["a.txt", "b.txt"], cmd="touch $(SRCS)")
genrule(name="top", srcs=[":b.txt"], outs=["top.out"], cmd="cp $< $@")

byte[] digest = DigestUtils.manuallyComputeDigest(file.getPath(), fileMetadata.getSize());
if (Arrays.equals(digest, lastKnownData.getDigest())
&& fileMetadata.getSize() == lastKnownData.getSize()) {
// TODO(chiwang): Find a way to update lastKnownData to fileMetadata so we don't need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to do this could be to attach a FileContentsProxy to a RemoteFileArtifactValue when we download the input. I think whether that idea makes sense would be informed by how we download the inputs -- how do we avoid re-downloading inputs today? By re-downloading, I mean a case of Action B depends on Action A and say we change something with Action B, but not A -- we would want not to materialize remote outputs of Action A again.

If we did that, we should see if the extra pointer takes more memory (for non-RBE case, we already have the actionId we don't use). Then, this code may not need any changes -- all you would need is to allow that content proxy comparison to work between RegularFileArtifactValue and RemoteFileArtifactValue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we avoid re-downloading inputs today?

We guard the download with FileArtifactValue.isRemote.

After changing Action B, the following build will check Action A via action cache during which the RegularFileArtifactValues of staged outputs are generated. If Action A hit the cache, we will use these RegularFileArtifactValues for Action B. So there is no extra downloads for Action B in this run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we avoid re-downloading inputs today?

Imagine we have action A executed remotely and action B locally.

  1. Build 1 -- we run A remotely, download outputs of A to run action B.
  2. We change a source of Action B so that it needs to be rerun.
  3. Build 2 -- we run bazel build :b -- for A, we have a Skyframe cache hit, B we execute.

For A, we get an ActionExecutionValue from Skyframe, no action cache checks needed, hence the metadata for all its outputs should stay as remote. Will we download the inputs again in this case even though they are already available locally from prior build?

What happens if I change the downloaded file on disk between bazel builds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we download the inputs again in this case even though they are already available locally from prior build?

In theory, we would download the inputs again because the ActionExecutionValue for Action A contains RemoteFileArtifactValue. However, in practice, and I have done some local tests, for some reason, Action A is invalidated and we check the action cache against it so the metadata becomes RegularFileArtifactValue when we run Action B. I don't totally understand the reason why Action A is invalidated but maybe related to the original issue.

What happens if I change the downloaded file on disk between bazel builds?

The generating action (in this case, Action A) is invalidated and re-executed. Before execution, the downloaded (and modified) file will be deleted. Action A is executed remotely and then we get the action result without downloading the outputs.

  1. If the result of Action A is same as before, Action B won't be re-executed because of action cache.
  2. Otherwise, execute Action B, download the updated outputs of Action A.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's surprising -- I would think one should be able to invalidate a top-action without needing to change the base one. I would be more than happy to try to help you offline with constructing an example doing that.

@coeuvre
Copy link
Member Author

coeuvre commented Nov 8, 2021

Would we need this change if we enabled action cache to store the remote metadata?

If we store remote metadata in action cache, we don't need this change. But I am wondering do we want these extra steps in this case i.e. invalidating generating actions, checking action cache and change-pruning.

My main worry is about the extra IO in case when we are not using an output service, namely in the case when a digest does not match -- will we end up computing a digest of of a stale output twice (once when checking the outputs, once when checking action cache for it)?

True.

I think, the best we can do is, after downloading the inputs, replacing RemoteFileArtifactValue with RegularFileArtifactValue in the ActionExecutionValues. Is that possible?

@coeuvre
Copy link
Member Author

coeuvre commented Nov 8, 2021

I think, the best we can do is, after downloading the inputs, replacing RemoteFileArtifactValue with RegularFileArtifactValue in the ActionExecutionValues. Is that possible?

Let skyframe invalidate the generating actions and redo the action cache checking is actually an indirect way to update ActionExeciontValues with RegularFileArtifactValues. Do we have more direct way to achieve this immediately after downloads?

@alexjski
Copy link
Contributor

alexjski commented Nov 9, 2021

Would we need this change if we enabled action cache to store the remote metadata?

If we store remote metadata in action cache, we don't need this change. But I am wondering do we want these extra steps in this case i.e. invalidating generating actions, checking action cache and change-pruning.

My main worry is about the extra IO in case when we are not using an output service, namely in the case when a digest does not match -- will we end up computing a digest of of a stale output twice (once when checking the outputs, once when checking action cache for it)?

True.

I think, the best we can do is, after downloading the inputs, replacing RemoteFileArtifactValue with RegularFileArtifactValue in the ActionExecutionValues. Is that possible?

The unfortunate part is that this would make ActionExecutionValue not-immutable anymore + you would need to synchronize that. It may be easier to do so at the level of the RemoteFileArtifactValue (imagine you add a transitive field there with the content proxy).

@coeuvre
Copy link
Member Author

coeuvre commented Nov 9, 2021

I think replacing RemoteFileArtifactValue with RegularFileArtifactValue after we downloaded the files (either with some hacks in prefetchInputs or via action rewinding) is a proper fix. This can fix the original issue and can prevent the same files be re-downloaded as well.

I can see a lot of issues will be fixed by implementing action rewinding but it's not that easy to do. I will try with some hacks in prefetchInputs first.

@coeuvre
Copy link
Member Author

coeuvre commented Nov 10, 2021

Closing since this change adds extra digest computations and I would like to explore other ways to fix the issue. In the meantime, a workaround is to use --experimental_action_cache_store_output_metadata.

@coeuvre coeuvre closed this Nov 10, 2021
@alexjski
Copy link
Contributor

I think replacing RemoteFileArtifactValue with RegularFileArtifactValue after we downloaded the files (either with some hacks in prefetchInputs or via action rewinding) is a proper fix. This can fix the original issue and can prevent the same files be re-downloaded as well.

I can see a lot of issues will be fixed by implementing action rewinding but it's not that easy to do. I will try with some hacks in prefetchInputs first.

The problem there is that in case of rewinding, you would want a special case of that -- one which does know about the prior result to avoid re-digesting the downloaded files. I think the best bet may be to update the actionExcecutionValue in prefetchInputs. Actually, my favorite would be to add nullable contentproxy to RemoteActionFileArtifactValue, update that after a download and use that to avoid the re-download as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants