Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execlog parser should output each action's digest #16741

Closed
tbaing opened this issue Nov 11, 2022 · 14 comments
Closed

Execlog parser should output each action's digest #16741

tbaing opened this issue Nov 11, 2022 · 14 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@tbaing
Copy link
Contributor

tbaing commented Nov 11, 2022

Description of the feature request:

While trying to determine why a given target results in a remote cache miss, I used the execlog parser to gather detailed logs about what was happening. I wanted to get the action digest for the problematic action, to check for the existence of that object in the remote cache, but didn't see anything I recognized as that. The output contains hashes for each individual input and for the output, but nothing that appeared to be the digest for the aggregate set of all inputs, env variables, command, etc.

Is there some option for adding this information to the output that's not documented on https://bazel.build/remote/cache-remote? If not, can we please add that information to the output from the execlog parser?

What underlying problem are you trying to solve with this feature?

It's difficult to correlate the output from the execlog parser with the artifacts in the cache, which makes it harder to be certain whether a cache miss is due to the artifact not being present or is due to something else going wrong.

Which operating system are you running Bazel on?

gLinux

What is the output of bazel info release?

release 6.0.0-pre.20221012.2

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

fatal: not a git repository (or any of the parent directories): .git

Have you found anything relevant by searching the web?

Nothing relevant found.

Any other information, logs, or outputs that you want to share?

No response

@sgowroji sgowroji added type: feature request untriaged team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Nov 11, 2022
@tjgq
Copy link
Contributor

tjgq commented Nov 11, 2022

You're looking for the SpawnExec.digest field. This is a new addition in Bazel 6.

On a separate note: you might want to use --execution_log_json_file to produce the log in JSON format instead of binary and avoid using the execlog parser.

@tbaing
Copy link
Contributor Author

tbaing commented Nov 11, 2022

I switched to the JSON format (thanks for the suggestion!) and it's easier to read. I do see the SpawnExec.digest field there, but I'm noticing it's not present in some cases like local execution, which is exactly the situation where I'd want to see its value. Is there a reason it's not present for local execution?

@tjgq
Copy link
Contributor

tjgq commented Nov 14, 2022

You are correct - the field is missing for locally executed actions because the corresponding data structure (RemoteCacheClient.ActionKey) is only created for actions that run remotely.

I agree that producing a digest for locally executed actions would be helpful, but I'm not sure about the best way to do it; creating an ActionKey for an action that won't run remotely might be unnecessarily expensive.

@tbaing
Copy link
Contributor Author

tbaing commented Nov 14, 2022

For the action to run locally, wouldn't we have had to compute the ActionKey and use it to check the remote cache, and only execute locally after a cache miss? Shouldn't the ActionKey already exist by this point?

@tjgq
Copy link
Contributor

tjgq commented Nov 14, 2022

Yes, the ActionKey is definitely computed at some point if you have either remote caching or remote execution enabled. But those code paths are both separate from local execution, so it isn't necessarily straightforward to plumb it into the right place.

@tjgq
Copy link
Contributor

tjgq commented Nov 14, 2022

There's also the question of whether/how we should populate the field if neither remote caching nor remote execution are enabled.

@tbaing
Copy link
Contributor Author

tbaing commented Nov 14, 2022

The difficulty of "we have it but not here" makes sense.

Maybe alternatively we could include information about the cache interactions (checking and writing) in the raw log for the run? That would allow us to see the ActionKey being used and the success/failure results for each interaction with the cache, eliminating both the problem of not having the necessary info at the time we construct the content for the actual execution and the problem of what to do if no remote cache interaction is occurring.

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Nov 22, 2022
@tjgq
Copy link
Contributor

tjgq commented Nov 22, 2022

@tbaing Bazel has its own concept of an "action key" [1] which every action has, independently of the strategy it ends up running under. What do you think about adding it to a separate SpawnExec.action_key field? This would let you correlate executions of "the same action" across invocations; the drawback is that there's no way to convert it to and from the Remote Execution key (but it's not clear to me that's a requirement).

[1] https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/actions/ActionAnalysisMetadata.java;l=98;drc=14e32e7f5d1492f7ca4ffc17c40b913ec5799603

@tbaing
Copy link
Contributor Author

tbaing commented Nov 22, 2022

Would the remote cache key be 1:1 with that internal action key, even though not identical/computable from one another? Or does one of them consider information that the other does not, potentially resulting in multiple values on one side being mapped to a single value on the other?

If it's 1:1 then that sounds great. If it's many:1, I'd need more information about what differs between the keys to know whether it would give what I'm asking for here. If a single internal action key could map to multiple remote cache keys, then it doesn't help answer the question of "should I have found a remote cache hit for this action" because it might be that the remote cache key would be different from one invocation to the next and remote cache misses would be expected. (But if that's the case, I'd be very interested in understanding why those attributes aren't part of Bazel's internal action key.)

@tjgq
Copy link
Contributor

tjgq commented May 3, 2023

Would the remote cache key be 1:1 with that internal action key, even though not identical/computable from one another? Or does one of them consider information that the other does not, potentially resulting in multiple values on one side being mapped to a single value on the other?

It's definitely possible for two internally distinct actions to map to the same remote cache entry. I don't believe the reverse is possible.

@tbaing
Copy link
Contributor Author

tbaing commented May 3, 2023

Thanks for the confirmation that it's not 1:many; that would be very unexpected!

I wrote this FR because I wasn't seeing remote cache hits for some actions in our graph when used as inputs by other actions, and I wanted to confirm whether the problem was that Bazel wasn't writing the output to the cache or that Bazel was failing to find/use the remote content. I've since identified why that was happening so this isn't a current problem, but it did highlight to me that it's very hard to examine Bazel's interactions with the remote cache to understand why we're getting a particular behavior.

I think what would be most useful would be for the exec log to list:

  • for each input, what was the status of the search of each location (execroot, disk cache, remote cache, etc.) including a full identifier/path for the artifact within that location, including the HTTP response code/message where applicable
  • for each output, did we attempt to write it to remote caches and if so what was the status of the attempt to write the artifact to the remote location, again including a full identifier/path for the artifact within that location

From that info, you'd be able to see exactly what remote interactions occurred and whether there were problems with any of them, and you could see where to look for the artifact in remotes that allow browsing of the content (GS, --disk_cache). I recognize that that info may not be readily available in the parts of the code that will be writing this file, and that it might be more effort than it's worth since I've gotten past the problem that prompted me to open this FR and maybe most people don't need to troubleshoot remote/disk cache interactions.

@brentleyjones
Copy link
Contributor

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Oct 30, 2023
@keertk
Copy link
Member

keertk commented Oct 30, 2023

@bazel-io fork 7.0.0

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Oct 30, 2023
bazel-io pushed a commit to bazel-io/bazel that referenced this issue Oct 30, 2023
Specifically, introduce a getter/setter pair to store the digest in the SpawnExecutionContext when looking up the disk or remote cache, and retrieve it later when constructing the SpawnResult.

Fixes bazelbuild#16741.

PiperOrigin-RevId: 577799982
Change-Id: I570db74f4dbeee95e00853883eda68b25a84934c
iancha1992 added a commit that referenced this issue Oct 30, 2023
Specifically, introduce a getter/setter pair to store the digest in the
SpawnExecutionContext when looking up the disk or remote cache, and
retrieve it later when constructing the SpawnResult.

Fixes #16741.

Commit
40d6780

PiperOrigin-RevId: 577799982
Change-Id: I570db74f4dbeee95e00853883eda68b25a84934c

Co-authored-by: Googler <tjgq@google.com>
Co-authored-by: Ian (Hee) Cha <heec@google.com>
tjgq added a commit to tjgq/bazel that referenced this issue Nov 6, 2023
…isses.

Specifically, introduce a getter/setter pair to store the digest in the SpawnExecutionContext when looking up the disk or remote cache, and retrieve it later when constructing the SpawnResult.

Fixes bazelbuild#16741.

PiperOrigin-RevId: 577799982
Change-Id: I570db74f4dbeee95e00853883eda68b25a84934c
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.0.0 RC5. Please test out the release candidate and report any issues as soon as possible. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

No branches or pull requests

8 participants