Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR?: ctx.actions.symlink action time scales w/ size of input; should be constant #14125

Closed
ghost opened this issue Oct 18, 2021 · 33 comments
Closed
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts type: feature request

Comments

@ghost
Copy link

ghost commented Oct 18, 2021

Description of the problem / feature request:

Action 1 generates a 2 GiB output, 2gb.out. Creating the output takes .75s, but postprocessing (checksumming?) via actuallyCompleteAction takes 5s
Action 2 is a ctx.actions.symlink with target_file = ctx.file.2gb_out. Creating the output is seemingly instantaneous, but postprocessing via actuallyCompleteAction costs another 5s.

Feature requests: what underlying problem are you trying to solve with this feature?

rules_pkg's pkg_zip deprecated the out attr and manages outputs internally w/ an implicit ctx.actions.symlink. That call to symlink adds 15s to my build's critical path.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

defs.bzl

def _dummy_impl(ctx):
    out_phys = ctx.actions.declare_file(ctx.outputs.out.basename + ".real")
    ctx.actions.run_shell(
        outputs = [out_phys],
        command = "dd if=/dev/zero of=$1 bs=1M count=2K",  # 2 GiB
        arguments = [out_phys.path],
        execution_requirements = {"local": ""},  # don't bother w/ sandboxing
    )

    ctx.actions.symlink(
        output = ctx.outputs.out,
        target_file = out_phys,
        # execution_requirements = {"local": ""},
    )

dummy = rule(
    implementation = _dummy_impl,
    attrs = {
        "out": attr.output(mandatory = True),
    },
)

BUILD

load(":defs.bzl", "dummy")

dummy(
    name = "dummy",
    out = "dummy.out",
    tags = ["no-cache"],
)
$ bazelisk build --profile=symlink.profile.gz :dummy
Starting local Bazel server and connecting to it...
INFO: Analyzed target //:dummy (4 packages loaded, 6 targets configured).
INFO: Found 1 target...
INFO: From Action dummy.out.real:
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.755435 s, 2.8 GB/s
Target //:dummy up-to-date:
  .bazel/bin/dummy.out
INFO: Elapsed time: 13.959s, Critical Path: 11.40s
INFO: 3 processes: 2 internal, 1 local.
INFO: Build completed successfully, 3 total actions

What operating system are you running Bazel on?

Linux, CentOS 8

What's the output of bazel info release?

release 4.2.1 and release 5.0.0-pre.20210929.1 (via USE_BAZEL_VERSION=rolling)

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

n/a

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

n/a

Have you found anything relevant by searching the web?

No.

(Edit: Found issue #12158 and discovered undocumented options digest_function and unix_digest_hash_attribute_name. Perhaps these options or code adjacent to them could be leveraged for this?)

Any other information, logs, or outputs that you want to share?

  • I assume that the cost of actuallyCompleteAction boils down to checksumming the target of the output symlink. Since Bazel already has a checksum for the input, I would hope that this value could be reused.
  • In addition to the rules_pkg use case, skylib's copy_file is optionally a wrapper for ctx.actions.symlink, so more users may encounter this than you'd think.
@tjgq
Copy link
Contributor

tjgq commented Oct 25, 2022

Repeating what I said in #10702:

Sadly, ctx.actions.symlink with a declare_file or declare_directory output parameter doesn't actually optimize out the copy. The action itself does create a symlink in the filesystem, but when the action outputs are collected, Skyframe follows the symlink and hashes the file it points to (ignoring the fact that the hash should be identical).

One possible solution is to generalize 32b0f5a to also short-circuit the filesystem traversal for symlinks in the local execution case. I know that this isn't trivial to do because I tried (and failed) to do that in an earlier iteration of that CL.

@brandjon brandjon added untriaged team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts and removed team-Build-Language labels Nov 4, 2022
@brentleyjones
Copy link
Contributor

@tjgq: you, @BalestraPatrick and I talked at BazelCon today about this, and you mentioned that we should be able to store the hash in metadata (or use the hash that is already in metadata?) to prevent having to follow the symlink and re-hash it. And if not, I suggested we could associate the hash with inodes, cache that, and do it that way instead.

@tjgq
Copy link
Contributor

tjgq commented Nov 17, 2022

Yes, my thinking is that #16283 could be adapted to do this - specifically, by having the symlink action itself inject the metadata for the symlink artifact into skyframe, instead of doing it only for BwoB. (I did try this at one point, but ran into some google3-specific issues and would have to look into it again.)

@BalestraPatrick
Copy link
Member

Hello! Just sending an update here that even in Bazel 6, our incremental build times are very much suffering from this issue. As you can see in the following annotated profile trace, over 50% of the build time is spent in Bazel's various steps:

  • "action dependency checking"
  • "complete action execution"
  • "Creating symlink"

All of the above steps account for more than 46s of this build out of 90s. I think this is a very common issue for all projects that have binaries that are in the order of hundreds of MBs (for us this binary is ~700MB). This is basically a blocking issue to adopt Bazel locally due to the overhead compared to other build systems.

Screenshot 2023-02-07 at 3 17 12 PM

@sventiffe
Copy link
Contributor

@coeuvre while we are working towards BwoB improvements for the next release, are there any workarounds that can help to improve the situation described by @BalestraPatrick in the last comment?

@coeuvre
Copy link
Member

coeuvre commented Feb 8, 2023

@BalestraPatrick Can you share a repro that can produce similar profile with the one you shared above? I know there is a repro in #14125 (comment), but it is only for "actuallyCompleteAction" and you mentioned 2.

For "actuallyCompleteAction", I believe most of the time was spend on calculating the digest of outputs (but I can be wrong with your specific case). I do have another "simple" improvement in my mind, let's whether I can create a PR for that.

@BalestraPatrick
Copy link
Member

@coeuvre There was a repro case here (not sure if you meant to link to that). I updated the same repro case with a new commit to inflate the binary a bit more (still 10x less than our binary though) here. You can use bazelisk build //iOSApp --ios_multi_cpus=sim_arm64,x86_64. I can try to find more ways to increase the binary size to see if I can get to a state similar to the profile I shared above. I wasn't able to track down what changes exactly cause those actions to take that amount of time. I'm happy to start using a custom version of Bazel with increased logging if that could help or anything like that.

@brentleyjones
Copy link
Contributor

I pushed another change to bring the binary up to the right size (over 700MB): https://app.buildbuddy.io/invocation/0374c0e7-c0a8-498c-9730-29e03fb8f920#timing. This clearly illustrates the duplicate time on symlinks. I'm still trying to repro the action dependency checking though.

@brentleyjones
Copy link
Contributor

This is an interesting comment:

// TODO(bazel-team): (2010) For RunfilesAction/SymlinkAction and similar actions that
// produce only symlinks we should not check whether inputs are valid at all - all that matters
// that inputs and outputs are still exist (and new inputs have not appeared). All other checks
// are unnecessary. In other words, the only metadata we should check for them is file existence
// itself.

@coeuvre
Copy link
Member

coeuvre commented Feb 13, 2023

@brentleyjones I have created #17478. With that, the time for Creating Symlink ... actions reduced from ~2000ms to ~1ms on my machine using your repro. Can you try that patch with your real world project?

@brentleyjones
Copy link
Contributor

@coeuvre It works ❤️! Would this be able to get into 6.1 as well? This is such a huge improvement for us.

@coeuvre
Copy link
Member

coeuvre commented Feb 13, 2023

I'm glad it works for you! Sure, I believe it's safe to cherrypick it into 6.1.

@BalestraPatrick
Copy link
Member

BalestraPatrick commented Feb 13, 2023

@coeuvre Thank you so much! I tested this in our project and indeed the Creating Symlink actions are much faster now.

Before After
Screenshot 2023-02-13 at 6 30 37 PM Screenshot 2023-02-13 at 6 33 00 PM

As you can see, the brown and blue lines disappeared in the second build.

Note: I had to cherry-pick your change locally to Bazel 6.0 because latest master causes our linking step to get into an infinite loop. I'll try to bisect exactly what change caused this.

Edit: found the cause for that infinite loop. It's 7b4acfe that required this change in rules_apple that we don't have yet, so nothing to worry about.

@chiragramani
Copy link
Contributor

Thanks a lot, @coeuvre! works great for our codebase as well...

@coeuvre coeuvre added P1 I'll work on this now. (Assignee required) and removed P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) untriaged labels Feb 14, 2023
@coeuvre coeuvre self-assigned this Feb 14, 2023
copybara-service bot pushed a commit that referenced this issue Feb 14, 2023
The cost of symlink action scales with the size of input because Bazel re-calculates the digest of the output by following the symlink in `actuallyCompleteAction` (#14125). However, the re-calculation is redundant because the digest was already computed by Bazel when checking the outputs of the generating action. Bazel should be smart enough to reuse the result.

There is a global cache in Bazel for digest computation. Symlink action didn't make use of the cache because it uses the path of symlink as key to look up the cache. This PR changes to use the path of input file (i.e. target path) to query the cache to avoid recalculation.

For a large target (700MB), the time for symlink action is reduced from 2000ms to 1ms.

Closes #17478.

PiperOrigin-RevId: 509524641
Change-Id: Id3c9dc07d68758770c092f6307e2433dad40ba10
hvadehra pushed a commit that referenced this issue Feb 14, 2023
The cost of symlink action scales with the size of input because Bazel re-calculates the digest of the output by following the symlink in `actuallyCompleteAction` (#14125). However, the re-calculation is redundant because the digest was already computed by Bazel when checking the outputs of the generating action. Bazel should be smart enough to reuse the result.

There is a global cache in Bazel for digest computation. Symlink action didn't make use of the cache because it uses the path of symlink as key to look up the cache. This PR changes to use the path of input file (i.e. target path) to query the cache to avoid recalculation.

For a large target (700MB), the time for symlink action is reduced from 2000ms to 1ms.

Closes #17478.

PiperOrigin-RevId: 509524641
Change-Id: Id3c9dc07d68758770c092f6307e2433dad40ba10
@coeuvre
Copy link
Member

coeuvre commented Feb 15, 2023

For "action dependency checking", it's probably there are too many actions are checking the action cache. We have added option to throttle the action cache check. 3d29b2e

@brentleyjones @BalestraPatrick If you are able to reproduce "action dependency checking" overhead, can you try whether that option improves it?

@BalestraPatrick
Copy link
Member

BalestraPatrick commented Feb 15, 2023

@coeuvre Thank you! For us "action dependency checking" doesn't happen always, but I can try to use that flag and report back if I see an improvement.

The other big (and last bottleneck in this series) is the actuallyCompleteAction. I did some digging and profiling and ended up tracking down that 25-30% of our build time is spent in this part (which is 100% of actuallyCompleteAction):

After adding some logging to that line, I see this:

getMetadata took: 1910ms for File:[[<execution_root>]bazel-out/ios-sim_arm64-min14.0-applebin_ios-ios_sim_arm64-fastbuild-ST-ec33f340e4be/bin]/App_bin
getMetadata took: 10964ms for File:[[<execution_root>]bazel-out/applebin_ios-ios_sim_arm64-fastbuild-ST-ec33f340e4be/bin]App.app

Screenshot 2023-02-15 at 2 55 29 PM

07a89-6028-4b55-b900-2e9db1374ff8.png">

The first is a binary and the second is a tree artifact with about 1000 files. Do you have any idea how we can either skip or make this part more performant? I see the docs mention this:

// getMetadata has the side effect of adding the artifact to the cache if it's not there already

Does no-cache come into play in any way here?

@BalestraPatrick
Copy link
Member

For the tree artifact case, this tree visiting logic is where the most time is spent:

TreeArtifactValue.visitTree(
treeDir,
(parentRelativePath, type) -> {
if (chmod && type != Dirent.Type.SYMLINK) {
setPathReadOnlyAndExecutable(treeDir.getRelative(parentRelativePath));
}
if (type == Dirent.Type.DIRECTORY) {
return; // The final TreeArtifactValue does not contain child directories.
}
TreeFileArtifact child = TreeFileArtifact.createTreeOutput(parent, parentRelativePath);
FileArtifactValue metadata;
try {
metadata = constructFileArtifactValueFromFilesystem(child);
} catch (FileNotFoundException e) {
String errorMessage =
String.format(
"Failed to resolve relative path %s inside TreeArtifact %s. "
+ "The associated file is either missing or is an invalid symlink.",
parentRelativePath, treeDir);
throw new IOException(errorMessage, e);
}
tree.putChild(child, metadata);
});

@coeuvre
Copy link
Member

coeuvre commented Feb 15, 2023

@BalestraPatrick Thanks for your investigation!

As you may already know, Bazel tracks files with their content not their path to ensure correctness. For every output an action generates, Bazel must know its digest to correctly track it. Essentially,

Time(action execution) = Time(execute action itself) + Time(compute digest of all outputs)

and the time for computing digests scales with the number and size of outputs. actuallyCompleteAction is where Bazel computes digests. In your case, Bazel is computing digests for App_bin which I assume is large and for all 1000 files inside the tree artifact.

Did you use local or remote execution? I am asking because for local execution, it's probably not much we can do except parallelize the computation.

For remote execution, on the other hand, when Bazel gets the execution result, it is able to get digest for each output from the result (because remote worker computed them).

However, currently, remote execution with all output mode doesn't use that information so actuallyCompleteAction can be optimized in this case.

For remote execution with toplevel or minimal (i.e. build without the bytes), Bazel already uses the pre-computed digests from remote worker so there shouldn't be overhead in actuallyCompleteAction.

no-cache wouldn't help here because it is a must for Bazel to know digest for every output regardless whether the action is cached or not.

@BalestraPatrick
Copy link
Member

@coeuvre Unfortunately these specific actions have to always run locally (we use --modify_execution_info to contain various mnemonics such as BundleTreeApp, ObjcLink, etc.) like many other projects with big outputs. The inputs and outputs are very big, and so the overhead would be even higher in that case. I can see that computing the digest takes about 2s for the big binary, but the remaining computation should be pretty cheap for each file (since they're all pretty small files). I guess parallelizing the computation would help here?

I've also noticed that possibly the digest computation is not reused across actions, is that correct? So for example you can see in my profile that the digest for the binary is computed once already (dark green line which takes 2s), and it likely takes again another 2s as part of the tree artifact computation.

@coeuvre
Copy link
Member

coeuvre commented Feb 15, 2023

@coeuvre Unfortunately these specific actions have to always run locally (we use --modify_execution_info to contain various mnemonics such as BundleTreeApp, ObjcLink, etc.) like many other projects with big outputs. The inputs and outputs are very big, and so the overhead would be even higher in that case.

I see. We have the same issue internally for our iOS builds. I don't have a clear solution yet but I will think about it.

I can see that computing the digest takes about 2s for the big binary, but the remaining computation should be pretty cheap for each file (since they're all pretty small files). I guess parallelizing the computation would help here?

Yes, I believe paralleilizing will help. Also @meisterT and I was profiling the overhead in "action dependency checking" today and found that it could be improved with paralleilizing as well. Our plan is to bring Loom into Bazel and do the I/O optimization with virtual thread. This is going to happen around Q2 (but I can't promise though).

I've also noticed that possibly the digest computation is not reused across actions, is that correct?

Bazel has a global cache for digest computation and the size is controlled by --cache_computed_file_digests. Can you try with a higher value to see whether Bazel is able to reuse the digest for the binary?

If Bazel cannot reuse it, something is definitely wrong. Otherwise, I can probably make the cache smarter to evict entries for small files more frequently.

@brentleyjones
Copy link
Contributor

Ahh, --cache_computed_file_digests. I forgot about that. The default is too small (here is me talking about it in 2020): https://speakerdeck.com/brentleyjones/using-bazel-for-ios-development.

@BalestraPatrick
Copy link
Member

@coeuvre Looks like --cache_computed_file_digests doesn't help directly here because the cache key is based on the path of the file (as noticed by @brentleyjones).

@coeuvre
Copy link
Member

coeuvre commented Feb 16, 2023

Yes, the key is based on the path. I re-read your case above, IIUC, the binary is generated by one action, and then is copied to the tree artifact by another action -- in this case, the cache won't help because the paths are different.

@coeuvre
Copy link
Member

coeuvre commented Feb 16, 2023

I spent some time on the idea of parallelizing digest computation. For my prototype, I was able to reduce the time of actuallyCompleteAction from ~100ms to ~10ms for tree artifact that has 300 files, and from ~600ms to ~100ms for tree artifact that has 3000 files. I am not satisfying with the result but we can see some improvement. I will spend more time on it once I am actually working on Loom related stuff in Q2.

For computing digest for large files, if I read the code correctly, Bazel is using Java code to do the SHA256. One optimization could be using native code instead and utilizing SIMD (e.g. use isa_l_crypto). We did this internally and we were able to see huge improvement for large iOS targets. I don't have time to do it yet, but I would love to review PRs.

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) and removed P1 I'll work on this now. (Assignee required) labels Feb 16, 2023
@sluongng
Copy link
Contributor

For large file digest, might worth take a look at the conversations over at bazelbuild/remote-apis#235, which I think @EdSchouten is planning to implement into Bazel after the PR is merged over at remote-apis.

@coeuvre
Copy link
Member

coeuvre commented Feb 16, 2023

@sluongng Thanks for the pointer! I didn't know PSHA2 is already made public. So my last point for large file digest is basically what is already discussed there.

keertk added a commit that referenced this issue Feb 20, 2023
* Fix symlink file creation overhead

The cost of symlink action scales with the size of input because Bazel re-calculates the digest of the output by following the symlink in `actuallyCompleteAction` (#14125). However, the re-calculation is redundant because the digest was already computed by Bazel when checking the outputs of the generating action. Bazel should be smart enough to reuse the result.

There is a global cache in Bazel for digest computation. Symlink action didn't make use of the cache because it uses the path of symlink as key to look up the cache. This PR changes to use the path of input file (i.e. target path) to query the cache to avoid recalculation.

For a large target (700MB), the time for symlink action is reduced from 2000ms to 1ms.

Closes #17478.

PiperOrigin-RevId: 509524641
Change-Id: Id3c9dc07d68758770c092f6307e2433dad40ba10

* Update ActionMetadataHandler.java

* Create OutputPermissions.java

---------

Co-authored-by: Chi Wang <chiwang@google.com>
Co-authored-by: keertk <110264242+keertk@users.noreply.github.com>
@brentleyjones
Copy link
Contributor

Seems the overhead might not be fully fixed @coeuvre? https://bazelbuild.slack.com/archives/C01E7TH8XK9/p1677829662738059

@coeuvre
Copy link
Member

coeuvre commented Mar 16, 2023

To close the loop, the overhead for symlink is fixed (and verified) in 6.1.

@chiragramani
Copy link
Contributor

Sharing some observations from 6.1 traces for our application:

When BundleTreeApp mnemonic output is only being cached on disk:
Screenshot 2023-03-28 at 1 23 43 PM
BundleTreeApp: 13s
upload outputs: 11s 889ms
actuallyCompleteAction: 1s 395ms

When BundleTreeApp mnemonic output is not cached at all:
Screenshot 2023-03-28 at 1 25 35 PM
BundleTreeApp: 13s
actuallyCompleteAction: 9s 744ms

I am curious to know if the parallelizing digest computation is a part of the plan for the 6.2 release as this would lead to a huge saving in our incremental build times.

@coeuvre
Copy link
Member

coeuvre commented Mar 31, 2023

Linking #17009 for tracking the effect of parallelizing digest computation for tree artifacts.

brentleyjones referenced this issue Apr 19, 2023
This makes it possible to use every core available for checksumming, which
makes a huge difference for large tree artifacts.

Fixes #17009.

RELNOTES: None.
PiperOrigin-RevId: 525085502
Change-Id: I2a995d3445940333c21eeb89b4ba60887f99e51b
@coeuvre
Copy link
Member

coeuvre commented May 2, 2023

The symlink issue has been fixed since 6.1. Parallelizing digest computation is submitted. Closing.

@coeuvre coeuvre closed this as completed May 2, 2023
@brentleyjones
Copy link
Contributor

@chiragramani @BalestraPatrick Can y'all verify that 6.2.0 fully fixes this?

@BalestraPatrick
Copy link
Member

I can bump our project to a version of Bazel that contains this change and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts type: feature request
Projects
None yet
Development

No branches or pull requests

9 participants