Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symlinked sandbox is slow #16711

Open
lberki opened this issue Nov 9, 2022 · 15 comments
Open

Symlinked sandbox is slow #16711

lberki opened this issue Nov 9, 2022 · 15 comments
Assignees
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug

Comments

@lberki
Copy link
Contributor

lberki commented Nov 9, 2022

Description of the bug:

The symlinked sandbox is slow when there is a large number of input files (I have seen reports of actions with up to 300K)

There are a number of ways one could improve this:

  1. Creating the input directories and symlinks on multiple threads (SandboxHelpers currently does this on one thread)
  2. Traversing the Java -> C++ boundary less frequently
  3. Using one symlink per large tree artifact instead of symlinking each file in it separately
  4. Using io_uring on Linux for more efficient data transfer to the kernel
  5. Keeping the file system created for an action around and re-using it if the same action (or a similar one) is executed again

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@lberki lberki added the team-Local-Exec Issues and PRs for the Execution (Local) team label Nov 9, 2022
@matthewjh
Copy link

matthewjh commented Nov 9, 2022

Yes, please. It is known to be particularly slow on MacOS:

#8230

In my (albeit limited experience), the biggest barrier to integrating Bazel into local developer workflows (where quick feedback is paramount) is the sandboxing time, which in many cases far outstrips action exec time.

@brentleyjones
Copy link
Contributor

Using one symlink per large tree artifact instead of symlinking each file in it separately

This one would really help for Apple platform builds where they produce bundles of bundles.

@meisterT
Copy link
Member

cc @larsrc-google

@lberki There is --experimental_reuse_sandbox_directories - have you tried that?

@larsrc-google
Copy link
Contributor

--experimental_reuse_sandbox_directories is essentially your point #5, and it helped a lot. #3 (tree artifacts) does sound reasonable. The others I'd like to have some measurements for first, to see how much we can actually save.

@lberki
Copy link
Contributor Author

lberki commented Nov 17, 2022

Learned today: https://github.com/ikorennoy/jasyncfio , io_uring in Java (I'm not sure if it's useful and it'd be an extra dependency, but I don't want this nugget of data to get lost)

@larsrc-google
Copy link
Contributor

We'll need some reproducible examples. I tried compiling Bazel itself with and without sandbox in various worker/non-worker configurations, and the difference was minimal.

@lberki
Copy link
Contributor Author

lberki commented Nov 17, 2022

Did you try synthetic loads? That would be a much easier avenue than testing on Chrome OS / Kleaf builds (AFAIU they have 300K/80K input files, but I don't know how many and how big TreeArtifacts there are in the former)

@meisterT
Copy link
Member

@lberki Can you share the build that actually triggers the slowness? From there we can work towards a more minimal repro.

@lberki
Copy link
Contributor Author

lberki commented Nov 17, 2022

Plussed @larsrc-google into the pertinent threads (unfortunately, they are Google internal communications even though they are about the interaction between two Google open source projects...)

@meisterT meisterT added the P3 We're not considering working on this, but happy to review a PR. (No assignee) label Aug 3, 2023
@jacky8hyf
Copy link

Kleaf uses sandbox builds by default (though we also encourage developers to disable the costly sandboxes for local development). This feature will greatly improve the build time for Kleaf.

I can provide some metrics for the time spent on sandbox creation for Kleaf builds on build bots (ci.android.com) upon request (the data is public but the dashboard is internal only).

@lberki
Copy link
Contributor Author

lberki commented Sep 6, 2023

Ack. Numbers would be really helpful to aid in our prioritization decisions.

@metti
Copy link
Contributor

metti commented Feb 29, 2024

I am currently into potential improvements for the SymlinkedSandbox. In particular, I explore pushing more work batched together to JNI and to facilitate io_uring for I/O.

@larsrc-google
Copy link
Contributor

I looked a bunch at io_uring, but dropped it again when I head it had several bugs, including security-critical ones. Doing a batch API that can be implemented with JNI or io_uring or Loom threads would be good, though.

@ismell
Copy link

ismell commented Sep 17, 2024

  • Using one symlink per large tree artifact instead of symlinking each file in it separately

This one would have made the largest impact for ChromeOS.

@lberki
Copy link
Contributor Author

lberki commented Sep 18, 2024

😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug
Projects
None yet
Development

No branches or pull requests

9 participants