Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel 0.10.0 too aggressive with memory, causes OOM-killer to be invoked #4616

Closed
mafanasyev-tri opened this issue Feb 9, 2018 · 10 comments
Closed
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Local-Exec Issues and PRs for the Execution (Local) team type: support / not a bug (process)

Comments

@mafanasyev-tri
Copy link

Description of the problem / feature request:

We switched bazel from 0.9.0 to 0.10.0 and our workers started to fail with out-of-memory errors.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Unfortunately, I do not have a simple reproduction receipt. We just need to build our project (bazel test //...), which is about 10K targets, with many of them being template-heavy C++ files.

Build machines: 32 cores, 64 GB of ram, no swap, default memory-related settings
We use the following configuration: --ram_utilization_factor 50, no -j option

What operating system are you running Bazel on?

Ubuntu 16.04.3 LTS
Linux *** 4.4.0-1041-aws #50-Ubuntu SMP Wed Nov 15 22:18:17 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

What's the output of bazel info release?

Build label: 0.10.0- (@non-git)
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Sat Aug 11 16:17:21 +50074 (1518034925841)
Build timestamp: 1518034925841
Build timestamp as int: 1518034925841

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

Downloaded source from 0.10.0 relaase page

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

This refers to private git repo that I unfortunately cannot share

Have you found anything relevant by searching the web?

There was an earlier discussion at https://groups.google.com/forum/#!searchin/bazel-discuss/josh$20pieper%7Csort:date/bazel-discuss/ujUkOus9g68/anihpWogDQAJ

Searching bazel-discuss and bug tracker.
#3886 seems related, but there are no cgroups involved.
#3645 / #2946 describe the similar situation, but we do not care about bazel hangs -- OOM killer usually kills some other important process first.

Any other information, logs, or outputs that you want to share?

Happy to run any possible diagnostics.

@mafanasyev-tri
Copy link
Author

Tried with --ram_utilization_factor 20 and --ram_utilization_factor 10, OOM's still happen.

@Steve-Munday
Copy link

Steve-Munday commented Feb 13, 2018

I have had this problem as well while trying to run Bazel in a docker container on Circle CI.

As was mentioned in #3886 I tried to limit local resources using --local_resources=4096,4.0,1.0 which should have been plenty on a container that was limited to 8GB of RAM, however my build still fails intermittently with OOM. I suspect that bazel may not be respecting these limits?

Snippet from the the logs:

[74 / 578] Compiling app/muse-cpp/src/session/new_session.cpp; 18s local ... (32 actions, 6 running)

ERROR: /home/circleci/ix/app/BUILD:145:1: C++ compilation of rule '//app:muse-cpp' failed (Exit 1)

app/muse-cpp/src/user/content_manager_impl.cpp:1528:1: fatal error: error writing to /tmp/cchSeBMY.s: Cannot allocate memory
 }
 ^
compilation terminated.

@philwo
Copy link
Member

philwo commented Feb 15, 2018

@mafanasyev-tri Can you reliably reproduce this with Bazel 0.10.0 but not with Bazel 0.9.0? That would be a very interesting case. Are your tools crashing with OOM or Bazel itself? Can you share what kind of compilers or tools are running out of memory?

Could you try if limiting parallelism (via --jobs=N and maybe start with N=16?) helps? The problem is likely to be that you're running many actions in parallel that consume a lot of RAM. If Bazel runs e.g. 32 JVM-based compilers in parallel, each with a 4 GB heap, you're already at twice your available system memory.

Are you possibly using tmpfs on /tmp or where Bazel's output base is (check with bazel info output_base)? Or using --sandbox_tmpfs_path=/tmp and your actions write big outputs to /tmp? Or --experimental_sandbox_base=/dev/shm?

I realize that this is a bad user experience and it should "just work" :( We want to make this better, but so far nobody had a good idea how to do resource estimation / scheduling for binaries that run as part of a build...

@philwo philwo self-assigned this Feb 15, 2018
@philwo philwo added P2 We'll consider working on this in future. (Assignee optional) type: support / not a bug (process) labels Feb 15, 2018
@mafanasyev-tri
Copy link
Author

@philwo
(1) yes, I can reliably reproduce it with 0.10.0 -- I change a single line (bazel version used) and run repeated clean/build cycles. With 0.9.0, the bazel builds 15 times with no problems (and then continues to build all day as part of other CI jobs). With 0.10.0, the script OOMs at 3th-4th attempt.

(2) I am sure that limiting parallelism would help, unfortunately it would also increase our CI build times a lot (and they are already more than 30 minutes). This bug is mostly about regression in functionality -- after all, bazel 0.9.0 was working just fine, while 0.10.0 fails.

Our build relies on bazel's memory limit logic: we have 32 cores, 64 GB of RAM, and no swap. Each C++ build can take up to 2.5 GB of RAM, so if bazel launches all C++ builds at once, we get OOM (this is what 0.10.0 seems to do). But we also have link steps, random generation scripts, and unit tests which take much less memory, so if you interleave them, you can still get close to full CPU utilization and no OOM's (I expect this is what 0.9.0 does)

If this bug cannot be fixed, my plan for workaround was to limit the number of jobs (and maybe available memory, if this tunable works). Unfortunately, we have 2 CI machine configurations, and 3 dev machine configurations, which all differ in amount of cores and RAM -- I would have to manually define the limits for each machine type.

If there are any specific tests that I could run, I am happy to help. I can probably some debug prints around scheduling code, if this helps?

(3) Yes, we are using --sandbox_tmpfs_path=/tmp and our C++ compiler writes large object files (about 0.5GB each). Bazel 0.9.0 used to handle it just fine, though.

@philwo
Copy link
Member

philwo commented Feb 20, 2018

With 0.10.0, the script OOMs at 3th-4th attempt.

This is interesting - I wonder if we have a memory leak in the Bazel server and thus memory usage grows with each build until it's too much and the system runs out of memory?

Do you run bazel clean as part of your script? Can you try with bazel clean --expunge? That would terminate the server so each run should be isolated in terms of memory usage.

Can you log free -m before and after each run of your script?

@mafanasyev-tri
Copy link
Author

I don't think it is a memory leak -- it looks more like a result of random scheduling.

Our test runs these two commands in a loop:

bazel --batch clean -expunge
bazel --batch test //...

I can tell that before the tests, free -m returns:

              total        used        free      shared  buff/cache   available
Mem:          60382         911       15800          48       43670       58800
Swap:             0           0           0

I will test this after repeats, but it might take some time -- our jenkins workers do not handle OOMs well, so I am trying to do it in the evenings, when no one is using them

@philwo
Copy link
Member

philwo commented Jul 19, 2018

@mafanasyev-tri Is this still an issue with recent Bazel versions?

@mafanasyev-tri
Copy link
Author

We are using a workaround now: --experimental_local_memory_estimate option, introduced in #4938 and merged into master.

AFAIK, the default scheduler is still too aggressive, but now that we can use local memory estimator, we are good.

@jmmv
Copy link
Contributor

jmmv commented Mar 14, 2019

Well, a problem is that --ram_utilization_factor is a lie. Bazel has hardcoded heuristics to estimate how much RAM an action will need... but if you have template-heavy compilations as you say, Bazel's built-in values just don't work.

#4938 is likely the best you can do in an automated manner.

There also is #6477, which would let you specify resource requirements in the rules. But those would be estimates too, as you'd have to hardcode values that may or may not match reality (e.g. switching toolchains can change the memory profile of the actions).

@jmmv
Copy link
Contributor

jmmv commented May 11, 2020

Closing given the claim that --experimental_local_memory_estimate is good enough.

@jmmv jmmv closed this as completed May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Local-Exec Issues and PRs for the Execution (Local) team type: support / not a bug (process)
Projects
None yet
Development

No branches or pull requests

7 participants