bazel 0.10.0 too aggressive with memory, causes OOM-killer to be invoked #4616

mafanasyev-tri · 2018-02-09T18:37:24Z

Description of the problem / feature request:

We switched bazel from 0.9.0 to 0.10.0 and our workers started to fail with out-of-memory errors.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Unfortunately, I do not have a simple reproduction receipt. We just need to build our project (bazel test //...), which is about 10K targets, with many of them being template-heavy C++ files.

Build machines: 32 cores, 64 GB of ram, no swap, default memory-related settings
We use the following configuration: --ram_utilization_factor 50, no -j option

What operating system are you running Bazel on?

Ubuntu 16.04.3 LTS
Linux *** 4.4.0-1041-aws #50-Ubuntu SMP Wed Nov 15 22:18:17 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

What's the output of `bazel info release`?

Build label: 0.10.0- (@non-git)
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Sat Aug 11 16:17:21 +50074 (1518034925841)
Build timestamp: 1518034925841
Build timestamp as int: 1518034925841

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.

Downloaded source from 0.10.0 relaase page

What's the output of `git remote get-url origin ; git rev-parse master ; git rev-parse HEAD` ?

This refers to private git repo that I unfortunately cannot share

Have you found anything relevant by searching the web?

There was an earlier discussion at https://groups.google.com/forum/#!searchin/bazel-discuss/josh$20pieper%7Csort:date/bazel-discuss/ujUkOus9g68/anihpWogDQAJ

Searching bazel-discuss and bug tracker.
#3886 seems related, but there are no cgroups involved.
#3645 / #2946 describe the similar situation, but we do not care about bazel hangs -- OOM killer usually kills some other important process first.

Any other information, logs, or outputs that you want to share?

Happy to run any possible diagnostics.

The text was updated successfully, but these errors were encountered:

mafanasyev-tri · 2018-02-12T17:47:35Z

Tried with --ram_utilization_factor 20 and --ram_utilization_factor 10, OOM's still happen.

Steve-Munday · 2018-02-13T18:25:18Z

I have had this problem as well while trying to run Bazel in a docker container on Circle CI.

As was mentioned in #3886 I tried to limit local resources using --local_resources=4096,4.0,1.0 which should have been plenty on a container that was limited to 8GB of RAM, however my build still fails intermittently with OOM. I suspect that bazel may not be respecting these limits?

Snippet from the the logs:

[74 / 578] Compiling app/muse-cpp/src/session/new_session.cpp; 18s local ... (32 actions, 6 running)

ERROR: /home/circleci/ix/app/BUILD:145:1: C++ compilation of rule '//app:muse-cpp' failed (Exit 1)

app/muse-cpp/src/user/content_manager_impl.cpp:1528:1: fatal error: error writing to /tmp/cchSeBMY.s: Cannot allocate memory
 }
 ^
compilation terminated.

philwo · 2018-02-15T20:39:12Z

@mafanasyev-tri Can you reliably reproduce this with Bazel 0.10.0 but not with Bazel 0.9.0? That would be a very interesting case. Are your tools crashing with OOM or Bazel itself? Can you share what kind of compilers or tools are running out of memory?

Could you try if limiting parallelism (via --jobs=N and maybe start with N=16?) helps? The problem is likely to be that you're running many actions in parallel that consume a lot of RAM. If Bazel runs e.g. 32 JVM-based compilers in parallel, each with a 4 GB heap, you're already at twice your available system memory.

Are you possibly using tmpfs on /tmp or where Bazel's output base is (check with bazel info output_base)? Or using --sandbox_tmpfs_path=/tmp and your actions write big outputs to /tmp? Or --experimental_sandbox_base=/dev/shm?

I realize that this is a bad user experience and it should "just work" :( We want to make this better, but so far nobody had a good idea how to do resource estimation / scheduling for binaries that run as part of a build...

mafanasyev-tri · 2018-02-16T15:08:48Z

@philwo
(1) yes, I can reliably reproduce it with 0.10.0 -- I change a single line (bazel version used) and run repeated clean/build cycles. With 0.9.0, the bazel builds 15 times with no problems (and then continues to build all day as part of other CI jobs). With 0.10.0, the script OOMs at 3th-4th attempt.

(2) I am sure that limiting parallelism would help, unfortunately it would also increase our CI build times a lot (and they are already more than 30 minutes). This bug is mostly about regression in functionality -- after all, bazel 0.9.0 was working just fine, while 0.10.0 fails.

Our build relies on bazel's memory limit logic: we have 32 cores, 64 GB of RAM, and no swap. Each C++ build can take up to 2.5 GB of RAM, so if bazel launches all C++ builds at once, we get OOM (this is what 0.10.0 seems to do). But we also have link steps, random generation scripts, and unit tests which take much less memory, so if you interleave them, you can still get close to full CPU utilization and no OOM's (I expect this is what 0.9.0 does)

If this bug cannot be fixed, my plan for workaround was to limit the number of jobs (and maybe available memory, if this tunable works). Unfortunately, we have 2 CI machine configurations, and 3 dev machine configurations, which all differ in amount of cores and RAM -- I would have to manually define the limits for each machine type.

If there are any specific tests that I could run, I am happy to help. I can probably some debug prints around scheduling code, if this helps?

(3) Yes, we are using --sandbox_tmpfs_path=/tmp and our C++ compiler writes large object files (about 0.5GB each). Bazel 0.9.0 used to handle it just fine, though.

philwo · 2018-02-20T12:39:29Z

With 0.10.0, the script OOMs at 3th-4th attempt.

This is interesting - I wonder if we have a memory leak in the Bazel server and thus memory usage grows with each build until it's too much and the system runs out of memory?

Do you run bazel clean as part of your script? Can you try with bazel clean --expunge? That would terminate the server so each run should be isolated in terms of memory usage.

Can you log free -m before and after each run of your script?

mafanasyev-tri · 2018-02-21T14:52:41Z

I don't think it is a memory leak -- it looks more like a result of random scheduling.

Our test runs these two commands in a loop:

bazel --batch clean -expunge
bazel --batch test //...

I can tell that before the tests, free -m returns:

              total        used        free      shared  buff/cache   available
Mem:          60382         911       15800          48       43670       58800
Swap:             0           0           0

I will test this after repeats, but it might take some time -- our jenkins workers do not handle OOMs well, so I am trying to do it in the evenings, when no one is using them

philwo · 2018-07-19T14:55:06Z

@mafanasyev-tri Is this still an issue with recent Bazel versions?

mafanasyev-tri · 2018-07-19T19:44:27Z

We are using a workaround now: --experimental_local_memory_estimate option, introduced in #4938 and merged into master.

AFAIK, the default scheduler is still too aggressive, but now that we can use local memory estimator, we are good.

jmmv · 2019-03-14T10:40:57Z

Well, a problem is that --ram_utilization_factor is a lie. Bazel has hardcoded heuristics to estimate how much RAM an action will need... but if you have template-heavy compilations as you say, Bazel's built-in values just don't work.

#4938 is likely the best you can do in an automated manner.

There also is #6477, which would let you specify resource requirements in the rules. But those would be estimates too, as you'd have to hardcode values that may or may not match reality (e.g. switching toolchains can change the memory profile of the actions).

jmmv · 2020-05-11T19:41:51Z

Closing given the claim that --experimental_local_memory_estimate is good enough.

Steve-Munday mentioned this issue Feb 13, 2018

Default local resources should respect cgroup limits on Linux #3886

Closed

philwo self-assigned this Feb 15, 2018

philwo added P2 We'll consider working on this in future. (Assignee optional) type: support / not a bug (process) labels Feb 15, 2018

hlopko added the category: performance label Mar 21, 2018

Globegitter mentioned this issue Nov 5, 2018

Update embedded JDK to 11 to fix a host of docker related issues #6592

Closed

meisterT added team-Execution and removed category: performance labels Nov 29, 2018

jin added team-Local-Exec Issues and PRs for the Execution (Local) team and removed team-Execution labels Jan 14, 2019

jmmv closed this as completed May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bazel 0.10.0 too aggressive with memory, causes OOM-killer to be invoked #4616

bazel 0.10.0 too aggressive with memory, causes OOM-killer to be invoked #4616

mafanasyev-tri commented Feb 9, 2018

mafanasyev-tri commented Feb 12, 2018

Steve-Munday commented Feb 13, 2018 •

edited

Loading

philwo commented Feb 15, 2018

mafanasyev-tri commented Feb 16, 2018

philwo commented Feb 20, 2018

mafanasyev-tri commented Feb 21, 2018

philwo commented Jul 19, 2018

mafanasyev-tri commented Jul 19, 2018

jmmv commented Mar 14, 2019

jmmv commented May 11, 2020

bazel 0.10.0 too aggressive with memory, causes OOM-killer to be invoked #4616

bazel 0.10.0 too aggressive with memory, causes OOM-killer to be invoked #4616

Comments

mafanasyev-tri commented Feb 9, 2018

Description of the problem / feature request:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

mafanasyev-tri commented Feb 12, 2018

Steve-Munday commented Feb 13, 2018 • edited Loading

philwo commented Feb 15, 2018

mafanasyev-tri commented Feb 16, 2018

philwo commented Feb 20, 2018

mafanasyev-tri commented Feb 21, 2018

philwo commented Jul 19, 2018

mafanasyev-tri commented Jul 19, 2018

jmmv commented Mar 14, 2019

jmmv commented May 11, 2020

What's the output of `bazel info release`?

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.

What's the output of `git remote get-url origin ; git rev-parse master ; git rev-parse HEAD` ?

Steve-Munday commented Feb 13, 2018 •

edited

Loading