-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bazel 0.10.0 too aggressive with memory, causes OOM-killer to be invoked #4616
Comments
Tried with |
I have had this problem as well while trying to run Bazel in a docker container on Circle CI. As was mentioned in #3886 I tried to limit local resources using Snippet from the the logs:
|
@mafanasyev-tri Can you reliably reproduce this with Bazel 0.10.0 but not with Bazel 0.9.0? That would be a very interesting case. Are your tools crashing with OOM or Bazel itself? Can you share what kind of compilers or tools are running out of memory? Could you try if limiting parallelism (via --jobs=N and maybe start with N=16?) helps? The problem is likely to be that you're running many actions in parallel that consume a lot of RAM. If Bazel runs e.g. 32 JVM-based compilers in parallel, each with a 4 GB heap, you're already at twice your available system memory. Are you possibly using tmpfs on /tmp or where Bazel's output base is (check with I realize that this is a bad user experience and it should "just work" :( We want to make this better, but so far nobody had a good idea how to do resource estimation / scheduling for binaries that run as part of a build... |
@philwo (2) I am sure that limiting parallelism would help, unfortunately it would also increase our CI build times a lot (and they are already more than 30 minutes). This bug is mostly about regression in functionality -- after all, bazel 0.9.0 was working just fine, while 0.10.0 fails. Our build relies on bazel's memory limit logic: we have 32 cores, 64 GB of RAM, and no swap. Each C++ build can take up to 2.5 GB of RAM, so if bazel launches all C++ builds at once, we get OOM (this is what 0.10.0 seems to do). But we also have link steps, random generation scripts, and unit tests which take much less memory, so if you interleave them, you can still get close to full CPU utilization and no OOM's (I expect this is what 0.9.0 does) If this bug cannot be fixed, my plan for workaround was to limit the number of jobs (and maybe available memory, if this tunable works). Unfortunately, we have 2 CI machine configurations, and 3 dev machine configurations, which all differ in amount of cores and RAM -- I would have to manually define the limits for each machine type. If there are any specific tests that I could run, I am happy to help. I can probably some debug prints around scheduling code, if this helps? (3) Yes, we are using --sandbox_tmpfs_path=/tmp and our C++ compiler writes large object files (about 0.5GB each). Bazel 0.9.0 used to handle it just fine, though. |
This is interesting - I wonder if we have a memory leak in the Bazel server and thus memory usage grows with each build until it's too much and the system runs out of memory? Do you run Can you log |
I don't think it is a memory leak -- it looks more like a result of random scheduling. Our test runs these two commands in a loop:
I can tell that before the tests,
I will test this after repeats, but it might take some time -- our jenkins workers do not handle OOMs well, so I am trying to do it in the evenings, when no one is using them |
@mafanasyev-tri Is this still an issue with recent Bazel versions? |
We are using a workaround now: AFAIK, the default scheduler is still too aggressive, but now that we can use local memory estimator, we are good. |
Well, a problem is that #4938 is likely the best you can do in an automated manner. There also is #6477, which would let you specify resource requirements in the rules. But those would be estimates too, as you'd have to hardcode values that may or may not match reality (e.g. switching toolchains can change the memory profile of the actions). |
Closing given the claim that |
Description of the problem / feature request:
We switched bazel from 0.9.0 to 0.10.0 and our workers started to fail with out-of-memory errors.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Unfortunately, I do not have a simple reproduction receipt. We just need to build our project (
bazel test //...
), which is about 10K targets, with many of them being template-heavy C++ files.Build machines: 32 cores, 64 GB of ram, no swap, default memory-related settings
We use the following configuration:
--ram_utilization_factor 50
, no-j
optionWhat operating system are you running Bazel on?
Ubuntu 16.04.3 LTS
Linux *** 4.4.0-1041-aws #50-Ubuntu SMP Wed Nov 15 22:18:17 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
What's the output of
bazel info release
?Build label: 0.10.0- (@non-git)
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Sat Aug 11 16:17:21 +50074 (1518034925841)
Build timestamp: 1518034925841
Build timestamp as int: 1518034925841
If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.Downloaded source from 0.10.0 relaase page
What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?This refers to private git repo that I unfortunately cannot share
Have you found anything relevant by searching the web?
There was an earlier discussion at https://groups.google.com/forum/#!searchin/bazel-discuss/josh$20pieper%7Csort:date/bazel-discuss/ujUkOus9g68/anihpWogDQAJ
Searching bazel-discuss and bug tracker.
#3886 seems related, but there are no cgroups involved.
#3645 / #2946 describe the similar situation, but we do not care about bazel hangs -- OOM killer usually kills some other important process first.
Any other information, logs, or outputs that you want to share?
Happy to run any possible diagnostics.
The text was updated successfully, but these errors were encountered: