Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage regression in bazel 0.14.0 and 0.14.1 #5389

Closed
alexeagle opened this issue Jun 13, 2018 · 9 comments
Closed

Memory usage regression in bazel 0.14.0 and 0.14.1 #5389

alexeagle opened this issue Jun 13, 2018 · 9 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) under investigation

Comments

@alexeagle
Copy link
Contributor

On Angular and related projects, we run builds on CI in a docker container. We use the --local_resources flag to workaround
#3645
which causes random failures where Bazel tries to allocate too much memory, since it asks the OS how much RAM is available rather than the containerization host.

After updating to 0.14.0 (and also observed in 0.14.1) we have the problem again:
angular/angular#24484
it affects Angular users who send PRs which are failing on CI.

An example failure:
https://circleci.com/gh/gregmagolan/angular-bazel-example/442?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

There is not much indication what is going wrong, but it looks the same as errors we got before adding the --local_resources flag.

We are rolling back Bazel in Angular and related projects to 0.13.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 14, 2018

Please provide the contents of jvm.out.

@laszlocsomor laszlocsomor added under investigation P1 I'll work on this now. (Assignee required) labels Jun 14, 2018
@gregmagolan
Copy link
Contributor

Please provide the contents of jvm.out.

I ssh'd into CircleCI when this failure occurs and checked jvm.out and it an empty file after the crash. The failure looks like this:

[1] Segmentation fault
[0] 
[0] Server terminated abruptly (error code: 14, error message: '', log file: '/home/circleci/.cache/bazel/_bazel_circleci/9ce5c2144ecf75d11717c0aa41e45a8d/server/jvm.out')
[0] 
[0] bazel run //src:prodserver exited with code 37

It doesn't always occur in the same place (there are a few steps in CI that I've seen it fail at) and on occasion it passes.

Is there anywhere else besides jvm.out that might give a hint on the crash?

@ulfjack
Copy link
Contributor

ulfjack commented Jun 14, 2018

If the file is empty then that's a sign that the gRPC client/server connection crashed. We've seen a similar case before where we didn't implement flow control correctly for file uploads to remote machines.

Do you have remote caching or execution enabled? If yes, please provide the flags you are using.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 14, 2018

If these are the right flags, then it doesn't look like it:

[0] INFO: Options provided by the client:
[0]   Inherited 'common' options: --isatty=0 --terminal_columns=80

[0] INFO: Reading rc options for 'run' from /home/circleci/ng/tools/bazel.rc:
[0]   Inherited 'build' options: --strategy=TypeScriptCompile=worker --strategy=AngularTemplateCompile=worker --symlink_prefix=dist/

[0] INFO: Reading rc options for 'run' from /etc/bazel.bazelrc:
[0]   Inherited 'build' options: --noshow_progress --announce_rc --experimental_strict_action_env --experimental_repository_cache=/home/circleci/bazel_repository_cache --local_resources=3072,2.0,1.0

@ulfjack
Copy link
Contributor

ulfjack commented Jun 14, 2018

Hmm, that's odd - this seems to be a null build. Here's a more complete snippet from a passing run:

$ bazel build test/...
...
INFO: Analysed target //test/e2e:e2e (1 packages loaded).
INFO: Found 1 target...
Target //test/e2e:e2e up-to-date:
  dist/bin/test/e2e/app.spec.d.ts
...
$ yarn e2e-prodserver && yarn e2e-devserver
$ concurrently "bazel run //src:prodserver" "while ! nc -z 127.0.0.1 5432; do sleep 1; done && protractor" --kill-others --success first
...
[0] INFO: Analysed target //src:prodserver (1 packages loaded).
[0] INFO: Found 1 target...
[0] Target //src:prodserver up-to-date:
[0]   dist/bin/src/prodserver_bin.sh
[0]   dist/bin/src/prodserver
[0] INFO: Elapsed time: 15.070s, Critical Path: 9.76s
[0] INFO: 2 processes: 1 processwrapper-sandbox, 1 worker.
[0] INFO: Running command line: dist/bin/src/prodserver src -p 5432
...
[0] history-server listening on port 5432; Ctrl+C to stop
...
[0] bazel run //src:prodserver exited with code null
$ concurrently "bazel run //src:devserver" "while ! nc -z 127.0.0.1 5432; do sleep 1; done && protractor" --kill-others --success first
...
[0] INFO: Analysed target //src:devserver (0 packages loaded).
[0] INFO: Found 1 target...
[0] Target //src:devserver up-to-date:
[0]   dist/bin/src/devserver.MF
[0]   dist/bin/src/scripts_devserver.MF
[0]   dist/bin/src/devserver
[0] INFO: Elapsed time: 4.219s, Critical Path: 0.07s
[0] INFO: 0 processes.
[0] INFO: Running command line: dist/bin/src/devserver
[0] Server listening on http://ad2f21ad8e4a:5432/

@ulfjack
Copy link
Contributor

ulfjack commented Jun 14, 2018

The failing build fails to run //src:prodserver, but it's still just two actions.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 14, 2018

I think the reference to #3645 is a red herring.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 14, 2018

If I understand correctly, then you're running the containers with a 4 GB limit. --local_resources does not actually restrict Bazel's own memory usage, but maybe that's the intent?

IIRC, Bazel is set to 4 GB max memory, which is unaffected by --local_resources. Maybe Bazel is trying to allocate too much memory and the container is killing it? Technically, this doesn't imply that Bazel's memory use has actually increased - we might be using a slightly different gc configuration or allocate memory more rapidly, and that might push it over the limit.

If that's correct, then you should try adding --host_jvm_args=-Xmx2G to the Bazel invocation, like this:

bazel --host_jvm_args=-Xmx2G build ...

or like this in the bazelrc:

startup --host_jvm_args=-Xmx2G

@alexeagle
Copy link
Contributor Author

Thanks for investigating, Ulf!

Your simple explanation is right. The build passes three times in a row with the 2G heap limit for the bazel JVM. https://circleci.com/gh/alexeagle/workflows/angular-bazel-example/tree/test-bazel-0.14.1

I also have a PR out to update Angular to use 0.14 again
angular/angular#24512
where I've increased the size of the VM used to run one of the jobs.

This should fix it for us. I'm not sure what else we could do on this issue, other than improve the "guard rails" here so it's harder to get the wrong memory limit, or easier to debug. Feel free to close if you don't want to take further action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) under investigation
Projects
None yet
Development

No branches or pull requests

5 participants