Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky behavior on MacOS #17437

Closed
mkruskal-google opened this issue Feb 7, 2023 · 20 comments
Closed

Flaky behavior on MacOS #17437

mkruskal-google opened this issue Feb 7, 2023 · 20 comments
Labels
help wanted Someone outside the Bazel team could own this P3 We're not considering working on this, but happy to review a PR. (No assignee) platform: apple potential 5.x cherry-picks Potential cherry-picks for the next 5.x release. We'll consider a new 5.x release if enough issues g team-Rules-CPP Issues for C++ rules type: bug

Comments

@mkruskal-google
Copy link
Contributor

Description of the bug:

We recently switched from Kokoro to github actions in the protobuf repo. This substantially reduced all our flakes, and there's really only one left that's causing issues. We only see this on Mac/Bazel builds, at rough rate somewhere between 1-5%. We use Bazel 5.1.1 in all our testing. One example of it is here: https://github.com/protocolbuffers/protobuf/actions/runs/4117212221/jobs/7108243227

The symptom is always the same, and looks like a toolchain resolution issue (see below). We have 12 macOS builds in our CI, so even a 1-5% failure rate is a pretty big issue for us.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Run any bazel build on mac ~30 times and you're likely to see this failure at least once

Which operating system are you running Bazel on?

macOS

What is the output of bazel info release?

development 5.1.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

We use the bazelisk installed on our github runners and set USE_BAZEL_VERSION=5.1.1

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

I found #11520, which looks like a similar issue dating back over 2 years.

Any other information, logs, or outputs that you want to share?

  INFO: Repository local_config_cc instantiated at:
    /DEFAULT.WORKSPACE.SUFFIX:519:13: in <toplevel>
    /private/var/tmp/_bazel_runner/88067fd1245450c0f2a417255e46b72b/external/bazel_tools/tools/cpp/cc_configure.bzl:184:16: in cc_configure
  Repository rule cc_autoconf defined at:
    /private/var/tmp/_bazel_runner/88067fd1245450c0f2a417255e46b72b/external/bazel_tools/tools/cpp/cc_configure.bzl:145:30: in <toplevel>
  ERROR: An error occurred during the fetch of repository 'local_config_cc':
     Traceback (most recent call last):
  	File "/private/var/tmp/_bazel_runner/88067fd1245450c0f2a417255e46b72b/external/bazel_tools/tools/cpp/cc_configure.bzl", line 125, column 32, in cc_autoconf_impl
  		configure_osx_toolchain(repository_ctx, cpu_value, overriden_tools)
  	File "/private/var/tmp/_bazel_runner/88067fd1245450c0f2a417255e46b72b/external/bazel_tools/tools/cpp/osx_cc_configure.bzl", line 211, column 25, in configure_osx_toolchain
  		_compile_cc_file(repository_ctx, libtool_check_unique_src_path, "libtool_check_unique")
  	File "/private/var/tmp/_bazel_runner/88067fd1245450c0f2a417255e46b72b/external/bazel_tools/tools/cpp/osx_cc_configure.bzl", line 136, column 37, in _compile_cc_file
  		_compile_cc_file_single_arch(repository_ctx, src_name, out_name)
  	File "/private/var/tmp/_bazel_runner/88067fd1245450c0f2a417255e46b72b/external/bazel_tools/tools/cpp/osx_cc_configure.bzl", line 83, column 13, in _compile_cc_file_single_arch
  		fail(out_name + " failed to generate. Please file an issue at " +
  Error in fail: libtool_check_unique failed to generate. Please file an issue at https://github.com/bazelbuild/bazel/issues with the following:
  return code 256, stderr: Timed out, stdout: 
@ShreeM01 ShreeM01 added type: bug untriaged team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website labels Feb 7, 2023
@Apollo9999
Copy link

Please see the screenshot protocolbuffers

@mkruskal-google
Copy link
Contributor Author

mkruskal-google commented Feb 8, 2023

Yea that's a consequence of the bazel run failing (we blindly assume it will succeed and try to run a generated executable). We've seen this root cause on pretty much every Mac/Bazel test we have, this one just behaves a bit badly afterwards. See the earlier error:
image

@fweikert fweikert added team-Rules-CPP Issues for C++ rules and removed team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website labels Feb 14, 2023
@susinmotion
Copy link
Contributor

@susinmotion following.

@keith
Copy link
Member

keith commented Feb 14, 2023

@thomasvl was also looking at this. I do wonder in this case if it didn't fail here, would the first compile just fail instead? 60 seconds is a huge timeout for the single compile that's failing here.

@susinmotion
Copy link
Contributor

@googlewalt

copybara-service bot pushed a commit to protocolbuffers/protobuf that referenced this issue Feb 15, 2023
See bazelbuild/bazel#17437 for more details.

PiperOrigin-RevId: 509869860
copybara-service bot pushed a commit to protocolbuffers/protobuf that referenced this issue Feb 15, 2023
Bazel has a 2 minute timeout for their internal `xcrun` call, which can be exceeded on our github runners about 5% of the time.  This leads to flakes and opaque errors, but is a one-time cost.  Subsequent xcruns finish in seconds, so we can just do an initial call w/o a timeout before running Bazel.

See bazelbuild/bazel#17437 for background.

PiperOrigin-RevId: 509869860
copybara-service bot pushed a commit to protocolbuffers/protobuf that referenced this issue Feb 15, 2023
Bazel has a 2 minute timeout for their internal `xcrun` call, which can be exceeded on our github runners about 5% of the time.  This leads to flakes and opaque errors, but is a one-time cost.  Subsequent xcruns finish in seconds, so we can just do an initial call w/o a timeout before running Bazel.

See bazelbuild/bazel#17437 for background.

PiperOrigin-RevId: 509869860
copybara-service bot pushed a commit to protocolbuffers/protobuf that referenced this issue Feb 15, 2023
Bazel has a 2 minute timeout for their internal `xcrun` call, which can be exceeded on our github runners about 5% of the time.  This leads to flakes and opaque errors, but is a one-time cost.  Subsequent xcruns finish in seconds, so we can just do an initial call w/o a timeout before running Bazel.

With this change our total flake rate drops from ~30% to nearly 0% for our full suite of tests

See bazelbuild/bazel#17437 for background.

PiperOrigin-RevId: 509869860
copybara-service bot pushed a commit to protocolbuffers/protobuf that referenced this issue Feb 15, 2023
Bazel has a 2 minute timeout for their internal `xcrun` call, which can be exceeded on our github runners about 5% of the time.  This leads to flakes and opaque errors, but is a one-time cost.  Subsequent xcruns finish in seconds, so we can just do an initial call w/o a timeout before running Bazel.

With this change our total flake rate drops from ~30% to nearly 0% for our full suite of tests

See bazelbuild/bazel#17437 for background.

PiperOrigin-RevId: 509944178
@mkruskal-google
Copy link
Contributor Author

Note: after a lot of debugging we traced this down to the xcrun calls initiated from https://github.com/bazelbuild/bazel/blob/e8a69f5d5acaeb6af760631490ecbf73e8a04eeb/tools/cpp/osx_cc_configure.bzl. The xcode locator can sometimes take over 2m and even the faster ones can rarely take over 1m. Pre-caching these in xcode by running them manually before Bazel speeds up the times enough to fix our flakes. Ideally these timeouts would all be configurable though, and maybe sped up if possible

@keith
Copy link
Member

keith commented Feb 15, 2023

How many versions of Xcode are installed in that case? Do your pre-warm calls take the same long amount of time even when you run them first? If this is an OS cache issue you could probably just run one, throw away the result, and then run bazel normally, instead of having to try and reproduce what it's doing in your script.

@mkruskal-google
Copy link
Contributor Author

mkruskal-google commented Feb 16, 2023

There are 8 versions of Xcode installed on our github runners, but we've already pinned 1 of them using DEVELOPER_DIR. What I found is that the first run of the xcode locator takes anywhere from 30s to 2m+. Subsequent runs finish in a few seconds. I tried clearing every cache I could track down (Bazel, Xcode, clang), but I couldn't get the second run to ever take longer.

My first attempt at a fix only ran the xcode locator, but that led to a (slightly rarer) timeout from some of the other xcrun calls in that file. When I replaced this with a duplication of all 5 xcrun calls the flakes disappeared

keith added a commit to keith/bazel that referenced this issue Feb 16, 2023
This switches all macOS toolchain setup compiles and executes to use the
default timeout of 600s. This should help avoid issues on GitHub actions
where these timeout and cause build failures. The common case shouldn't
really be affected.

bazelbuild#17437
keith added a commit to bazelbuild/apple_support that referenced this issue Feb 16, 2023
This switches all macOS toolchain setup compiles and executes to use the
default timeout of 600s. This should help avoid issues on GitHub actions
where these timeout and cause build failures. The common case shouldn't
really be affected.

bazelbuild/bazel#17437
@keith
Copy link
Member

keith commented Feb 16, 2023

Takeaway from a meeting about this: we don't know why these things are so slow the first time. We're going to make a few changes to try and help things:

  1. Avoid running xcode-locator with either an opt in Add BAZEL_SKIP_XCODE_FETCH to reduce toolchain configuration time apple_support#191 or opt out Add BAZEL_ALLOW_NON_APPLICATIONS_XCODE to run xcode-locator apple_support#197 env var (at least if you're also locking the xcode version)
  2. Potentially provide binaries for the few tools that the toolchain compiles to avoid the clang invocations, tracking here Precompile toolchain binaries to avoid issues apple_support#202
  3. Bump the timeouts for the compiles, realistically we'd probably rather things take longer than fail Increase timeouts for macOS toolchain setup #17519 Increase timeouts for macOS toolchain setup apple_support#203

keith added a commit to bazelbuild/apple_support that referenced this issue Feb 16, 2023
This switches all macOS toolchain setup compiles and executes to use the
default timeout of 600s. This should help avoid issues on GitHub actions
where these timeout and cause build failures. The common case shouldn't
really be affected.

bazelbuild/bazel#17437
@mkruskal-google
Copy link
Contributor Author

Note immediately after the meeting we hit another flake: https://github.com/protocolbuffers/protobuf/actions/runs/4196826466/jobs/7278362423. If you look at the timing one of the multiarch builds took over 2m and the other took just over 1m. So it looks like we still have an issue, it's just much rarer. Bumping the timeouts to 5m would be the quickest fix here, and I think it would fix the issue for us

copybara-service bot pushed a commit that referenced this issue Feb 16, 2023
…nvironment variable

In certain setups, these calls to xcrun during bazel setup have been reported to
sometimes take more than 2 minutes (see
#17437).  We have already bumped this
timeout multiple times, which is currently 60 for some calls and 120 for others.
Standardize the default timeout to 120, and allow the timeout to be override via
BAZEL_OSX_EXECUTE_TIMEOUT, to allow individual enviroments to increase that even
more if needed.

PiperOrigin-RevId: 510200188
Change-Id: I664eb7979c4fd2b46ccc87d073f319c1e6041d77
@keith
Copy link
Member

keith commented Feb 16, 2023

Definitely interesting that even after warming up clang that it still timed out. But yea if we can get #17519 merged I think we can cherry pick it into the LTS release

keith pushed a commit to keith/bazel that referenced this issue Feb 16, 2023
…nvironment variable

In certain setups, these calls to xcrun during bazel setup have been reported to
sometimes take more than 2 minutes (see
bazelbuild#17437).  We have already bumped this
timeout multiple times, which is currently 60 for some calls and 120 for others.
Standardize the default timeout to 120, and allow the timeout to be override via
BAZEL_OSX_EXECUTE_TIMEOUT, to allow individual enviroments to increase that even
more if needed.

PiperOrigin-RevId: 510200188
Change-Id: I664eb7979c4fd2b46ccc87d073f319c1e6041d77
@mkruskal-google
Copy link
Contributor Author

Can we cherry pick this into Bazel 5 as well?

mkruskal-google added a commit to mkruskal-google/protobuf that referenced this issue Feb 16, 2023
Bazel has a 2 minute timeout for their internal `xcrun` call, which can be exceeded on our github runners about 5% of the time.  This leads to flakes and opaque errors, but is a one-time cost.  Subsequent xcruns finish in seconds, so we can just do an initial call w/o a timeout before running Bazel.

With this change our total flake rate drops from ~30% to nearly 0% for our full suite of tests

See bazelbuild/bazel#17437 for background.

PiperOrigin-RevId: 509944178
ShreeM01 pushed a commit that referenced this issue Feb 16, 2023
…nvironment variable (#17521)

In certain setups, these calls to xcrun during bazel setup have been reported to
sometimes take more than 2 minutes (see
#17437).  We have already bumped this
timeout multiple times, which is currently 60 for some calls and 120 for others.
Standardize the default timeout to 120, and allow the timeout to be override via
BAZEL_OSX_EXECUTE_TIMEOUT, to allow individual enviroments to increase that even
more if needed.

PiperOrigin-RevId: 510200188
Change-Id: I664eb7979c4fd2b46ccc87d073f319c1e6041d77

Co-authored-by: Googler <waltl@google.com>
@keith
Copy link
Member

keith commented Feb 16, 2023

Idk what the status of that older LTS is, someone from the bazel team will have to chime in on if there will be another one there

@mkruskal-google
Copy link
Contributor Author

@susinmotion what do we need to do to get this back-ported to Bazel 5? I could upgrade our mac tests to only run over Bazel 6, but since we support both I'd prefer not to

@susinmotion
Copy link
Contributor

Since Bazel 5 is in maintenance mode, we'd prefer not to backport if we can avoid it. What would it take to upgrade the tests?

@mkruskal-google
Copy link
Contributor Author

I guess using exclusively Bazel 6 on mac wouldn't be that bad, as long as we keep our Bazel 4/5 support for linux tests. When can we expect the new release?

@susinmotion
Copy link
Contributor

I think we're aiming for early March.

@Wyverald Wyverald added the potential 5.x cherry-picks Potential cherry-picks for the next 5.x release. We'll consider a new 5.x release if enough issues g label Feb 17, 2023
@oquenchil oquenchil added P3 We're not considering working on this, but happy to review a PR. (No assignee) help wanted Someone outside the Bazel team could own this and removed untriaged labels Feb 20, 2023
@mkruskal-google
Copy link
Contributor Author

Does adding the potential 5.x cherry-picks tag mean this is being considered? It would be really nice to have this back-ported, since we're going to be supporting Bazel 5 for a long time

@meteorcloudy
Copy link
Member

Sorry for missing this, I hope you have already migrated to Bazel 6?

@mkruskal-google
Copy link
Contributor Author

Yep, we've migrated to Bazel 6 now

@meteorcloudy meteorcloudy closed this as not planned Won't fix, can't repro, duplicate, stale Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Someone outside the Bazel team could own this P3 We're not considering working on this, but happy to review a PR. (No assignee) platform: apple potential 5.x cherry-picks Potential cherry-picks for the next 5.x release. We'll consider a new 5.x release if enough issues g team-Rules-CPP Issues for C++ rules type: bug
Projects
None yet
Development

No branches or pull requests

9 participants