-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky behavior on MacOS #17437
Comments
@susinmotion following. |
@thomasvl was also looking at this. I do wonder in this case if it didn't fail here, would the first compile just fail instead? 60 seconds is a huge timeout for the single compile that's failing here. |
See bazelbuild/bazel#17437 for more details. PiperOrigin-RevId: 509869860
Bazel has a 2 minute timeout for their internal `xcrun` call, which can be exceeded on our github runners about 5% of the time. This leads to flakes and opaque errors, but is a one-time cost. Subsequent xcruns finish in seconds, so we can just do an initial call w/o a timeout before running Bazel. See bazelbuild/bazel#17437 for background. PiperOrigin-RevId: 509869860
Bazel has a 2 minute timeout for their internal `xcrun` call, which can be exceeded on our github runners about 5% of the time. This leads to flakes and opaque errors, but is a one-time cost. Subsequent xcruns finish in seconds, so we can just do an initial call w/o a timeout before running Bazel. See bazelbuild/bazel#17437 for background. PiperOrigin-RevId: 509869860
Bazel has a 2 minute timeout for their internal `xcrun` call, which can be exceeded on our github runners about 5% of the time. This leads to flakes and opaque errors, but is a one-time cost. Subsequent xcruns finish in seconds, so we can just do an initial call w/o a timeout before running Bazel. With this change our total flake rate drops from ~30% to nearly 0% for our full suite of tests See bazelbuild/bazel#17437 for background. PiperOrigin-RevId: 509869860
Bazel has a 2 minute timeout for their internal `xcrun` call, which can be exceeded on our github runners about 5% of the time. This leads to flakes and opaque errors, but is a one-time cost. Subsequent xcruns finish in seconds, so we can just do an initial call w/o a timeout before running Bazel. With this change our total flake rate drops from ~30% to nearly 0% for our full suite of tests See bazelbuild/bazel#17437 for background. PiperOrigin-RevId: 509944178
Note: after a lot of debugging we traced this down to the xcrun calls initiated from https://github.com/bazelbuild/bazel/blob/e8a69f5d5acaeb6af760631490ecbf73e8a04eeb/tools/cpp/osx_cc_configure.bzl. The xcode locator can sometimes take over 2m and even the faster ones can rarely take over 1m. Pre-caching these in xcode by running them manually before Bazel speeds up the times enough to fix our flakes. Ideally these timeouts would all be configurable though, and maybe sped up if possible |
How many versions of Xcode are installed in that case? Do your pre-warm calls take the same long amount of time even when you run them first? If this is an OS cache issue you could probably just run one, throw away the result, and then run bazel normally, instead of having to try and reproduce what it's doing in your script. |
There are 8 versions of Xcode installed on our github runners, but we've already pinned 1 of them using My first attempt at a fix only ran the xcode locator, but that led to a (slightly rarer) timeout from some of the other |
This switches all macOS toolchain setup compiles and executes to use the default timeout of 600s. This should help avoid issues on GitHub actions where these timeout and cause build failures. The common case shouldn't really be affected. bazelbuild#17437
This switches all macOS toolchain setup compiles and executes to use the default timeout of 600s. This should help avoid issues on GitHub actions where these timeout and cause build failures. The common case shouldn't really be affected. bazelbuild/bazel#17437
Takeaway from a meeting about this: we don't know why these things are so slow the first time. We're going to make a few changes to try and help things:
|
This switches all macOS toolchain setup compiles and executes to use the default timeout of 600s. This should help avoid issues on GitHub actions where these timeout and cause build failures. The common case shouldn't really be affected. bazelbuild/bazel#17437
Note immediately after the meeting we hit another flake: https://github.com/protocolbuffers/protobuf/actions/runs/4196826466/jobs/7278362423. If you look at the timing one of the multiarch builds took over 2m and the other took just over 1m. So it looks like we still have an issue, it's just much rarer. Bumping the timeouts to 5m would be the quickest fix here, and I think it would fix the issue for us |
…nvironment variable In certain setups, these calls to xcrun during bazel setup have been reported to sometimes take more than 2 minutes (see #17437). We have already bumped this timeout multiple times, which is currently 60 for some calls and 120 for others. Standardize the default timeout to 120, and allow the timeout to be override via BAZEL_OSX_EXECUTE_TIMEOUT, to allow individual enviroments to increase that even more if needed. PiperOrigin-RevId: 510200188 Change-Id: I664eb7979c4fd2b46ccc87d073f319c1e6041d77
Definitely interesting that even after warming up clang that it still timed out. But yea if we can get #17519 merged I think we can cherry pick it into the LTS release |
…nvironment variable In certain setups, these calls to xcrun during bazel setup have been reported to sometimes take more than 2 minutes (see bazelbuild#17437). We have already bumped this timeout multiple times, which is currently 60 for some calls and 120 for others. Standardize the default timeout to 120, and allow the timeout to be override via BAZEL_OSX_EXECUTE_TIMEOUT, to allow individual enviroments to increase that even more if needed. PiperOrigin-RevId: 510200188 Change-Id: I664eb7979c4fd2b46ccc87d073f319c1e6041d77
Can we cherry pick this into Bazel 5 as well? |
Bazel has a 2 minute timeout for their internal `xcrun` call, which can be exceeded on our github runners about 5% of the time. This leads to flakes and opaque errors, but is a one-time cost. Subsequent xcruns finish in seconds, so we can just do an initial call w/o a timeout before running Bazel. With this change our total flake rate drops from ~30% to nearly 0% for our full suite of tests See bazelbuild/bazel#17437 for background. PiperOrigin-RevId: 509944178
…nvironment variable (#17521) In certain setups, these calls to xcrun during bazel setup have been reported to sometimes take more than 2 minutes (see #17437). We have already bumped this timeout multiple times, which is currently 60 for some calls and 120 for others. Standardize the default timeout to 120, and allow the timeout to be override via BAZEL_OSX_EXECUTE_TIMEOUT, to allow individual enviroments to increase that even more if needed. PiperOrigin-RevId: 510200188 Change-Id: I664eb7979c4fd2b46ccc87d073f319c1e6041d77 Co-authored-by: Googler <waltl@google.com>
Idk what the status of that older LTS is, someone from the bazel team will have to chime in on if there will be another one there |
@susinmotion what do we need to do to get this back-ported to Bazel 5? I could upgrade our mac tests to only run over Bazel 6, but since we support both I'd prefer not to |
Since Bazel 5 is in maintenance mode, we'd prefer not to backport if we can avoid it. What would it take to upgrade the tests? |
I guess using exclusively Bazel 6 on mac wouldn't be that bad, as long as we keep our Bazel 4/5 support for linux tests. When can we expect the new release? |
I think we're aiming for early March. |
Does adding the |
Sorry for missing this, I hope you have already migrated to Bazel 6? |
Yep, we've migrated to Bazel 6 now |
Description of the bug:
We recently switched from Kokoro to github actions in the protobuf repo. This substantially reduced all our flakes, and there's really only one left that's causing issues. We only see this on Mac/Bazel builds, at rough rate somewhere between 1-5%. We use Bazel 5.1.1 in all our testing. One example of it is here: https://github.com/protocolbuffers/protobuf/actions/runs/4117212221/jobs/7108243227
The symptom is always the same, and looks like a toolchain resolution issue (see below). We have 12 macOS builds in our CI, so even a 1-5% failure rate is a pretty big issue for us.
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Run any bazel build on mac ~30 times and you're likely to see this failure at least once
Which operating system are you running Bazel on?
macOS
What is the output of
bazel info release
?development 5.1.1
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.We use the bazelisk installed on our github runners and set
USE_BAZEL_VERSION=5.1.1
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
I found #11520, which looks like a similar issue dating back over 2 years.
Any other information, logs, or outputs that you want to share?
The text was updated successfully, but these errors were encountered: