-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel 5.1.0 toolchain resolution fails on my ARM64 Mac #15175
Comments
You likely need to upgrade your reference to the platforms repo, which might be pulled in transitively somehow today in your project |
We also verified that this problem occurs with TensorFlow by itself, not just JAX. I'm honestly not sure how TF gets the platforms repo. |
Can you test with bazel HEAD? They are vendored in bazel if nothing includes them and they were just updated recently https://github.com/bazelbuild/bazel//commit/676a0c8dea0e7782e47a386396e386a51566087f (probably not in 5.x). Either way likely the fastest way to fix this is to pin the newer version for now |
Some news: Bazel built from source at HEAD on my machine with the updated platforms repo and #14995 fails with the same toolchain mismatch errors. Bazel built from source at HEAD on my machine with the updated platforms repo but with #14995 reverted builds I did not change anything else, just branched off of master and did The gist of this issue is, in my opinion, the following:
This was even mentioned in the original PR thread: #14844 (review), but that line of inquiry was abandoned. Happy to hear your thoughts. |
@Wyverald Could we look into addressing this for 5.1.1, please? |
@bazel-io fork 5.1.1 |
@katre @keith what do you recommend we should do here? Following @nicholasjng's analysis above, do we just change LocalConfigPlatformFunction to return |
@nicholasjng how did you update the platforms repo? I can reproduce the issue with jax HEAD, but if I add this to the
to a local checkout of platforms at HEAD, it works as expected. Is it possible the way you update it didn't "stick" ? For example if you put it in the bottom of the WORKSPACE instead of the top (above the tf loading potentially) |
Since aarch64 is defined as an alias of arm64 https://github.com/bazelbuild/platforms/blob/fbd0d188dac49fbcab3d2876a2113507e6fc68e9/cpu/BUILD#L16-L20 I would expect them to be virtually interchangeable, if that's not the case maybe that should be what we attempt to fix instead? Since otherwise you could always have this mismatch just potentially in the opposite direction |
In an entirely empty project I am able to see that since bazel 5.x the version of platforms has contained this fix by using:
which makes me think that somehow TF, and transitively jax, are just pulling in older versions through some other transitive dependency. |
With this branch of jax where I bump platforms the issue goes away https://github.com/keith/jax/tree/platforms-update |
I get a slightly different output, using Bazel
Nevertheless, your patch works for me as well. Does this, in turn, mean that if one target pins the Good job on finding that. I had no idea about the implications of different versions of |
I believe the old platforms repository is being included via https://github.com/tensorflow/runtime/blob/ed92908bf93f09db579f4be41e8f4ae567bce0e1/third_party/rules_cuda/cuda/dependencies.bzl#L61 I'll manually override it in JAX for now. Thanks! |
Nice. So to confirm, is this no longer a 5.1.1 blocker? |
I sent jax-ml/jax#10164 to apply the |
Can confirm, it works with a normal Bazel release download from GitHub now, no patches required. Thanks for the help! |
The goal of this change is to fix build failures on Mac ARM apparently caused by an outdated copy of @platforms. See bazelbuild/bazel#15175 PiperOrigin-RevId: 439858709
For future reference for readers, you can see where
And you can see the stack trace that lead to including it, and the version being used. You can also query the CPU definition directly to see if you have the old one:
Or the new one:
|
In Bazel versions from 5.1.0 and older, there was a change [1] which prevents copybara from compiling on Apple Silicon by default. The solution recommended in [2] was to pin `platforms` repository to a newer version where the constraints value for CPP toolchain could be correctly resolved. Without this change, we would need to use Bazel 5.0.0 or older to compile copybara successfully on Apple Silicon. [1]: bazelbuild/bazel#14844 [2]: bazelbuild/bazel#15175
In Bazel versions from 5.1.0 and older, there was a change [1] which prevents copybara from compiling on Apple Silicon by default. The solution recommended in [2] was to pin `platforms` repository to a newer version where the constraints value for CPP toolchain could be correctly resolved. Without this change, we would need to use Bazel 5.0.0 or older to compile copybara successfully on Apple Silicon. [1]: bazelbuild/bazel#14844 [2]: bazelbuild/bazel#15175 Fixes #207 Change-Id: I8f71518f3c569de794fd60acb899d835323fccc9
Description of the problem / feature request:
Crosspost from this JAX issue on building
jaxlib
from source.I have been using Bazel 5.0.0 to (successfully) build
jaxlib
from source on my machine (Apple M1 Pro, macOS 12.3.1) in the past. My last successful build was about two weeks ago.When pulling in JAX main today at HEAD and trying to build from source, the build script downloaded Bazel version
5.1.0
for macOS ARM64 from GitHub (this is important), started the build, and failed.Setting the toolchain debug flag
--toolchain_resolution_debug=@bazel_tools//tools/cpp:toolchain_type
reveals that the problem is a toolchain resolution failure (see the error below).The reason that Bazel
5.1.0
became necessary to use now is that there were some changes in Tensorflow'sBUILD
definitions, and thus, as JAX depends on Tensorflow for XLA, also of JAX's build process.I already talked to @hawkinsp, a JAX core developer who typically answers build-related questions on the project. He suspects that it is a Bazel issue, based on the following observations:
local_config_cc_toolchains
targets #14995, theaarch64
macOS CPU constraint value was renamed toarm64
.aarch64
, as evidenced by the following autogenerated platform config file:The logic for this is apparently found here (quoting directly from the discussion thread linked above):
bazel/src/main/java/com/google/devtools/build/lib/bazel/repository/LocalConfigPlatformFunction.java
Lines 116 to 117 in af56aec
So, on first glance, it looks like Bazel identifies and saves the platform CPU name as
aarch64
, and later fails to match any macOS toolchains to it, because all of them have been renamed as a result of the above pull request. For more information, see the build log attached at the end of this message.I would be happy about some feedback and / or guidance to resolving this issue. Please let me know if there is more information I can provide that could be helpful in the resolution of this problem.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Prerequisites:
Steps to reproduce:
5.
git clone https://github.com/google/jax
, cd into the resultingjax
directory.6. Create a virtual environment (
python3 -m venv venv --system-site-packages --upgrade-deps
is what I like to use, this works on Python >=3.9, but YMMV)7.
source venv/bin/activate
followed bypython -m pip install -e .
to install the package itself (and dependencies) in developer mode.8. Run
python build/build.py
. This prompts a Python script downloading Bazel5.1.0
directly from the Bazel GitHub release for the macOS arm64 architecture, and invokes it for a Python wheel build ofjaxlib
, a companion package of JAX.What operating system are you running Bazel on?
macOS 12.3.1, Apple M1 Pro Macbook Pro 14".
What's the output of
bazel info release
?If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.(Not applicable, as the Bazel binary was downloaded directly off this Github repo's
5.1.0
release.)What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?Have you found anything relevant by searching the web?
No.
Any other information, logs, or outputs that you want to share?
The text was updated successfully, but these errors were encountered: