Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update JAX and fix aarch64-darwin build #219778

Merged
merged 4 commits into from
Apr 19, 2023
Merged

Conversation

samuela
Copy link
Member

@samuela samuela commented Mar 6, 2023

Description of changes

This PR does a few things:

  • Update jaxlib(-bin): 0.3.22 -> 0.4.4
  • Update jax: 0.4.1 -> 0.4.5
  • Fix the rev that JAX was building on which was messed up by 0d6a071 which was merged without approval.
  • Unbreak JAX on aarch64-darwin. The jaxlib source build still doesn't work on this platform but that doesn't mean we can use the jaxlib-bin package in its place, esp. considering it's only a checkPhase dependency of jax.
Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.05 Release Notes (or backporting 22.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@samuela
Copy link
Member Author

samuela commented Mar 6, 2023

There are a few TODOs before this can be merged:

@samuela
Copy link
Member Author

samuela commented Mar 6, 2023

I should add that although we may drop support for from-source jaxlib builds for platforms, they don't necessarily need to be dropped from jaxlib-bin since that's easier to maintain. So support would not be dropped entirely.

@samuela samuela marked this pull request as ready for review March 9, 2023 18:31
@samuela
Copy link
Member Author

samuela commented Mar 9, 2023

This PR should be ready to go now. The JAX team was kind enough to hook us up with a new snappy release in order to fix the build (s/o @hawkinsp! 🚀)

I've tested that jaxlib builds with and without CUDA support. I'm running a nixpkgs-review now...

@samuela
Copy link
Member Author

samuela commented Mar 9, 2023

cc @NixOS/cuda-maintainers @uri-canva @ndl

@uri-canva
Copy link
Contributor

Unfortunately I now do not have access to x86_64-darwin anymore either.

@uri-canva
Copy link
Contributor

You can run the build on ofborg with fake shas to get the hash of the fetch derivations.

@samuela
Copy link
Member Author

samuela commented Mar 9, 2023

If we don't have anyone with access to a machine to test it out conveniently I vote that we drop support from the source build and x86_64-darwin users can use the jaxlib-bin package instead. Otherwise the maintenance load is just too high IMHO.

@SomeoneSerge
Copy link
Contributor

Feels reasonable. For the sake of documentation, I suggest that if one later wants to reintroduce x8664Darwin support they also set up a GitHub Actions workflow that runs over a matrix of platforms and tries to update the derivation (version and hashes) via something like nix-update. And I don't expect nix-update to just work here. The whole business of conditional hashes is too bearing

@uri-canva
Copy link
Contributor

We could also fix it so the fetch hashes are not system dependent, though I'm not sure how hard it would be. I'll take a look. Are there any other reasons to keep jax both built from source and as a binary?

@samuela
Copy link
Member Author

samuela commented Mar 11, 2023

We could also fix it so the fetch hashes are not system dependent, though I'm not sure how hard it would be. I'll take a look. Are there any other reasons to keep jax both built from source and as a binary?

jaxlib-bin comes in handy since the builds are so simple and fast. So eg that makes it a lot easier to support more architectures and lets users more easily override to get different versions of jaxlib if necessary. The full jaxlib source builds are just so looooooong

@samuela
Copy link
Member Author

samuela commented Mar 11, 2023

snappy is fairly popular, so this actually results in a few hundred rebuilds. I was hoping to run a nixpkgs-review and be done with it, but I actually ran out of disk space on my 200Gb drive. Would someone else be able to nixpkgs-review it? If not, I could create a snappy-only PR targeting staging.

@uri-canva
Copy link
Contributor

I hit another error in snappy on aarch64-darwin:

/tmp/nix-build-snappy-1.1.10.drv-0/source/snappy.cc:1293:37: error: comparison of integers of different signs: 'unsigned long' and 'ptrdiff_t' (aka 'long') [-Werror,-Wsign-compare]
             (op + deferred_length) < op_limit_min_slop);
              ~~~~~~~~~~~~~~~~~~~~  ^ ~~~~~~~~~~~~~~~~~
/tmp/nix-build-snappy-1.1.10.drv-0/source/snappy.cc:1293:37: error: comparison of integers of different signs: 'unsigned long' and 'ptrdiff_t' (aka 'long') [-Werror,-Wsign-compare]
             (op + deferred_length) < op_limit_min_slop);
              ~~~~~~~~~~~~~~~~~~~~  ^ ~~~~~~~~~~~~~~~~~
/tmp/nix-build-snappy-1.1.10.drv-0/source/snappy.cc:1405:15: note: in instantiation of function template specialization 'snappy::DecompressBranchless<char *>' requested here
              DecompressBranchless(reinterpret_cast<const uint8_t*>(ip),
              ^
/tmp/nix-build-snappy-1.1.10.drv-0/source/snappy.cc:1600:17: note: in instantiation of function template specialization 'snappy::SnappyDecompressor::DecompressAllTags<snappy::SnappyIOVecWriter>' requested here
  decompressor->DecompressAllTags(writer);
                ^
/tmp/nix-build-snappy-1.1.10.drv-0/source/snappy.cc:1587:10: note: in instantiation of function template specialization 'snappy::InternalUncompressAllTags<snappy::SnappyIOVecWriter>' requested here
  return InternalUncompressAllTags(&decompressor, writer, r->Available(),
         ^
/tmp/nix-build-snappy-1.1.10.drv-0/source/snappy.cc:1936:10: note: in instantiation of function template specialization 'snappy::InternalUncompress<snappy::SnappyIOVecWriter>' requested here
  return InternalUncompress(compressed, &output);
         ^
/tmp/nix-build-snappy-1.1.10.drv-0/source/snappy.cc:1293:37: error: comparison of integers of different signs: 'unsigned long' and 'ptrdiff_t' (aka 'long') [-Werror,-Wsign-compare]
             (op + deferred_length) < op_limit_min_slop);
              ~~~~~~~~~~~~~~~~~~~~  ^ ~~~~~~~~~~~~~~~~~
/tmp/nix-build-snappy-1.1.10.drv-0/source/snappy.cc:1405:15: note: in instantiation of function template specialization 'snappy::DecompressBranchless<unsigned long>' requested here
              DecompressBranchless(reinterpret_cast<const uint8_t*>(ip),
              ^
/tmp/nix-build-snappy-1.1.10.drv-0/source/snappy.cc:1600:17: note: in instantiation of function template specialization 'snappy::SnappyDecompressor::DecompressAllTags<snappy::SnappyDecompressionValidator>' requested here
  decompressor->DecompressAllTags(writer);
                ^
/tmp/nix-build-snappy-1.1.10.drv-0/source/snappy.cc:1587:10: note: in instantiation of function template specialization 'snappy::InternalUncompressAllTags<snappy::SnappyDecompressionValidator>' requested here
  return InternalUncompressAllTags(&decompressor, writer, r->Available(),
         ^
/tmp/nix-build-snappy-1.1.10.drv-0/source/snappy.cc:2101:10: note: in instantiation of function template specialization 'snappy::InternalUncompress<snappy::SnappyDecompressionValidator>' requested here
  return InternalUncompress(&reader, &writer);
         ^

@uri-canva
Copy link
Contributor

Yeah we should probably do the snappy upgrade separately first.

@uri-canva
Copy link
Contributor

I can try to nixpkgs-review the snappy PR once you put it up.

@uri-canva
Copy link
Contributor

jaxlib-bin comes in handy since the builds are so simple and fast. So eg that makes it a lot easier to support more architectures and lets users more easily override to get different versions of jaxlib if necessary. The full jaxlib source builds are just so looooooong

Are there things the source build supports that the binary doesn't? I'm confused why we need both, I thought we had both as a workaround for some issue that prevented us from building from source, but it doesn't sound like it's a transitional thing. Do we still need both?

@uri-canva
Copy link
Contributor

I worked around the snappy break with:

diff --git a/pkgs/development/libraries/snappy/default.nix b/pkgs/development/libraries/snappy/default.nix
index 24c3e7bb7df..d33e5755084 100644
--- a/pkgs/development/libraries/snappy/default.nix
+++ b/pkgs/development/libraries/snappy/default.nix
@@ -29,6 +29,8 @@ stdenv.mkDerivation rec {

   nativeBuildInputs = [ cmake ];

+  env.NIX_CFLAGS_COMPILE = "-Wno-sign-compare";
+
   cmakeFlags = [
     "-DBUILD_SHARED_LIBS=${if static then "OFF" else "ON"}"
     "-DSNAPPY_BUILD_TESTS=OFF"

but I'm having trouble getting the diff of the fetch derivation, I'll have to come back to it another time.

@samuela samuela mentioned this pull request Mar 14, 2023
12 tasks
@samuela
Copy link
Member Author

samuela commented Mar 14, 2023

Spun off the snappy change into #221215. @uri-canva could you ptal?

We should be able to rebase and merge this once that chance lands on master.

@samuela
Copy link
Member Author

samuela commented Mar 14, 2023

but I'm having trouble getting the diff of the fetch derivation, I'll have to come back to it another time.

wdym? it built fine for me on aarch64-darwin

Are there things the source build supports that the binary doesn't? I'm confused why we need both, I thought we had both as a workaround for some issue that prevented us from building from source, but it doesn't sound like it's a transitional thing. Do we still need both?

Are you proposing removing the source build or removing the binary build? i think the main selling point for the source build is just that nixpkgs generally prefers source builds 🤷

@uri-canva
Copy link
Contributor

wdym? it built fine for me on aarch64-darwin

Yeah but I want to build it on a different system too so I can diff them. Usually if different systems need different shas it means the deps have system specific files in them, which we can delete and rebuild during the build. Ideally the output of the fetch only has system independent files that were downloaded.

Are you proposing removing the source build or removing the binary build? i think the main selling point for the source build is just that nixpkgs generally prefers source builds 🤷

I'm proposing keeping only one of the two, so we don't duplicate the maintenance effort. Ideally it would be the source one, that's why I'm trying to figure out what is causing the diff in the deps shas, so we can develop the derivation for all systems more easily.

@samuela
Copy link
Member Author

samuela commented Mar 14, 2023

Yeah but I want to build it on a different system too so I can diff them. Usually if different systems need different shas it means the deps have system specific files in them, which we can delete and rebuild during the build. Ideally the output of the fetch only has system independent files that were downloaded.

Ohhhh this is in regards to fetchAttrs? Yeah I don't really know anything about that. It'd be really great if we could get all platforms on the same fetchAttrs hash!

I'm proposing keeping only one of the two, so we don't duplicate the maintenance effort. Ideally it would be the source one, that's why I'm trying to figure out what is causing the diff in the deps shas, so we can develop the derivation for all systems more easily.

There are still benefits to having the binary build around. It's a lot easier to hack on, and users (like myself) occasionally need to override things in order to upgrade/downgrade versions, play around with settings, etc. Luckily the maintenance burden of the binary build is much lower.

@uri-canva
Copy link
Contributor

There are still benefits to having the binary build around. It's a lot easier to hack on, and users (like myself) occasionally need to override things in order to upgrade/downgrade versions, play around with settings, etc. Luckily the maintenance burden of the binary build is much lower.

Right then maybe we should keep only the binary build?

@samuela
Copy link
Member Author

samuela commented Mar 14, 2023

Right then maybe we should keep only the binary build?

I would be sympathetic to such a change, but IIUC there are a number of people who really like to have the source build around

@uri-canva
Copy link
Contributor

Ok 😢 .

@samuela
Copy link
Member Author

samuela commented Apr 7, 2023

Note to self: We should prob merge #221390 before this

@GaetanLepage
Copy link
Contributor

Thank you for this updating effort !
I would suggest that we directly go to jaxlib v0.4.7 and jax v0.4.8.

I think that we might need to change the prefetch.sh script as is:

version="$1"
nix hash to-sri --type sha256 "$(nix-prefetch-url https://storage.googleapis.com/jax-releases/nocuda/jaxlib-${version}-cp310-cp310-manylinux2014_x86_64.whl)"
nix hash to-sri --type sha256 "$(nix-prefetch-url https://storage.googleapis.com/jax-releases/nocuda/jaxlib-${version}-cp311-cp311-manylinux2014_x86_64.whl)"
nix hash to-sri --type sha256 "$(nix-prefetch-url https://storage.googleapis.com/jax-releases/mac/jaxlib-${version}-cp310-cp310-macosx_10_14_x86_64.whl)"
nix hash to-sri --type sha256 "$(nix-prefetch-url https://storage.googleapis.com/jax-releases/mac/jaxlib-${version}-cp311-cp311-macosx_10_14_x86_64.whl)"
nix hash to-sri --type sha256 "$(nix-prefetch-url https://storage.googleapis.com/jax-releases/mac/jaxlib-${version}-cp310-cp310-macosx_11_0_arm64.whl)"
nix hash to-sri --type sha256 "$(nix-prefetch-url https://storage.googleapis.com/jax-releases/mac/jaxlib-${version}-cp311-cp311-macosx_11_0_arm64.whl)"
nix hash to-sri --type sha256 "$(nix-prefetch-url https://storage.googleapis.com/jax-releases/cuda12/jaxlib-${version}+cuda12.cudnn88-cp310-cp310-manylinux2014_x86_64.whl)"
nix hash to-sri --type sha256 "$(nix-prefetch-url https://storage.googleapis.com/jax-releases/cuda12/jaxlib-${version}+cuda12.cudnn88-cp311-cp311-manylinux2014_x86_64.whl)"

@@ -50,21 +50,21 @@ let
cpuSrcs = {
"x86_64-linux" = fetchurl {
url = "https://storage.googleapis.com/jax-releases/nocuda/jaxlib-${version}-cp310-cp310-manylinux2014_x86_64.whl";
Copy link
Contributor

@SomeoneSerge SomeoneSerge Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp310-cp310

Should we also account for pythonVersion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm not a bad idea....

@samuela
Copy link
Member Author

samuela commented Apr 18, 2023

Every time I try to run nixpkgs-review it hangs, apparently on the cvxpy test suite. Not sure why as the cvxpy build does not seem to hang either on this branch or on master.

@SomeoneSerge
Copy link
Contributor

Every time I try to run nixpkgs-review it hangs, apparently on the cvxpy test suite. Not sure why as the cvxpy build does not seem to hang either on this branch or on master.

I've been continuously suffering from cvxpy tests when running parallel builds, and I haven't quite figured out why. Upstream github actions reveal that their tests take at most two minutes

@GaetanLepage
Copy link
Contributor

Every time I try to run nixpkgs-review it hangs, apparently on the cvxpy test suite. Not sure why as the cvxpy build does not seem to hang either on this branch or on master.

Could you try to update this branch ? I can give it a shot on my machine if you want.

@SomeoneSerge
Copy link
Contributor

@samuela if you get bored, I think it's OK to

❯ sudo kill -9 $(pgrep -f cvxpy)

The failures we're looking for are likely unrelated to cvxpy...

@samuela
Copy link
Member Author

samuela commented Apr 18, 2023

I've been continuously suffering from cvxpy tests when running parallel builds, and I haven't quite figured out why. Upstream github actions reveal that their tests take at most two minutes

Glad to hear I'm not alone here!

If you get bored, I think it's OK to [...]

Brilliant! I'll see how far I can get with that. I really wish nixpkgs-review had an option to skip packages...

@GaetanLepage
Copy link
Contributor

@samuela When trying to build jaxlib 0.4.7 from source I faced a build failure at the preBuild phase.
More precisely, linking the protobuf lib in ../output/external/org_tensorflow was failing because this folder doesn't exist.
I removed this patch and it built succesfully.

It seems to come from this upstream commit.

@samuela
Copy link
Member Author

samuela commented Apr 18, 2023

Result of nixpkgs-review pr 219778 run on x86_64-linux 1

43 packages failed to build:
  • python310Packages.arviz
  • python310Packages.arviz.dist
  • python310Packages.bambi
  • python310Packages.bambi.dist
  • python310Packages.blackjax
  • python310Packages.blackjax.dist
  • python310Packages.dalle-mini
  • python310Packages.dalle-mini.dist
  • python310Packages.distrax
  • python310Packages.distrax.dist
  • python310Packages.dm-sonnet
  • python310Packages.dm-sonnet.dist
  • python310Packages.elegy
  • python310Packages.elegy.dist
  • python310Packages.flax
  • python310Packages.flax.dist
  • python310Packages.jaxopt
  • python310Packages.jaxopt.dist
  • python310Packages.numpyro
  • python310Packages.numpyro.dist
  • python310Packages.pymc
  • python310Packages.pymc.dist
  • python310Packages.pytensor
  • python310Packages.pytensor.dist
  • python310Packages.rlax
  • python310Packages.rlax.dist
  • python310Packages.tensorflow-datasets
  • python310Packages.tensorflow-datasets.dist
  • python310Packages.treex
  • python310Packages.treex.dist
  • python310Packages.vqgan-jax
  • python310Packages.vqgan-jax.dist
  • python311Packages.blackjax
  • python311Packages.blackjax.dist
  • python311Packages.dm-haiku
  • python311Packages.dm-haiku.dist
  • python311Packages.dm-haiku.testsout
  • python311Packages.jaxopt
  • python311Packages.jaxopt.dist
  • python311Packages.jmp
  • python311Packages.jmp.dist
  • python311Packages.numpyro
  • python311Packages.numpyro.dist
45 packages built:
  • python310Packages.aeppl
  • python310Packages.aeppl.dist
  • python310Packages.aesara
  • python310Packages.aesara.dist
  • python310Packages.augmax
  • python310Packages.augmax.dist
  • python310Packages.chex
  • python310Packages.chex.dist
  • python310Packages.dm-haiku
  • python310Packages.dm-haiku.dist
  • python310Packages.dm-haiku.testsout
  • python310Packages.jax
  • python310Packages.jax.dist
  • python310Packages.jaxlib (python310Packages.jaxlib-build ,python310Packages.jaxlibWithoutCuda)
  • python310Packages.jaxlib-bin
  • python310Packages.jaxlib-bin.dist
  • python310Packages.jaxlib.dist (python310Packages.jaxlib-build.dist ,python310Packages.jaxlibWithoutCuda.dist)
  • python310Packages.jaxlibWithCuda
  • python310Packages.jaxlibWithCuda.dist
  • python310Packages.jmp
  • python310Packages.jmp.dist
  • python310Packages.objax
  • python310Packages.objax.dist
  • python310Packages.optax
  • python310Packages.optax.dist
  • python310Packages.optax.testsout
  • python310Packages.tensorflow-bin
  • python310Packages.tensorflow-bin.dist
  • python310Packages.treeo
  • python310Packages.treeo.dist
  • python311Packages.augmax
  • python311Packages.augmax.dist
  • python311Packages.chex
  • python311Packages.chex.dist
  • python311Packages.jax
  • python311Packages.jax.dist
  • python311Packages.jaxlib (python311Packages.jaxlib-build ,python311Packages.jaxlibWithoutCuda)
  • python311Packages.jaxlib.dist (python311Packages.jaxlib-build.dist ,python311Packages.jaxlibWithoutCuda.dist)
  • python311Packages.jaxlibWithCuda
  • python311Packages.jaxlibWithCuda.dist
  • python311Packages.optax
  • python311Packages.optax.dist
  • python311Packages.optax.testsout
  • python311Packages.treeo
  • python311Packages.treeo.dist

@samuela
Copy link
Member Author

samuela commented Apr 18, 2023

Failed derivations

@samuela
Copy link
Member Author

samuela commented Apr 18, 2023

Doesn't look like any of these failures are JAX version related AFAICT. I propose we go ahead and merge this and then address the python version-specific downloads and 0.4.7/8 upgrade in separate PR(s).

@SomeoneSerge
Copy link
Contributor

We're still at 80| src = if !cudaSupport then cpuSrcs."${stdenv.hostPlatform.system}" else gpuSrc;?

@samuela
Copy link
Member Author

samuela commented Apr 19, 2023

We're still at 80| src = if !cudaSupport then cpuSrcs."${stdenv.hostPlatform.system}" else gpuSrc;?

You mean we're not yet branching on the python version? Or something else?

@SomeoneSerge
Copy link
Contributor

You mean we're not yet branching on the python version? Or something else?

No, the ofborg thing?

@samuela
Copy link
Member Author

samuela commented Apr 19, 2023

Ohhh, I totally forgot about that! Should be addressed in 3fa9f1f.

@samuela
Copy link
Member Author

samuela commented Apr 19, 2023

Looks like all checks are green now! proceeding to merge

@samuela samuela merged commit 8faef6d into NixOS:master Apr 19, 2023
@samuela samuela deleted the samuela/jax branch April 19, 2023 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants