Update cudnn from v8 to v9 across CUDA versions and x86/arm #1847

nWEIdia · 2024-05-30T23:10:15Z

Re-land #1822

Supporting pytorch/pytorch#123475

Reference PR: #1271

cc @eqy @tinglvv @ptrblck @atalman @malfet

Update cudnn to v9 for arm cuda binary as well

tinglvv · 2024-05-31T05:13:12Z

Thanks for preparing this! Let me test locally if the upgrade would break anything for ARM.

tinglvv · 2024-05-31T10:14:53Z

Suggested one change.
Built wheel with cudnnv9 and running into this error when running the tests:

Unable to load any of {libcudnn_engines_precompiled.so.9.1.0, libcudnn_engines_precompiled.so.9.1, libcudnn_engines_precompiled.so.9, libcudnn_engines_precompiled.so}
 File "/test-arm/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDNN_BACKEND_TENSOR_DESCRIPTOR cudnnFinalize failed cudnn_status: CUDNN_STATUS_NOT_INITIALIZED`

@nWEIdia We will need to resolve this before merging the change.

nWEIdia · 2024-06-01T07:21:52Z

Suggested one change. Built wheel with cudnnv9 and running into this error when running the tests:

Unable to load any of {libcudnn_engines_precompiled.so.9.1.0, libcudnn_engines_precompiled.so.9.1, libcudnn_engines_precompiled.so.9, libcudnn_engines_precompiled.so}
 File "/test-arm/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDNN_BACKEND_TENSOR_DESCRIPTOR cudnnFinalize failed cudnn_status: CUDNN_STATUS_NOT_INITIALIZED`

@nWEIdia We will need to resolve this before merging the change.

Thanks @tinglvv ! Could you please briefly describe the reproducer steps? I do know currently it has not been successful in building v9 based wheel, did you manually build a v9 based arm cuda wheel?

bryantbiggs · 2024-06-01T12:08:40Z

aarch64_linux/aarch64_wheel_ci_build.py

@@ -103,7 +103,7 @@ def update_wheel(wheel_path) -> None:
    os.system(f"unzip {wheel_path} -d {folder}/tmp")
    libs_to_copy = [
        "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
-        "/usr/local/cuda/lib64/libcudnn.so.8",
+        "/usr/local/cuda/lib64/libcudnn.so.9",


out of curiosity, do we plan to carry both the legacy API and new API here so that we can migrate pytorch over to the new graph API? this seems like the size impact would roughly double which as things currently stand ... its already quite a lot in terms of overall final artifact size

ref https://docs.nvidia.com/deeplearning/cudnn/latest/api/overview.html

Good point and I think this question applies to x86 as well.
This PR was created not considering the size impact and ship both legacy and new API .so files, for both x86 and arm.

cc @ptrblck @eqy @malfet @atalman @tinglvv for additional inputs

tinglvv · 2024-06-03T13:45:29Z

Suggested one change. Built wheel with cudnnv9 and running into this error when running the tests:
Unable to load any of {libcudnn_engines_precompiled.so.9.1.0, libcudnn_engines_precompiled.so.9.1, libcudnn_engines_precompiled.so.9, libcudnn_engines_precompiled.so}
 File "/test-arm/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDNN_BACKEND_TENSOR_DESCRIPTOR cudnnFinalize failed cudnn_status: CUDNN_STATUS_NOT_INITIALIZED`
@nWEIdia We will need to resolve this before merging the change.
Thanks @tinglvv ! Could you please briefly describe the reproducer steps? I do know currently it has not been successful in building v9 based wheel, did you manually build a v9 based arm cuda wheel?

Yes the wheel is built ok, just need to edit the cmake rules that @eqy mentioned in slack so that it recognizes cudnnv9 I think.

The required cmake changes seem to be from pytorch/pytorch#123475, which is failing cuda-aarch64 (due to this change #1847 missing). Due to this inter-dependency, it might be okay to ignore the cuda-aarch64 failures and merge the pytorch/pytorch change, then merge this change to fix the cuda-aarch64 failure.

Reproducer steps:

Step 1 Build image: GPU_ARCH_TYPE=cuda-aarch64 GPU_ARCH_VERSION=12.4 manywheel/build_docker.sh
Step 2 Run image: docker run --gpus all -it pytorch/manylinuxaarch64-builder:(replace with the generated docker from step 1)
Step 3 Git clone the pytorch repo to the docker image: cd / && git clone https://github.com/pytorch/pytorch.git
Step 4 Build the wheel: cd /builder/aarch64_linux && DESIRED_PYTHON=3.10 DESIRED_CUDA=12.4 ./aarch64_ci_build.sh

…1847)" This reverts commit 5783bcc.

* Remove triton constraint for py312 (#1846) * Cache OpenBLAS to docker image for SBSA builds (#1842) * apply openblas cache for cpu-aarch64 * reapply for cuda-aarch64 * [MacOS] Don't build wheel while building libtorch Not sure why this was ever done twice * Allow validate doker images to be called from different workflow (#1850) * Allow validate doker images to be called from different workflow * Revert "[MacOS] Don't build wheel while building libtorch" This reverts commit d88495a. * [MacOS] Don't build libtorch twice (take 2) By not invoking `tools/build_libtorch.py` as as it's not done on Linux * [MacOs][LibTorch] Copy libomp.dylib into libtorch package * Update cudnn from v8 to v9 across CUDA versions and x86/arm (#1847) * Update cudnn to v9.1.0.70 for cuda11.8, cuda12.1, and cuda12.4 * Add CUDNN_VERSION variable * Remove 2 spaces for install_cu124 * trivial fix * Fix DEPS_LIST and DEPS_SONAME for x86 Update cudnn to v9 for arm cuda binary as well * libcudnn_adv_infer/libcudnn_adv_train becomes libcudnn_adv * Change DEPS due to cudnn v9 libraries name changes (and additions) * Fix lint * Add missing changes to cu121/cu124 * Change OpenSSL URL (#1854) * Change OpenSSL URL * Change to use openssl URL (but no longer ftp!) * Update build-manywheel-images.yml - Add a note about manylinux_2_28 state * Revert "Update cudnn from v8 to v9 across CUDA versions and x86/arm" (#1855) This reverts commit 5783bcc. * Don't run torch.compile on runtime images in docker validations (#1858) * Don't run torch.compile on runtime images * test * Don't run torch.compile on runtime images in docker validations * Update cudnn from v8 to v9 across CUDA versions and x86/arm (#1857) * Update cudnn to v9.1.0.70 for cuda11.8, cuda12.1, and cuda12.4 * Add CUDNN_VERSION variable * Remove 2 spaces for install_cu124 * trivial fix * Fix DEPS_LIST and DEPS_SONAME for x86 Update cudnn to v9 for arm cuda binary as well * libcudnn_adv_infer/libcudnn_adv_train becomes libcudnn_adv * Change DEPS due to cudnn v9 libraries name changes (and additions) * Fix lint * Add missing changes to cu121/cu124 * Fix aarch64 cuda typos * Update validate-docker-images.yml - disable runtime error check for now * Update validate-docker-images.yml - use validation_runner rather then hardcoded one * Update validate-docker-images.yml - fix MATRIX_GPU_ARCH_TYPE setting for cpu only workflows * [aarch64 cuda cudnn] Add RUNPATH to libcudnn_graph.so.9 (#1859) * Add executorch to pypi prep, promotion and validation scripts (#1860) * Add AOTriton install step for ROCm manylinux images (#1862) * Add AOTriton install step for ROCm * No common_utils.sh needed * temporary disable runtime error check * Add python 3.13 builder (#1845) --------- Co-authored-by: Ting Lu <92425201+tinglvv@users.noreply.github.com> Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com> Co-authored-by: Wei Wang <143543872+nWEIdia@users.noreply.github.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

* Update cudnn to v9.1.0.70 for cuda11.8, cuda12.1, and cuda12.4 * Add CUDNN_VERSION variable * Remove 2 spaces for install_cu124 * trivial fix * Fix DEPS_LIST and DEPS_SONAME for x86 Update cudnn to v9 for arm cuda binary as well * libcudnn_adv_infer/libcudnn_adv_train becomes libcudnn_adv * Change DEPS due to cudnn v9 libraries name changes (and additions) * Fix lint * Add missing changes to cu121/cu124

nWEIdia added 5 commits May 24, 2024 16:56

Update cudnn to v9.1.0.70 for cuda11.8, cuda12.1, and cuda12.4

ec5c97f

Add CUDNN_VERSION variable

7e970d7

Remove 2 spaces for install_cu124

7108860

trivial fix

7c9f67c

Fix DEPS_LIST and DEPS_SONAME for x86

a545d3c

Update cudnn to v9 for arm cuda binary as well

facebook-github-bot added the cla signed label May 30, 2024

nWEIdia requested a review from atalman May 30, 2024 23:10

nWEIdia added 3 commits May 30, 2024 16:12

libcudnn_adv_infer/libcudnn_adv_train becomes libcudnn_adv

28d4fe5

Change DEPS due to cudnn v9 libraries name changes (and additions)

950b1d5

Fix lint

a51642e

eqy approved these changes May 31, 2024

View reviewed changes

Add missing changes to cu121/cu124

0f8d4f4

nWEIdia mentioned this pull request Jun 1, 2024

[BE]: Update cudnn to 9.1.0.70 pytorch/pytorch#123475

Closed

bryantbiggs reviewed Jun 1, 2024

View reviewed changes

atalman approved these changes Jun 4, 2024

View reviewed changes

atalman merged commit 5783bcc into pytorch:main Jun 4, 2024
26 checks passed

atalman added a commit that referenced this pull request Jun 5, 2024

Revert "Update cudnn from v8 to v9 across CUDA versions and x86/arm (#…

d02903a

…1847)" This reverts commit 5783bcc.

atalman mentioned this pull request Jun 5, 2024

Revert "Update cudnn from v8 to v9 across CUDA versions and x86/arm" #1855

Merged

nWEIdia mentioned this pull request Jun 5, 2024

Update cudnn from v8 to v9 across CUDA versions and x86/arm #1857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update cudnn from v8 to v9 across CUDA versions and x86/arm #1847

Update cudnn from v8 to v9 across CUDA versions and x86/arm #1847

nWEIdia commented May 30, 2024

tinglvv commented May 31, 2024

tinglvv commented May 31, 2024 •

edited

Loading

nWEIdia commented Jun 1, 2024

bryantbiggs Jun 1, 2024

nWEIdia Jun 1, 2024

tinglvv commented Jun 3, 2024 •

edited

Loading

Update cudnn from v8 to v9 across CUDA versions and x86/arm #1847

Update cudnn from v8 to v9 across CUDA versions and x86/arm #1847

Conversation

nWEIdia commented May 30, 2024

tinglvv commented May 31, 2024

tinglvv commented May 31, 2024 • edited Loading

nWEIdia commented Jun 1, 2024

bryantbiggs Jun 1, 2024

Choose a reason for hiding this comment

nWEIdia Jun 1, 2024

Choose a reason for hiding this comment

tinglvv commented Jun 3, 2024 • edited Loading

tinglvv commented May 31, 2024 •

edited

Loading

tinglvv commented Jun 3, 2024 •

edited

Loading