[Refiled] IFU-main-2023-07-31 #36

jithunnair-amd · 2023-08-07T16:00:33Z

Re-attempt for #35

Mistakenly hit "Squash and merge" when it should have been a regular "Merge" to maintain individual commits.

This IFU PR brings in the following notable changes for ROCm:

Using -complete base images for CentOS and Ubuntu
Removing install_rocm.sh step from Dockerfiles because they're not needed if using the -complete images
A fix for erroneous logic regarding msccl-algorithm files not being included for ROCm5.6 or above.
Tested successfully for wheels via http://rocmhead.amd.com:8080/job/pytorch/job/dev/job/manylinux_rocm_wheels/243 using ROCm5.7 RC1 (build 7) and PyTorch 2.0

* add 12.1 workflow for docker image build * add github workflow * update cuDNN to 8.8.1 and location for archive

* Do not use ftp `s#ftp://#https://#` * Remove no-longer relevant comment

* add magma build for CUDA12.1 * copy and fix CMake.patch; drop sm_37 for CUDA 12.1

* remove CUDA 11.6 builds * remove more 11.6 builds

* enable nightly CUDA 12.1 builds * fix version typo

* Remove special case for Python 3.11 * Remove install torch script

* Windows CUDA 12.1 changes * add CUDA version checks for Windows MAGMA builds * use magma branch without fermi arch

And add `12.1` to the matrix Test plan: `conda build . -c nvidia` and observe https://anaconda.org/malfet/pytorch-cuda/files

…ase architecture (pytorch#1370)

To unblock pytorch/pytorch#98209

As 10.9 was release a decade ago and for that reason yet not supported C++17 standard. Similar to pytorch/pytorch#99857

To fix builds, though we should really target 11.0 at the very least

Added by pytorch/pytorch#99768

* Fix nvjitlink inclusion in 12.1 wheels * Fix typo

Companion PR to pytorch/pytorch#100166

As it should be part of the AMI

* Fix wheel validations * Try using upgrade flag instead * try uninstall * test * Try using python3 * use python3 vs python for validation * Fix windows vs other os python execution * Uninstall fix

…torch#1444) More arm64 changes test run under environment sleep 15min allow investigate add sleep test test Test test test Arm64 use python fix test testing test tests testing test test

Use [`nvidia/cuda:11.4.3-devel-centos7`](https://hub.docker.com/layers/nvidia/cuda/11.4.3-devel-centos7/images/sha256-e2201a4954dfd65958a6f5272cd80b968902789ff73f26151306907680356db8?context=explore) because `nvidia/cuda:10.2-devel-centos7` was deleted in accordance with [Nvidia's Container Support Policy](https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md): > After a period of Six Months time, the EOL tags WILL BE DELETED from Docker Hub and Nvidia GPU Cloud (NGC). This deletion ensures unsupported tags (and image layers) are not left lying around for customers to continue using after they have long been abandoned. Also delete redundant DEVTOOLSET=7 clause

Followup after pytorch#1446 CUDA-10.2 and moreover CUDA-9.2 docker images are gone per [Nvidia's Container Support Policy](https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md): > After a period of Six Months time, the EOL tags WILL BE DELETED from Docker Hub and Nvidia GPU Cloud (NGC). This deletion ensures unsupported tags (and image layers) are not left lying around for customers to continue using after they have long been abandoned. Also, as all our Docker script install CUDA toolkit anyway, what's the point of using `nvidia/cuda` images at all instead of `centos:7`/`ubuntu:18.04` that former are based on, according to https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/11.4.3/centos7/base/Dockerfile Explicitly install `g++` to `libtorch/Docker` base image, as it's needed by `patchelf` Please note, that `libtorch/Docker` can not be completed without buildkit, as `rocm` step depends on `python3` which is not available in `cpu` image

Not sure, what weird version of `wget` is getting installed, but attempt to download https://anaconda.org/pytorch/magma-cuda121/2.6.1/download/linux-64/magma-cuda121-2.6.1-1.tar.bz2 fails with: ``` --2023-07-06 03:18:38-- https://anaconda.org/pytorch/magma-cuda121/2.6.1/download/linux-64/magma-cuda121-2.6.1-1.tar.bz2 Resolving anaconda.org (anaconda.org)... 104.17.93.24, 104.17.92.24, 2606:4700::6811:5d18, ... Connecting to anaconda.org (anaconda.org)|104.17.93.24|:443... connected. ERROR: cannot verify anaconda.org's certificate, issued by ‘/C=US/O=Let's Encrypt/CN=E1’: Issued certificate has expired. To connect to anaconda.org insecurely, use `--no-check-certificate'. ``` Also, switch from NVIDIA container to a stock `centos:7` one, to make containers slimmer and fit on standard GitHub Actions runners.

And add `nvcc` to path Regression introduced by pytorch#1447 when NVIDIA image was dropped in favor of base `centos` image

As NNC is dead, and llvm dependency has not been updated in last 4 years First step towards fixing pytorch/pytorch#103756

As [`pytorch/manylinux-builder`](https://hub.docker.com/r/pytorch/manylinux-builder) containers has only one version of CUDA, there is no need to select any Nor setup `LD_LIBRARY_PATH` as it does not match the setup users might have on their system (but keep it for libtorch tests) Should fix crash due to different minor version of cudnn installed in docker container and specified as dependency to a small wheel package, seen here https://github.com/pytorch/pytorch/actions/runs/5478547018/jobs/9980463690

This reverts commit ed9a2ae.

* Rebuild docker images on release * Include with-push

I.e. applying the same changes as in pytorch@4a7ed14 to libtorch docker builds

…rch#1452) This reverts commit 2ba03df as it essentially broke all the builds on trunk (fix is coming)

This is a reland of pytorch#1451 with an important to branches filter: entries in multi-line array definition should start with `-` otherwise it were attempting to match branch name `main\nrelease/*` I.e. just copy-n-paste example from https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#using-filters Test plan: actionlint .github/workflows/build-manywheel-images.yml Original PR description: Rebuild docker images on release builds. It should also tag images for release here: https://github.com/pytorch/builder/blob/3fc310ac21c9ede8d0ce13ec71096820a41eb9f4/conda/build_docker.sh#L58-L60 This is first step in pinning docker images for release.

As [`pytorch/manylinux-builder`](https://hub.docker.com/r/pytorch/manylinux-builder) containers has only one version of CUDA, there is no need to select any Nor setup `LD_LIBRARY_PATH` as it does not match the setup users might have on their system (but keep it for libtorch tests for now) Should fix crash due to different minor version of cudnn installed in docker container and specified as dependency to a small wheel package, seen here https://github.com/pytorch/pytorch/actions/runs/5478547018/jobs/9980463690

* Update manywheel and libtorch images to rocm5.6 * Add MIOpen branch for ROCm5.6

…1460)

* Add msccl-algorithms directory to PyTorch wheel * Bundle msccl-algorithms into wheel * Use correct src path for msccl-algorithms (cherry picked from commit 95b5af3) * Add hipblaslt dependency for ROCm5.6 onwards * Update build_all_docker.sh to ROCm5.6

…#1462) * Fix lapack missing and armcl update * update ARMCL version

…pytorch#1464)

…pdate msccl path for ROCm5.7 (cherry picked from commit 36c10cc)

ptrblck and others added 30 commits March 28, 2023 14:12

Add 12.1 workflow for docker image build (pytorch#1367)

8d8dbb5

* add 12.1 workflow for docker image build * add github workflow * update cuDNN to 8.8.1 and location for archive

Do not use ftp (pytorch#1369)

a88926a

* Do not use ftp `s#ftp://#https://#` * Remove no-longer relevant comment

add magma build for CUDA12.1 (pytorch#1368)

7b870ed

* add magma build for CUDA12.1 * copy and fix CMake.patch; drop sm_37 for CUDA 12.1

remove CUDA 11.6 builds (pytorch#1366)

2432b6c

* remove CUDA 11.6 builds * remove more 11.6 builds

build libtorch and manywheel for 12.1 (pytorch#1373)

7740097

enable nightly CUDA 12.1 builds (pytorch#1374)

c85da84

* enable nightly CUDA 12.1 builds * fix version typo

Fix typo (bracket) in DEPS_LIST setting (pytorch#1377)

d1cea8b

Remove special cases for Python 3.11 (pytorch#1381)

e0fd49f

* Remove special case for Python 3.11 * Remove install torch script

Windows CUDA 12.1 changes (pytorch#1376)

0b0d091

* Windows CUDA 12.1 changes * add CUDA version checks for Windows MAGMA builds * use magma branch without fermi arch

fix check for 12.1 (pytorch#1383)

943bace

Update CUDA_UPGRADE_GUIDE.MD (pytorch#1384)

6ce166e

add pytorch-cuda constraints for CUDA 12.1 (pytorch#1385)

8b34ae5

Fix cuda-pytorch/meta.yaml

3df3313

And add `12.1` to the matrix Test plan: `conda build . -c nvidia` and observe https://anaconda.org/malfet/pytorch-cuda/files

[aarch64] acl build script updates for multi isa build with armv8-a b…

aa40417

…ase architecture (pytorch#1370)

Compile with C++17 standard in check_binary.sh

31d4fed

To unblock pytorch/pytorch#98209

Use robocopy fix 256char limit (pytorch#1386)

beff087

add constraints for pytorch-cuda (pytorch#1391)

839c072

[Conda] Update MacOS target to 11.0

9fcdf4a

As 10.9 was release a decade ago and for that reason yet not supported C++17 standard. Similar to pytorch/pytorch#99857

Update MACOSX_DEPLOYMENT_TARGET to 10.13

1f72c41

To fix builds, though we should really target 11.0 at the very least

add nvjitlink for Windows builds for CUDA 12.1 (pytorch#1393)

858cd0d

pytorch-cuda: Added nvjitlink as dependency

54941b0

Add fsspec to list of packages

fc41ad9

Added by pytorch/pytorch#99768

Add ffmpeg build to audio aarch64 (pytorch#1396)

2b5264f

[S3] Add all cu117.with.pypi deps to nightly CUDA index

b55a167

Fix nvjitlink inclusion in 12.1 wheels (pytorch#1397)

843d6e9

* Fix nvjitlink inclusion in 12.1 wheels * Fix typo

update winserver driver (pytorch#1388)

84dbc90

Add pyyaml as PyTorch runtime dep (pytorch#1394)

3892567

Companion PR to pytorch/pytorch#100166

Fix typo

e89a086

Temp: Comment out VS2019 installation

e919e17

As it should be part of the AMI

Use VS2022 for libtorch windows tests

2922d7d

atalman and others added 29 commits June 29, 2023 19:39

Fix wheel macos arm64 validations (pytorch#1441)

387228a

* Fix wheel validations * Try using upgrade flag instead * try uninstall * test * Try using python3 * use python3 vs python for validation * Fix windows vs other os python execution * Uninstall fix

Remove previous installations on macos-arm64 before smoke testing (py…

d9b2c42

…torch#1444) More arm64 changes test run under environment sleep 15min allow investigate add sleep test test Test test test Arm64 use python fix test testing test tests testing test test

Fix aarch64 nightly (pytorch#1449)

2578481

[Manywheel] Add /usr/local/cuda symlink

4a7ed14

And add `nvcc` to path Regression introduced by pytorch#1447 when NVIDIA image was dropped in favor of base `centos` image

Do not build PyTorch with LLVM (pytorch#1445)

0d18a12

As NNC is dead, and llvm dependency has not been updated in last 4 years First step towards fixing pytorch/pytorch#103756

Revert "Remove DESIRED_CUDA logic from check_binary.sh"

c5da9be

This reverts commit ed9a2ae.

Update CUDA_UPGRADE_GUIDE.MD to add small wheel update Cudnn step

3776829

Set CUDA_VERSION in conda Docker environment

3fc310a

Rebuild docker images on release builds (pytorch#1451)

2ba03df

* Rebuild docker images on release * Include with-push

Create /usr/local/cuda in libtorch builds

d4c3ff7

I.e. applying the same changes as in pytorch@4a7ed14 to libtorch docker builds

Revert "Rebuild docker images on release builds (pytorch#1451)" (pyto…

c84ddc9

…rch#1452) This reverts commit 2ba03df as it essentially broke all the builds on trunk (fix is coming)

Let's try to force the path this way

d07d66c

Advance libgfortran version (pytorch#1453)

1939c68

Update builder images to ROCm5.6 (pytorch#1443)

e29c929

* Update manywheel and libtorch images to rocm5.6 * Add MIOpen branch for ROCm5.6

Pin miniconda install to py310_23.5.2 for macos and windows (pytorch#…

cef9f0a

…1460)

Cleanup unused builder files (pytorch#1459)

444b7f2

Remove unused builder files (pytorch#1461)

b9ab281

[aarch64][build] Aarch64 lapack fix and ARMCL version update (pytorch…

56d9d17

…#1462) * Fix lapack missing and armcl update * update ARMCL version

Remove unused parameter to limit-win-builds from validation workflows (…

6cf7018

…pytorch#1464)

Run git update-index --chmod=+x on aarch64_ci_build.sh (pytorch#1466)

2a03668

Merge remote-tracking branch 'upstream/main' into IFU-main-2023-07-31

f4464c9

Fix erroneous logic that was skipping msccl files even for ROCm5.6; u…

a581835

…pdate msccl path for ROCm5.7 (cherry picked from commit 36c10cc)

jithunnair-amd merged commit 6987207 into main Aug 7, 2023
4 of 68 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refiled] IFU-main-2023-07-31 #36

[Refiled] IFU-main-2023-07-31 #36

jithunnair-amd commented Aug 7, 2023 •

edited

Loading

[Refiled] IFU-main-2023-07-31 #36

[Refiled] IFU-main-2023-07-31 #36

Conversation

jithunnair-amd commented Aug 7, 2023 • edited Loading

jithunnair-amd commented Aug 7, 2023 •

edited

Loading