Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refiled] IFU-main-2023-07-31 #36

Merged
merged 99 commits into from
Aug 7, 2023
Merged

[Refiled] IFU-main-2023-07-31 #36

merged 99 commits into from
Aug 7, 2023

Conversation

jithunnair-amd
Copy link

@jithunnair-amd jithunnair-amd commented Aug 7, 2023

Re-attempt for #35

Mistakenly hit "Squash and merge" when it should have been a regular "Merge" to maintain individual commits.

This IFU PR brings in the following notable changes for ROCm:

Using -complete base images for CentOS and Ubuntu
Removing install_rocm.sh step from Dockerfiles because they're not needed if using the -complete images
A fix for erroneous logic regarding msccl-algorithm files not being included for ROCm5.6 or above.
Tested successfully for wheels via http://rocmhead.amd.com:8080/job/pytorch/job/dev/job/manylinux_rocm_wheels/243 using ROCm5.7 RC1 (build 7) and PyTorch 2.0

ptrblck and others added 30 commits March 28, 2023 14:12
* add 12.1 workflow for docker image build

* add github workflow

* update cuDNN to 8.8.1 and location for archive
* Do not use ftp

`s#ftp://#https://#`

* Remove no-longer relevant comment
* add magma build for CUDA12.1

* copy and fix CMake.patch; drop sm_37 for CUDA 12.1
* remove CUDA 11.6 builds

* remove more 11.6 builds
* enable nightly CUDA 12.1 builds

* fix version typo
* Remove special case for Python 3.11

* Remove install torch script
* Windows CUDA 12.1 changes

* add CUDA version checks for Windows MAGMA builds

* use magma branch without fermi arch
And add `12.1` to the matrix

Test plan: `conda build . -c nvidia` and observe https://anaconda.org/malfet/pytorch-cuda/files
As 10.9 was release a decade ago and for that reason yet not supported C++17 standard.

Similar to pytorch/pytorch#99857
To fix builds, though we should really target 11.0 at the very least
* Fix nvjitlink inclusion in 12.1 wheels

* Fix typo
As it should be part of the AMI
atalman and others added 29 commits June 29, 2023 19:39
* Fix wheel validations

* Try using upgrade flag instead

* try uninstall

* test

* Try using python3

* use python3 vs python for validation

* Fix windows vs other os python execution

* Uninstall fix
…torch#1444)

More arm64 changes

test run under environment

sleep 15min allow investigate

add sleep

test

test

Test

test

test

Arm64 use python

fix

test

testing

test

tests

testing

test

test
Use [`nvidia/cuda:11.4.3-devel-centos7`](https://hub.docker.com/layers/nvidia/cuda/11.4.3-devel-centos7/images/sha256-e2201a4954dfd65958a6f5272cd80b968902789ff73f26151306907680356db8?context=explore) because `nvidia/cuda:10.2-devel-centos7` was deleted in accordance with [Nvidia's Container Support Policy](https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md):
> After a period of Six Months time, the EOL tags WILL BE DELETED from Docker Hub and Nvidia GPU Cloud (NGC). This deletion ensures unsupported tags (and image layers) are not left lying around for customers to continue using after they have long been abandoned.

Also delete redundant DEVTOOLSET=7 clause
Followup after pytorch#1446

CUDA-10.2 and moreover CUDA-9.2 docker images are gone per [Nvidia's Container Support Policy](https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md):
> After a period of Six Months time, the EOL tags WILL BE DELETED from Docker Hub and Nvidia GPU Cloud (NGC). This deletion ensures unsupported tags (and image layers) are not left lying around for customers to continue using after they have long been abandoned.

Also, as all our Docker script install CUDA toolkit anyway, what's the point of using `nvidia/cuda` images at all instead of `centos:7`/`ubuntu:18.04` that former are based on, according to https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/11.4.3/centos7/base/Dockerfile

Explicitly install `g++` to `libtorch/Docker` base image, as it's needed by `patchelf`

Please note, that `libtorch/Docker` can not be completed without buildkit, as `rocm` step depends on `python3` which is not available in `cpu` image
Not sure, what weird version of `wget` is getting installed, but  attempt to download https://anaconda.org/pytorch/magma-cuda121/2.6.1/download/linux-64/magma-cuda121-2.6.1-1.tar.bz2 fails with:
```
--2023-07-06 03:18:38--  https://anaconda.org/pytorch/magma-cuda121/2.6.1/download/linux-64/magma-cuda121-2.6.1-1.tar.bz2
Resolving anaconda.org (anaconda.org)... 104.17.93.24, 104.17.92.24, 2606:4700::6811:5d18, ...
Connecting to anaconda.org (anaconda.org)|104.17.93.24|:443... connected.
ERROR: cannot verify anaconda.org's certificate, issued by ‘/C=US/O=Let's Encrypt/CN=E1’:
  Issued certificate has expired.
To connect to anaconda.org insecurely, use `--no-check-certificate'.
```

Also, switch from NVIDIA container to a stock `centos:7` one, to make containers slimmer and fit on standard GitHub Actions runners.
And add `nvcc` to path

Regression introduced by pytorch#1447 when NVIDIA image was dropped in favor of base `centos` image
As NNC is dead, and llvm dependency has not been updated in last 4 years

First step towards fixing pytorch/pytorch#103756
As [`pytorch/manylinux-builder`](https://hub.docker.com/r/pytorch/manylinux-builder) containers has only one version of CUDA, there is no need to select any

Nor setup `LD_LIBRARY_PATH` as it does not match the setup users might have on their system (but keep it for libtorch tests)

Should fix crash due to different minor version of cudnn installed in docker container and specified as dependency to a small wheel package, seen here https://github.com/pytorch/pytorch/actions/runs/5478547018/jobs/9980463690
* Rebuild docker images on release

* Include with-push
I.e. applying the same changes as in pytorch@4a7ed14 to libtorch docker builds
…rch#1452)

This reverts commit 2ba03df as it essentially broke all the builds on trunk (fix is coming)
This is a reland of pytorch#1451 with an
important to branches filter:  entries in multi-line array definition
should start with `-` otherwise it were attempting to match branch name `main\nrelease/*`
I.e. just copy-n-paste example from https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#using-filters

Test plan: actionlint .github/workflows/build-manywheel-images.yml

Original PR description:
Rebuild docker images on release builds. It should also tag images for release here: https://github.com/pytorch/builder/blob/3fc310ac21c9ede8d0ce13ec71096820a41eb9f4/conda/build_docker.sh#L58-L60
This is first step in pinning docker images for release.
As [`pytorch/manylinux-builder`](https://hub.docker.com/r/pytorch/manylinux-builder) containers has only one version of CUDA, there is no need to select any

Nor setup `LD_LIBRARY_PATH` as it does not match the setup users might have on their system (but keep it for libtorch tests for now)

Should fix crash due to different minor version of cudnn installed in docker container and specified as dependency to a small wheel package, seen here https://github.com/pytorch/pytorch/actions/runs/5478547018/jobs/9980463690
* Update manywheel and libtorch images to rocm5.6
* Add MIOpen branch for ROCm5.6
* Add msccl-algorithms directory to PyTorch wheel

* Bundle msccl-algorithms into wheel

* Use correct src path for msccl-algorithms

(cherry picked from commit 95b5af3)

* Add hipblaslt dependency for ROCm5.6 onwards

* Update build_all_docker.sh to ROCm5.6
…#1462)

* Fix lapack missing and armcl update

* update ARMCL version
…pdate msccl path for ROCm5.7

(cherry picked from commit 36c10cc)
@jithunnair-amd jithunnair-amd merged commit 6987207 into main Aug 7, 2023
4 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants