Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TransformerEngine to PT 2.0 training images #3315

Merged
merged 40 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
813bfe9
Add TransformerEngine to PT 2.0 training images
arjkesh Sep 7, 2023
9a50f6b
Merge branch 'master' into tf_engine
arjkesh Sep 7, 2023
4653068
Update Dockerfile.gpu
arjkesh Sep 8, 2023
227daaa
Update buildspec.yml
arjkesh Sep 8, 2023
efe2170
Update buildspec.yml
arjkesh Sep 8, 2023
97d3440
install cudnn
arjkesh Sep 11, 2023
d5d0314
Update Dockerfile.gpu
arjkesh Sep 12, 2023
e1d10c8
update
arjkesh Sep 19, 2023
d9d742d
Merge branch 'tf_engine' of https://github.com/arjkesh/deep-learning-…
arjkesh Sep 19, 2023
ce8d087
update
arjkesh Sep 19, 2023
ee98782
Update Dockerfile.gpu
arjkesh Sep 20, 2023
22d7d60
Update Dockerfile.gpu
arjkesh Sep 20, 2023
af662fd
Update Dockerfile.gpu
arjkesh Sep 20, 2023
c97541b
Merge branch 'master' of https://github.com/aws/deep-learning-contain…
arjkesh Sep 22, 2023
6cb71c8
save progress
arjkesh Sep 23, 2023
d5626a4
skip efa
arjkesh Sep 23, 2023
3d83645
run TE test
arjkesh Sep 23, 2023
15df3aa
update formatting
arjkesh Sep 23, 2023
e91071d
update formatting
arjkesh Sep 23, 2023
02c9187
update
arjkesh Sep 25, 2023
6f1caca
rebuild image
arjkesh Sep 25, 2023
95a9003
update cudnn
arjkesh Sep 25, 2023
7530535
update cudnn to 8.9.4.25 for fused attn fix
arjkesh Sep 25, 2023
51594ab
try cudnn 8.9.5
arjkesh Sep 25, 2023
91285fe
install TE v12
arjkesh Sep 25, 2023
f6976e5
revert to 8.9.3, upgrade TE
arjkesh Sep 25, 2023
c410e1d
add cudnn match test
arjkesh Sep 25, 2023
edd5550
add cudnn test
arjkesh Sep 26, 2023
6c91e67
python formatting
arjkesh Sep 26, 2023
799e144
add hide=true for ease of debug
arjkesh Sep 26, 2023
7a8f34b
docstring update
arjkesh Sep 26, 2023
5e14ead
patch cryptography
arjkesh Sep 26, 2023
6368fc0
revert temp changes
arjkesh Sep 26, 2023
8ae87c2
update skip condition
arjkesh Sep 26, 2023
4295532
update
arjkesh Sep 26, 2023
52c8b86
typo fix
arjkesh Sep 26, 2023
b2787cd
add docker pull cmd
arjkesh Sep 26, 2023
2c43c26
update test, format
arjkesh Sep 26, 2023
da102bb
Update testPTTransformerEngine
arjkesh Sep 26, 2023
7fb2477
Update Dockerfile.gpu
arjkesh Sep 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 30 additions & 30 deletions pytorch/training/buildspec.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,21 +41,21 @@ images:
# target: ec2
# context:
# <<: *TRAINING_CONTEXT
# BuildEC2GPUPTTrainPy3cu121DockerImage:
# <<: *TRAINING_REPOSITORY
# build: &PYTORCH_GPU_TRAINING_PY3 false
# image_size_baseline: 19700
# device_type: &DEVICE_TYPE gpu
# python_version: &DOCKER_PYTHON_VERSION py3
# tag_python_version: &TAG_PYTHON_VERSION py310
# cuda_version: &CUDA_VERSION cu121
# os_version: &OS_VERSION ubuntu20.04
# tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
# docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile.,
# *DEVICE_TYPE ]
# target: ec2
# context:
# <<: *TRAINING_CONTEXT
BuildEC2GPUPTTrainPy3cu121DockerImage:
<<: *TRAINING_REPOSITORY
build: &PYTORCH_GPU_TRAINING_PY3 false
image_size_baseline: 19700
device_type: &DEVICE_TYPE gpu
python_version: &DOCKER_PYTHON_VERSION py3
tag_python_version: &TAG_PYTHON_VERSION py310
cuda_version: &CUDA_VERSION cu121
os_version: &OS_VERSION ubuntu20.04
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile.,
*DEVICE_TYPE ]
target: ec2
context:
<<: *TRAINING_CONTEXT
# BuildEC2GPUPTTrainPy3cu118DockerImage:
# <<: *TRAINING_REPOSITORY
# build: &PYTORCH_GPU_TRAINING_PY3 false
Expand Down Expand Up @@ -84,21 +84,21 @@ images:
# target: sagemaker
# context:
# <<: *TRAINING_CONTEXT
BuildSageMakerGPUPTTrainPy3DockerImage:
<<: *TRAINING_REPOSITORY
build: &PYTORCH_GPU_TRAINING_PY3 false
image_size_baseline: 21500
device_type: &DEVICE_TYPE gpu
python_version: &DOCKER_PYTHON_VERSION py3
tag_python_version: &TAG_PYTHON_VERSION py310
cuda_version: &CUDA_VERSION cu118
os_version: &OS_VERSION ubuntu20.04
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ]
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile.,
*DEVICE_TYPE ]
target: sagemaker
context:
<<: *TRAINING_CONTEXT
# BuildSageMakerGPUPTTrainPy3DockerImage:
# <<: *TRAINING_REPOSITORY
# build: &PYTORCH_GPU_TRAINING_PY3 false
# image_size_baseline: 21500
# device_type: &DEVICE_TYPE gpu
# python_version: &DOCKER_PYTHON_VERSION py3
# tag_python_version: &TAG_PYTHON_VERSION py310
# cuda_version: &CUDA_VERSION cu118
# os_version: &OS_VERSION ubuntu20.04
# tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ]
# docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile.,
# *DEVICE_TYPE ]
# target: sagemaker
# context:
# <<: *TRAINING_CONTEXT
# BuildPyTorchExampleGPUTrainPy3cu121DockerImage:
# <<: *TRAINING_REPOSITORY
# build: &PYTORCH_GPU_TRAINING_PY3 false
Expand Down
15 changes: 13 additions & 2 deletions pytorch/training/docker/2.0/py3/cu121/Dockerfile.gpu
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ ENV PATH /opt/conda/bin:$PATH
# 5.2 is G3 EC2 instance, 7.5 is G4*, 7.0 is p3*, 8.0 is P4*, 8.6 is G5* and 9.0 is P5*
ENV TORCH_CUDA_ARCH_LIST="5.2;7.0+PTX;7.5;8.0;8.6;9.0"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CUDNN_VERSION=8.9.3.28
ENV NCCL_VERSION=2.18.3
ENV EFA_VERSION=1.24.1
ENV GDRCOPY_VERSION=2.3.1
Expand All @@ -68,6 +69,8 @@ RUN apt-get update \
build-essential \
ca-certificates \
cmake \
libcudnn8=$CUDNN_VERSION-1+cuda12.1 \
libcudnn8-dev=$CUDNN_VERSION-1+cuda12.1 \
curl \
emacs \
git \
Expand Down Expand Up @@ -133,7 +136,7 @@ RUN /opt/conda/bin/mamba install -y -c conda-forge \
# Adding package for studio kernels
ipykernel \
# patch CVE
"cryptography>=41.0.2" \
"cryptography>=41.0.4" \
# patch CVE
"pillow>=9.4" \
"mpi4py>=3.1.4,<3.2" \
Expand Down Expand Up @@ -268,7 +271,7 @@ RUN /opt/conda/bin/mamba install -y -c conda-forge \
&& /opt/conda/bin/mamba clean -afy

# Patches
RUN pip install "pillow>=9.5" opencv-python
RUN pip install "pillow>=9.5" opencv-python huggingface_hub
RUN /opt/conda/bin/mamba install -y -c conda-forge \
"requests>=2.31.0" \
&& /opt/conda/bin/mamba clean -afy
Expand All @@ -292,6 +295,14 @@ RUN pip install packaging \
&& cd .. \
&& rm -rf apex

# Install flash attn and NVIDIA transformer engine
ENV NVTE_FRAMEWORK=pytorch
# Install flash-attn using instructions from https://github.com/Dao-AILab/flash-attention#installation-and-features
# Set MAX_JOBS=4 to avoid OOM issues in installation process
RUN MAX_JOBS=4 pip install flash-attn==2.0.4 --no-build-isolation
arjkesh marked this conversation as resolved.
Show resolved Hide resolved
# Install TE using instructions from https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html
RUN pip install git+https://github.com/NVIDIA/TransformerEngine.git@release_v0.12
arjkesh marked this conversation as resolved.
Show resolved Hide resolved

RUN HOME_DIR=/root \
&& curl -o ${HOME_DIR}/oss_compliance.zip https://aws-dlinfra-utilities.s3.amazonaws.com/oss_compliance.zip \
&& unzip ${HOME_DIR}/oss_compliance.zip -d ${HOME_DIR}/ \
Expand Down
10 changes: 10 additions & 0 deletions test/dlc_tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1025,6 +1025,11 @@ def skip_pt110():
pass


@pytest.fixture(scope="session")
def pt21_and_above_only():
pass


@pytest.fixture(scope="session")
def pt18_and_above_only():
pass
Expand Down Expand Up @@ -1154,6 +1159,10 @@ def framework_version_within_limit(metafunc_obj, image):
"skip_pt110" in metafunc_obj.fixturenames
and is_equal_to_framework_version("1.10.*", image, image_framework_name)
)
pt21_requirement_failed = (
"pt21_and_above_only" in metafunc_obj.fixturenames
and is_below_framework_version("2.1", image, image_framework_name)
)
pt18_requirement_failed = (
"pt18_and_above_only" in metafunc_obj.fixturenames
and is_below_framework_version("1.8", image, image_framework_name)
Expand Down Expand Up @@ -1181,6 +1190,7 @@ def framework_version_within_limit(metafunc_obj, image):
or below_pt113_requirement_failed
or pt111_requirement_failed
or not_pt110_requirement_failed
or pt21_requirement_failed
or pt18_requirement_failed
or pt17_requirement_failed
or pt16_requirement_failed
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

set -ex

git clone --branch release_v0.12 https://github.com/NVIDIA/TransformerEngine.git
cd TransformerEngine/tests/pytorch

pip install pytest==6.2.5 onnxruntime==1.13.1 onnx
pytest -v -s test_sanity.py
PYTORCH_JIT=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 pytest -v -s test_numerics.py
NVTE_TORCH_COMPILE=0 pytest -v -s test_onnx_export.py
pytest -v -s test_jit.py
arjkesh marked this conversation as resolved.
Show resolved Hide resolved
44 changes: 44 additions & 0 deletions test/dlc_tests/ec2/pytorch/training/test_pytorch_training.py
Original file line number Diff line number Diff line change
Expand Up @@ -620,3 +620,47 @@ def test_pytorch_standalone_hpu(
container_name="ec2_training_habana_pytorch_container",
enable_habana_async_execution=True,
)


@pytest.mark.usefixtures("feature_aws_framework_present")
@pytest.mark.usefixtures("sagemaker")
@pytest.mark.integration("cudnn")
@pytest.mark.model("N/A")
@pytest.mark.parametrize("ec2_instance_type", PT_EC2_SINGLE_GPU_INSTANCE_TYPE, indirect=True)
def test_pytorch_cudnn_match_gpu(
pytorch_training, ec2_connection, region, gpu_only, ec2_instance_type, pt21_and_above_only
):
"""
PT 2.1 reintroduces a dependency on CUDNN to support NVDA TransformerEngine. This test is to ensure that torch CUDNN matches system CUDNN in the container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no PT 2.1 yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no PT 2.1 yet, this is an anticipatory test that we are adding to ensure that torch binaries are compiled with the same cudnn as exists in the container

"""
container_name = "pt_cudnn_test"
ec2_connection.run(f"$(aws ecr get-login --no-include-email --region {region})", hide=True)
ec2_connection.run(f"docker pull -q {pytorch_training}", hide=True)
ec2_connection.run(
f"nvidia-docker run --name {container_name} -itd {pytorch_training}", hide=True
)
major_cmd = "cat /usr/include/cudnn_version.h | grep '#define CUDNN_MAJOR'"
minor_cmd = "cat /usr/include/cudnn_version.h | grep '#define CUDNN_MINOR'"
patch_cmd = "cat /usr/include/cudnn_version.h | grep '#define CUDNN_PATCHLEVEL'"
major = ec2_connection.run(
f"nvidia-docker exec --user root {container_name} bash -c '{major_cmd}'", hide=True
).stdout.split()[-1]
minor = ec2_connection.run(
f"nvidia-docker exec --user root {container_name} bash -c '{minor_cmd}'", hide=True
).stdout.split()[-1]
patch = ec2_connection.run(
f"nvidia-docker exec --user root {container_name} bash -c '{patch_cmd}'", hide=True
).stdout.split()[-1]

cudnn_from_torch = ec2_connection.run(
f"nvidia-docker exec --user root {container_name} python -c 'from torch.backends import cudnn; print(cudnn.version())'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this cudnn comes from pytorch and not from installed from OS package, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cudnn represents the cudnn version that torch is compiled with, not the DLC cudnn version. There are basically static links to cudnn from torch - while it doesn't appear to be a big issue if there are slightly different versions of cudnn from compile --> system, adding this test for future safety so that the versions don't go out of sync

hide=True,
).stdout.strip()

if len(patch) == 1:
patch = f"0{patch}"

system_cudnn = f"{major}{minor}{patch}"
assert (
system_cudnn == cudnn_from_torch
), f"System CUDNN {system_cudnn} and torch cudnn {cudnn_from_torch} do not match. Please downgrade system CUDNN or recompile torch with correct CUDNN verson."
33 changes: 33 additions & 0 deletions test/dlc_tests/ec2/test_transformerengine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import os

import pytest

import test.test_utils.ec2 as ec2_utils
from test.test_utils import CONTAINER_TESTS_PREFIX, is_pr_context, is_efa_dedicated
from test.test_utils.ec2 import get_efa_ec2_instance_type, filter_efa_instance_type

PT_TE_TESTS_CMD = os.path.join(
CONTAINER_TESTS_PREFIX, "transformerengine", "testPTTransformerEngine"
)


EC2_EFA_GPU_INSTANCE_TYPE_AND_REGION = get_efa_ec2_instance_type(
default="p4d.24xlarge",
arjkesh marked this conversation as resolved.
Show resolved Hide resolved
filter_function=filter_efa_instance_type,
)


@pytest.mark.processor("gpu")
@pytest.mark.model("N/A")
@pytest.mark.integration("transformerengine")
@pytest.mark.usefixtures("sagemaker")
@pytest.mark.allow_p4de_use
@pytest.mark.parametrize("ec2_instance_type,region", EC2_EFA_GPU_INSTANCE_TYPE_AND_REGION)
@pytest.mark.skipif(
is_pr_context() and not is_efa_dedicated(),
reason="Skip heavy instance test in PR context unless explicitly enabled",
)
def test_pytorch_transformerengine(
pytorch_training, ec2_connection, region, ec2_instance_type, gpu_only, py3_only
):
ec2_utils.execute_ec2_training_test(ec2_connection, pytorch_training, PT_TE_TESTS_CMD)