Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ansible): upgrade for CUDA, TensorRT and CUDNN #5733

Merged
merged 2 commits into from
Feb 13, 2025

Conversation

amadeuszsz
Copy link
Contributor

Description

Copy of #5608 with fix #5730.

Reverted by #5729 due to incompatibility of upgraded dependencies with 0.40.0 autoware.universe release.

This PR can be opened after autoware 0.41.0 release.

How was this PR tested?

Notes for reviewers

None.

Effects on system behavior

None.

Copy link

github-actions bot commented Feb 5, 2025

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

@amadeuszsz amadeuszsz self-assigned this Feb 5, 2025
@amadeuszsz amadeuszsz added the type:containers Docker containers, containerization of components, or container orchestration. label Feb 5, 2025
@amadeuszsz amadeuszsz mentioned this pull request Feb 5, 2025
19 tasks
@amadeuszsz amadeuszsz marked this pull request as ready for review February 12, 2025 07:32
@amadeuszsz amadeuszsz added the tag:run-health-check Run health-check label Feb 12, 2025
@xmfcx
Copy link
Contributor

xmfcx commented Feb 13, 2025

Without the PR

Rosbag Replay Sim Demo

AWSIM Labs Demo

Had to merge these 2 PRs:

With the PR

First removed the following files to regenerate:

~/autoware_data/traffic_light_classifier/ped_traffic_light_classifier_mobilenetv2_batch_4.engine
~/autoware_data/traffic_light_classifier/traffic_light_classifier_mobilenetv2_batch_6.engine
~/autoware_data/lidar_centerpoint/pts_backbone_neck_head_centerpoint_tiny.engine
~/autoware_data/lidar_centerpoint/pts_voxel_encoder_centerpoint_tiny.engine
~/autoware_data/traffic_light_fine_detector/tlr_car_ped_yolox_s_batch_6.engine

Installed:

sudo apt-get -y install cuda-toolkit-12-4
sudo apt-get install -y cuda-drivers-550

For the cuDNN and TensorRT, followed https://github.com/autowarefoundation/autoware/blob/c4a8feff4e59dcce34d3005b3c5e7ed89f98501c/ansible/roles/tensorrt/README.md but by using the correct amd64.env file from this version.

Also since I had the previous versions, first I had to unhold them with:

sudo apt-mark unhold \
libcudnn8 \
libnvinfer10 \
libnvinfer-plugin10 \
libnvonnxparsers10 \
libcudnn8-dev \
libnvinfer-dev \
libnvinfer-plugin-dev \
libnvonnxparsers-dev \
libnvinfer-headers-dev \
libnvinfer-headers-plugin-dev

Still I had an error:

$ sudo apt-get install -y libcudnn8=${cudnn_version} libnvinfer10=${tensorrt_version} libnvinfer-plugin10=${tensorrt_version} libnvonnxparsers10=${tensorrt_version} libcudnn8-dev=${cudnn_version} libnvinfer-dev=${tensorrt_version} libnvinfer-plugin-dev=${tensorrt_version} libnvinfer-headers-dev=${tensorrt_version} libnvinfer-headers-plugin-dev=${tensorrt_version} libnvonnxparsers-dev=${tensorrt_version}
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libnvparsers-dev : Depends: libnvinfer-dev (= 8.6.1.6-1+cuda12.0) but 10.8.0.43-1+cuda12.8 is to be installed
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages

For this, I installed sudo apt install -y aptitude then ran:
sudo aptitude install -y libcudnn8=${cudnn_version} libnvinfer10=${tensorrt_version} libnvinfer-plugin10=${tensorrt_version} libnvonnxparsers10=${tensorrt_version} libcudnn8-dev=${cudnn_version} libnvinfer-dev=${tensorrt_version} libnvinfer-plugin-dev=${tensorrt_version} libnvinfer-headers-dev=${tensorrt_version} libnvinfer-headers-plugin-dev=${tensorrt_version} libnvonnxparsers-dev=${tensorrt_version} and it installed.

Rosbag Replay Sim Demo

AWSIM Labs Demo

image

It all works 🎊

Signed-off-by: Amadeusz Szymko <amadeusz.szymko.2@tier4.jp>
Signed-off-by: Amadeusz Szymko <amadeusz.szymko.2@tier4.jp>
@xmfcx xmfcx force-pushed the feat/cuda-deps-aw-0.41.0 branch from 9550b20 to c4a8fef Compare February 13, 2025 05:21
arm64.env Show resolved Hide resolved
arm64.env Show resolved Hide resolved
@xmfcx xmfcx merged commit daba3d9 into autowarefoundation:main Feb 13, 2025
18 checks passed
@xmfcx
Copy link
Contributor

xmfcx commented Feb 13, 2025

Also tested with the sudo apt install nvidia-driver-570 as well and works without issues.

https://docs.nvidia.com/deploy/cuda-compatibility/ it seems now the newer drivers support older CUDA versions.

Backwards compatibility ensures that a newer NVIDIA driver can be used with an older CUDA Toolkit. This is implicit and most simple way of doing upgrades.

@amadeuszsz
Copy link
Contributor Author

amadeuszsz commented Feb 13, 2025

@xmfcx
autoware-base failed, but manual trigger done before didn't fail that early 😢

FYI, the issue dpkg: error processing package libc-bin (--configure): occurred on my local amd64 machine as well, solved via apt install qemu-system & apt install qemu-user-static but I have no idea how it applies for our runner since qemu seems configured.

I can confirm successful image build for linux/arm64 on my machine.

Potential fix for our workflow. If this is the case, please feel free to open and review this PR.

@xmfcx
Copy link
Contributor

xmfcx commented Feb 13, 2025

Isn't this a cache issue?

It's just doing some basic apt package installs, not even related to this PR. Or am I missing something?

@youtalk -san, maybe we should disable caching for that command?

# Install apt packages and add GitHub to known hosts for private repositories
RUN rm -f /etc/apt/apt.conf.d/docker-clean \
&& echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' >/etc/apt/apt.conf.d/keep-cache
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
apt-get update && DEBIAN_FRONTEND=noninteractive apt-get -y install --no-install-recommends \
gosu \
ssh \
&& /autoware/cleanup_apt.sh \
&& mkdir -p ~/.ssh \
&& ssh-keyscan github.com >> ~/.ssh/known_hosts

@amadeuszsz
Copy link
Contributor Author

@xmfcx

They upgraded latest tag recently. Initially I thought tag upgrade might resolve the issue, but now I'm up to rolling it back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tag:run-health-check Run health-check type:containers Docker containers, containerization of components, or container orchestration.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants