Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kmod-5.10-nvidia: add remaining libraries #1928

Merged
merged 2 commits into from
Feb 4, 2022

Conversation

arnaldo2792
Copy link
Contributor

@arnaldo2792 arnaldo2792 commented Jan 26, 2022

Issue number:
Closes #1822

Description of changes:

kmod-5.10-nvidia: add remaining libraries
kmod-5.10-nvidia: add releases url

The NVIDIA sources provide user-space libraries that will be mounted into the containers, depending on the set of driver capabilities configured for the workload.

Testing done:

I ran a daemonset in a p3.2xlarge instance with the following image definition:

FROM nvidia/cuda:11.4.3-devel-ubuntu20.04 as cuda-samples

RUN apt update
RUN apt install git build-essential -y
RUN git clone https://github.com/NVIDIA/cuda-samples.git

# Compute samples
RUN mkdir -p /samples
RUN cd /cuda-samples/Samples/0_Introduction/vectorAdd && make -j && [ -f vectorAdd ] && cp vectorAdd /samples/
RUN cd /cuda-samples/Samples/1_Utilities/bandwidthTest && make -j && [ -f bandwidthTest ] && cp bandwidthTest /samples/
RUN cd /cuda-samples/Samples/1_Utilities/deviceQuery && make -j && [ -f deviceQuery ] && cp deviceQuery /samples/
RUN cd /cuda-samples/Samples/1_Utilities/topologyQuery && make -j && [ -f topologyQuery ] && cp topologyQuery /samples/

FROM alpine as builder
RUN apk update \
  && apk add --update git

FROM builder as benchmarks
RUN git clone https://github.com/tensorflow/benchmarks.git \
  && cd benchmarks \
  && git checkout cnn_tf_v1.15_compatible

FROM tensorflow/tensorflow:1.15.2-gpu
ENV SAMPLES="vectorAdd bandwidthTest deviceQuery topologyQuery"
COPY ./entrypoint.sh /
COPY --from=benchmarks /benchmarks /opt/benchmarks
COPY --from=cuda-samples /samples/* /usr/bin/
RUN chmod +x ./entrypoint.sh && mkdir -p /opt
ENTRYPOINT ["sh", "-c", "/entrypoint.sh"]

Entrypoint:

#! /usr/bin/env bash

# Cuda samples:

for sample in $SAMPLES; do
  $sample
done

# GPU benchmark:
python3 /opt/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
  --data_name=imagenet                                                 \
  --model=resnet50                                                     \
  --num_batches=100                                                    \
  --batch_size=4                                                       \
  --num_gpus=1                                                         \
  --gpu_memory_frac_for_testing=0.2

The containers ran successfully.

TODO:

  • Same test for aarch64

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@arnaldo2792 arnaldo2792 marked this pull request as ready for review January 28, 2022 01:43
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#driver-capabilities

# Utility libs
install -m755 libnvidia-ml.so.%{nvidia_tesla_470_version} %{buildroot}%{_cross_libdir}/nvidia/tesla/%{nvidia_tesla_470_version}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
install -m755 libnvidia-ml.so.%{nvidia_tesla_470_version} %{buildroot}%{_cross_libdir}/nvidia/tesla/%{nvidia_tesla_470_version}
install -m 755 libnvidia-ml.so.%{nvidia_tesla_470_version} %{buildroot}%{_cross_libdir}/nvidia/tesla/%{nvidia_tesla_470_version}

nit. You might want to add a space to be consistent with the preexisting code, but it seems to work either way.

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved
packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved
packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved
packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved
packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved
packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved
@arnaldo2792
Copy link
Contributor Author

Forced push includes:

  • Install all libraries, and explicitly include and exclude the libraries in the %files section
  • Short global variable nvidia_tesla_470_version
  • Create only the required symlinks, based on the output of readelf -a <lib> | grep SONAME

Copy link
Contributor

@etungsten etungsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested both aarch64 and x86_64 builds with benchmarks and samples and they pass.

@arnaldo2792
Copy link
Contributor Author

Forced push includes:

  • Automated symlink creation
  • Explanation on why some libraries are excluded

@arnaldo2792
Copy link
Contributor Author

Forced push includes:

  • Remove compat32 libs

packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved
packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved
packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved
packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved
packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated Show resolved Hide resolved
@bcressey
Copy link
Contributor

bcressey commented Feb 3, 2022

Still trying to come up with better advice for what to include, since our current method seems pretty high touch and error prone.

I'd like to err on the side of including everything and letting libnvidia-container sort it out, with the possible exception of the Gtk and Wayland stuff that we know is excluded.

It seems like the only problem with that plan is what to do about the libEGL.so.1 symlink, which should point to one of the two libraries with that SONAME, and perhaps one of them should be excluded.

@bcressey
Copy link
Contributor

bcressey commented Feb 3, 2022

It seems like the only problem with that plan is what to do about the libEGL.so.1 symlink, which should point to one of the two libraries with that SONAME, and perhaps one of them should be excluded.

Let's point libEGL.so.1 to libEGL.so.1.1.0 since that seems to be how the other libglvnd libraries are treated. We can still include both. On a running instance afterwards, you can check to see whether ldconfig --print-cache agrees with that resolution.

If we could get the output from libnvidia-container when it's checking the compat libraries for inclusion, that might help determine whether either or both of them is OK. Or check how the other driver container images handle this.

The NVIDIA sources provide user-space libraries that will be mounted
into the containers, depending on the set of driver capabilities
configured for the workload.

Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
@arnaldo2792
Copy link
Contributor Author

If we could get the output from libnvidia-container when it's checking the compat libraries for inclusion, that might help determine whether either or both of them is OK. Or check how the other driver container images handle this.

libnvidia-container will complain when a library is missing, with a message like this, visible in the journal:

missing <compat32> library: <library>

With these changes, I didn't see any complains:

bash-5.0# uname -a
Linux ip-192-168-74-162.us-west-2.compute.internal 5.10.93 #1 SMP Wed Jan 26 19:56:51 UTC 2022 x86_64 GNU/Linux
bash-5.0# journalctl | grep missing
Feb 03 23:12:00 ip-192-168-74-162.us-west-2.compute.internal kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
bash-5.0# uname -a
Linux ip-192-168-78-160.us-west-2.compute.internal 5.10.93 #1 SMP Wed Jan 26 19:54:26 UTC 2022 aarch64 GNU/Linux
bash-5.0# journalctl | grep missing
Feb 03 23:12:20 ip-192-168-78-160.us-west-2.compute.internal kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel

@arnaldo2792 arnaldo2792 merged commit 0440bcb into bottlerocket-os:develop Feb 4, 2022
@arnaldo2792 arnaldo2792 deleted the nvidia-integration branch March 31, 2022 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add additional NVIDIA libraries to aws-k8s-1.21-nvidia variant
4 participants