Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mount NVIDIA device files and shared libraries directly rather than delegating this to nvidia-container-cli #11224

Open
EtiennePerot opened this issue Nov 26, 2024 · 0 comments
Labels
type: enhancement New feature or request

Comments

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Nov 26, 2024

Description

Currently, runsc's NVIDIA GPU support relies on nvidia-container-cli's configure subcommand to mount NVIDIA device files and shared libraries (*.so's) into the container's root filesystem during container setup. In runc containers, this runs as a "prestart hook" in the OCI spec. In gVisor, because prestart hooks run at a time where the gVisor sandbox filesystem is already isolated from the host using pivot_root(2). So runsc also has logic to unconditionally skip the NVIDIA prestart hook, and to instead run the nvidia-container-cli configure subcommand earlier in the sandbox startup sequence, within the context of the Gofer process.

runsc also has GKE-specific code that detects the request of GPUs for the container by scanning the list of devices to be mounted in the container, and manually mounts these device files if that is present. In this environment, shared libraries are already mounted by another GKE component that inserts a bind mount of shared libraries into the container spec, so runsc doesn't need to have specific code to mount those shared libraries itself.

This issue tracks the removal of the first codepath. runsc should do all the work and not invoke nvidia-container-cli configure. This has the following advantages:

  • Faster container startup (no fork/exec required)
  • Fewer dependencies
  • No inheritance of security vulnerabilities in nvidia-container-cli

... at the cost of more brittleness when the behavior of nvidia-container-cli changes. But given that we already maintain a codepath that avoids nvidia-container-cli entirely, this doesn't seem like a large incremental cost. The main cost seems to be to add logic to mount the required shared libraries into the container's root filesystem.

Is this feature related to a specific bug?

Perhaps.

Do you have a specific solution in mind?

Remove the use of nvidia-container-cli configure in runsc. Replace it with manual mounting of device files and shared libraries.

@EtiennePerot EtiennePerot added the type: enhancement New feature or request label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant