support for nvidia-docker GPU container sandboxing #14

thomasjungblut · 2018-05-02T18:33:58Z

In order to expose GPUs in K8s, you'll have to install nvidia-docker as an additional container runtime. A lot of people surely would love to run sandboxed containers with GPU support though.

Do you guys see an easy way to layer one over the other, maybe?

resouer · 2018-05-03T04:37:12Z

@thomasjungblut If you are using latest Device Plugin + CRI based Kubernetes GPU support (e.g. 1.10), nvidia-docker should not be a dependency. So docker + gVisor or even cri-containerd + gVisor would be the solution.

Though it seems current gVisor sandbox does not work well with GPU devices (correct me if I'm wrong).

hugelgupf · 2018-05-03T04:38:47Z

We don't expose access to GPUs at the moment. It's an open problem for us, too.

flx42 · 2018-05-09T00:09:03Z

We have an OCI prestart hook here: https://github.com/NVIDIA/nvidia-container-runtime/tree/master/hook
It leverages libnvidia-container. But of course that's probably not sufficient, depending on the actual capabilities/blocked syscalls at runtime.

kratan · 2018-05-14T12:31:32Z

supporting GPUs would be a really nice feature
greets

JorgeCeja · 2018-06-05T01:47:07Z

Any updates? Thanks!

fvoznika · 2018-06-05T16:29:33Z

Yes, support for GPUs would be really nice!

The work here involves exposing a passthru device directly from the host, as there are no device drivers in gVisor. The challenge being how to make this access secure.

At the moment though, we have a few other things to work on, and GPU support is not in our short list (yet).

maxlouthain-unity · 2019-07-08T18:09:00Z

Hi, I see this is tagged with a priority label now, and am wondering if GPU support is on the roadmap? Thanks!

* Update containerd to 1.2.2 Signed-off-by: Lantao Liu <lantaol@google.com> * Port containerd/containerd#2803. Signed-off-by: Lantao Liu <lantaol@google.com>

zvonkok · 2021-11-18T11:59:41Z

/cc @zvonkok

yu-alvin · 2022-02-24T01:05:57Z

https://docs.vaccel.org/ Found this project recently and thought it might be helpful to reference.

Updates #14 PiperOrigin-RevId: 529411547

Updates #14 PiperOrigin-RevId: 529803365

Very few ioctls are initially implemented. Updates #14 PiperOrigin-RevId: 529511917

With this change, we can now run simple CUDA applications on H100 GPUs. ``` $ docker run --runtime=runsc --gpus all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8 [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` Note that this was tested with 525.60.13 driver version. Updates #14 PiperOrigin-RevId: 534607359

The --nvproxy flag allows container GPU usage to be specified via device nodes and mounts provided in the runtime spec, as when using Kubernetes with GKE's Nvidia GPU device plugin (https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu). The --nvproxy-docker flag additionally allows container GPU usage to be specified via the NVIDIA_VISIBLE_DEVICES container environment variable, as when using `docker --gpus`. This does not require the Nvidia Container Toolkit (or the Nvidia Container Runtime [Hook], which are part of the Toolkit), but does require libnvidia-container, which is typically installed as a dependency of the Nvidia Container Toolkit. Updates #14 PiperOrigin-RevId: 535002602

Distributed training isn't working with PyTorch on certain A100 nodes. Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training. ## Reproduction This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB. - **NVIDIA Driver Version**: 550.54.15 - **CUDA Version**: 12.4 - **NVIDIA device**: NVIDIA A100 80GB PCIe ### Steps 1. **Install gvisor** ```bash URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}" wget -nc "${URL}/runsc" "${URL}/runsc.sha512" chmod +x runsc sudo cp runsc /usr/local/bin/runsc sudo /usr/local/bin/runsc install sudo systemctl reload docker ``` 2. **Add GPU enabling gvisor options** ```json { "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] }, "runsc": { "path": "/usr/local/bin/runsc", "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"] } } } ``` Reload configs with `sudo systemctl reload docker`. 3. **Run reproduction NCCL test** This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL. ```Dockerfile # Dockerfile FROM python:3.9.15-slim-bullseye RUN pip install torch numpy COPY <<EOF repro.py import argparse import datetime import os import torch import torch.distributed as dist import torch.multiprocessing as mp def setup(rank, world_size): os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600)) torch.cuda.set_device(rank) def cleanup(): dist.destroy_process_group() def send_tensor(rank, world_size): try: setup(rank, world_size) # rank receiving all tensors target_rank = world_size - 1 dist.barrier() tensor = torch.ones(5).cuda(rank) if rank < target_rank: print(f"[RANK {rank}] sending tensor: {tensor}") dist.send(tensor=tensor, dst=target_rank) elif rank == target_rank: for other_rank in range(target_rank): tensor = torch.zeros(5).cuda(target_rank) dist.recv(tensor=tensor, src=other_rank) print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}") print("PASS: NCCL working.") except Exception as e: print(f"[RANK {rank}] error in send_tensor: {e}") raise finally: cleanup() def main(world_size: int = 2): mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True) if __name__ == "__main__": parser = argparse.ArgumentParser(description="Run torch-based NCCL tests") parser.add_argument("world_size", type=int, help="number of GPUs to run test on") args = parser.parse_args() if args.world_size < 2: raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}") main(args.world_size) EOF ENTRYPOINT ["python", "repro.py", "4"] ``` Build image with: ``` docker build -f Dockerfile . ``` Then run it with: ``` sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1 ``` #### Failure (truncated) ``` ... Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so) <omitting python frames> frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python) . This may indicate a possible application crash on rank 0 or a network set up issue. ... ``` ### Fix gvisor debug logs show: ``` W0702 20:36:17.577055 445833 uvm.go:148] [ 22: 84] nvproxy: unknown uvm ioctl 66 = 0x42 ``` I've implemented that ioctl in this PR. This is the output after the fix. ``` [RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2') [RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0') [RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1') [RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3') [RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3') [RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3') PASS: NCCL working. ``` FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734 PiperOrigin-RevId: 649146570

fvoznika added the type: enhancement New feature or request label Jan 11, 2019

ianlewis added the area: container runtime Issue related to docker, kubernetes, OCI runtime label Jan 21, 2019

ianlewis added the priority: p2 Normal priority label May 31, 2019

amscanne pushed a commit to amscanne/gvisor that referenced this issue May 6, 2020

Update to containerd 1.2.2 (google#14)

326bc9f

* Update containerd to 1.2.2 Signed-off-by: Lantao Liu <lantaol@google.com> * Port containerd/containerd#2803. Signed-off-by: Lantao Liu <lantaol@google.com>

ianlewis added the area: integration Issue related to third party integrations label Aug 14, 2020

0x2b3bfa0 mentioned this issue Nov 24, 2021

Standardize on container images instead of machine images iterative/terraform-provider-iterative#146

Open

copybara-service bot mentioned this issue Apr 27, 2023

Nvidia driver proxy proposal #8890

Merged

copybara-service bot pushed a commit that referenced this issue May 4, 2023

Nvidia driver proxy proposal

4c9bf18

Updates #14 PiperOrigin-RevId: 529411547

copybara-service bot mentioned this issue May 4, 2023

Add VFS.GetDynamicCharDevMajor(). #8919

Closed

copybara-service bot pushed a commit that referenced this issue May 5, 2023

Add VFS.GetDynamicCharDevMajor().

20deedc

Updates #14 PiperOrigin-RevId: 529803365

copybara-service bot pushed a commit that referenced this issue May 9, 2023

Add //pkg/sentry/devices/nvproxy and //pkg/abi/nvgpu.

ea05eab

Very few ioctls are initially implemented. Updates #14 PiperOrigin-RevId: 529511917

copybara-service bot mentioned this issue May 9, 2023

Add //pkg/sentry/devices/nvproxy and //pkg/abi/nvgpu. #8932

Closed

copybara-service bot mentioned this issue May 19, 2023

Add runsc flags --nvproxy and --nvproxy-docker. #8993

Closed

copybara-service bot mentioned this issue May 23, 2023

Add support for Nvidia Hopper GPU architecture. #9000

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for nvidia-docker GPU container sandboxing #14

support for nvidia-docker GPU container sandboxing #14

thomasjungblut commented May 2, 2018

resouer commented May 3, 2018

hugelgupf commented May 3, 2018

flx42 commented May 9, 2018

kratan commented May 14, 2018 •

edited

Loading

JorgeCeja commented Jun 5, 2018

fvoznika commented Jun 5, 2018

maxlouthain-unity commented Jul 8, 2019

zvonkok commented Nov 18, 2021

yu-alvin commented Feb 24, 2022

support for nvidia-docker GPU container sandboxing #14

support for nvidia-docker GPU container sandboxing #14

Comments

thomasjungblut commented May 2, 2018

resouer commented May 3, 2018

hugelgupf commented May 3, 2018

flx42 commented May 9, 2018

kratan commented May 14, 2018 • edited Loading

JorgeCeja commented Jun 5, 2018

fvoznika commented Jun 5, 2018

maxlouthain-unity commented Jul 8, 2019

zvonkok commented Nov 18, 2021

yu-alvin commented Feb 24, 2022

kratan commented May 14, 2018 •

edited

Loading