{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":131212638,"defaultBranch":"master","name":"gvisor","ownerLogin":"google","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2018-04-26T21:28:49.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/1342004?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1720322544.0","currentOid":""},"activityList":{"items":[{"before":null,"after":"74d96af5a220d0da86c35ff49f27d5553ca8a441","ref":"refs/heads/test/cl649924384","pushedAt":"2024-07-07T03:22:24.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"copybara-service[bot]","name":null,"path":"/apps/copybara-service","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/44061?s=80&v=4"},"commit":{"message":"Reword \"Supported GPUs\" documentation\n\nThis notes which GPU architecture each of the supported GPU types belong to, and some other consumer-oriented GPUs that also belong to these architectures. If a workload is broken on them, it is likely also broken on a supported GPU, and therefore is a valid bug report to improve compatibility of supported GPU types in gVisor.\n\nIt also mentions the existence of the `--nvproxy-driver-version` flag and the `ioctl_sniffer` tool.\n\nFixes #10624\n\nFUTURE_COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/10626 from EtiennePerot:gpu-support-note f9649574959134f137aa35e5320357b437327ab1\nPiperOrigin-RevId: 649924384","shortMessageHtmlLink":"Reword \"Supported GPUs\" documentation"}},{"before":"222258a585465cdc0072dc44e676d8b5e48304b2","after":null,"ref":"refs/heads/test/cl648255905","pushedAt":"2024-07-06T07:50:00.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"copybara-service[bot]","name":null,"path":"/apps/copybara-service","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/44061?s=80&v=4"}},{"before":"8820fde4eaa4dcd75feab248a73c93f57fcf9af3","after":"222258a585465cdc0072dc44e676d8b5e48304b2","ref":"refs/heads/master","pushedAt":"2024-07-06T07:49:58.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"copybara-service[bot]","name":null,"path":"/apps/copybara-service","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/44061?s=80&v=4"},"commit":{"message":"Support RTM_SETLINK in gVisor.\n\nRTM_NEWLINK is the preferred way to change a link's configs. RTM_SETLINK is\nneeded by setting up Docker in gVisor.\n\nPiperOrigin-RevId: 649789500","shortMessageHtmlLink":"Support RTM_SETLINK in gVisor."}},{"before":"981324854676fc760e770c4ce8574b9a1e626196","after":"222258a585465cdc0072dc44e676d8b5e48304b2","ref":"refs/heads/test/cl648255905","pushedAt":"2024-07-06T07:49:57.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"copybara-service[bot]","name":null,"path":"/apps/copybara-service","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/44061?s=80&v=4"},"commit":{"message":"Support RTM_SETLINK in gVisor.\n\nRTM_NEWLINK is the preferred way to change a link's configs. RTM_SETLINK is\nneeded by setting up Docker in gVisor.\n\nPiperOrigin-RevId: 649789500","shortMessageHtmlLink":"Support RTM_SETLINK in gVisor."}},{"before":"7217354a2ea8327b334f2c55b00549315f6ee109","after":"981324854676fc760e770c4ce8574b9a1e626196","ref":"refs/heads/test/cl648255905","pushedAt":"2024-07-06T07:19:35.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"copybara-service[bot]","name":null,"path":"/apps/copybara-service","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/44061?s=80&v=4"},"commit":{"message":"Support RTM_SETLINK in gVisor.\n\nRTM_NEWLINK is the preferred way to change a link's configs. RTM_SETLINK is\nneeded by setting up Docker in gVisor.\n\nPiperOrigin-RevId: 648255905","shortMessageHtmlLink":"Support RTM_SETLINK in gVisor."}},{"before":null,"after":"4b17be32946988b943e826ea7982b54535f68e31","ref":"refs/heads/test/cl649679106","pushedAt":"2024-07-05T18:44:20.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"copybara-service[bot]","name":null,"path":"/apps/copybara-service","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/44061?s=80&v=4"},"commit":{"message":"Add restore support to runsc shim.\n\nPiperOrigin-RevId: 649679106","shortMessageHtmlLink":"Add restore support to runsc shim."}},{"before":"8820fde4eaa4dcd75feab248a73c93f57fcf9af3","after":null,"ref":"refs/heads/test/cl648836201","pushedAt":"2024-07-05T15:26:11.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"copybara-service[bot]","name":null,"path":"/apps/copybara-service","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/44061?s=80&v=4"}},{"before":"388ad3c640c22f105f2dd6ff36da1c676ef95aa9","after":"8820fde4eaa4dcd75feab248a73c93f57fcf9af3","ref":"refs/heads/master","pushedAt":"2024-07-05T15:26:09.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"copybara-service[bot]","name":null,"path":"/apps/copybara-service","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/44061?s=80&v=4"},"commit":{"message":"Merge pull request #10603 from thundergolfer:jonathon/mod-3218-uvm-peer-access\n\nPiperOrigin-RevId: 649656506","shortMessageHtmlLink":"Merge pull request #10603 from thundergolfer:jonathon/mod-3218-uvm-pe…"}},{"before":"f65101581ba796f7252cb6019e681548eea5f200","after":"8820fde4eaa4dcd75feab248a73c93f57fcf9af3","ref":"refs/heads/test/cl648836201","pushedAt":"2024-07-05T15:26:08.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"copybara-service[bot]","name":null,"path":"/apps/copybara-service","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/44061?s=80&v=4"},"commit":{"message":"Merge pull request #10603 from thundergolfer:jonathon/mod-3218-uvm-peer-access\n\nPiperOrigin-RevId: 649656506","shortMessageHtmlLink":"Merge pull request #10603 from thundergolfer:jonathon/mod-3218-uvm-pe…"}},{"before":"cf44a6b2e35ad8481e7ca1c7c939490000609b84","after":"f65101581ba796f7252cb6019e681548eea5f200","ref":"refs/heads/test/cl648836201","pushedAt":"2024-07-05T14:47:04.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"copybara-service[bot]","name":null,"path":"/apps/copybara-service","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/44061?s=80&v=4"},"commit":{"message":"nvproxy: add missing UVM peer access ioctls\n\nAdding ioctls to provide only a partial fix to a simple Huggingface `accelerate` program that does not work on H100s.\n\n---\n\n### Reproduction\n\n**Machine info**\n\n* **NVIDIA driver:** `Driver Version: 550.54.15 CUDA Version: 12.4`\n* **NVIDIA device:** NVIDIA H100 PCIe\n* **uname -a:** `Linux worker-prod-lat-h100-dal-0-d22375e2688d4e5caa12deca256c79ec 5.15.0-113-generic #123-Ubuntu SMP Mon Jun 10 08:16:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux`\n\n**Steps**\n\n**1. Install gvisor**\n\n**2. Add GPU enabling gvisor options**\n\nIn `/etc/docker/daemon.json`:\n\n```json\n{\n \"runtimes\": {\n \"nvidia\": {\n \"path\": \"nvidia-container-runtime\",\n \"runtimeArgs\": []\n },\n \"runsc\": {\n \"path\": \"/usr/local/bin/runsc\",\n\t \"runtimeArgs\": [\"--nvproxy\", \"--nvproxy-docker\", \"-debug-log=/tmp/runsc/\", \"-debug\", \"-strace\"]\n\n }\n }\n}\n```\n\nthen run `sudo systemctl reload docker`\n\n**3. Run the reproducing `accelerate` application**\n\n```Dockerfile\n# Dockerfile\nFROM python:3.9.15-slim-bullseye\n\nRUN pip install accelerate\nCOPY <\n------------------------------------------------------------\nRoot Cause (first observed failure):\n[0]:\n time : 2024-07-02_18:25:56\n host : 07de9e138901\n rank : 1 (local_rank: 1)\n exitcode : 1 (pid: 68)\n error_file: /tmp/torchelastic_d5gyii1n/none_swdujvh0/attempt_0/1/error.json\n traceback : Traceback (most recent call last):\n File \"/usr/local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 347, in wrapper\n return f(*args, **kwargs)\n File \"//repro_notebook_launch.py\", line 6, in train_loop\n with accelerator.main_process_first():\n File \"/usr/local/lib/python3.9/contextlib.py\", line 119, in __enter__\n return next(self.gen)\n File \"/usr/local/lib/python3.9/site-packages/accelerate/accelerator.py\", line 882, in main_process_first\n with self.state.main_process_first():\n File \"/usr/local/lib/python3.9/contextlib.py\", line 119, in __enter__\n return next(self.gen)\n File \"/usr/local/lib/python3.9/site-packages/accelerate/state.py\", line 1053, in main_process_first\n with PartialState().main_process_first():\n File \"/usr/local/lib/python3.9/contextlib.py\", line 119, in __enter__\n return next(self.gen)\n File \"/usr/local/lib/python3.9/site-packages/accelerate/state.py\", line 499, in main_process_first\n yield from self._goes_first(self.is_main_process)\n File \"/usr/local/lib/python3.9/site-packages/accelerate/state.py\", line 384, in _goes_first\n self.wait_for_everyone()\n File \"/usr/local/lib/python3.9/site-packages/accelerate/state.py\", line 378, in wait_for_everyone\n torch.distributed.barrier()\n File \"/usr/local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py\", line 75, in wrapper\n return func(*args, **kwargs)\n File \"/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py\", line 3683, in barrier\n work = default_pg.barrier(opts=opts)\n torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5\n ncclUnhandledCudaError: Call to CUDA function failed.\n Last error:\n Cuda failure 1 'invalid argument'\n```\n\n---\n\n## Fix (partial)\n\nThis change set provides only a _**partial fix**_. By adding the unimplemented UVM ioctls we get at least the main process to print `hello 0` but the program still crashes with no logging and exit code 159.\n\n```\n$ sudo docker run -it --runtime=runsc2 --gpus='\"device=GPU-f24cdb48-8af0-33a3-bff1-83c72aa5e460,GPU-d5140682-a598-a7f2-23d4-29dcb83bb30f,GPU-556fcf94-b6b7-550d-cfc3-60b9bfbe112e\"' 8df8be98ce1ffa22baacf3910e2d22fceeaa003d44bbc384a2df9c971f1829a0\nLaunching training on 3 GPUs.\nDetected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.\nhello! 0\n$ echo $?\n159\n```\n\nI suspected that the issue may be due to the `fork` vs `spawn` multiprocessing strategy and saw tried the library's notebook launcher:\n\n```\nCOPY <