Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include versioned so files in cuda_runtime #114

Merged
merged 1 commit into from
Jul 6, 2023

Conversation

garymm
Copy link
Contributor

@garymm garymm commented Jun 29, 2023

libcudart.so is a symlink to a versioned .so file, and the SONAME
specifies the versioned file, which means that at runtime the dynamic
linker will look for the versioned file, so we need to import the
versioned files.

Fixes: #113

@garymm
Copy link
Contributor Author

garymm commented Jun 29, 2023

I guess this might be an issue for the other .so files as well. LMK if you want me to do the same for all of them.

@garymm
Copy link
Contributor Author

garymm commented Jun 29, 2023

I'm guessing the checks failed due to a GitHub outage. Please retry them if you're able.

Comment on lines 42 to 54
# libcudart.so is a symlink to a versioned .so file, and the SONAME
# specifies the versioned file, so we need to import the versioned files.
cudart_so_paths = glob(["cuda/lib64/libcudart.so*"])

[cc_import(
name = paths.basename(p),
shared_library = p,
target_compatible_with = ["@platforms//os:linux"],
) for p in cudart_so_paths]

cc_library(
name = "cudart_so",
deps = [":%s" % paths.basename(p) for p in cudart_so_paths],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it shows the following:

> ldd bazel-bin/basic/main
        linux-vdso.so.1 (0x00007ffff4bfd000)
        libcudart.so.12 => /home/cloud/rules_cuda/examples/bazel-bin/basic/../_solib_k8/_U@local_Ucuda_S_S_Clibcudart.so.12___Ucuda_Slib64/libcudart.so.12 (0x00007f0066c00000)
        ...

It starts to use the full mangled path again. This dismissed the purpose to use cc_import, acts the same as cuda_library(srcs=[<lib_path>]).

This is not something reasonable path outside of bazel environment, better avoid this case totally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look.
Prior to this PR, I believe the way things are working is the following (please correct me if I'm wrong):

  • repositories.local_cuda() creates a bazel repository which is a symlink to some absolute path, default /usr/local/cuda.
  • A cuda_library depends on @local_cuda//:cuda_runtime.
  • In its sandbox, bazel creates a directory which only contains the declared dependencies and passes it to the linker with -L. In this case, that means it includes libcudart.so, but NOT libcudart.so.12 (replace 12 with whatever version you want) is being used. You can see this in the linker params with a line like: -Lbazel-out/k8-fastbuild/bin/_solib_k8/_U@local_Ucuda_S_S_Ccudart_Uso___Ucuda_Slib64, and
➜  ls bazel-out/k8-fastbuild/bin/_solib_k8/_U@local_Ucuda_S_S_Ccudart_Uso___Ucuda_Slib64
libcudart.so@
  • Similarly the linker specifies the rpath to include a directory inside runfiles, which also contains just libcdart.so: -rpath $ORIGIN/main.runfiles/rules_cuda_examples/_solib_k8/_U@local_Ucuda_S_S_Ccudart_Uso___Ucuda_Slib64.
  • The linker opens libcudart.so, which is a symlink (possibly to another symlink). The final non-link target file has SONAME set to a versioned name like libcudart.so.12. So the binary is linked against libcudart.so.12.
  • At runtime (including bazel test), the dynamic linker ldd will look in the runfiles for libcudart.so.12 and fail to find it. Then it will look in system search paths. On my machine it will look in /lib/x86_64-linux-gnu, amongst other paths.
  • If there is a libcudart.so.12 in any of the system paths, then it works. Else, it fails.

The situation that triggered the failure for me was:

  1. Multiple versions of CUDA installed, 11 and 12.
  2. /lib/x86_64_linux-gnu/libcudart.so is a symlink to /lib/x86_64_linux-gnu/libcudart.so.11.
  3. In my WORKSPACE.bazel I have: local_cuda(name = "local_cuda", toolkit_path = "/usr/local/cuda-12")

In this case the binary gets linked against libcudart.so.12 but the rpath does NOT contain /usr/local/cuda-12, and the system's default lib paths also don't contain it, so ldd fails to find it at runtime.

So I think this proposed change will not actually make things any worse as to your concern:

This is not something reasonable path outside of bazel environment

The only way this works at runtime before this PR is if ldd searched the system library search paths. It will continue to do that in the case that the libcudart.so.12 is missing from runfiles (I tested this by rm'ing the .so file from the runfiles and confirmed that ldd then finds the one that it used to find before this PR).

So I'm pretty sure this will not cause any regressions at all in things that currently work, but it will fix things that currently do not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, my bad, it seem I am blind at some point. xxxD

> patchelf --print-needed bazel-bin/basic/main
libcudart.so.12
libstdc++.so.6
libm.so.6
libc.so.6

They are correctly resolved.

I am wondering is it willing for you to design a macro get_versioned_so("cuda/lib64/libcudart.so", n=1) and it should resolve to "cuda/lib64/libcudart.so.12" and get_versioned_so("cuda/lib64/libcudart.so", n=2) and resolve to "cuda/lib64/libcudart.so.12.1" so that we can apply it to other libs.

(So that I don't need to do it myself in the future.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a macro and applied to tall of the libraries that appear to benefit from it. PTAL.

Prior to this PR, I believe the way things are working is the following:
* `repositories.local_cuda()` creates a bazel repository which is a symlink to some absolute path, default  `/usr/local/cuda`.
* A `cuda_library` depends on `@local_cuda//:cuda_runtime`.
* In its sandbox, bazel creates a directory which only contains the declared dependencies and passes it to the linker with `-L`. In this case, that means it includes `libcudart.so`, but NOT `libcudart.so.12` (replace 12 with whatever version you want) is being used. You can see this in the linker params with a line like: `-Lbazel-out/k8-fastbuild/bin/_solib_k8/_U@local_Ucuda_S_S_Ccudart_Uso___Ucuda_Slib64`, and

```sh
➜  ls bazel-out/k8-fastbuild/bin/_solib_k8/_U@local_Ucuda_S_S_Ccudart_Uso___Ucuda_Slib64
libcudart.so@
```
* Similarly the linker specifies the `rpath` to include a directory inside `runfiles`, which also contains just libcdart.so: `-rpath $ORIGIN/main.runfiles/rules_cuda_examples/_solib_k8/_U@local_Ucuda_S_S_Ccudart_Uso___Ucuda_Slib64`.
* The linker opens `libcudart.so`, which is a symlink (possibly to another symlink). The final non-link target file has SONAME set to a versioned name like `libcudart.so.12`. So the binary is linked against `libcudart.so.12`.
* At runtime (including `bazel test`), the dynamic linker `ldd` will look in the runfiles for `libcudart.so.12` and fail to find it. Then it will look in system search paths. On my machine it will look in `/lib/x86_64-linux-gnu`, amongst other paths.
* If there is a libcudart.so.12 in any of the system paths, then it works. Else, it fails.

The situation that triggered the failure for me was:

1. Multiple versions of CUDA installed, 11 and 12.
2. `/lib/x86_64_linux-gnu/libcudart.so` is a symlink to `/lib/x86_64_linux-gnu/libcudart.so.11`.
3. In my WORKSPACE.bazel I have: `local_cuda(name = "local_cuda", toolkit_path = "/usr/local/cuda-12")`

In this case the binary gets linked against `libcudart.so.12` but the `rpath` does NOT contain `/usr/local/cuda-12`, and the system's default lib paths also don't contain it, so ldd fails to find it at runtime.

So I think this proposed change will not actually make things any worse as to your concern:

> This is not something reasonable path outside of bazel environment

The only way this works at runtime before this PR is if `ldd` searched the system library search paths. It will continue to do that in the case that the `libcudart.so.12` is missing from runfiles (I tested this by `rm`'ing the .so file from the runfiles and confirmed that `ldd` then finds the one that it used to find before this PR).

So I'm pretty sure this will not cause any regressions at all in things that currently work, but it will fix things that currently do not.

Fixes: bazel-contrib#113
Copy link
Collaborator

@cloudhan cloudhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@cloudhan
Copy link
Collaborator

cloudhan commented Jul 6, 2023

@jsharpe Any concern for you?

@jsharpe
Copy link
Member

jsharpe commented Jul 6, 2023

@jsharpe Any concern for you?

No this looks good to me - in fact with the hermetic toolchain setup this is also an issue; I work around this in our internal build by fixing it up in a rules_pkg setup.

@cloudhan cloudhan merged commit 22578d7 into bazel-contrib:main Jul 6, 2023
11 checks passed
@garymm garymm deleted the garymm/versioned-so branch July 6, 2023 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

libcudart.so not found on system with multiple CUDA versions
3 participants