Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel does not handle symlinks in system include paths for CUDA #10167

Open
Flamefire opened this issue Nov 5, 2019 · 7 comments
Open

Bazel does not handle symlinks in system include paths for CUDA #10167

Flamefire opened this issue Nov 5, 2019 · 7 comments
Labels
not stale Issues or PRs that are inactive but not considered stale P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Rules-CPP Issues for C++ rules type: bug

Comments

@Flamefire
Copy link
Contributor

Description of the problem / feature request:

Very similar to #7315:
If the compilers system includes are installed in a Path which is referenced by a symlink then the dependency check will fail.

Example:

  • Compiler includes in real path /software/haswell/GCCcore/8.2.0/include/c++/8.2.0/x86_64-pc-linux-gnu /8.2.0/include
  • Symlink exists so user sees /sw/installed/GCCcore/8.2.0/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/
  • Toolchain file (e.g. created by TensorFlow for CUDA) will contain the 2nd one (resolved path, /sw/installed)
  • Check will use unresolved path and check only the prefix in
    if (FileSystemUtils.startsWithAny(execPath, permittedSystemIncludePrefixes)) {
  • Error will be something like:
ERROR: /tmp/easybuild-tmp/eb-5QGVSJ/tmpMZolLj-bazel-build/external/fft2d/BUILD.bazel:27:1: undeclared inclusion(s) in rule '@fft2d//:fft2d':
this rule is missing dependency declarations for the following files included by 'external/fft2d/fft2d/fftsg2d.c':
  '/sw/installed/GCCcore/8.2.0/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/stddef.h'
  '/sw/installed/GCCcore/8.2.0/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/stdarg.h'
Target //tensorflow/tools/pip_package:build_pip_package failed to build

Feature requests: what underlying problem are you trying to solve with this feature?

Building TensorFlow on a HPC cluster. See related issue tensorflow/tensorflow#33975

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

See above: Setup the compiler with a symlink path and try to build tensorflow 2 with CUDA

What operating system are you running Bazel on?

RHEL 7.4

What's the output of bazel info release?

0.26.1 (@non-git)

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

With EasyBuild recipe

Have you found anything relevant by searching the web?

Possible solutions:

  1. patch TensorFlow to include non-resolved paths: https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/t/TensorFlow/TensorFlow-1.14.0_fix-cuda-build.patch
  2. Unify logic of https://github.com/tensorflow/tensorflow/blob/ce69fe54695361d88c3b73d18e0d647baeb50c49/third_party/gpus/cuda_configure.bzl#L280 with
    def get_escaped_cxx_inc_directories(repository_ctx, cc, lang_flag, additional_flags = []):
    as indicated at https://github.com/tensorflow/tensorflow/blob/ce69fe54695361d88c3b73d18e0d647baeb50c49/third_party/gpus/cuda_configure.bzl#L221 and handle it in bazel (resolved and non-resolved paths added)
  3. Check resolved dependencies against each other in https://github.com/bazelbuild/bazel/blob/c42f93f2285e0d8f1ea70ccf3168200db34235f5/src/main/java/com/google/devtools/build/lib/rules/cpp/HeaderDiscovery.java
@Flamefire
Copy link
Contributor Author

Update: Proposed solution 1 does not work: It relies on the compiler being where the includes are. This works for e.g.:

  • /sw/installed/GCCcore/8.3.0/lib/gcc/x86_64-pc-linux-gnu/8.3.0/include/stdarg.h
  • /sw/installed/GCCcore/8.3.0/bin/g++

But it fails when e.g. using ccache:

  • /sw/installed/GCCcore/8.3.0/lib/gcc/x86_64-pc-linux-gnu/8.3.0/include/stdarg.h
  • /sw/installed/GCCcore/8.3.0/bin/g++
  • /usr/bin/g++ -> /usr/bin/ccache g++

So the only real solution is to compare realpaths as suggested in 3

@scentini
Copy link
Contributor

cc @oquenchil

@scentini scentini added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Dec 11, 2019
@oquenchil oquenchil added P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) type: bug and removed P2 We'll consider working on this in future. (Assignee optional) labels Nov 19, 2020
@sgowroji sgowroji added the stale Issues or PRs that are stale (no activity for 30 days) label Feb 10, 2023
@sgowroji
Copy link
Member

Hi there! We're doing a clean up of old issues and will be closing this one. Please reopen if you’d like to discuss anything further. We’ll respond as soon as we have the bandwidth/resources to do so.

@oquenchil oquenchil closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2023
@Flamefire
Copy link
Contributor Author

This is still an issue in Bazel. However current TensorFlow has a patch that resolves symlinks to the compiler before passing it to Bazel which works but is only a workaround.

@sgowroji sgowroji added not stale Issues or PRs that are inactive but not considered stale and removed stale Issues or PRs that are stale (no activity for 30 days) labels Feb 16, 2023
@sgowroji sgowroji reopened this Feb 16, 2023
@vors
Copy link

vors commented Feb 24, 2023

Hello! We are hitting the same problem when trying to make nvcc+sccache work in https://github.com/pytorch/pytorch
Any workarounds are highly appreciated.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Feb 28, 2023
Fixes #79348

This change is mostly focused on enabling nvcc+sccache in the PyTorch CI.

Along the way we had to do couple tweaks:
1.  Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found.

2. Workaround bug in bazel bazelbuild/bazel#10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers.

## Test plan

Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740

Note that now it says "CUDA" in the sccache output

```
+ sccache --show-stats
Compile requests                    9784
Compile requests executed           6726
Cache hits                          6200
Cache hits (C/C++)                  6131
Cache hits (CUDA)                     69
Cache misses                         519
Cache misses (C/C++)                 201
Cache misses (CUDA)                  318
Cache timeouts                         0
Cache read errors                      0
Forced recaches                        0
Cache write errors                     0
Compilation failures                   0
Cache errors                           7
Cache errors (C/C++)                   7
Non-cacheable compilations             0
Non-cacheable calls                 2893
Non-compilation calls                165
Unsupported compiler calls             0
Average cache write                0.116 s
Average cache read miss           23.722 s
Average cache read hit             0.057 s
Failed distributed compilations        0
```
Pull Request resolved: #95528
Approved by: https://github.com/huydhn
cyyever pushed a commit to cyyever/pytorch_private that referenced this issue Mar 2, 2023
Fixes #79348

This change is mostly focused on enabling nvcc+sccache in the PyTorch CI.

Along the way we had to do couple tweaks:
1.  Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found.

2. Workaround bug in bazel bazelbuild/bazel#10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers.

## Test plan

Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740

Note that now it says "CUDA" in the sccache output

```
+ sccache --show-stats
Compile requests                    9784
Compile requests executed           6726
Cache hits                          6200
Cache hits (C/C++)                  6131
Cache hits (CUDA)                     69
Cache misses                         519
Cache misses (C/C++)                 201
Cache misses (CUDA)                  318
Cache timeouts                         0
Cache read errors                      0
Forced recaches                        0
Cache write errors                     0
Compilation failures                   0
Cache errors                           7
Cache errors (C/C++)                   7
Non-cacheable compilations             0
Non-cacheable calls                 2893
Non-compilation calls                165
Unsupported compiler calls             0
Average cache write                0.116 s
Average cache read miss           23.722 s
Average cache read hit             0.057 s
Failed distributed compilations        0
```
Pull Request resolved: pytorch/pytorch#95528
Approved by: https://github.com/huydhn
cyyever pushed a commit to cyyever/pytorch_private that referenced this issue Mar 5, 2023
Fixes #79348

This change is mostly focused on enabling nvcc+sccache in the PyTorch CI.

Along the way we had to do couple tweaks:
1.  Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found.

2. Workaround bug in bazel bazelbuild/bazel#10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers.

## Test plan

Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740

Note that now it says "CUDA" in the sccache output

```
+ sccache --show-stats
Compile requests                    9784
Compile requests executed           6726
Cache hits                          6200
Cache hits (C/C++)                  6131
Cache hits (CUDA)                     69
Cache misses                         519
Cache misses (C/C++)                 201
Cache misses (CUDA)                  318
Cache timeouts                         0
Cache read errors                      0
Forced recaches                        0
Cache write errors                     0
Compilation failures                   0
Cache errors                           7
Cache errors (C/C++)                   7
Non-cacheable compilations             0
Non-cacheable calls                 2893
Non-compilation calls                165
Unsupported compiler calls             0
Average cache write                0.116 s
Average cache read miss           23.722 s
Average cache read hit             0.057 s
Failed distributed compilations        0
```
Pull Request resolved: pytorch/pytorch#95528
Approved by: https://github.com/huydhn
cyyever pushed a commit to cyyever/pytorch_private that referenced this issue Mar 5, 2023
Fixes #79348

This change is mostly focused on enabling nvcc+sccache in the PyTorch CI.

Along the way we had to do couple tweaks:
1.  Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found.

2. Workaround bug in bazel bazelbuild/bazel#10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers.

## Test plan

Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740

Note that now it says "CUDA" in the sccache output

```
+ sccache --show-stats
Compile requests                    9784
Compile requests executed           6726
Cache hits                          6200
Cache hits (C/C++)                  6131
Cache hits (CUDA)                     69
Cache misses                         519
Cache misses (C/C++)                 201
Cache misses (CUDA)                  318
Cache timeouts                         0
Cache read errors                      0
Forced recaches                        0
Cache write errors                     0
Compilation failures                   0
Cache errors                           7
Cache errors (C/C++)                   7
Non-cacheable compilations             0
Non-cacheable calls                 2893
Non-compilation calls                165
Unsupported compiler calls             0
Average cache write                0.116 s
Average cache read miss           23.722 s
Average cache read hit             0.057 s
Failed distributed compilations        0
```
Pull Request resolved: pytorch/pytorch#95528
Approved by: https://github.com/huydhn
@lgulich
Copy link

lgulich commented Oct 30, 2023

This is still an issue in Bazel. However current TensorFlow has a patch that resolves symlinks to the compiler before passing it to Bazel which works but is only a workaround.

@Flamefire Could you share a link to this patch? I am facing the same problem (for a non-tensorflow workspace) and would like to investigate if I could reuse your fix. Thanks in advance!

@Flamefire
Copy link
Contributor Author

@lgulich The patch I was referring to is tensorflow/tensorflow@58b236b
However that doesn't fully solve the issue when compiler wrappers (e.g. ccache) are used. We use a patch I added as a PR there but it is pending for ages: tensorflow/tensorflow#60668

jhavukainen pushed a commit to kulinseth/pytorch that referenced this issue Mar 15, 2024
Fixes pytorch#79348

This change is mostly focused on enabling nvcc+sccache in the PyTorch CI.

Along the way we had to do couple tweaks:
1.  Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found.

2. Workaround bug in bazel bazelbuild/bazel#10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers.

## Test plan

Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740

Note that now it says "CUDA" in the sccache output

```
+ sccache --show-stats
Compile requests                    9784
Compile requests executed           6726
Cache hits                          6200
Cache hits (C/C++)                  6131
Cache hits (CUDA)                     69
Cache misses                         519
Cache misses (C/C++)                 201
Cache misses (CUDA)                  318
Cache timeouts                         0
Cache read errors                      0
Forced recaches                        0
Cache write errors                     0
Compilation failures                   0
Cache errors                           7
Cache errors (C/C++)                   7
Non-cacheable compilations             0
Non-cacheable calls                 2893
Non-compilation calls                165
Unsupported compiler calls             0
Average cache write                0.116 s
Average cache read miss           23.722 s
Average cache read hit             0.057 s
Failed distributed compilations        0
```
Pull Request resolved: pytorch#95528
Approved by: https://github.com/huydhn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not stale Issues or PRs that are inactive but not considered stale P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Rules-CPP Issues for C++ rules type: bug
Projects
None yet
Development

No branches or pull requests

7 participants