Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot compile PyTorch/XLA master. #6564

Closed
ysiraichi opened this issue Feb 17, 2024 · 2 comments · Fixed by #6569
Closed

Cannot compile PyTorch/XLA master. #6564

ysiraichi opened this issue Feb 17, 2024 · 2 comments · Fixed by #6569
Labels

Comments

@ysiraichi
Copy link
Collaborator

ysiraichi commented Feb 17, 2024

🐛 Bug

I'm trying to compile PyTorch/XLA master branch (see command below), however I'm getting the following error:

export CUDA_HOME="/usr/local/cuda"
export XLA_CUDA=1
python setup.py develop --user
ERROR: xla/service/gpu/kernels/BUILD:156:13: Compiling xla/service/gpu/kernels/topk_kernel_bfloat16.cu.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF bazel-out/k8-opt/bin/external/xla/xla/service/gpu/kernels/_objs/topk_kernel_cuda/topk_kernel_bfloat16.cu.pic.d ... (remaining 110 arguments skipped)

external/com_google_absl/absl/strings/internal/str_format/bind.h: In constructor ‘absl::lts_20230802::str_format_internal::FormatSpecTemplate<Args>::FormatSpecTemplate(const absl::lts_20230802::str_format_internal::ExtendedParsedFormat<absl::lts_20230802::FormatConversionCharSet(C)...>&)’:
external/com_google_absl/absl/strings/internal/str_format/bind.h:172:1: error: parse error in template argument list
  172 |     CheckArity<sizeof...(C), sizeof...(Args)>();
      | ^   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
external/com_google_absl/absl/strings/internal/str_format/bind.h:172:63: error: expected ‘;’ before ‘)’ token
  172 |     CheckArity<sizeof...(C), sizeof...(Args)>();
      |                                                               ^
      |                                                               ;
external/com_google_absl/absl/strings/internal/str_format/bind.h:173:147: error: template argument 1 is invalid
  173 |     CheckMatches<C...>(absl::make_index_sequence<sizeof...(C)>{});
      |                                                                                                                                                   ^
external/com_google_absl/absl/strings/internal/str_format/bind.h:173:151: error: expected primary-expression before ‘{’ token
  173 |     CheckMatches<C...>(absl::make_index_sequence<sizeof...(C)>{});
      |                                                                                                                                                       ^
external/com_google_absl/absl/strings/internal/str_format/bind.h:173:150: error: expected ‘;’ before ‘{’ token
  173 |     CheckMatches<C...>(absl::make_index_sequence<sizeof...(C)>{});
      |                                                                                                                                                      ^
      |                                                                                                                                                      ;
external/com_google_absl/absl/strings/internal/str_format/bind.h:173:153: error: expected primary-expression before ‘)’ token
  173 |     CheckMatches<C...>(absl::make_index_sequence<sizeof...(C)>{});
      |                                                                                                                                                         ^
Target //:_XLAC.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 882.296s, Critical Path: 94.86s
INFO: 8108 processes: 259 internal, 7849 local.
FAILED: Build did NOT complete successfully
error: command 'bazel' failed with exit status 1

Environment

cc @miladm @lezcano

@ysiraichi
Copy link
Collaborator Author

After investigating a bit, I see that the last XLA pin update is the culprit (#6530). Replacing the updated XLA version b166243711f71b0a55daa1eda36b1dc745886784 by the former pin c08cfb0377e4e33a21bde65950f986a21c8a8199 made the error go away.

cota added a commit that referenced this issue Feb 20, 2024
Bump the pinned XLA version to fix GPU builds with CUDA11.
Note that there are only 13 commits between the new pin and
the previous one:

```
$ git log --oneline b1662437^..419a3d73
419a3d736 [xla] Do not include absl headers into xla/types.h
1a4ec9190 [xla:gpu] Add initialization guard to make sure we have exactly one NCCL clique initialization in progress
1365d31a8 [xla] Fix test compilation for environments without cuda
86e231a58 [xla:gpu] Add support for legacy API custom calls in AddressComputationFusionRewriter
82e775381 Fix broken build for convert_memory_placement_to_internal_annotations_test
db973b7fb Integrate LLVM at llvm/llvm-project@bc66e0cf9feb
09c7c0818 Fix gcd simplification of div.
04af47afd PR #9400: Move Gt(Max) optimization after all other HandleCompare optimizations
06c8c19d8 Fix pad indexing map with interior padding.
a27177d76 [XLA:GPU] Implement GpuPriorityFusion::Run instead of calling InstructionFusion::Run.
8a5491aa8 Don't require the argument of ReducePrecision to be a tensor.
50b3b8c40 [XLA] Add a way for an HLO runner to run instructions in isolation.
e020e2e9b [XLA:GPU] Add coalescing heuristic.
b16624371 Add support for unpinned_host for host memory offloading. XLA does not currently differentiate between pinned and unpinned.
```

Fixes #6564.
cota added a commit that referenced this issue Feb 21, 2024
Bump the pinned XLA version to fix GPU builds with CUDA11.
Note that there are only 13 commits between the new pin and
the previous one:

```
$ git log --oneline b1662437^..419a3d73
419a3d736 [xla] Do not include absl headers into xla/types.h
1a4ec9190 [xla:gpu] Add initialization guard to make sure we have exactly one NCCL clique initialization in progress
1365d31a8 [xla] Fix test compilation for environments without cuda
86e231a58 [xla:gpu] Add support for legacy API custom calls in AddressComputationFusionRewriter
82e775381 Fix broken build for convert_memory_placement_to_internal_annotations_test
db973b7fb Integrate LLVM at llvm/llvm-project@bc66e0cf9feb
09c7c0818 Fix gcd simplification of div.
04af47afd PR #9400: Move Gt(Max) optimization after all other HandleCompare optimizations
06c8c19d8 Fix pad indexing map with interior padding.
a27177d76 [XLA:GPU] Implement GpuPriorityFusion::Run instead of calling InstructionFusion::Run.
8a5491aa8 Don't require the argument of ReducePrecision to be a tensor.
50b3b8c40 [XLA] Add a way for an HLO runner to run instructions in isolation.
e020e2e9b [XLA:GPU] Add coalescing heuristic.
b16624371 Add support for unpinned_host for host memory offloading. XLA does not currently differentiate between pinned and unpinned.
```

Fixes #6564.
@yeounoh
Copy link
Contributor

yeounoh commented Mar 21, 2024

After investigating a bit, I see that the last XLA pin update is the culprit (#6530). Replacing the updated XLA version b166243711f71b0a55daa1eda36b1dc745886784 by the former pin c08cfb0377e4e33a21bde65950f986a21c8a8199 made the error go away.

Hi @ysiraichi Does this mean we have to revert the latest pin update? Is it resolved in the HEAD? We are moving to the HEAD soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants