Cannot compile PyTorch/XLA master. #6564

ysiraichi · 2024-02-17T18:50:27Z

🐛 Bug

I'm trying to compile PyTorch/XLA master branch (see command below), however I'm getting the following error:

export CUDA_HOME="/usr/local/cuda"
export XLA_CUDA=1
python setup.py develop --user

ERROR: xla/service/gpu/kernels/BUILD:156:13: Compiling xla/service/gpu/kernels/topk_kernel_bfloat16.cu.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF bazel-out/k8-opt/bin/external/xla/xla/service/gpu/kernels/_objs/topk_kernel_cuda/topk_kernel_bfloat16.cu.pic.d ... (remaining 110 arguments skipped)

external/com_google_absl/absl/strings/internal/str_format/bind.h: In constructor ‘absl::lts_20230802::str_format_internal::FormatSpecTemplate<Args>::FormatSpecTemplate(const absl::lts_20230802::str_format_internal::ExtendedParsedFormat<absl::lts_20230802::FormatConversionCharSet(C)...>&)’:
external/com_google_absl/absl/strings/internal/str_format/bind.h:172:1: error: parse error in template argument list
  172 |     CheckArity<sizeof...(C), sizeof...(Args)>();
      | ^   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
external/com_google_absl/absl/strings/internal/str_format/bind.h:172:63: error: expected ‘;’ before ‘)’ token
  172 |     CheckArity<sizeof...(C), sizeof...(Args)>();
      |                                                               ^
      |                                                               ;
external/com_google_absl/absl/strings/internal/str_format/bind.h:173:147: error: template argument 1 is invalid
  173 |     CheckMatches<C...>(absl::make_index_sequence<sizeof...(C)>{});
      |                                                                                                                                                   ^
external/com_google_absl/absl/strings/internal/str_format/bind.h:173:151: error: expected primary-expression before ‘{’ token
  173 |     CheckMatches<C...>(absl::make_index_sequence<sizeof...(C)>{});
      |                                                                                                                                                       ^
external/com_google_absl/absl/strings/internal/str_format/bind.h:173:150: error: expected ‘;’ before ‘{’ token
  173 |     CheckMatches<C...>(absl::make_index_sequence<sizeof...(C)>{});
      |                                                                                                                                                      ^
      |                                                                                                                                                      ;
external/com_google_absl/absl/strings/internal/str_format/bind.h:173:153: error: expected primary-expression before ‘)’ token
  173 |     CheckMatches<C...>(absl::make_index_sequence<sizeof...(C)>{});
      |                                                                                                                                                         ^
Target //:_XLAC.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 882.296s, Critical Path: 94.86s
INFO: 8108 processes: 259 internal, 7849 local.
FAILED: Build did NOT complete successfully
error: command 'bazel' failed with exit status 1

Environment

PyTorch version: becfda005e524f93b1efed64917a129ef6778135
PyTorch/XLA version: f4971a7

cc @miladm @lezcano

The text was updated successfully, but these errors were encountered:

ysiraichi · 2024-02-19T22:38:49Z

After investigating a bit, I see that the last XLA pin update is the culprit (#6530). Replacing the updated XLA version b166243711f71b0a55daa1eda36b1dc745886784 by the former pin c08cfb0377e4e33a21bde65950f986a21c8a8199 made the error go away.

Bump the pinned XLA version to fix GPU builds with CUDA11. Note that there are only 13 commits between the new pin and the previous one: ``` $ git log --oneline b1662437^..419a3d73 419a3d736 [xla] Do not include absl headers into xla/types.h 1a4ec9190 [xla:gpu] Add initialization guard to make sure we have exactly one NCCL clique initialization in progress 1365d31a8 [xla] Fix test compilation for environments without cuda 86e231a58 [xla:gpu] Add support for legacy API custom calls in AddressComputationFusionRewriter 82e775381 Fix broken build for convert_memory_placement_to_internal_annotations_test db973b7fb Integrate LLVM at llvm/llvm-project@bc66e0cf9feb 09c7c0818 Fix gcd simplification of div. 04af47afd PR #9400: Move Gt(Max) optimization after all other HandleCompare optimizations 06c8c19d8 Fix pad indexing map with interior padding. a27177d76 [XLA:GPU] Implement GpuPriorityFusion::Run instead of calling InstructionFusion::Run. 8a5491aa8 Don't require the argument of ReducePrecision to be a tensor. 50b3b8c40 [XLA] Add a way for an HLO runner to run instructions in isolation. e020e2e9b [XLA:GPU] Add coalescing heuristic. b16624371 Add support for unpinned_host for host memory offloading. XLA does not currently differentiate between pinned and unpinned. ``` Fixes #6564.

yeounoh · 2024-03-21T04:44:35Z

After investigating a bit, I see that the last XLA pin update is the culprit (#6530). Replacing the updated XLA version b166243711f71b0a55daa1eda36b1dc745886784 by the former pin c08cfb0377e4e33a21bde65950f986a21c8a8199 made the error go away.

Hi @ysiraichi Does this mean we have to revert the latest pin update? Is it resolved in the HEAD? We are moving to the HEAD soon.

ysiraichi added the xla:gpu label Feb 17, 2024

ysiraichi mentioned this issue Feb 19, 2024

Failing Torchbench Models: tracking issue #5932

Open

cota mentioned this issue Feb 20, 2024

Update OpenXLA pin to fix GPU build #6569

Merged

JackCaoG closed this as completed in #6569 Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot compile PyTorch/XLA master. #6564

Cannot compile PyTorch/XLA master. #6564

ysiraichi commented Feb 17, 2024 •

edited

Loading

ysiraichi commented Feb 19, 2024

yeounoh commented Mar 21, 2024

Cannot compile PyTorch/XLA master. #6564

Cannot compile PyTorch/XLA master. #6564

Comments

ysiraichi commented Feb 17, 2024 • edited Loading

🐛 Bug

Environment

ysiraichi commented Feb 19, 2024

yeounoh commented Mar 21, 2024

ysiraichi commented Feb 17, 2024 •

edited

Loading