Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][Windows] Workaround for error in Findzstd.cmake #17283

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

Lunderberg
Copy link
Contributor

This is a workaround for an upstream LLVM issue [0], which looks to be caused by the CMAKE_INSTALL_LIBDIR variable is used before definition. While there is an LLVM PR to resolve this fix [1], as of 2024-08-19 it has not yet been merged to LLVM.

This change is intended to resolve the following error, which occurs during the CI build of TVM on Windows.

The system cannot find the file specified.
CMake Error at C:/Miniconda/envs/tvm-build/conda-bld/tvm-package_1723747883202/_h_env/Library/lib/cmake/llvm/Findzstd.cmake:39 (string):
  string sub-command REGEX, mode REPLACE: regex "$" matched an empty string.
Call Stack (most recent call first):
  C:/Miniconda/envs/tvm-build/conda-bld/tvm-package_1723747883202/_h_env/Library/lib/cmake/llvm/LLVMConfig.cmake:277 (find_package)
  cmake/utils/FindLLVM.cmake:47 (find_package)
  cmake/modules/LLVM.cmake:31 (find_llvm)
  CMakeLists.txt:565 (include)

[0] llvm/llvm-project#83802
[1] llvm/llvm-project#83807

This is a workaround for an upstream LLVM issue [0], which looks to be
caused by the `CMAKE_INSTALL_LIBDIR` variable is used before
definition.  While there is an LLVM PR to resolve this fix [1], as of
2024-08-19 it has not yet been merged to LLVM.

This change is intended to resolve the following error, which occurs
during the CI build of TVM on Windows.

```
The system cannot find the file specified.
CMake Error at C:/Miniconda/envs/tvm-build/conda-bld/tvm-package_1723747883202/_h_env/Library/lib/cmake/llvm/Findzstd.cmake:39 (string):
  string sub-command REGEX, mode REPLACE: regex "$" matched an empty string.
Call Stack (most recent call first):
  C:/Miniconda/envs/tvm-build/conda-bld/tvm-package_1723747883202/_h_env/Library/lib/cmake/llvm/LLVMConfig.cmake:277 (find_package)
  cmake/utils/FindLLVM.cmake:47 (find_package)
  cmake/modules/LLVM.cmake:31 (find_llvm)
  CMakeLists.txt:565 (include)
```

[0] llvm/llvm-project#83802
[1] llvm/llvm-project#83807
@Lunderberg
Copy link
Contributor Author

And confirming that the error in the Windows build in CI is resolved, as the CI has passed the location where Findzstd.cmake was imported, and which would have caused the error.

@Lunderberg Lunderberg marked this pull request as draft August 19, 2024 20:50
@Lunderberg
Copy link
Contributor Author

Even though the CI build was able to complete, the unit test tests/python/all-platform-minimal-test/test_runtime_ndarray.py::test_fp16_conversion failed with the error message Windows fatal exception: access violation. This error isn't reproducible on Linux, so for now I've added some debug print statements to the PR. To make sure these are removed before landing this change, this PR is currently marked as a draft.

If pytest captures the output, segfaults in a unit test prevent any
output from being printed.
@tqchen
Copy link
Member

tqchen commented Aug 21, 2024

They may have to do with some of the functions defined in https://github.com/apache/tvm/blob/main/src/runtime/builtin_fp16.cc#L46, although i am not sure which one

@Lunderberg
Copy link
Contributor Author

Hmm. I'm seeing local __truncsfhf2 and __extendhfsf2 functions, which are defined here along with a comment saying that they were based on the functions in buildin_fp16.cc.

However, I don't see any calls to these local functions in the LLVM IR. It looks like the generated LLVM IR instead uses fpext (here) to convert from float16 to float32.

@Lunderberg
Copy link
Contributor Author

I think I've tracked down the problem. Writing down the steps to record it, and to collect all the links in one spot.

  1. The __truncsfhf2 and __extendhfsf2 are builtins provided by LLVM. Calls to these builtins are generated from the LLVM IR instructions fptrunc and fpext.

  2. In LLVM 15, the ABI of these builtins was changed from accepting uint16 arguments to accepting _Float16 arguments (see this thread. As a result, TVM-generated code that was compiled under LLVM 15 would be incompatible with the LLVM 14 runtime.

  3. As a result of (2), TVM PR#12877 would inject local overrides of __truncsfhf2 and __extendhfsf2. That way, when LLVM lowers fpext to __extendhfsf2, it uses our local override of __extendhfsf2. The choice of which __extendhfsf2 is based on whether the target supports SSE2, matching the decision made by LLVM.

  4. After (this commit), LLVM performs a per-architecture check to determine whether the compiler supports the use of _Float16 use of float16. The first release containing the commit was LLVM 17.

    However, this commit also switches from using the LLVM builtin_check_c_compiler_source to the cmake check_c_source_compiles. The former only attempts to compile a string (source link), while the latter attempts to compile and link the string (doc link). If I understand correctly, since the string doesn't define int main, this check would always return false.

    This looks like a bug in the upstream LLVM implementation, but the relevance here is that it would mean that LLVM 17 would produce calls to the uint32 ABI, while our replacement function would expect calls with the _Float16 ABI.

  5. In June, the Windows CI runners updated from using LLVM 16 to LLVM 18 (link). The timing of it is a point against this hypothesis, since the ~2 month gap between when the new version of LLVM was rolled out in Github and when the issue started occurring in TVM. It's possible that that was just their gradual rollout, but that would be a stretch.

To test whether there's an incompatibility between the ABI expected by LLVM, and the ABI that we provide, I've added more debug statements and commented out the call to EmitFloat16ConversionBuiltins. If commenting out that line avoids the issue on Windows, then that would at least point to some sort of incompatibility.

@tqchen
Copy link
Member

tqchen commented Aug 22, 2024

i see, we can try to move forward and be compatible with later LLVM ver if that is something we can do

@Lunderberg
Copy link
Contributor Author

Well, it was a theory, but it doesn't seem to have panned out. Even with the custom conversions in EmitFloat16ConversionBuiltins commented out, the same access violation occurs.

For now, I'm out of ideas. This may need to be debugged by somebody with access to a Windows development environment.

@Lunderberg
Copy link
Contributor Author

With a few more debug print statements, the error appears unrelated to the use of f16 primitives. The last print statement occurs in LLVMModule, just before the PackedFunc call to (*faddr)(arg_values, arg_type_codes, ...). The print statement at the start of the TIR implementation ("Start of function\n") does not get printed, so the compiled PrimFunc must not be entered at all. To verify this, I'm running a version of the test where both the input and output are float32.

If this is the case, it may be related to this issue in the github runners for Windows. From what I can tell, the Windows image shipped with a MSVC compiler newer than its MSVC runtime, causing incompatibilities between generated code and the runtime. The issue has a workaround in some cases, but several users have reported that the workaround is very fragile, and depends on (1) DLL load order, (2) whether any other program provided an older version of the MSVC runtime, and (3) whether the C:\hostedtoolcache\windows path contains leftover files from a previous CI run.

The f32-to-f32 test case passed, so it's not an issue with
all generated code.  Trying a f16-to-f16 conversion to see if it's a
problem with the existence of f16 arguments at all.
@Lunderberg
Copy link
Contributor Author

And after a couple more test cases, it's back to looking like the conversion functions between float16 and float32 are the issue, since the Windows CI can pass when running either f32-to-f32 or f16-to-f16 test cases.

Though, that doesn't explain why the print statements from inside the TIR aren't showing up in the output, even when fflush is immediately called. With the access violation only occurring when the conversions are enabled, I would have expected the start of function print statements to appear in the output. This may point to the error occurring during the PackedFunc interface, which would be before the first print TIR print statement. Running another test, where the test case specifies the lowered form of the PrimFunc, with print statements prior to the PackedFunc argument unpacking.

@mshr-h
Copy link
Contributor

mshr-h commented Sep 16, 2024

Just FYI, we can ssh to the GitHub Actions runner.
https://github.com/mxschmitt/action-tmate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants