[CI][Windows] Workaround for error in Findzstd.cmake #17283

Lunderberg · 2024-08-19T13:45:40Z

This is a workaround for an upstream LLVM issue [0], which looks to be caused by the CMAKE_INSTALL_LIBDIR variable is used before definition. While there is an LLVM PR to resolve this fix [1], as of 2024-08-19 it has not yet been merged to LLVM.

This change is intended to resolve the following error, which occurs during the CI build of TVM on Windows.

The system cannot find the file specified.
CMake Error at C:/Miniconda/envs/tvm-build/conda-bld/tvm-package_1723747883202/_h_env/Library/lib/cmake/llvm/Findzstd.cmake:39 (string):
  string sub-command REGEX, mode REPLACE: regex "$" matched an empty string.
Call Stack (most recent call first):
  C:/Miniconda/envs/tvm-build/conda-bld/tvm-package_1723747883202/_h_env/Library/lib/cmake/llvm/LLVMConfig.cmake:277 (find_package)
  cmake/utils/FindLLVM.cmake:47 (find_package)
  cmake/modules/LLVM.cmake:31 (find_llvm)
  CMakeLists.txt:565 (include)

[0] llvm/llvm-project#83802
[1] llvm/llvm-project#83807

This is a workaround for an upstream LLVM issue [0], which looks to be caused by the `CMAKE_INSTALL_LIBDIR` variable is used before definition. While there is an LLVM PR to resolve this fix [1], as of 2024-08-19 it has not yet been merged to LLVM. This change is intended to resolve the following error, which occurs during the CI build of TVM on Windows. ``` The system cannot find the file specified. CMake Error at C:/Miniconda/envs/tvm-build/conda-bld/tvm-package_1723747883202/_h_env/Library/lib/cmake/llvm/Findzstd.cmake:39 (string): string sub-command REGEX, mode REPLACE: regex "$" matched an empty string. Call Stack (most recent call first): C:/Miniconda/envs/tvm-build/conda-bld/tvm-package_1723747883202/_h_env/Library/lib/cmake/llvm/LLVMConfig.cmake:277 (find_package) cmake/utils/FindLLVM.cmake:47 (find_package) cmake/modules/LLVM.cmake:31 (find_llvm) CMakeLists.txt:565 (include) ``` [0] llvm/llvm-project#83802 [1] llvm/llvm-project#83807

Lunderberg · 2024-08-19T14:41:27Z

And confirming that the error in the Windows build in CI is resolved, as the CI has passed the location where Findzstd.cmake was imported, and which would have caused the error.

Lunderberg · 2024-08-19T20:52:33Z

Even though the CI build was able to complete, the unit test tests/python/all-platform-minimal-test/test_runtime_ndarray.py::test_fp16_conversion failed with the error message Windows fatal exception: access violation. This error isn't reproducible on Linux, so for now I've added some debug print statements to the PR. To make sure these are removed before landing this change, this PR is currently marked as a draft.

If pytest captures the output, segfaults in a unit test prevent any output from being printed.

tqchen · 2024-08-21T13:08:25Z

They may have to do with some of the functions defined in https://github.com/apache/tvm/blob/main/src/runtime/builtin_fp16.cc#L46, although i am not sure which one

Lunderberg · 2024-08-21T19:50:31Z

Hmm. I'm seeing local __truncsfhf2 and __extendhfsf2 functions, which are defined here along with a comment saying that they were based on the functions in buildin_fp16.cc.

However, I don't see any calls to these local functions in the LLVM IR. It looks like the generated LLVM IR instead uses fpext (here) to convert from float16 to float32.

Lunderberg · 2024-08-22T15:36:02Z

I think I've tracked down the problem. Writing down the steps to record it, and to collect all the links in one spot.

The __truncsfhf2 and __extendhfsf2 are builtins provided by LLVM. Calls to these builtins are generated from the LLVM IR instructions fptrunc and fpext.
In LLVM 15, the ABI of these builtins was changed from accepting uint16 arguments to accepting _Float16 arguments (see this thread. As a result, TVM-generated code that was compiled under LLVM 15 would be incompatible with the LLVM 14 runtime.
As a result of (2), TVM PR#12877 would inject local overrides of __truncsfhf2 and __extendhfsf2. That way, when LLVM lowers fpext to __extendhfsf2, it uses our local override of __extendhfsf2. The choice of which __extendhfsf2 is based on whether the target supports SSE2, matching the decision made by LLVM.
After (this commit), LLVM performs a per-architecture check to determine whether the compiler supports the use of _Float16 use of float16. The first release containing the commit was LLVM 17.

However, this commit also switches from using the LLVM builtin_check_c_compiler_source to the cmake check_c_source_compiles. The former only attempts to compile a string (source link), while the latter attempts to compile and link the string (doc link). If I understand correctly, since the string doesn't define int main, this check would always return false.

This looks like a bug in the upstream LLVM implementation, but the relevance here is that it would mean that LLVM 17 would produce calls to the uint32 ABI, while our replacement function would expect calls with the _Float16 ABI.
In June, the Windows CI runners updated from using LLVM 16 to LLVM 18 (link). The timing of it is a point against this hypothesis, since the ~2 month gap between when the new version of LLVM was rolled out in Github and when the issue started occurring in TVM. It's possible that that was just their gradual rollout, but that would be a stretch.

To test whether there's an incompatibility between the ABI expected by LLVM, and the ABI that we provide, I've added more debug statements and commented out the call to EmitFloat16ConversionBuiltins. If commenting out that line avoids the issue on Windows, then that would at least point to some sort of incompatibility.

tqchen · 2024-08-22T17:15:07Z

i see, we can try to move forward and be compatible with later LLVM ver if that is something we can do

Lunderberg · 2024-08-22T17:18:31Z

Well, it was a theory, but it doesn't seem to have panned out. Even with the custom conversions in EmitFloat16ConversionBuiltins commented out, the same access violation occurs.

For now, I'm out of ideas. This may need to be debugged by somebody with access to a Windows development environment.

Lunderberg · 2024-08-26T18:43:18Z

With a few more debug print statements, the error appears unrelated to the use of f16 primitives. The last print statement occurs in LLVMModule, just before the PackedFunc call to (*faddr)(arg_values, arg_type_codes, ...). The print statement at the start of the TIR implementation ("Start of function\n") does not get printed, so the compiled PrimFunc must not be entered at all. To verify this, I'm running a version of the test where both the input and output are float32.

If this is the case, it may be related to this issue in the github runners for Windows. From what I can tell, the Windows image shipped with a MSVC compiler newer than its MSVC runtime, causing incompatibilities between generated code and the runtime. The issue has a workaround in some cases, but several users have reported that the workaround is very fragile, and depends on (1) DLL load order, (2) whether any other program provided an older version of the MSVC runtime, and (3) whether the C:\hostedtoolcache\windows path contains leftover files from a previous CI run.

The f32-to-f32 test case passed, so it's not an issue with all generated code. Trying a f16-to-f16 conversion to see if it's a problem with the existence of f16 arguments at all.

Lunderberg · 2024-08-29T13:59:59Z

And after a couple more test cases, it's back to looking like the conversion functions between float16 and float32 are the issue, since the Windows CI can pass when running either f32-to-f32 or f16-to-f16 test cases.

Though, that doesn't explain why the print statements from inside the TIR aren't showing up in the output, even when fflush is immediately called. With the access violation only occurring when the conversions are enabled, I would have expected the start of function print statements to appear in the output. This may point to the error occurring during the PackedFunc interface, which would be before the first print TIR print statement. Running another test, where the test case specifies the lowered form of the PrimFunc, with print statements prior to the PackedFunc argument unpacking.

mshr-h · 2024-09-16T16:00:10Z

Just FYI, we can ssh to the GitHub Actions runner.
https://github.com/mxschmitt/action-tmate

Add debug print statements for failing Windows unit test

6788d48

Lunderberg marked this pull request as draft August 19, 2024 20:50

Lunderberg added 2 commits August 20, 2024 09:18

Do not capture output in CI Windows pytest

f784949

If pytest captures the output, segfaults in a unit test prevent any output from being printed.

More debug print statements

af3073b

Add another debug print statement

fa5ee2b

Add more debug print statements

4252cec

Lunderberg added 3 commits August 22, 2024 15:29

Add even more debug print statements

ce8fc4a

Add another debug print

da1022c

Added more debug prints

3c2bc5a

Lunderberg mentioned this pull request Aug 26, 2024

[Relax] Require correct input/output shapes R.call_tir #17285

Merged

More debugging, try a float32 to float32 (no-op) function

64bb4c5

Lunderberg added 3 commits August 27, 2024 10:10

Debugging, ensure that the f32-to-f32 test case runs first

eeee23c

Try a f16-to-f16 test case

a2bd7de

The f32-to-f32 test case passed, so it's not an issue with all generated code. Trying a f16-to-f16 conversion to see if it's a problem with the existence of f16 arguments at all.

Add more debug print statements, in PackedFunc argument unpacking

1a4216f

Lunderberg added 2 commits September 16, 2024 08:17

Lint fix

43498c4

Another attempt, to see if using JIT vs saved DLL makes a difference

7b7cef4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][Windows] Workaround for error in Findzstd.cmake #17283

[CI][Windows] Workaround for error in Findzstd.cmake #17283

Lunderberg commented Aug 19, 2024

Lunderberg commented Aug 19, 2024

Lunderberg commented Aug 19, 2024

tqchen commented Aug 21, 2024

Lunderberg commented Aug 21, 2024

Lunderberg commented Aug 22, 2024

tqchen commented Aug 22, 2024

Lunderberg commented Aug 22, 2024

Lunderberg commented Aug 26, 2024

Lunderberg commented Aug 29, 2024

mshr-h commented Sep 16, 2024

[CI][Windows] Workaround for error in Findzstd.cmake #17283

Are you sure you want to change the base?

[CI][Windows] Workaround for error in Findzstd.cmake #17283

Conversation

Lunderberg commented Aug 19, 2024

Lunderberg commented Aug 19, 2024

Lunderberg commented Aug 19, 2024

tqchen commented Aug 21, 2024

Lunderberg commented Aug 21, 2024

Lunderberg commented Aug 22, 2024

tqchen commented Aug 22, 2024

Lunderberg commented Aug 22, 2024

Lunderberg commented Aug 26, 2024

Lunderberg commented Aug 29, 2024

mshr-h commented Sep 16, 2024