Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL telescope Kalman fitter tests fail in SYCL with OneAPI 2024.2 #655

Open
stephenswat opened this issue Jul 26, 2024 · 1 comment
Open
Labels
bug Something isn't working sycl Changes related to SYCL

Comments

@stephenswat
Copy link
Member

Here is one of the most profoundly bewildering bugs I have ever seen. The Kalman fitter tests in telescope geometries don't work in SYCL with OneAPI 2024.2.

Reproduction

To reproduce the bug, perform the following set of actions:

# Can also be done without the Docker container, but this is easier
$ docker run -it ghcr.io/acts-project/ubuntu2404_oneapi:55
$ git clone https://github.com/acts-project/traccc.git
$ (cd traccc; git checkout f7d9df8)
$ source /opt/intel/oneapi/setvars.sh --include-intel-llvm
# Building for the spir64_x86_64 target causes the compiler to crash, which is a whole different issue
$ export SYCLFLAGS="-fsycl -fsycl-targets=spir64"
$ cmake -S traccc -B build -DCMAKE_BUILD_TYPE=Debug -DTRACCC_BUILD_TESTING=ON -DTRACCC_BUILD_SYCL=ON
$ cmake --build build -- -j $(nproc) traccc_test_sycl
$ build/bin/traccc_test_sycl --gtest_filter="SYCLKalmanFitTelescopeValidation/KalmanFittingTelescopeTests.Run/*"

This will produce the following error:

Running main() from /build/_deps/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = SYCLKalmanFitTelescopeValidation/KalmanFittingTelescopeTests.Run/*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from SYCLKalmanFitTelescopeValidation/KalmanFittingTelescopeTests
[ RUN      ] SYCLKalmanFitTelescopeValidation/KalmanFittingTelescopeTests.Run/0
Running Seeding on device: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics  
WARNING: No entries in volume finder

Detector check: OK

 *** Break *** segmentation violation

So, we have a segmentation fault in this executable.

Diagnostics

At this point you may, just like I did, naively assume that this is some memory error in our code. Wouldn't that be nice and easy to fix. But nothing could be less true, as gdb shows us:

$ apt install -y gdb
$ gdb -ex run --args build/bin/traccc_test_sycl --gtest_filter="SYCLKalmanFitTelescopeValidation/KalmanFittingTelescopeTests.Run/*"
Thread 1 "traccc_test_syc" received signal SIGSEGV, Segmentation fault.
0x00007f51d97e14f8 in llvm::vpo::VPlanTTICostModel::getLoadStoreIndexSize(llvm::vpo::VPLoadStoreInst const*) const () from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
(gdb) bt
...
#0  0x00007f51d97e14f8 in llvm::vpo::VPlanTTICostModel::getLoadStoreIndexSize(llvm::vpo::VPLoadStoreInst const*) const () from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#1  0x00007f51d8d8f7e7 in llvm::vpo::VPlanTTICostModel::getLoadStoreCost(llvm::vpo::VPLoadStoreInst const*, llvm::Align, unsigned int, bool) const () from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#2  0x00007f51d8fdd45d in llvm::vpo::VPlanTTICostModel::getTTICostForVF(llvm::vpo::VPInstruction const*, unsigned int) () from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#3  0x00007f51d8fdd2bc in llvm::vpo::VPlanTTICostModel::getTTICost(llvm::vpo::VPInstruction const*) () from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#4  0x00007f51da8ee56f in llvm::vpo::VPlanCostModelWithHeuristics<llvm::vpo::HeuristicsList<llvm::vpo::VPInstruction const>, llvm::vpo::HeuristicsList<llvm::vpo::VPBasicBlock const>, llvm::vpo::HeuristicsList<llvm::vpo::VPlanVector const, llvm::vpo::VPlanCostModelHeuristics::HeuristicSpillFill, llvm::vpo::VPlanCostModelHeuristics::HeuristicUnroll> >::getCostImpl(llvm::vpo::VPInstruction const*, llvm::raw_ostream*) ()
   from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#5  0x00007f51da8ee427 in llvm::vpo::VPlanCostModelWithHeuristics<llvm::vpo::HeuristicsList<llvm::vpo::VPInstruction const>, llvm::vpo::HeuristicsList<llvm::vpo::VPBasicBlock const>, llvm::vpo::HeuristicsList<llvm::vpo::VPlanVector const, llvm::vpo::VPlanCostModelHeuristics::HeuristicSpillFill, llvm::vpo::VPlanCostModelHeuristics::HeuristicUnroll> >::getCostImpl(llvm::vpo::VPBasicBlock const*, llvm::raw_ostream*) ()
   from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
...

So the issue is not really on our end per se, it's happening in Intel's SPIR compiler. Aight.

Workarounds

This is where it gets truly spicy. I've been able to identify two different ways that the segmentation fault can be avoided (of course, these all break the actual test; but they make it run), here they are:

  1. In simulation/include/traccc/simulation/simulator.hpp, comment out line 97 (p.propagate(propagation, actor_states);).
  2. In core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp, comment out lines 185 (propagator.propagate(propagation, fitter_state());) and 188 (smooth(fitter_state);).

These functions are completely independent, and one of them runs on the host, the other runs on the device. Lmao.

Conclusion

I don't even know at this point, but most certainly there is something very funky happening in OneAPI right now. It could also be some subtle bug in our code, but I haven't been able to find it.

@stephenswat stephenswat added bug Something isn't working sycl Changes related to SYCL labels Jul 26, 2024
stephenswat added a commit to stephenswat/traccc that referenced this issue Jul 26, 2024
As shown in acts-project#655, this is creating a lot of headache. I am looking for a
fix but in the meanwhile this is holding up acts-project#628, so I want to
temporarily disable these tests.
stephenswat added a commit to stephenswat/traccc that referenced this issue Jul 26, 2024
As shown in acts-project#655, this is creating a lot of headache. I am looking for a
fix but in the meanwhile this is holding up acts-project#628, so I want to
temporarily disable these tests.
stephenswat added a commit to stephenswat/traccc that referenced this issue Jul 26, 2024
As shown in acts-project#655, this is creating a lot of headache. I am looking for a
fix but in the meanwhile this is holding up acts-project#628, so I want to
temporarily disable these tests.
@niermann999
Copy link
Contributor

niermann999 commented Nov 13, 2024

This sounds similar to what I see in algebra plugins now. I ran the same instructions, but this time on algebra-plugins, and I see the following:

Thread 1 "algebra_test_ar" received signal SIGSEGV, Segmentation fault.
0x00007f18b96874f8 in llvm::vpo::VPlanTTICostModel::getLoadStoreIndexSize(llvm::vpo::VPLoadStoreInst const*) const ()
   from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
(gdb) bt
#0  0x00007f18b96874f8 in llvm::vpo::VPlanTTICostModel::getLoadStoreIndexSize(llvm::vpo::VPLoadStoreInst const*) const ()
   from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#1  0x00007f18b8c357e7 in llvm::vpo::VPlanTTICostModel::getLoadStoreCost(llvm::vpo::VPLoadStoreInst const*, llvm::Align, unsigned int, bool) const ()
   from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#2  0x00007f18b8e8345d in llvm::vpo::VPlanTTICostModel::getTTICostForVF(llvm::vpo::VPInstruction const*, unsigned int) ()
   from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#3  0x00007f18b8e832bc in llvm::vpo::VPlanTTICostModel::getTTICost(llvm::vpo::VPInstruction const*) ()
...

So, the problem might already be in the linear algebra implementation? Interestingly, only the array and vecmem plugins produce a segmentation fault, while the eigen plugin runs but produces incorrect results that fail the host comparison test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working sycl Changes related to SYCL
Projects
None yet
Development

No branches or pull requests

2 participants