Merge #2982

2982: Reduce excessive loop unrolling in lbgpu velocity interpolation r=KaiSzuttor a=mkuron This caused excessive register usage, especially when combined with thrust. Issue discovered by @fweik in #2878. It turns out that this is a problem for CUDA too, it just exhibits a different behavior. Instead of crashing like on HIP, CUDA just produces a large binary and slower code. In a perfect world, the compiler should display a warning, but I guess neither AMD nor Nvidia operate in a perfect world. Co-authored-by: Michael Kuron <mkuron@users.noreply.github.com>
espressomd · Jul 10, 2019 · 326c261 · 326c261
2 parents d404075 + f6acc47
commit 326c261
Showing 1 changed file with 3 additions and 1 deletion.
diff --git a/src/core/grid_based_algorithms/lbgpu_cuda.cu b/src/core/grid_based_algorithms/lbgpu_cuda.cu
@@ -1418,9 +1418,11 @@ velocity_interpolation(LB_nodes_gpu n_a, float *particle_position,
 
   int cnt = 0;
   float3 interpolated_u{0.0f, 0.0f, 0.0f};
-#pragma unroll
+#pragma unroll 1
   for (int i = 0; i < 3; ++i) {
+#pragma unroll 1
     for (int j = 0; j < 3; ++j) {
+#pragma unroll 3
       for (int k = 0; k < 3; ++k) {
         auto const x =
             fold_if_necessary(center_node_index[0] - 1 + i, para->dim_x);