-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loop Blocking for fn GPU Backend #1787
Conversation
Hi there, this is jenkins continuous integration... |
launch jenkins |
launch perftests |
launch jenkins |
1 similar comment
launch jenkins |
launch jenkins |
launch perftest |
launch perftest |
launch perftest |
return index_at_dim<I>(blockIdx) * (ThreadBlockSize::value * LoopBlockSize::value) + | ||
index_at_dim<I>(threadIdx) * LoopBlockSize::value; | ||
} else { | ||
return integral_constant<int, 0>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit lost here, maybe you can add a few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some at various places, let me know if that’s enough.
launch perftest |
launch perftest |
launch jenkins |
launch perftest |
launch perftest |
launch jenkins |
launch jenkins |
launch perftest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Implements loop blocking for the GPU fn backend. Thread block size (that is, CUDA/HIP threads per block) and loop block size (that is, loop iterations per CUDA/HIP thread) can now be specified as template parameters. Further changes: - Set `__launch_bounds__` in the fn GPU kernel based on the thread block size. - Activate vertical loop blocking in the fn nabla kernels on newer CUDA versions that support `GT_PROMISE`. Performance changes: - `__launch_bounds__` affects performance of the `fn_cartesian_vertical_advection` benchmark significantly (positively or negatively, depending on domain size). - Performance of fn nabla benchmarks improves significantly on newer CUDA versions. - Performance on Daint is currently reduced due to too old CUDA version.
Implements loop blocking for the GPU fn backend. Thread block size (that is, CUDA/HIP threads per block) and loop block size (that is, loop iterations per CUDA/HIP thread) can now be specified as template parameters.
Further changes:
__launch_bounds__
in the fn GPU kernel based on the thread block size.GT_PROMISE
.Performance changes:
__launch_bounds__
affects performance of thefn_cartesian_vertical_advection
benchmark significantly (positively or negatively, depending on domain size).